Back to projects
Public Health

Data analysis of the National Medical Residency Commission

Public health analytics application to organize, validate and explore national records of medical residency certificates in Brazil.

PythonBigQueryPublic HealthMedical ResidencyWorkforceAnalyticsDashboard

Context

Medical residency is one of the main pillars of specialist training in Brazil. However, public data on certificates, programs, institutions, specialties and territorial distribution need to be processed and organized before they can support useful analyses for education planning, regulation and health workforce assessment.

Problem addressed

The original dataset required standardization of names, dates, programs, institutions, states, specialties, prerequisites and validation rules. There were also limitations related to invalid records, physician name homonyms and changes in institutional and program nomenclature.

Project objective

Build an interactive application to explore CNRM medical residency certificate records, enabling analyses by time, territory, program, institution, training type, inferred sex, specialty and training cycle.

Situation

Public medical residency certificate data were available, but required standardization, validation and organization to support analyses of medical training, geographic distribution and specialized workforce.

Task

Build an analytical application to explore CNRM records, allowing users to filter, validate and visualize certificates, programs, institutions, states, regions, specialties and training cycles.

Action

Structured a pipeline for extraction, column standardization, date normalization, text field cleaning, hash identifier creation, regional enrichment and record validation based on consistency rules.

Result

The application consolidated more than 333k valid certificates, identified more than 135k physicians and delivered interactive dashboards for national analysis of medical residency in Brazil.

Key metrics

Valid certificates
333,518
Invalid certificates
21,128
Identified physicians
135,828
States
27
Regions
5
Programs
132
Average certificates per physician
1.31
Average training duration
3.51 years

Dataset scope

The scope included medical residency certificate records from the National Medical Residency Commission, focusing on valid and invalid certificates, programs, institutions, physicians, states, regions, training duration, start year, end year and certificate issue year.

Methodological pipeline

The pipeline followed these steps: - Acquisition of public data by state. - Stacking of state-level files. - Column renaming and standardization. - Normalization of residency start date, residency end date and certificate issue date. - Standardization of text fields such as program, institution and state. - Creation of unique hash identifiers for physician name, medical license and certificate. - Generation of derived columns such as start year and end year. - Regional enrichment based on state. - Specialty standardization according to CFM rules. - Standardization of training duration, prerequisites and area of practice. - Creation of a validation field to separate valid and invalid records.

Validation rules

Validation rules were created to identify invalid records, including: - Rows without start or end dates. - Rows without program, institution or physician name. - Specialties or areas of practice not identified according to CFM rules. - End dates incompatible with the standard training duration. - Records without prerequisite criteria when applicable. - Invalid rows excluded from the main analyses.

Application features

The application allows users to filter and explore data by: - Program - Institution - Region - State - Inferred physician sex - Training type - Basic specialty - Direct entry - Training duration - Certificate issue year - Training start year - Training end year

Visualizations and analyses

The dashboard includes visualizations such as: - Time series of certificates issued by year. - Pareto chart of certificates by region. - Pareto chart of certificates by state. - Ranking of programs with the most certificates. - Ranking of institutions with the most certificates. - Distribution by inferred sex. - Entries and exits by residency year. - Matrix by training cycle and year. - Indicators for valid certificates, invalid certificates, programs, regions, states and trained physicians.

Technologies used

- Python - BigQuery - Streamlit - Pandas - Plotly - ETL - Interactive dashboard - Public data processing and validation

Known limitations

- Presence of records with invalid data. - Physician name homonyms. - Changes in institutional and program nomenclature. - Need for continuous validation when public datasets are updated. - Biological sex inferred from names, subject to error and used only for aggregate analysis.

Useful links

- Official source: CNRM Portal - Official reference: CFM Resolution - BigQuery table: escolap2p.base_siscnrm.residentes_applications - Interactive application: open analysis

Deliverables

- Interactive analytics application. - Data processing and standardization pipeline. - Analytical table in BigQuery. - Certificate validation rules. - National indicators on medical residency. - Temporal, territorial and institutional visualizations. - Filters by program, institution, state, region, sex, training type and period. - Methodological documentation inside the app.

Learnings

- Public health data may require substantial standardization before generating reliable analyses. - Analytical quality depends on both visualization and validation rules. - Hashes and standardization help protect identifiers and enable consistent counts. - Educational and regulatory datasets require semantic enrichment to support decisions. - Public health dashboards should make scope, limitations and methodological criteria explicit.

External links