Dropout prediction in UNA-SUS courses
CatBoost model to estimate dropout risk in healthcare training courses and support data-driven retention actions.
Project context
The project started from a practical need: understanding and reducing dropout in UNA-SUS healthcare training courses. Before modeling, the educational process was mapped, covering course creation, offering, enrollment, start date, completion and student classification as approved, dropped out, under administrative process or still able to complete.
Problem addressed
Dropout reduces the impact of training initiatives, makes public workforce development policies harder to manage and limits the strategic use of educational resources. The challenge was to identify students at higher dropout risk in advance, supporting active contact, tutoring, communication reinforcement and possible deadline flexibility.
Methodological approach
The analysis followed a data project logic based on CRISP-DM: - Business understanding - Data understanding - Data preparation - Modeling - Evaluation - Deployment The work included process mapping, definition of completion criteria, data-driven investigation of Dengue courses, data preparation, feature selection, model training, metric evaluation and app deployment.
Situation
UNA-SUS healthcare training courses needed to identify dropout signals before students left, in a process involving high enrollment volume, different course timelines and multiple completion statuses.
Task
Structure a predictive analysis capable of estimating completion probability and dropout risk, supporting tutoring, retention and follow-up actions.
Action
Mapped the educational process, defined completion rules, investigated Dengue course data, trained a CatBoost model with 9 features and evaluated performance using ROC, PR, confusion matrix, thresholds and SHAP.
Result
The model reached a ROC AUC of 0.778, AP of 0.84 and accuracy of 0.714, and was deployed in an app for individual and batch prediction.
Key metrics
Process understanding
The project began with mapping the educational flow. Different time levels were considered: course creation, course offering, enrollment period, class start, class end, student entry and completion date. This understanding was necessary to distinguish approved students, dropouts, students under administrative process and students still able to complete the course.
Data-driven investigation
The initial investigation focused on Dengue courses and organized the student flow across enrollment, completion, approval, dropout, administrative process and students still able to complete. This step transformed a broad educational problem into an operational predictive question: which students are at higher risk of dropping out?
Proposed solution
A CatBoost classifier was developed to estimate the probability of course completion. Dropout risk is interpreted as the complement of this probability, that is, 1 minus the probability of completion. The application allows users to evaluate model performance, adjust operational thresholds and interpret the factors associated with predictions.
Why CatBoost
CatBoost was chosen because it handles real-world tabular data, categorical variables and imbalanced datasets well. It combines multiple decision trees sequentially, correcting errors during training, and reduces the need to artificially transform categories into multiple columns.
Final model features
The final model used 9 features: - status_formulario_inicial - matricula_ingresso_inicio_dias - aluno_profissional_profissao_descricao - hist_12m_conclusoes - hist_total_conclusoes - hist_total_abandonos - form_inicial_atual_q03_nivel_conhecimento - aluno_acesso_acesso_uf - aluno_profissional_escolaridade_descricao
Classification report
At the 0.50 threshold, the model achieved: Class 0, dropout: - Precision: 0.606 - Recall: 0.453 - F1-score: 0.518 - Support: 74,668 Class 1, completion: - Precision: 0.750 - Recall: 0.848 - F1-score: 0.796 - Support: 144,893 Total: - Accuracy: 0.714 - Macro F1: 0.657 - Weighted F1: 0.702 - Total support: 219,561
Threshold interpretation
The model does not provide an automatic "will complete" or "will drop out" answer. It provides a probability. The operational threshold must be defined according to the team's strategy: - Lower threshold: increases sensitivity and captures more at-risk students, useful when the cost of missing a student is high. - Higher threshold: increases precision and concentrates actions on higher-risk groups, useful when tutoring, scholarships or follow-up resources are limited. - The Youden suggested threshold was 0.656, but the final threshold should be chosen according to the team's operational capacity.
Explainability
Model interpretation was performed using global feature importance and SHAP values. SHAP values help explain how each variable pushes the prediction toward a higher chance of completion or higher dropout risk. In the technical material, for example, more days since enrollment start is associated with higher dropout risk, while answering the initial form increases the chance of completion.
Deployment
The solution was deployed in Streamlit, with features for: - Individual student prediction - Batch prediction for multiple students - Model performance evaluation - Operational threshold adjustment - ROC and Precision-Recall curve visualization - Confusion matrix - Classification report - Feature importance analysis - SHAP analysis
About the Average Precision metric
Average Precision summarizes model performance on the Precision-Recall curve and is useful for imbalanced classification problems.
Deliverables
- Interactive web application - Trained CatBoost model - Data preparation and feature selection pipeline - Technical performance report - Operational threshold analysis - ROC and Precision-Recall curves - Confusion matrix - Classification report - Feature importance analysis - SHAP analysis - Technical project presentation
Learnings
- Predictive projects need to start with process understanding, not with the model. - Dropout definition depends on clear operational rules. - Threshold selection is as important as the model's global metric. - In imbalanced datasets, AP, PR curve and class-level analysis are essential. - SHAP helps turn the model into a management tool. - The model should support prioritization, not replace human decision-making.