Back to projects
Predictive Models

Breast lesion risk prediction

Logistic Regression model to stratify breast lesion risk and support prevention, screening and early diagnosis pathways in Primary Care.

PythonLogistic RegressionMachine LearningOncologyPrivate Health InsurancePrimary CareDecision Support

Clinical and care context

Breast cancer is one of the leading causes of mortality among women and has high potential for cure when detected early. In private health insurance, Primary Care can support care coordination, risk factor identification and better targeting of prevention, screening and diagnostic pathways.

Problem addressed

The payer observed frequent and costly use of mammography and breast ultrasound, including among young patients without identified risk factors and without later positive diagnoses. At the same time, it was necessary not to miss beneficiaries at increased risk, especially outside traditional screening criteria.

Project objective

Develop a prediction and risk stratification algorithm for breast problems, supporting individualized Primary Care pathways for prevention, early diagnosis and more targeted referral.

Situation

A health insurance healthtech needed to identify beneficiaries at higher risk of breast problems and better direct prevention, screening and early diagnosis actions.

Task

Build a prediction and risk stratification algorithm to support individualized Primary Care pathways and reduce poorly targeted opportunistic screening.

Action

Consolidated real data from electronic health records, medical claims, questionnaires, exams and gym access logs, transforming risk factors into 21 Boolean features and comparing models such as Logistic Regression, Random Forest, SVM and Neural Network.

Result

The best model was Logistic Regression, with a mean AUC of 0.8165, 95% CI from 0.8056 to 0.8229, and visual AUC of 0.823 in the demonstration app.

Key metrics

Visual ROC AUC in the app
0.823
Mean validation AUC
0.8165
95% CI for AUC
0.8056 – 0.8229
Boolean features
21
Final model
Logistic Regression
Hyperparameters
class_weight balanced · max_iter 1000 · penalty l2

Data sources

The project used real private datasets from a health insurance organization, including: - Electronic health records - Self-administered questionnaires - ERP and medical claims data - Exam results - Gym access logs from partner gyms

Building the analytical dataset

The datasets were integrated to build a final risk factor matrix. Twenty-one Boolean features were identified, representing the presence or absence of factors associated with breast problems. The target grouped benign and malignant breast alterations because isolated malignancy had low occurrence in the sample.

Examples of risk factors analyzed

The factors considered included: - Family and personal history of breast cancer - Ovarian neoplasia - Biopsy history - Breast-related diagnoses - Breast density - Obesity - Smoking - Alcohol use - Sedentary behavior - Inadequate diet - Hormonal and gynecological-obstetric factors - Other relevant clinical conditions

Modeling

Different machine learning algorithms were compared: - Logistic Regression - Random Forest - Support Vector Machines - Sequential Neural Network The comparison used cross-validation, hyperparameter optimization and evaluation by area under the ROC curve.

Final model

The best model was a Logistic Regression with the following hyperparameters: - class_weight: balanced - max_iter: 1000 - penalty: l2 This choice favored a balance between performance, interpretability and applicability in a healthcare decision support context.

Confusion matrix and visual performance

In the demonstration app, the confusion matrix shows: - TN: 1,264 - FP: 737 - FN: 9 - TP: 49 These results reinforce that the operating point must balance sensitivity and precision according to care capacity. In prevention and screening, threshold selection should consider the cost of false negatives, the cost of false positives and care availability.

Comparison with literature

The technical report notes that published breast cancer risk prediction models using epidemiological, demographic and clinical data reported a mean AUC around 0.73, with 95% CI from 0.66 to 0.80. The developed model reached a mean AUC of 0.8165, suggesting competitive performance in an applied setting using real payer datasets.

How to use the output

The model should be interpreted as a decision support tool, not as a replacement for clinical judgment. Its proposed use is to support prioritization of beneficiaries for Primary Care pathways, clinical investigation, preventive guidance, individualized screening and eventual diagnostic referral.

Limitations and caveats

- The target grouped benign and malignant alterations, reducing clinical specificity for isolated malignant cancer. - The low occurrence of malignancy required a broader breast problem modeling approach. - Data came from a specific payer and require external validation. - The model should be periodically recalibrated. - The operating point should be defined according to care capacity and tolerance for false positives and false negatives. - The tool does not replace clinical assessment, care guidelines or shared decision-making.

Deliverables

- Demonstration web application - Logistic Regression model - Feature matrix with 21 Boolean variables - Comparison between machine learning algorithms - ROC curve - Precision-Recall curve - Confusion matrix - Hyperparameter evaluation - Technical project material - Technical presentation

Learnings

- Health risk prediction needs to be connected to a clear care pathway. - In screening, threshold selection is as important as the global metric. - Real payer datasets require integration of multiple sources and well-defined business rules. - Grouping benign and malignant events may be necessary when isolated malignancy occurrence is low, but it changes the clinical interpretation of the model. - The greatest value lies in supporting care coordination, not automating diagnosis.

External links