This repository contains my solution and experimentation workflow for the SPR 2026 Mammography Report Classification Kaggle competition, focused on predicting the BI-RADS category (0–6) from the text of mammography reports, using only the indication and findings sections of each report.
The project explores transformer-based NLP approaches for clinical text classification, primarily using the pretrained models pucpr/biobertpt-clin and BERTimbau, with the goal of inferring radiological assessment from descriptive mammography findings alone.
Best competition score: 0.83
Kaggle competition: (https://www.kaggle.com/competitions/spr-2026-mammography-report-classification)
The competition was organized by the Radiology Society of São Paulo (SPR - Sociedade Paulista de Radiologia) as part of its Artificial Intelligence Challenge for Breast Cancer. The objective is to develop a model that accurately predicts the BI-RADS category from mammography report text, excluding the impression/conclusion section where the target label is explicitly stated.
This task is clinically relevant because it simulates a decision-support setting in which a model must infer the final assessment from the descriptive content of the report. Potential applications include:
- improving consistency across radiology reports,
- supporting automated quality control,
- providing educational feedback for trainees,
- and enabling large-scale research in breast imaging NLP.
The competition metric is macro F1-score over the seven BI-RADS classes.
BI-RADS (Breast Imaging Reporting and Data System) is a standardized framework developed by the American College of Radiology (ACR) for breast imaging reporting. In this challenge, the target is a 7-class classification problem:
- 0 – Incomplete, needs additional imaging
- 1 – Negative
- 2 – Benign finding
- 3 – Probably benign
- 4 – Suspicious abnormality
- 5 – Highly suggestive of malignancy
- 6 – Known biopsy-proven malignancy
The target label was extracted from the original report’s impression/conclusion section, which was removed from the input text provided to participants.
The dataset consists of de-identified mammography radiology reports collected from multiple Brazilian institutions. Each report typically contains three sections:
- Indication / Reason for Exam
- Findings
- Impression / Conclusion
For this competition, participants are given only the indication and findings sections, while the impression/conclusion is excluded because it contains the ground-truth BI-RADS label.
-
train.csv
Contains labeled reports with:idreporttarget
-
test.csv
Contains unlabeled reports with:idreport
-
submission.csv
Submission file with:idtarget
The competition data is subject to strict privacy and usage restrictions. According to the challenge rules, participants agree not to copy, redistribute, or attempt to re-identify any part of the dataset. For that reason:
- the original competition dataset is not included in this repository,
- no raw or processed competition data is shared here,
- and this repository is intended to provide the code, project structure, and reproducible workflow only, in accordance with the competition’s data access constraints.
To reproduce the experiments, you must obtain access to the dataset directly through the official Kaggle competition page and comply with all competition rules.
This project focused on transformer-based clinical NLP classification, primarily using:
pucpr/biobertpt-clin- BERTimbau
Among the tested backbones, pucpr/biobertpt-clin performed slightly better overall, although the gap versus BERTimbau was not large.
- 5-fold cross-validation
- typically 8 to 12 epochs
- early stopping patience of 3 epochs
- model selection based on macro F1
The goal was to evaluate how well Portuguese biomedical and general-domain transformer models could infer BI-RADS classes from the descriptive sections of mammography reports alone.
Preprocessing was intentionally minimal.
Based on exploratory data analysis, the reports were already well-structured and consistently organized, so no heavy text normalization pipeline was necessary. The main preprocessing step focused on leakage cleaning, ensuring that no direct target information from the impression/conclusion section remained in the modeling input.
This design choice helped preserve the original clinical language while minimizing the risk of artificial performance inflation.
- Best competition score: 0.83
- Validation strategy: 5-fold cross-validation
- Best-performing backbone:
pucpr/biobertpt-clin - Alternative backbone explored: BERTimbau
- Core metric: macro F1
This project showed that pretrained transformer models can perform strongly on radiology report classification with relatively limited preprocessing, provided that leakage is handled carefully and validation is done properly.
This repository is being organized progressively into a more modular format. Initially, the experimentation workflow is centered around two notebooks:
01_eda.ipynb– exploratory data analysis02_modeling.ipynb– feature engineering, modeling, validation, ensembling, and inference
birads-mammography-report-classification/
├── README.md
├── requirements.txt
├── .gitignore
├── configs/
│ └── config.yaml
├── notebooks/
│ ├── 01_eda.ipynb
│ └── 02_modeling.ipynb
├── src/
│ ├── __init__.py
│ ├── preprocess.py
│ ├── features.py
│ ├── modeling.py
│ ├── inference.py
│ └── utils.py
├── results/
│ ├── figures/
│ ├── metrics/
│ └── submissions/
├── images/
│ └── README_assets/
└── data/
Because the dataset cannot be redistributed, this repository is designed to be reproducible once the user has access to the official competition data.
Join the Kaggle competition and accept its rules. Download the competition data from Kaggle. Place the files locally in the expected directory structure. Run the notebooks and/or modularized scripts. Generate predictions and submission files.
More detailed setup instructions can be added once the final project structure is completed.
This competition was organized by:
Sociedade Paulista de Radiologia Douglas Racy Gustavo Corradi Eduardo Farina Vanderlei Silva Aline Sessino Giuliano Giovanetti Almir Bittencourt Felipe Kitamura Lilian Mallagoli Sheila Costa Donating Institutions AC Camargo Almir Bittencourt Soraia Damião Hapvida Alexandre Bialowas Eduardo Caminha Unifesp Eduardo Farina Nitamar Abdala Citation
If you use or reference the competition, please cite:
Eduardo Farina and Felipe Kitamura, MD, PhD. SPR 2026 Mammography Report Classification. Kaggle, 2026. (https://www.kaggle.com/competitions/spr-2026-mammography-report-classification)
This repository is an independent portfolio project built around a public Kaggle competition. It is intended for educational, research, and reproducibility purposes only. All dataset rights, competition rules, and privacy restrictions remain with the official organizers and hosting platform. ├── sample/ └── metadata/