A comprehensive comparative study of BART and T5 transformer models for abstractive text summarization on the CNN/Daily Mail dataset, exploring pre-trained model performance and advanced fine-tuning techniques.
This project investigates three key aspects of text summarization using Large Language Models (LLMs):
- How do pre-trained models initially perform on the CNN/Daily Mail dataset?
- What techniques can be used to effectively re-train LLMs on a relatively small dataset?
- How do different models react to the training process, and what are their performances after fine-tuning?
Six different variations of BART and T5 models were tested and fine-tuned:
| Model | Layers | Attention Heads | Hidden Size | Parameters | Vocabulary Size |
|---|---|---|---|---|---|
| BART-BASE | 12 | 12 | 768 | 139M | 50,265 |
| BART-LARGE | 12 | 16 | 1024 | 406M | 50,265 |
| BART-LARGE-CNN | 12 | 16 | 1024 | 406M | 50,265 |
| T5-SMALL | 6 | 8 | 512 | 60M | 32,128 |
| T5-BASE | 12 | 12 | 768 | 220M | 32,128 |
| T5-LARGE | 24 | 16 | 1024 | 770M | 32,128 |
- BART (Bidirectional and Auto-Regressive Transformers): A Seq-to-Seq model with a bidirectional encoder (BERT-inspired) and an autoregressive decoder (GPT-inspired)
- T5 (Text-To-Text Transfer Transformer): A versatile transformer-based model that frames all NLP tasks as text-to-text problems
CNN/Daily Mail Dataset (v3.0.0)
- One of the most popular datasets for text summarization tasks
- Contains 311,971 long articles from CNN and Daily Mail
- Each article includes a human-written summary
├── 1_code_data_visualization.ipynb # Data exploration and visualization
├── 2_code_text_summarization.ipynb # Model training and evaluation
├── 3_code_experiment_analysis.ipynb # Results analysis and comparison
└── README.md
- Load and preprocess CNN/Daily Mail dataset
- Tokenization with BART and T5 tokenizers
- Comprehensive data visualization:
- Word clouds of most common tokens
- Top 20 common tokens analysis
- Distribution of article lengths
- Distribution of summary lengths
- Article-summary length correlation analysis
- Qualitative analysis of sample articles and summaries
- Experiment Design: 6-stage pipeline from data preparation to inference
- Pre-training Evaluation: Baseline performance metrics
- Model Fine-tuning:
- Hyperparameter tuning
- Advanced fine-tuning techniques (layer freezing, discriminative fine-tuning)
- Regularization methods
- Performance Evaluation: ROUGE score computation
- Pipeline Building: Create production-ready summarization pipeline
- Real-world Testing: Evaluate on latest news articles
- Pre-training evaluation results comparison
- Training and validation metrics visualization
- Model training time analysis
- Performance evaluation across all models
- Improvement rates calculation
- Ablation studies on fine-tuning techniques
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum | Mean Length |
|---|---|---|---|---|---|
| BART-BASE | 30.41 | 11.96 | 18.92 | 25.04 | 117.0 |
| BART-LARGE | 28.68 | 9.98 | 18.64 | 24.40 | 117.1 |
| BART-LARGE-CNN | 40.37 | 16.99 | 28.98 | 34.36 | 50.2 |
| T5-SMALL | 28.83 | 8.67 | 20.12 | 24.28 | 45.1 |
| T5-BASE | 35.60 | 14.49 | 26.49 | 30.99 | 42.7 |
| T5-LARGE | 31.59 | 11.04 | 22.23 | 26.75 | 31.5 |
- T5 vs BART: T5 models generally outperform BART models due to their versatile text-to-text architecture
- Pre-training Matters: BART-LARGE-CNN shows the best performance as it was pre-trained on the CNN/Daily Mail dataset
- Model Size: Larger models generally achieve better results due to increased capacity and expressive ability
- Generation Length: Output length significantly impacts performance, particularly evident in T5-LARGE
- Layer Freezing: Selectively freeze lower layers to reduce overfitting
- Discriminative Fine-tuning: Apply different learning rates to different layers
- Hyperparameter Tuning: Optimize batch size, learning rate, and training epochs
torch
datasets
transformers
rouge-score
accelerate
numpy
pandas
matplotlib
seaborn
wordcloud
nltk
pip install torch datasets transformers rouge-score accelerate
pip install numpy pandas matplotlib seaborn wordcloud nltkjupyter notebook 1_code_data_visualization.ipynbjupyter notebook 2_code_text_summarization.ipynbjupyter notebook 3_code_experiment_analysis.ipynbThe project uses ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics:
- ROUGE-1: Unigram overlap between generated and reference summaries
- ROUGE-2: Bigram overlap between generated and reference summaries
- ROUGE-L: Longest common subsequence between generated and reference summaries
- ROUGE-Lsum: ROUGE-L computed on summary-level