A large-scale, balanced emotion classification dataset constructed by unifying 10 publicly available emotion corpora spanning diverse text domains — social media, clinical cancer forums, news headlines, fan fiction, daily dialogues, and more. Designed to enable robust emotion recognition across heterogeneous text domains for training and evaluating large language models (LLMs).
- Overview
- Dataset Statistics
- Source Datasets
- Processing Pipeline
- Emotion Taxonomy
- File Structure
- Quick Start
- Data Schema
- Techniques Used
- Analysis Results
- Citation
- License
- Acknowledgements
Emotion recognition is a core task in affective computing and natural language understanding. However, individual emotion datasets are often small, domain-specific, or use incompatible label sets, making it difficult to train general-purpose emotion classifiers.
This repository unifies 10 public datasets under a common 13-class emotion taxonomy (based on Plutchik's wheel of emotions extended with moral emotions). Key challenges addressed:
| Challenge | Solution Applied |
|---|---|
| Inconsistent label names across datasets | Manual label mapping to unified taxonomy |
| Severe class imbalance (up to 78× ratio) | Semantic clustering + synonym augmentation balancing |
| Cross-dataset duplicates | Text-level deduplication within and across sources |
| Noisy social media text | Tweet cleaning: URL/mention removal, emoji preservation |
| Domain shift | Multi-domain sources intentionally retained for generalisability |
| Property | Value |
|---|---|
| Total samples | 70,757 |
| Emotion classes | 13 |
| Samples per class | ~5,000 |
| Average text length | ~98 characters / ~19 words |
| Vocabulary size | 132,629 unique tokens |
| Average lexical diversity | 0.916 |
| Average Flesch readability | 70.0 (standard) |
| Missing texts | 0 |
| Cross-source duplicates | removed |
| Emotion | Count | % |
|---|---|---|
| anger | 5,000 | 7.7 % |
| anticipation | 5,000 | 7.7 % |
| disgust | 5,000 | 7.7 % |
| fear | 5,000 | 7.7 % |
| guilt | 5,000 | 7.7 % |
| joy | 5,000 | 7.7 % |
| love | 5,000 | 7.7 % |
| optimism | 5,003 | 7.7 % |
| pessimism | 5,000 | 7.7 % |
| sadness | 5,000 | 7.7 % |
| shame | 5,001 | 7.7 % |
| surprise | 5,000 | 7.7 % |
| trust | 5,000 | 7.7 % |
| Total | 70,757 | 100% |
The combined, unbalanced dataset (unified_emotions_raw_merged.csv) merges the following 10 sources:
| # | Dataset | Domain | Original Labels | Samples (processed) | License | Reference |
|---|---|---|---|---|---|---|
| 1 | CARER / dair-ai Emotion | Social media posts | 6 (joy, sadness, anger, fear, love, surprise) | 416,809 | MIT | Saravia et al., 2018 |
| 2 | SemEval-2018 Task 1-E-c | 11 emotions | 10,936 | CC-BY | Mohammad et al., 2018 | |
| 3 | WASSA-2017 | anger, fear, joy, sadness | 6,861 | Research use | Mohammad & Bravo-Marquez, 2017 | |
| 4 | CancerEMO | Online cancer forums | 8 Plutchik emotions | 9,960 | CC-BY | Wang et al., 2021 |
| 5 | ISEAR | Self-report questionnaire | joy, fear, anger, sadness, disgust, shame, guilt | 7,666 | Research use | Scherer & Wallbott, 1994 |
| 6 | DailyDialog | Multi-turn conversations | happiness, surprise, sadness, anger, disgust, fear | 13,429 | CC-BY-NC-SA | Li et al., 2017 |
| 7 | Emotion Stimulus | News / fiction | happy, anger, fear, sad, shame, surprise, disgust | 820 | Research use | Ghazi et al., 2015 |
| 8 | FanFic | Fan fiction stories | 8 Plutchik emotions | 1,742 | CC-BY | Bhatt et al., 2019 |
| 9 | SSEC | Social media (US elections) | 8 Plutchik + No Emotion | 2,905 | CC-BY-NC-SA | Schuff et al., 2017 |
| 10 | TEC (Twitter Emotion Corpus) | Twitter (hashtag-based) | joy, sadness, surprise, fear, anger, disgust | 20,522 | Research use | Mohammad, 2012 |
Note on GoEmotions: GoEmotions was processed separately in the code with a 27→4 emotion mapping. It is not included in the current version of the final balanced dataset (
unified_emotions_v1.0.0.csv) or the combined dataset (unified_emotions_raw_merged.csv). However, the preprocessing code is retained in the notebooks as its full integration and mapping into the 13-class taxonomy is a primary goal for the next major release (v2.0) of this dataset.
Raw Source Datasets (10 datasets, varied formats)
│
▼
┌───────────────────────────────────────────────────┐
│ Notebook 01: Per-Dataset Preprocessing │
│ - Load (CSV / TXT / TSV / XML / JSON) │
│ - Tweet cleaning (remove @mentions, URLs) │
│ - Emoji preservation │
│ - Duplicate removal (text-level) │
│ - Label normalisation to unified taxonomy │
│ - Save → Prep_files/<dataset>.csv │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ Notebook 02: Dataset Union │
│ - Load all Prep_files/*.csv │
│ - Concatenate into single DataFrame │
│ - Add sentiment polarity column │
│ - Save → unified_emotions_raw_merged.csv (~499K rows, 13 classes) │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ Notebook 03: Dataset Balancing │
│ - For majority classes: KMeans semantic │
│ clustering + representative sampling │
│ - For minority classes: Synonym augmentation │
│ (WordNet) + paraphrase generation │
│ - Quality filtering (sentence embeddings) │
│ - Target: 5,000 samples per class │
│ - Save → unified_emotions_v1.0.0.csv (70,757 rows) │
└───────────────────────────────────────────────────┘
The 13 unified emotion classes with their mappings from source label names:
| Unified Label | Maps From (source labels) | Plutchik Category |
|---|---|---|
anger |
anger, angry, Anger | Primary |
anticipation |
anticipation, Anticipation | Primary |
disgust |
disgust, Disgust | Primary |
fear |
fear, Fear, scared | Primary |
joy |
joy, happiness, happy, Happy, HAPPY | Primary |
surprise |
surprise, Surprise, surprised | Primary |
sadness |
sadness, sad, Sadness | Primary |
trust |
trust, Trust | Primary |
love |
love | Secondary (joy + trust) |
optimism |
optimism | Secondary (anticipation + joy) |
pessimism |
pessimism | Secondary (anticipation + disgust) |
guilt |
guilt | Moral emotion |
shame |
shame, Shame | Moral emotion |
Note on source datasets: Raw source data is not included in this repository (to respect original licenses and avoid redistribution). See DATA_SOURCES.md for download links for all 12 sources.
Data_Prep/
├── README.md ← This file
├── DATA_SOURCES.md ← Download links for all 12 source datasets
├── LICENSE ← CC-BY-4.0
├── CITATION.cff ← Machine-readable citation
├── .gitignore
├── .gitattributes ← Git LFS for unified_emotions_v1.0.0.csv
│
├── unified_emotions_v1.0.0.csv ← ✅ FINAL balanced dataset (70,757 rows, Git LFS)
│
├── notebooks/
│ ├── 01_data_preprocessing.ipynb ← Per-dataset loading, cleaning, label mapping
│ ├── 02_data_union.ipynb ← Merging all sources → unified_emotions_raw_merged.csv
│ ├── 03_dataset_balancing.ipynb ← KMeans + augmentation → unified_emotions_v1.0.0.csv
│ └── 04_data_analysis.ipynb ← Quality analysis and visualisations
│
├── src/
│ └── dataset_balancer.py ← Standalone DatasetBalancer class (CLI-ready)
│
├── Analysis_Results/ ← Per-dataset analysis charts & reports
│ ├── summary_report.txt
│ ├── *_label_distribution.png
│ ├── *_text_length.png
│ └── *_samples.txt
│
├── llm_dataset_analysis/ ← Final dataset analysis
│ ├── analysis_report.txt
│ ├── class_distribution.png
│ ├── lexical_diversity.png
│ ├── readability_analysis.png
│ ├── text_length_analysis.png
│ └── wordcloud_*.png
│
└── docs/
├── preprocessing_guide.md ← Step-by-step reproduction guide
└── dataset_cards/ ← Per-source dataset documentation
├── carer.md
├── semeval2018.md
├── wassa2017.md
├── cancer_emo.md
├── isear.md
├── daily_dialogue.md
├── emotion_stimulus.md
├── fanfic.md
├── ssec.md
├── tec.md
└── go_emotions.md
pip install pandas numpy scikit-learn sentence-transformers transformers torch nltk emoji tqdmimport pandas as pd
# Load final balanced dataset
df = pd.read_csv('unified_emotions_v1.0.0.csv')
print(df.head())
# Output:
# text emotion sentiment cluster
# 0 i am swimming weekly which feels amazing but... joy positive 3.0
# 1 i feel excited for it joy positive NaN
# ...
print(df['emotion'].value_counts())
# anger 5000
# anticipation 5000
# disgust 5000
# ...
print(f"Dataset shape: {df.shape}")
# Dataset shape: (70757, 4)# Load the combined, unbalanced dataset (requires Git LFS or manual download)
df_combined = pd.read_csv('unified_emotions_raw_merged.csv')
print(df_combined['emotion'].value_counts())
# joy 167586
# sadness 131834
# anger 64755
# ...from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('unified_emotions_v1.0.0.csv')
# Encode labels
le = LabelEncoder()
df['label'] = le.fit_transform(df['emotion'])
# Split
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['label'],
test_size=0.2, random_state=42, stratify=df['label']
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")
# Train: 56605, Test: 14152| Column | Type | Description |
|---|---|---|
text |
str | The input text sample |
emotion |
str | One of 13 emotion classes (see taxonomy) |
sentiment |
str | Coarse sentiment polarity: positive, negative, or neutral |
cluster |
float | KMeans cluster ID assigned during balancing (NaN for augmented samples) |
| Column | Type | Description |
|---|---|---|
text |
str | The input text sample |
emotion |
str | Normalised emotion label |
sentiment |
str | Coarse sentiment polarity |
Most files use text, label schema. Exceptions:
ISEAR.csv: addsIDcolumnwassa2017.csv: addsintensity(0–1 float, continuous emotion intensity score)
Applied to Twitter-sourced datasets (TEC, WASSA-2017, SemEval-2018):
- Remove
@mentionsandhttp://...URLs using regex - Tokenise with NLTK's
TweetTokenizer(handles hashtags, contractions) - Preserve emojis (extracted and appended back after cleaning)
Each source dataset uses different naming conventions. Labels were manually mapped to the 13-class unified taxonomy (e.g., happiness → joy, sad → sadness, Anger → anger).
- Within each dataset: exact text-level deduplication (keep first occurrence)
- Cross-dataset: merged file retains cross-source duplicates for transparent tracking; users can apply further deduplication if needed
For majority classes (e.g., joy: 167K → 5K):
- TF-IDF vectorisation of text
- KMeans clustering to identify semantic sub-groups
- Representative sampling within each cluster (preserves diversity)
For minority classes (e.g., pessimism: 375 → 5K):
- Synonym augmentation via WordNet (random word replacement)
- Back-translation paraphrasing using a Seq2Seq transformer model
- Quality filtering using cosine similarity of sentence embeddings (
all-MiniLM-L6-v2) to ensure augmented samples remain semantically close to originals
Per-dataset analysis charts are in Analysis_Results/. Final dataset analysis in llm_dataset_analysis/.
| Dataset | Rows | Classes | Avg Text Length | Class Imbalance Ratio | Duplicates |
|---|---|---|---|---|---|
| CARER | 416,809 | 6 | 97 chars | 9.4× | 22,987 |
| TEC Tweets | 20,522 | 6 | 82 chars | 10.7× | 0 |
| DailyDialog | 13,429 | 6 | 59 chars | 78.4× | 662 |
| ISEAR | 7,666 | 7 | 115 chars | 1.0× (balanced) | 163 |
| SemEval-2018 | 7,869 | 11 | 96 chars | 9.4× | 4,685 |
| WASSA-2017 | 6,861 | 4 | 89 chars | 1.7× | 0 |
| CancerEMO | 9,960 | 8 | 83 chars | 13.9× | 1,494 |
| SSEC | 2,905 | 9 | 80 chars | 28.9× | 16 |
| FanFic | 1,742 | 8 | 943 chars | 4.3× | 120 |
| Emotion Stimulus | 820 | 7 | 113 chars | 5.6× | 0 |
If you use this dataset in your research, please cite:
@dataset{jamil2025unifiedemotion,
author = {Jamil, Luqman},
title = {Unified Emotion Dataset for LLM Training: A Multi-Source Benchmark with 13 Emotion Classes},
year = {2025},
version = {1.0.0},
license = {CC-BY-4.0},
url = {https://github.com/luqmanjamilch/Unified_Emotions-Dataset}
}Please also cite the individual source datasets you use. Full BibTeX entries for all sources:
📚 Click to expand all source citations
@inproceedings{saravia2018carer,
title = {{CARER}: Contextualized Affect Representations for Emotion Recognition},
author = {Saravia, Elvis and Liu, Hsien-Chi Toby and Huang, Yen-Hao and Wu, Junlin and Chen, Yi-Shin},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
pages = {3687--3697},
year = {2018},
publisher = {Association for Computational Linguistics}
}
@inproceedings{mohammad2018semeval,
title = {{SemEval}-2018 Task 1: Affect in Tweets},
author = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},
booktitle = {Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018)},
pages = {1--17},
year = {2018}
}
@inproceedings{mohammad2017wassa,
title = {{WASSA}-2017 Shared Task on Emotion Intensity},
author = {Mohammad, Saif M. and Bravo-Marquez, Felipe},
booktitle = {Proceedings of the EMNLP 2017 Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis},
pages = {34--49},
year = {2017}
}
@inproceedings{wang2021canceremo,
title = {{CancerEMO}: A Dataset for Fine-Grained Emotion Detection in Cancer-Related Online Posts},
author = {Wang, Tao and Wan, Xiaojun and Jin, Hua},
booktitle = {Proceedings of EMNLP 2021},
year = {2021}
}
@article{scherer1994isear,
title = {Evidence for universality and cultural variation of differential emotion response patterning},
author = {Scherer, Klaus R. and Wallbott, Harald G.},
journal = {Journal of Personality and Social Psychology},
volume = {66},
number = {2},
pages = {310--328},
year = {1994}
}
@inproceedings{li2017dailydialog,
title = {{DailyDialog}: A Manually Labelled Multi-turn Dialogue Dataset},
author = {Li, Yanran and Su, Hui and Shen, Xiaoyu and Li, Wenjie and Cao, Ziqiang and Niu, Shuzi},
booktitle = {Proceedings of the Eighth International Joint Conference on Natural Language Processing (IJCNLP 2017)},
pages = {986--995},
year = {2017}
}
@article{ghazi2015stimuli,
title = {Detecting Emotions in Text: Integrating Cognitive Theories of Emotion with Neural Network Models of Language},
author = {Ghazi, Diman and Inkpen, Diana and Szpakowicz, Stan},
journal = {PloS ONE},
year = {2015}
}
@inproceedings{bhatt2019fanfic,
title = {Automatic Identification of Character Types from Film Dialogs},
author = {Bhatt, Hardik and Bhatt, Brijesh and Pimpale, Ankit},
booktitle = {Proceedings of the Workshop on Narrative Understanding},
year = {2019}
}
@inproceedings{schuff2017ssec,
title = {{SSEC}: A Human-Annotated Corpus for Fine-Grained Emotion Classification},
author = {Schuff, Hendrik and Barnes, Jeremy and Mohme, Julian and Pad{\'o}, Sebastian and Klinger, Roman},
booktitle = {Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis},
pages = {74--80},
year = {2017}
}
@inproceedings{mohammad2012tec,
title = {\#Emotional Tweets},
author = {Mohammad, Saif M.},
booktitle = {Proceedings of the First Joint Conference on Lexical and Computational Semantics (*SEM 2012)},
pages = {246--255},
year = {2012}
}
@inproceedings{demszky2020goemotions,
title = {{GoEmotions}: A Dataset of Fine-Grained Emotions},
author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)},
pages = {4040--4054},
year = {2020}
}This repository (the unified dataset and processing code) is licensed under CC BY 4.0.
Important: The individual source datasets retain their original licenses. Some (e.g., DailyDialog, SSEC) are licensed under CC-BY-NC-SA and may not be used for commercial purposes. Please review
docs/dataset_cards/for the specific license of each source before use.
This dataset was compiled as part of PhD research. We thank the authors of all source datasets for making their work publicly available. Special thanks to the organisers of SemEval 2018 and WASSA 2017 shared tasks.
Last updated: June 2025 | Version 1.0.0