🧠 Unified Emotion Dataset for LLM Training

A large-scale, balanced emotion classification dataset constructed by unifying 10 publicly available emotion corpora spanning diverse text domains — social media, clinical cancer forums, news headlines, fan fiction, daily dialogues, and more. Designed to enable robust emotion recognition across heterogeneous text domains for training and evaluating large language models (LLMs).

Overview

Emotion recognition is a core task in affective computing and natural language understanding. However, individual emotion datasets are often small, domain-specific, or use incompatible label sets, making it difficult to train general-purpose emotion classifiers.

This repository unifies 10 public datasets under a common 13-class emotion taxonomy (based on Plutchik's wheel of emotions extended with moral emotions). Key challenges addressed:

Challenge	Solution Applied
Inconsistent label names across datasets	Manual label mapping to unified taxonomy
Severe class imbalance (up to 78× ratio)	Semantic clustering + synonym augmentation balancing
Cross-dataset duplicates	Text-level deduplication within and across sources
Noisy social media text	Tweet cleaning: URL/mention removal, emoji preservation
Domain shift	Multi-domain sources intentionally retained for generalisability

Dataset Statistics

Final Balanced Dataset (`unified_emotions_v1.0.0.csv`)

Property	Value
Total samples	70,757
Emotion classes	13
Samples per class	~5,000
Average text length	~98 characters / ~19 words
Vocabulary size	132,629 unique tokens
Average lexical diversity	0.916
Average Flesch readability	70.0 (standard)
Missing texts	0
Cross-source duplicates	removed

Class Distribution (Final Dataset)

Emotion	Count	%
anger	5,000	7.7 %
anticipation	5,000	7.7 %
disgust	5,000	7.7 %
fear	5,000	7.7 %
guilt	5,000	7.7 %
joy	5,000	7.7 %
love	5,000	7.7 %
optimism	5,003	7.7 %
pessimism	5,000	7.7 %
sadness	5,000	7.7 %
shame	5,001	7.7 %
surprise	5,000	7.7 %
trust	5,000	7.7 %
Total	70,757	100%

Source Datasets

The combined, unbalanced dataset (unified_emotions_raw_merged.csv) merges the following 10 sources:

#	Dataset	Domain	Original Labels	Samples (processed)	License	Reference
1	CARER / dair-ai Emotion	Social media posts	6 (joy, sadness, anger, fear, love, surprise)	416,809	MIT	Saravia et al., 2018
2	SemEval-2018 Task 1-E-c	Twitter	11 emotions	10,936	CC-BY	Mohammad et al., 2018
3	WASSA-2017	Twitter	anger, fear, joy, sadness	6,861	Research use	Mohammad & Bravo-Marquez, 2017
4	CancerEMO	Online cancer forums	8 Plutchik emotions	9,960	CC-BY	Wang et al., 2021
5	ISEAR	Self-report questionnaire	joy, fear, anger, sadness, disgust, shame, guilt	7,666	Research use	Scherer & Wallbott, 1994
6	DailyDialog	Multi-turn conversations	happiness, surprise, sadness, anger, disgust, fear	13,429	CC-BY-NC-SA	Li et al., 2017
7	Emotion Stimulus	News / fiction	happy, anger, fear, sad, shame, surprise, disgust	820	Research use	Ghazi et al., 2015
8	FanFic	Fan fiction stories	8 Plutchik emotions	1,742	CC-BY	Bhatt et al., 2019
9	SSEC	Social media (US elections)	8 Plutchik + No Emotion	2,905	CC-BY-NC-SA	Schuff et al., 2017
10	TEC (Twitter Emotion Corpus)	Twitter (hashtag-based)	joy, sadness, surprise, fear, anger, disgust	20,522	Research use	Mohammad, 2012

Note on GoEmotions: GoEmotions was processed separately in the code with a 27→4 emotion mapping. It is not included in the current version of the final balanced dataset (unified_emotions_v1.0.0.csv) or the combined dataset (unified_emotions_raw_merged.csv). However, the preprocessing code is retained in the notebooks as its full integration and mapping into the 13-class taxonomy is a primary goal for the next major release (v2.0) of this dataset.

Processing Pipeline

Raw Source Datasets (10 datasets, varied formats)
         │
         ▼
 ┌───────────────────────────────────────────────────┐
 │  Notebook 01: Per-Dataset Preprocessing           │
 │  - Load (CSV / TXT / TSV / XML / JSON)            │
 │  - Tweet cleaning (remove @mentions, URLs)        │
 │  - Emoji preservation                             │
 │  - Duplicate removal (text-level)                 │
 │  - Label normalisation to unified taxonomy        │
 │  - Save → Prep_files/<dataset>.csv                │
 └───────────────────────────────────────────────────┘
         │
         ▼
 ┌───────────────────────────────────────────────────┐
 │  Notebook 02: Dataset Union                       │
 │  - Load all Prep_files/*.csv                      │
 │  - Concatenate into single DataFrame              │
 │  - Add sentiment polarity column                  │
 │  - Save → unified_emotions_raw_merged.csv (~499K rows, 13 classes)   │
 └───────────────────────────────────────────────────┘
         │
         ▼
 ┌───────────────────────────────────────────────────┐
 │  Notebook 03: Dataset Balancing                   │
 │  - For majority classes: KMeans semantic          │
 │    clustering + representative sampling           │
 │  - For minority classes: Synonym augmentation     │
 │    (WordNet) + paraphrase generation              │
 │  - Quality filtering (sentence embeddings)        │
 │  - Target: 5,000 samples per class                │
 │  - Save → unified_emotions_v1.0.0.csv (70,757 rows)       │
 └───────────────────────────────────────────────────┘

Emotion Taxonomy

The 13 unified emotion classes with their mappings from source label names:

Unified Label	Maps From (source labels)	Plutchik Category
`anger`	anger, angry, Anger	Primary
`anticipation`	anticipation, Anticipation	Primary
`disgust`	disgust, Disgust	Primary
`fear`	fear, Fear, scared	Primary
`joy`	joy, happiness, happy, Happy, HAPPY	Primary
`surprise`	surprise, Surprise, surprised	Primary
`sadness`	sadness, sad, Sadness	Primary
`trust`	trust, Trust	Primary
`love`	love	Secondary (joy + trust)
`optimism`	optimism	Secondary (anticipation + joy)
`pessimism`	pessimism	Secondary (anticipation + disgust)
`guilt`	guilt	Moral emotion
`shame`	shame, Shame	Moral emotion

File Structure

Note on source datasets: Raw source data is not included in this repository (to respect original licenses and avoid redistribution). See DATA_SOURCES.md for download links for all 12 sources.

Data_Prep/
├── README.md                          ← This file
├── DATA_SOURCES.md                    ← Download links for all 12 source datasets
├── LICENSE                            ← CC-BY-4.0
├── CITATION.cff                       ← Machine-readable citation
├── .gitignore
├── .gitattributes                     ← Git LFS for unified_emotions_v1.0.0.csv
│
├── unified_emotions_v1.0.0.csv                ← ✅ FINAL balanced dataset (70,757 rows, Git LFS)
│
├── notebooks/
│   ├── 01_data_preprocessing.ipynb    ← Per-dataset loading, cleaning, label mapping
│   ├── 02_data_union.ipynb            ← Merging all sources → unified_emotions_raw_merged.csv
│   ├── 03_dataset_balancing.ipynb     ← KMeans + augmentation → unified_emotions_v1.0.0.csv
│   └── 04_data_analysis.ipynb         ← Quality analysis and visualisations
│
├── src/
│   └── dataset_balancer.py            ← Standalone DatasetBalancer class (CLI-ready)
│
├── Analysis_Results/                  ← Per-dataset analysis charts & reports
│   ├── summary_report.txt
│   ├── *_label_distribution.png
│   ├── *_text_length.png
│   └── *_samples.txt
│
├── llm_dataset_analysis/              ← Final dataset analysis
│   ├── analysis_report.txt
│   ├── class_distribution.png
│   ├── lexical_diversity.png
│   ├── readability_analysis.png
│   ├── text_length_analysis.png
│   └── wordcloud_*.png
│
└── docs/
    ├── preprocessing_guide.md         ← Step-by-step reproduction guide
    └── dataset_cards/                 ← Per-source dataset documentation
        ├── carer.md
        ├── semeval2018.md
        ├── wassa2017.md
        ├── cancer_emo.md
        ├── isear.md
        ├── daily_dialogue.md
        ├── emotion_stimulus.md
        ├── fanfic.md
        ├── ssec.md
        ├── tec.md
        └── go_emotions.md

Quick Start

Requirements

pip install pandas numpy scikit-learn sentence-transformers transformers torch nltk emoji tqdm

Load the Final Balanced Dataset

import pandas as pd

# Load final balanced dataset
df = pd.read_csv('unified_emotions_v1.0.0.csv')

print(df.head())
# Output:
#                                                 text emotion sentiment  cluster
# 0  i am swimming weekly which feels amazing but...     joy  positive      3.0
# 1                      i feel excited for it    joy  positive      NaN
# ...

print(df['emotion'].value_counts())
# anger          5000
# anticipation   5000
# disgust        5000
# ...

print(f"Dataset shape: {df.shape}")
# Dataset shape: (70757, 4)

Load the Unbalanced Combined Dataset

# Load the combined, unbalanced dataset (requires Git LFS or manual download)
df_combined = pd.read_csv('unified_emotions_raw_merged.csv')
print(df_combined['emotion'].value_counts())
# joy         167586
# sadness     131834
# anger        64755
# ...

Use for Training (HuggingFace style)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('unified_emotions_v1.0.0.csv')

# Encode labels
le = LabelEncoder()
df['label'] = le.fit_transform(df['emotion'])

# Split
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label'],
    test_size=0.2, random_state=42, stratify=df['label']
)

print(f"Train: {len(X_train)}, Test: {len(X_test)}")
# Train: 56605, Test: 14152

Data Schema

`unified_emotions_v1.0.0.csv` (Final Balanced Dataset)

Column	Type	Description
`text`	str	The input text sample
`emotion`	str	One of 13 emotion classes (see taxonomy)
`sentiment`	str	Coarse sentiment polarity: `positive`, `negative`, or `neutral`
`cluster`	float	KMeans cluster ID assigned during balancing (NaN for augmented samples)

`unified_emotions_raw_merged.csv` (Combined Unbalanced Dataset)

Column	Type	Description
`text`	str	The input text sample
`emotion`	str	Normalised emotion label
`sentiment`	str	Coarse sentiment polarity

Per-Source CSVs in `Prep_files/`

Most files use text, label schema. Exceptions:

ISEAR.csv: adds ID column
wassa2017.csv: adds intensity (0–1 float, continuous emotion intensity score)

Techniques Used

1. Tweet Cleaning (`tweet_cleaner()` in Notebook 01)

Applied to Twitter-sourced datasets (TEC, WASSA-2017, SemEval-2018):

Remove @mentions and http://... URLs using regex
Tokenise with NLTK's TweetTokenizer (handles hashtags, contractions)
Preserve emojis (extracted and appended back after cleaning)

2. Label Normalisation

Each source dataset uses different naming conventions. Labels were manually mapped to the 13-class unified taxonomy (e.g., happiness → joy, sad → sadness, Anger → anger).

3. Deduplication

Within each dataset: exact text-level deduplication (keep first occurrence)
Cross-dataset: merged file retains cross-source duplicates for transparent tracking; users can apply further deduplication if needed

4. Class Balancing (`DatasetBalancer` class in Notebook 03 / `src/dataset_balancer.py`)

For majority classes (e.g., joy: 167K → 5K):

TF-IDF vectorisation of text
KMeans clustering to identify semantic sub-groups
Representative sampling within each cluster (preserves diversity)

For minority classes (e.g., pessimism: 375 → 5K):

Synonym augmentation via WordNet (random word replacement)
Back-translation paraphrasing using a Seq2Seq transformer model
Quality filtering using cosine similarity of sentence embeddings (all-MiniLM-L6-v2) to ensure augmented samples remain semantically close to originals

Analysis Results

Per-dataset analysis charts are in Analysis_Results/. Final dataset analysis in llm_dataset_analysis/.

Per-Dataset Summary

Dataset	Rows	Classes	Avg Text Length	Class Imbalance Ratio	Duplicates
CARER	416,809	6	97 chars	9.4×	22,987
TEC Tweets	20,522	6	82 chars	10.7×	0
DailyDialog	13,429	6	59 chars	78.4×	662
ISEAR	7,666	7	115 chars	1.0× (balanced)	163
SemEval-2018	7,869	11	96 chars	9.4×	4,685
WASSA-2017	6,861	4	89 chars	1.7×	0
CancerEMO	9,960	8	83 chars	13.9×	1,494
SSEC	2,905	9	80 chars	28.9×	16
FanFic	1,742	8	943 chars	4.3×	120
Emotion Stimulus	820	7	113 chars	5.6×	0

Citation

If you use this dataset in your research, please cite:

@dataset{jamil2025unifiedemotion,
  author    = {Jamil, Luqman},
  title     = {Unified Emotion Dataset for LLM Training: A Multi-Source Benchmark with 13 Emotion Classes},
  year      = {2025},
  version   = {1.0.0},
  license   = {CC-BY-4.0},
  url       = {https://github.com/luqmanjamilch/Unified_Emotions-Dataset}
}

Please also cite the individual source datasets you use. Full BibTeX entries for all sources:

📚 Click to expand all source citations

@inproceedings{saravia2018carer,
  title     = {{CARER}: Contextualized Affect Representations for Emotion Recognition},
  author    = {Saravia, Elvis and Liu, Hsien-Chi Toby and Huang, Yen-Hao and Wu, Junlin and Chen, Yi-Shin},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages     = {3687--3697},
  year      = {2018},
  publisher = {Association for Computational Linguistics}
}

@inproceedings{mohammad2018semeval,
  title     = {{SemEval}-2018 Task 1: Affect in Tweets},
  author    = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},
  booktitle = {Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018)},
  pages     = {1--17},
  year      = {2018}
}

@inproceedings{mohammad2017wassa,
  title     = {{WASSA}-2017 Shared Task on Emotion Intensity},
  author    = {Mohammad, Saif M. and Bravo-Marquez, Felipe},
  booktitle = {Proceedings of the EMNLP 2017 Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis},
  pages     = {34--49},
  year      = {2017}
}

@inproceedings{wang2021canceremo,
  title     = {{CancerEMO}: A Dataset for Fine-Grained Emotion Detection in Cancer-Related Online Posts},
  author    = {Wang, Tao and Wan, Xiaojun and Jin, Hua},
  booktitle = {Proceedings of EMNLP 2021},
  year      = {2021}
}

@article{scherer1994isear,
  title     = {Evidence for universality and cultural variation of differential emotion response patterning},
  author    = {Scherer, Klaus R. and Wallbott, Harald G.},
  journal   = {Journal of Personality and Social Psychology},
  volume    = {66},
  number    = {2},
  pages     = {310--328},
  year      = {1994}
}

@inproceedings{li2017dailydialog,
  title     = {{DailyDialog}: A Manually Labelled Multi-turn Dialogue Dataset},
  author    = {Li, Yanran and Su, Hui and Shen, Xiaoyu and Li, Wenjie and Cao, Ziqiang and Niu, Shuzi},
  booktitle = {Proceedings of the Eighth International Joint Conference on Natural Language Processing (IJCNLP 2017)},
  pages     = {986--995},
  year      = {2017}
}

@article{ghazi2015stimuli,
  title     = {Detecting Emotions in Text: Integrating Cognitive Theories of Emotion with Neural Network Models of Language},
  author    = {Ghazi, Diman and Inkpen, Diana and Szpakowicz, Stan},
  journal   = {PloS ONE},
  year      = {2015}
}

@inproceedings{bhatt2019fanfic,
  title     = {Automatic Identification of Character Types from Film Dialogs},
  author    = {Bhatt, Hardik and Bhatt, Brijesh and Pimpale, Ankit},
  booktitle = {Proceedings of the Workshop on Narrative Understanding},
  year      = {2019}
}

@inproceedings{schuff2017ssec,
  title     = {{SSEC}: A Human-Annotated Corpus for Fine-Grained Emotion Classification},
  author    = {Schuff, Hendrik and Barnes, Jeremy and Mohme, Julian and Pad{\'o}, Sebastian and Klinger, Roman},
  booktitle = {Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis},
  pages     = {74--80},
  year      = {2017}
}

@inproceedings{mohammad2012tec,
  title     = {\#Emotional Tweets},
  author    = {Mohammad, Saif M.},
  booktitle = {Proceedings of the First Joint Conference on Lexical and Computational Semantics (*SEM 2012)},
  pages     = {246--255},
  year      = {2012}
}

@inproceedings{demszky2020goemotions,
  title     = {{GoEmotions}: A Dataset of Fine-Grained Emotions},
  author    = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
  booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)},
  pages     = {4040--4054},
  year      = {2020}
}

License

This repository (the unified dataset and processing code) is licensed under CC BY 4.0.

Important: The individual source datasets retain their original licenses. Some (e.g., DailyDialog, SSEC) are licensed under CC-BY-NC-SA and may not be used for commercial purposes. Please review docs/dataset_cards/ for the specific license of each source before use.

Acknowledgements

This dataset was compiled as part of PhD research. We thank the authors of all source datasets for making their work publicly available. Special thanks to the organisers of SemEval 2018 and WASSA 2017 shared tasks.

Last updated: June 2025 | Version 1.0.0

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Analysis_Results		Analysis_Results
docs		docs
llm_dataset_analysis		llm_dataset_analysis
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
DATA_SOURCES.md		DATA_SOURCES.md
LICENSE		LICENSE
README.md		README.md
emotion_distribution.png		emotion_distribution.png
unified_emotions_v1.0.0.csv		unified_emotions_v1.0.0.csv

Folders and files

Latest commit

History

Repository files navigation

🧠 Unified Emotion Dataset for LLM Training

📋 Table of Contents

Overview

Dataset Statistics

Final Balanced Dataset (unified_emotions_v1.0.0.csv)

Class Distribution (Final Dataset)

Source Datasets

Processing Pipeline

Emotion Taxonomy

File Structure

Quick Start

Requirements

Load the Final Balanced Dataset

Load the Unbalanced Combined Dataset

Use for Training (HuggingFace style)

Data Schema

unified_emotions_v1.0.0.csv (Final Balanced Dataset)

unified_emotions_raw_merged.csv (Combined Unbalanced Dataset)

Per-Source CSVs in Prep_files/

Techniques Used

1. Tweet Cleaning (tweet_cleaner() in Notebook 01)

2. Label Normalisation

3. Deduplication

4. Class Balancing (DatasetBalancer class in Notebook 03 / src/dataset_balancer.py)

Analysis Results

Per-Dataset Summary

Citation

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Final Balanced Dataset (`unified_emotions_v1.0.0.csv`)

`unified_emotions_v1.0.0.csv` (Final Balanced Dataset)

`unified_emotions_raw_merged.csv` (Combined Unbalanced Dataset)

Per-Source CSVs in `Prep_files/`

1. Tweet Cleaning (`tweet_cleaner()` in Notebook 01)

4. Class Balancing (`DatasetBalancer` class in Notebook 03 / `src/dataset_balancer.py`)

Packages