Skip to content

luqmanjamilch/Unified_Emotions-Dataset

Repository files navigation

🧠 Unified Emotion Dataset for LLM Training

License: CC BY 4.0 Python 3.8+ Dataset Size Classes

A large-scale, balanced emotion classification dataset constructed by unifying 10 publicly available emotion corpora spanning diverse text domains — social media, clinical cancer forums, news headlines, fan fiction, daily dialogues, and more. Designed to enable robust emotion recognition across heterogeneous text domains for training and evaluating large language models (LLMs).


📋 Table of Contents


Overview

Emotion recognition is a core task in affective computing and natural language understanding. However, individual emotion datasets are often small, domain-specific, or use incompatible label sets, making it difficult to train general-purpose emotion classifiers.

This repository unifies 10 public datasets under a common 13-class emotion taxonomy (based on Plutchik's wheel of emotions extended with moral emotions). Key challenges addressed:

Challenge Solution Applied
Inconsistent label names across datasets Manual label mapping to unified taxonomy
Severe class imbalance (up to 78× ratio) Semantic clustering + synonym augmentation balancing
Cross-dataset duplicates Text-level deduplication within and across sources
Noisy social media text Tweet cleaning: URL/mention removal, emoji preservation
Domain shift Multi-domain sources intentionally retained for generalisability

Dataset Statistics

Final Balanced Dataset (unified_emotions_v1.0.0.csv)

Property Value
Total samples 70,757
Emotion classes 13
Samples per class ~5,000
Average text length ~98 characters / ~19 words
Vocabulary size 132,629 unique tokens
Average lexical diversity 0.916
Average Flesch readability 70.0 (standard)
Missing texts 0
Cross-source duplicates removed

Class Distribution (Final Dataset)

Emotion Count %
anger 5,000 7.7 %
anticipation 5,000 7.7 %
disgust 5,000 7.7 %
fear 5,000 7.7 %
guilt 5,000 7.7 %
joy 5,000 7.7 %
love 5,000 7.7 %
optimism 5,003 7.7 %
pessimism 5,000 7.7 %
sadness 5,000 7.7 %
shame 5,001 7.7 %
surprise 5,000 7.7 %
trust 5,000 7.7 %
Total 70,757 100%

Source Datasets

The combined, unbalanced dataset (unified_emotions_raw_merged.csv) merges the following 10 sources:

# Dataset Domain Original Labels Samples (processed) License Reference
1 CARER / dair-ai Emotion Social media posts 6 (joy, sadness, anger, fear, love, surprise) 416,809 MIT Saravia et al., 2018
2 SemEval-2018 Task 1-E-c Twitter 11 emotions 10,936 CC-BY Mohammad et al., 2018
3 WASSA-2017 Twitter anger, fear, joy, sadness 6,861 Research use Mohammad & Bravo-Marquez, 2017
4 CancerEMO Online cancer forums 8 Plutchik emotions 9,960 CC-BY Wang et al., 2021
5 ISEAR Self-report questionnaire joy, fear, anger, sadness, disgust, shame, guilt 7,666 Research use Scherer & Wallbott, 1994
6 DailyDialog Multi-turn conversations happiness, surprise, sadness, anger, disgust, fear 13,429 CC-BY-NC-SA Li et al., 2017
7 Emotion Stimulus News / fiction happy, anger, fear, sad, shame, surprise, disgust 820 Research use Ghazi et al., 2015
8 FanFic Fan fiction stories 8 Plutchik emotions 1,742 CC-BY Bhatt et al., 2019
9 SSEC Social media (US elections) 8 Plutchik + No Emotion 2,905 CC-BY-NC-SA Schuff et al., 2017
10 TEC (Twitter Emotion Corpus) Twitter (hashtag-based) joy, sadness, surprise, fear, anger, disgust 20,522 Research use Mohammad, 2012

Note on GoEmotions: GoEmotions was processed separately in the code with a 27→4 emotion mapping. It is not included in the current version of the final balanced dataset (unified_emotions_v1.0.0.csv) or the combined dataset (unified_emotions_raw_merged.csv). However, the preprocessing code is retained in the notebooks as its full integration and mapping into the 13-class taxonomy is a primary goal for the next major release (v2.0) of this dataset.


Processing Pipeline

Raw Source Datasets (10 datasets, varied formats)
         │
         ▼
 ┌───────────────────────────────────────────────────┐
 │  Notebook 01: Per-Dataset Preprocessing           │
 │  - Load (CSV / TXT / TSV / XML / JSON)            │
 │  - Tweet cleaning (remove @mentions, URLs)        │
 │  - Emoji preservation                             │
 │  - Duplicate removal (text-level)                 │
 │  - Label normalisation to unified taxonomy        │
 │  - Save → Prep_files/<dataset>.csv                │
 └───────────────────────────────────────────────────┘
         │
         ▼
 ┌───────────────────────────────────────────────────┐
 │  Notebook 02: Dataset Union                       │
 │  - Load all Prep_files/*.csv                      │
 │  - Concatenate into single DataFrame              │
 │  - Add sentiment polarity column                  │
 │  - Save → unified_emotions_raw_merged.csv (~499K rows, 13 classes)   │
 └───────────────────────────────────────────────────┘
         │
         ▼
 ┌───────────────────────────────────────────────────┐
 │  Notebook 03: Dataset Balancing                   │
 │  - For majority classes: KMeans semantic          │
 │    clustering + representative sampling           │
 │  - For minority classes: Synonym augmentation     │
 │    (WordNet) + paraphrase generation              │
 │  - Quality filtering (sentence embeddings)        │
 │  - Target: 5,000 samples per class                │
 │  - Save → unified_emotions_v1.0.0.csv (70,757 rows)       │
 └───────────────────────────────────────────────────┘

Emotion Taxonomy

The 13 unified emotion classes with their mappings from source label names:

Unified Label Maps From (source labels) Plutchik Category
anger anger, angry, Anger Primary
anticipation anticipation, Anticipation Primary
disgust disgust, Disgust Primary
fear fear, Fear, scared Primary
joy joy, happiness, happy, Happy, HAPPY Primary
surprise surprise, Surprise, surprised Primary
sadness sadness, sad, Sadness Primary
trust trust, Trust Primary
love love Secondary (joy + trust)
optimism optimism Secondary (anticipation + joy)
pessimism pessimism Secondary (anticipation + disgust)
guilt guilt Moral emotion
shame shame, Shame Moral emotion

File Structure

Note on source datasets: Raw source data is not included in this repository (to respect original licenses and avoid redistribution). See DATA_SOURCES.md for download links for all 12 sources.

Data_Prep/
├── README.md                          ← This file
├── DATA_SOURCES.md                    ← Download links for all 12 source datasets
├── LICENSE                            ← CC-BY-4.0
├── CITATION.cff                       ← Machine-readable citation
├── .gitignore
├── .gitattributes                     ← Git LFS for unified_emotions_v1.0.0.csv
│
├── unified_emotions_v1.0.0.csv                ← ✅ FINAL balanced dataset (70,757 rows, Git LFS)
│
├── notebooks/
│   ├── 01_data_preprocessing.ipynb    ← Per-dataset loading, cleaning, label mapping
│   ├── 02_data_union.ipynb            ← Merging all sources → unified_emotions_raw_merged.csv
│   ├── 03_dataset_balancing.ipynb     ← KMeans + augmentation → unified_emotions_v1.0.0.csv
│   └── 04_data_analysis.ipynb         ← Quality analysis and visualisations
│
├── src/
│   └── dataset_balancer.py            ← Standalone DatasetBalancer class (CLI-ready)
│
├── Analysis_Results/                  ← Per-dataset analysis charts & reports
│   ├── summary_report.txt
│   ├── *_label_distribution.png
│   ├── *_text_length.png
│   └── *_samples.txt
│
├── llm_dataset_analysis/              ← Final dataset analysis
│   ├── analysis_report.txt
│   ├── class_distribution.png
│   ├── lexical_diversity.png
│   ├── readability_analysis.png
│   ├── text_length_analysis.png
│   └── wordcloud_*.png
│
└── docs/
    ├── preprocessing_guide.md         ← Step-by-step reproduction guide
    └── dataset_cards/                 ← Per-source dataset documentation
        ├── carer.md
        ├── semeval2018.md
        ├── wassa2017.md
        ├── cancer_emo.md
        ├── isear.md
        ├── daily_dialogue.md
        ├── emotion_stimulus.md
        ├── fanfic.md
        ├── ssec.md
        ├── tec.md
        └── go_emotions.md

Quick Start

Requirements

pip install pandas numpy scikit-learn sentence-transformers transformers torch nltk emoji tqdm

Load the Final Balanced Dataset

import pandas as pd

# Load final balanced dataset
df = pd.read_csv('unified_emotions_v1.0.0.csv')

print(df.head())
# Output:
#                                                 text emotion sentiment  cluster
# 0  i am swimming weekly which feels amazing but...     joy  positive      3.0
# 1                      i feel excited for it    joy  positive      NaN
# ...

print(df['emotion'].value_counts())
# anger          5000
# anticipation   5000
# disgust        5000
# ...

print(f"Dataset shape: {df.shape}")
# Dataset shape: (70757, 4)

Load the Unbalanced Combined Dataset

# Load the combined, unbalanced dataset (requires Git LFS or manual download)
df_combined = pd.read_csv('unified_emotions_raw_merged.csv')
print(df_combined['emotion'].value_counts())
# joy         167586
# sadness     131834
# anger        64755
# ...

Use for Training (HuggingFace style)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('unified_emotions_v1.0.0.csv')

# Encode labels
le = LabelEncoder()
df['label'] = le.fit_transform(df['emotion'])

# Split
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label'],
    test_size=0.2, random_state=42, stratify=df['label']
)

print(f"Train: {len(X_train)}, Test: {len(X_test)}")
# Train: 56605, Test: 14152

Data Schema

unified_emotions_v1.0.0.csv (Final Balanced Dataset)

Column Type Description
text str The input text sample
emotion str One of 13 emotion classes (see taxonomy)
sentiment str Coarse sentiment polarity: positive, negative, or neutral
cluster float KMeans cluster ID assigned during balancing (NaN for augmented samples)

unified_emotions_raw_merged.csv (Combined Unbalanced Dataset)

Column Type Description
text str The input text sample
emotion str Normalised emotion label
sentiment str Coarse sentiment polarity

Per-Source CSVs in Prep_files/

Most files use text, label schema. Exceptions:

  • ISEAR.csv: adds ID column
  • wassa2017.csv: adds intensity (0–1 float, continuous emotion intensity score)

Techniques Used

1. Tweet Cleaning (tweet_cleaner() in Notebook 01)

Applied to Twitter-sourced datasets (TEC, WASSA-2017, SemEval-2018):

  • Remove @mentions and http://... URLs using regex
  • Tokenise with NLTK's TweetTokenizer (handles hashtags, contractions)
  • Preserve emojis (extracted and appended back after cleaning)

2. Label Normalisation

Each source dataset uses different naming conventions. Labels were manually mapped to the 13-class unified taxonomy (e.g., happinessjoy, sadsadness, Angeranger).

3. Deduplication

  • Within each dataset: exact text-level deduplication (keep first occurrence)
  • Cross-dataset: merged file retains cross-source duplicates for transparent tracking; users can apply further deduplication if needed

4. Class Balancing (DatasetBalancer class in Notebook 03 / src/dataset_balancer.py)

For majority classes (e.g., joy: 167K → 5K):

  • TF-IDF vectorisation of text
  • KMeans clustering to identify semantic sub-groups
  • Representative sampling within each cluster (preserves diversity)

For minority classes (e.g., pessimism: 375 → 5K):

  • Synonym augmentation via WordNet (random word replacement)
  • Back-translation paraphrasing using a Seq2Seq transformer model
  • Quality filtering using cosine similarity of sentence embeddings (all-MiniLM-L6-v2) to ensure augmented samples remain semantically close to originals

Analysis Results

Per-dataset analysis charts are in Analysis_Results/. Final dataset analysis in llm_dataset_analysis/.

Per-Dataset Summary

Dataset Rows Classes Avg Text Length Class Imbalance Ratio Duplicates
CARER 416,809 6 97 chars 9.4× 22,987
TEC Tweets 20,522 6 82 chars 10.7× 0
DailyDialog 13,429 6 59 chars 78.4× 662
ISEAR 7,666 7 115 chars 1.0× (balanced) 163
SemEval-2018 7,869 11 96 chars 9.4× 4,685
WASSA-2017 6,861 4 89 chars 1.7× 0
CancerEMO 9,960 8 83 chars 13.9× 1,494
SSEC 2,905 9 80 chars 28.9× 16
FanFic 1,742 8 943 chars 4.3× 120
Emotion Stimulus 820 7 113 chars 5.6× 0

Citation

If you use this dataset in your research, please cite:

@dataset{jamil2025unifiedemotion,
  author    = {Jamil, Luqman},
  title     = {Unified Emotion Dataset for LLM Training: A Multi-Source Benchmark with 13 Emotion Classes},
  year      = {2025},
  version   = {1.0.0},
  license   = {CC-BY-4.0},
  url       = {https://github.com/luqmanjamilch/Unified_Emotions-Dataset}
}

Please also cite the individual source datasets you use. Full BibTeX entries for all sources:

📚 Click to expand all source citations
@inproceedings{saravia2018carer,
  title     = {{CARER}: Contextualized Affect Representations for Emotion Recognition},
  author    = {Saravia, Elvis and Liu, Hsien-Chi Toby and Huang, Yen-Hao and Wu, Junlin and Chen, Yi-Shin},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages     = {3687--3697},
  year      = {2018},
  publisher = {Association for Computational Linguistics}
}

@inproceedings{mohammad2018semeval,
  title     = {{SemEval}-2018 Task 1: Affect in Tweets},
  author    = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},
  booktitle = {Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018)},
  pages     = {1--17},
  year      = {2018}
}

@inproceedings{mohammad2017wassa,
  title     = {{WASSA}-2017 Shared Task on Emotion Intensity},
  author    = {Mohammad, Saif M. and Bravo-Marquez, Felipe},
  booktitle = {Proceedings of the EMNLP 2017 Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis},
  pages     = {34--49},
  year      = {2017}
}

@inproceedings{wang2021canceremo,
  title     = {{CancerEMO}: A Dataset for Fine-Grained Emotion Detection in Cancer-Related Online Posts},
  author    = {Wang, Tao and Wan, Xiaojun and Jin, Hua},
  booktitle = {Proceedings of EMNLP 2021},
  year      = {2021}
}

@article{scherer1994isear,
  title     = {Evidence for universality and cultural variation of differential emotion response patterning},
  author    = {Scherer, Klaus R. and Wallbott, Harald G.},
  journal   = {Journal of Personality and Social Psychology},
  volume    = {66},
  number    = {2},
  pages     = {310--328},
  year      = {1994}
}

@inproceedings{li2017dailydialog,
  title     = {{DailyDialog}: A Manually Labelled Multi-turn Dialogue Dataset},
  author    = {Li, Yanran and Su, Hui and Shen, Xiaoyu and Li, Wenjie and Cao, Ziqiang and Niu, Shuzi},
  booktitle = {Proceedings of the Eighth International Joint Conference on Natural Language Processing (IJCNLP 2017)},
  pages     = {986--995},
  year      = {2017}
}

@article{ghazi2015stimuli,
  title     = {Detecting Emotions in Text: Integrating Cognitive Theories of Emotion with Neural Network Models of Language},
  author    = {Ghazi, Diman and Inkpen, Diana and Szpakowicz, Stan},
  journal   = {PloS ONE},
  year      = {2015}
}

@inproceedings{bhatt2019fanfic,
  title     = {Automatic Identification of Character Types from Film Dialogs},
  author    = {Bhatt, Hardik and Bhatt, Brijesh and Pimpale, Ankit},
  booktitle = {Proceedings of the Workshop on Narrative Understanding},
  year      = {2019}
}

@inproceedings{schuff2017ssec,
  title     = {{SSEC}: A Human-Annotated Corpus for Fine-Grained Emotion Classification},
  author    = {Schuff, Hendrik and Barnes, Jeremy and Mohme, Julian and Pad{\'o}, Sebastian and Klinger, Roman},
  booktitle = {Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis},
  pages     = {74--80},
  year      = {2017}
}

@inproceedings{mohammad2012tec,
  title     = {\#Emotional Tweets},
  author    = {Mohammad, Saif M.},
  booktitle = {Proceedings of the First Joint Conference on Lexical and Computational Semantics (*SEM 2012)},
  pages     = {246--255},
  year      = {2012}
}

@inproceedings{demszky2020goemotions,
  title     = {{GoEmotions}: A Dataset of Fine-Grained Emotions},
  author    = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
  booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)},
  pages     = {4040--4054},
  year      = {2020}
}

License

This repository (the unified dataset and processing code) is licensed under CC BY 4.0.

Important: The individual source datasets retain their original licenses. Some (e.g., DailyDialog, SSEC) are licensed under CC-BY-NC-SA and may not be used for commercial purposes. Please review docs/dataset_cards/ for the specific license of each source before use.


Acknowledgements

This dataset was compiled as part of PhD research. We thank the authors of all source datasets for making their work publicly available. Special thanks to the organisers of SemEval 2018 and WASSA 2017 shared tasks.


Last updated: June 2025 | Version 1.0.0

About

A large-scale, multi-domain emotion classification dataset specifically designed for training and evaluating Large Language Models (LLMs).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors