Tweet Emotions Recognition

Created two classifiers (an MLP and a SVM) for Tweet emotion detection: anger, joy, sadness and optimism.
Grid-searched the MLP and the SVM to look for the optimal hyper-parameters.
Used random undersampling, random oversampling and SMOTE to handle data imbalance.
The best SVM model had an F1 weighted score of 0.665 and was trained with the original data
The best MLP model had an F1 weighted score of 0.601 and was trained with SMOTE applied to the training data.

Packages

Python Version: 3.8.3

Packages used: pyspark, pandas, numpy, scikit-larn, imblearn, matplotlib, seaborn, pytroch, skorch

Requirements: pip install -r requirements.txt

Source of data

The original data was downloaded from tweeteval repo, from the folder Emotion Recognition.

Data cleaning

Merged all the datasets and split them in 75% training set and 25% test set.
Parsed the Tweets with a bag of n-grams approach.
- The best SVM model was trained with 1-grams and with stop-words removed
- The best MLP model was trained with 1-grams and with stop-words removed

EDA

Word-clouds for each class:

Heat-maps for different hyper-parameter combinations for the SVM (regularisation and gamma) and the MLP (number of hidden layers and the dimension of the hidden layers). The score measures the weighted F1 score:

After training the SVM and the MLP with the best hyper-parameters, to look for faults in the classification two confusion matrices were plotted. On the right, word-clouds for the most common misclassified word in each class are added:

Best models

Model	Training data	Validation score	Test score
SVM	Original data	0.665	0.665
MLP	SMOTE	0.601	0.584

SVM trained with a sigmoid kernel, regularisation of 0.25 and gamma of 0.01.

MLP trained with 100 hidden layers, 10 units per layer each, and a learning rate of 0.01.

Conclusions

The SVM outperformed the MLP at the cost of underfitting the minority class optimism
The MLP failed at detecting the context of emotion-related words for classification like angry, fun or sad, while the SVM made mistakes with more general words
Given that the context seemed to be ignored in our models, other networks like recurrent neural networks (RNNs) such as long short-term memory might improve the results

Specifications

===== Jupyter Notebooks =====

All jupyter notebooks are available as ipynb and html files

Cleaning: Contains the cleaning process of the original data to build the training and test sets
SVM_tuning: Contains the grid-searches for the SVM parameters with the different approaches to handle data imbalance.
MLP_tuning: Contains the grid-searches for the MLP parameters with the different approaches to handle data imbalance.
results_discussion: The results for the grid-searches are analysed and the best models are saved.
Model_Testing: Contains all the necessary code to run and test the two best models.

===== Folders =====

emotion: Contains the original txt files with the source data
Data: Contains the CSV files with the final data
Grid-Search: Contains the CSV files with the results for the grid-searches.
Models: Contains the files of the two best models.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Data		Data
Grid-Search		Grid-Search
Models		Models
emotion		emotion
.gitattributes		.gitattributes
.gitignore		.gitignore
Cleaning.ipynb		Cleaning.ipynb
MLP_tuning.ipynb		MLP_tuning.ipynb
Model_Testing.ipynb		Model_Testing.ipynb
README.md		README.md
SVM_tuning.ipynb		SVM_tuning.ipynb
heatmaps.png		heatmaps.png
matrices.png		matrices.png
requirements.txt		requirements.txt
results_discussion.ipynb		results_discussion.ipynb
wordcloud.png		wordcloud.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweet Emotions Recognition

Packages

Source of data

Data cleaning

EDA

Best models

Conclusions

Specifications

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tweet Emotions Recognition

Packages

Source of data

Data cleaning

EDA

Best models

Conclusions

Specifications

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages