Datasets

This project intends to collect necessary data (for modeling) from a combination of web scraping which adheres to ethical considerations as well as publicly-available datasets.

Reddit SuicideWatch Posts

Web Scraping: By launching a web request to Reddit, we load a list of 100 records of posts from curl https://reddit.com/r/SuicideWatch/new.json?limit=100 which is in the subreddit (i.e., category) named "SuicideWatch". Note that this request could only fetch 100 records at a time. Raw data are loaded into reddit_suicidewatch.json.

Data Processing

Original text is normalized before classification:

Removing emojis
Removing symbols - such as hashtag # sign, the @ symbol, and URLs.
Removing punctuations
Converting the entire text to lowercase
Lemmatization - to replace abstract words with its base form
Word Tokenization
Removing Stopwords
Vectorizing a list of texts - using DistilBertTokenizer since it caters for contextual information within texts.

Data Splitting

The training set uses data from the following datasets:

Twitter
Social Media Sentiment Analysis

The validation set uses data from the following dataset:

The test set uses data from the following dataset:

Depression Tweet

Labeling:

We consider a binary classification problem with the following labels and interpretation:

0: non-suicidal
1: suicidal

Resulting Vectors

Access the generated data here

Note: The data splitting and processing are vectorizing texts using DistilBERT's tokenizer, which is believed to preserve contextual semantic meanings very well. However, by running the scripts data_processing.py and data_splitting_DistilBERT.py (in sequence), it does not guarantee to work with other types of models like the baseline nor the LLM-based approach. This method is specifically tailored for the 2nd model - fine-tuning a a pre-trained DistilBERT model.

Text Embedding Vector

(Title, Post Content, Hashtags)

Shape: (n, 3, 768), where n = number of records in a particular set.

Metadata Embedding Vector

Post Category
Number of Comments
Hide Score
Upvote Ratio
Ups
Score
Edited
no_follow
over_18
Created Date / Timestamp
Country
Platform
Sentiment
Reposts
Number of Likes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets

Reddit SuicideWatch Posts

Social Media Sentiments Analysis Dataset

Twitter Suicidal Data

Depression Tweets

Data Processing

Data Splitting

Labeling:

Resulting Vectors

Text Embedding Vector

Metadata Embedding Vector

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Datasets

Reddit SuicideWatch Posts

Social Media Sentiments Analysis Dataset

Twitter Suicidal Data

Depression Tweets

Data Processing

Data Splitting

Labeling:

Resulting Vectors

Text Embedding Vector

Metadata Embedding Vector