This project intends to collect necessary data (for modeling) from a combination of web scraping which adheres to ethical considerations as well as publicly-available datasets.
Web Scraping: By launching a web request to Reddit, we load a list of 100 records of posts from curl https://reddit.com/r/SuicideWatch/new.json?limit=100 which is in the subreddit (i.e., category) named "SuicideWatch". Note that this request could only fetch 100 records at a time.
Raw data are loaded into reddit_suicidewatch.json.
Dataset is downloaded into a .csv format from https://www.kaggle.com/datasets/kashishparmar02/social-media-sentiments-analysis-dataset?resource=download.
Dataset is downloaded into a .csv format from https://www.kaggle.com/datasets/hosammhmdali/twitter-suicidal-data.
Dataset is downloaded into a .json format from https://www.kaggle.com/datasets/senapatirajesh/depression-tweets.
Original text is normalized before classification:
- Removing emojis
- Removing symbols - such as hashtag # sign, the @ symbol, and URLs.
- Removing punctuations
- Converting the entire text to lowercase
- Lemmatization - to replace abstract words with its base form
- Word Tokenization
- Removing Stopwords
- Vectorizing a list of texts - using DistilBertTokenizer since it caters for contextual information within texts.
The training set uses data from the following datasets:
- Social Media Sentiment Analysis
The validation set uses data from the following dataset:
The test set uses data from the following dataset:
- Depression Tweet
We consider a binary classification problem with the following labels and interpretation:
- 0: non-suicidal
- 1: suicidal
Access the generated data here
- Note: The data splitting and processing are vectorizing texts using DistilBERT's tokenizer, which is believed to preserve contextual semantic meanings very well. However, by running the scripts
data_processing.pyanddata_splitting_DistilBERT.py(in sequence), it does not guarantee to work with other types of models like the baseline nor the LLM-based approach. This method is specifically tailored for the 2nd model - fine-tuning a a pre-trained DistilBERT model.
(Title, Post Content, Hashtags)
- Shape: (n, 3, 768), where n = number of records in a particular set.
- Post Category
- Number of Comments
- Hide Score
- Upvote Ratio
- Ups
- Score
- Edited
- no_follow
- over_18
- Created Date / Timestamp
- Country
- Platform
- Sentiment
- Reposts
- Number of Likes