Purpose. The scripts on this repository provides an easy way to stream, filter, and process live tweets iteratively for researchers interested in data science, mathematical modeling, computational linguistics, historical linguistics, and/or discourse analysis. This repository lets you stream live tweets via using the Natural Language Processing Tool Kit (NLTK) developed by Bird et. al. [1]. The code is designed to capture
After successful streaming, the code will filter the tweets to reduce the memory size. Processing is done separately to choose which time range to process.
Warning. Streaming live tweets for long periods is not recommended because the filesize gets very large unless you have reasonable file storage. The streamer also can capture any public tweets (protected tweets are not captured). That means it can capture, tweets from bot users, pornography, and profanity; when viewing this tweets, it might be NSFW.
References.
- Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009.
Twitter Application User Interface (API). You must have keys and tokens of your Twitter API to use the streamer.
Data tabulation and subsetting. The raw Twitter data collected during live streaming was processed in five steps. The first step is when the raw data is saved into a JSON file and processed through a filter and processing steps where the unecessary information (such as hyperlinks and empty strings) are filtered away. The last two steps are the conversion from JSON data structure to a tabular data structure for easy access of information.
Using the streamer scripts streamTwitter.py, loop.streamTweets.sh, and run.loop.streamTweets.sh requires two things.
-
The NLTK python modules which is available in this website ( https://www.nltk.org/ ). You can read about it in detail in this book by Bird et. al. [1].
-
Twitter Application Progamming Interface (API) keys.
Clone repository.
git clone https://github.com/stressosaurus/raw-data-twitter-tweets-stream.gitInstall NLTK using Linux terminal.
sudo pip install -U nltkInstall NLTK Corpora.
python -m nltk.downloader allInstall Python modules.
pip3 install --user -r requirements.txtYou can follow the NLTK instructions for setting up the API here ( NLTK ) but essentially it comes down into three steps.
- Step 1. Create a Twitter account ( twitter.com )
- Step 2. Apply for a developer account - standard API ( developer.twitter.com )
- Step 3. Copy your keys in the format - shown below - while saving it into a text file named credentials.txt. Create a folder/directory twitter-files in your home directory and place the text file into that new folder.
app_key=[key]
app_secret=[secret]
oauth_token=[token]
oauth_token_secret=[secret token]Before you can stream, you need to provide a list of keywords. The NLTK streamer requires a list of keywords to capture live tweets with those keywords.
For example you need to create a text file named keywords-stream[X].txt in the keywords-stream folder where [X] is the whatever you want to name your list. Then, list all the keywords inside the text file like shown below.
normal
abnormal
supernormal
unnormal
binormal
mononormal
uninormal
homonormal
heteronormal
Next, edit the lines shown below of the bash script run.loop.streamTweets.sh. The path_twitterFiles variable is the absolute path to the credentials.txt file. The STDVID is the name of your list [X]. Below is an example where we tell the streamer to take 4000 tweets for every loop until August 14, 2018 at 6:14 pm (the time is in 24-hour format). If the entered time is in the past, then the loop will just immediately stop upon execution.
path_twitterFiles="/home/[username]/twitter-files"
STDVID=[X]
stopDay=14
stopMonth=08
stopYear=2018
stopHour=18
stopMinute=14
tweetsPerLoop=4000Save the updated run.loop.streamTweets.sh and run it using the command below on linux terminal. The streamer saves the tweets every loop and filters and compresses the tweets every day and for every month.
./run.loop.streamTweets.shMake sure you have the compressed filtered data in the [X]-filtered folder where [X] is the name of the keywords list. The streamer saves the tweets in JSON format ( you can see an overview of the JSON data structure here ( tweet-object ).
The processor will process the JSON file such that it takes in information and converts it into a python dictionary data structure for easy access. The advantage of this data structure is that we can track connected tweets (i.e. replies and retweets to the a parent tweet) more easily than dealing with the JSON files directly.
To process the streamed tweets, use the processTweets.sh bash file. This file takes in three arguments.
-
[X]
$\rightarrow$ The name of the keyword list. - An array of specified dates you want to process.
- The number of cores you want use (the code is paralellized in a multi-threading fasion where it can process
$n$ tweets at a time).
Use the example bash script below where it will process the data set using the A keyword list from September 23, 2017 to September 24, 2017 and using 3 cores. The processed tweets is saved into [X]-processed folder.
STDVID=A
dates=("09232017" "09232017")
./processTweets.py ${STDVID} "${dates[*]}" 3STDVID=A
dates=("09232017" "09242017")
./processTweets.py ${STDVID} "${dates[*]}" 3Make sure you have the processed data in the [X]-processed where [X] is the name of the keywords list.
Use the example bash script below where it will tabulate the processed data using A keyword list from September 23, 2017 to September 24, 2017. The tabulated tweets is saved into [X]-tabulated folder.
./tabulateTweets.py 'A' '09232017-09232017'
./tabulateTweets.py 'A' '09232017-09242017'Create a folder named keywords-subset and make a text file named keywords-[F]-[Y].txt where [F] is the column to filter and [Y] is for whatever you want to name your keyword subset list. Then, list all the keywords like shown below or the hashtag list in Section 2.
hashtag
love
hate
Use the example bash script below where it will subset the tabulated data set using the hashtag keyword list named a from September 23, 2017 to September 24, 2017. The tabulated tweets is saved into [X]-subset folder. The third line of the below script is where you can merge two tabulated tweets. The merged tweet is saved into [X]-merged folder.
./indexTweets.py 'A' 'hashtag' 'a' '09232017-09242017' 100000 '09232017-09242017'
./subsetTweets.py 'A' 'hashtag' 'a' '09232017-09232017' 100000 '09232017-09242017'
./subsetTweets.py 'A' 'hashtag' 'a' '09232017-09242017' 100000 '09232017-09242017'
./mergeTweets.py 'A' 'hashtag' 'a' '09232017-09242017'