GitHub - stressosaurus/raw-data-twitter-tweets-stream: Stream and tabulate live tweets from Twitter using NLTK.

Live Twitter Data Streamer, Filterer, and Processor

Alex John Quijano

Purpose. The scripts on this repository provides an easy way to stream, filter, and process live tweets iteratively for researchers interested in data science, mathematical modeling, computational linguistics, historical linguistics, and/or discourse analysis. This repository lets you stream live tweets via using the Natural Language Processing Tool Kit (NLTK) developed by Bird et. al. [1]. The code is designed to capture $n$ tweets in a way such that it streams continuously until a specified time. The streamer uses a list of keywords to capture tweets that matches those keywords.

After successful streaming, the code will filter the tweets to reduce the memory size. Processing is done separately to choose which time range to process.

Warning. Streaming live tweets for long periods is not recommended because the filesize gets very large unless you have reasonable file storage. The streamer also can capture any public tweets (protected tweets are not captured). That means it can capture, tweets from bot users, pornography, and profanity; when viewing this tweets, it might be NSFW.

References.

Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009.

Twitter Application User Interface (API). You must have keys and tokens of your Twitter API to use the streamer.

Data tabulation and subsetting. The raw Twitter data collected during live streaming was processed in five steps. The first step is when the raw data is saved into a JSON file and processed through a filter and processing steps where the unecessary information (such as hyperlinks and empty strings) are filtered away. The last two steps are the conversion from JSON data structure to a tabular data structure for easy access of information.

1. Stream, Filter and Process Live Tweets.

Using the streamer scripts streamTwitter.py, loop.streamTweets.sh, and run.loop.streamTweets.sh requires two things.

The NLTK python modules which is available in this website ( https://www.nltk.org/ ). You can read about it in detail in this book by Bird et. al. [1].
Twitter Application Progamming Interface (API) keys.

1.1 Dependencies.

Clone repository.

git clone https://github.com/stressosaurus/raw-data-twitter-tweets-stream.git

Install NLTK using Linux terminal.

sudo pip install -U nltk

Install NLTK Corpora.

python -m nltk.downloader all

Install Python modules.

pip3 install --user -r requirements.txt

1.2 Setting Up the Twitter Application Programming Interface (API).

You can follow the NLTK instructions for setting up the API here ( NLTK ) but essentially it comes down into three steps.

Step 1. Create a Twitter account ( twitter.com )
Step 2. Apply for a developer account - standard API ( developer.twitter.com )
Step 3. Copy your keys in the format - shown below - while saving it into a text file named credentials.txt. Create a folder/directory twitter-files in your home directory and place the text file into that new folder.

app_key=[key]
app_secret=[secret]
oauth_token=[token]
oauth_token_secret=[secret token]

1.3 Streaming and Filtering Live Tweets.

Before you can stream, you need to provide a list of keywords. The NLTK streamer requires a list of keywords to capture live tweets with those keywords.

For example you need to create a text file named keywords-stream[X].txt in the keywords-stream folder where [X] is the whatever you want to name your list. Then, list all the keywords inside the text file like shown below.

normal
abnormal
supernormal
unnormal
binormal
mononormal
uninormal
homonormal
heteronormal

Next, edit the lines shown below of the bash script run.loop.streamTweets.sh. The path_twitterFiles variable is the absolute path to the credentials.txt file. The STDVID is the name of your list [X]. Below is an example where we tell the streamer to take 4000 tweets for every loop until August 14, 2018 at 6:14 pm (the time is in 24-hour format). If the entered time is in the past, then the loop will just immediately stop upon execution.

path_twitterFiles="/home/[username]/twitter-files"

STDVID=[X]
stopDay=14
stopMonth=08
stopYear=2018
stopHour=18
stopMinute=14
tweetsPerLoop=4000

Save the updated run.loop.streamTweets.sh and run it using the command below on linux terminal. The streamer saves the tweets every loop and filters and compresses the tweets every day and for every month.

./run.loop.streamTweets.sh

1.4 Processing Live Tweets.

Make sure you have the compressed filtered data in the [X]-filtered folder where [X] is the name of the keywords list. The streamer saves the tweets in JSON format ( you can see an overview of the JSON data structure here ( tweet-object ).

The processor will process the JSON file such that it takes in information and converts it into a python dictionary data structure for easy access. The advantage of this data structure is that we can track connected tweets (i.e. replies and retweets to the a parent tweet) more easily than dealing with the JSON files directly.

To process the streamed tweets, use the processTweets.sh bash file. This file takes in three arguments.

[X] $\rightarrow$ The name of the keyword list.
An array of specified dates you want to process.
The number of cores you want use (the code is paralellized in a multi-threading fasion where it can process $n$ tweets at a time).

Use the example bash script below where it will process the data set using the A keyword list from September 23, 2017 to September 24, 2017 and using 3 cores. The processed tweets is saved into [X]-processed folder.

STDVID=A
dates=("09232017" "09232017")
./processTweets.py ${STDVID} "${dates[*]}" 3

STDVID=A
dates=("09232017" "09242017")
./processTweets.py ${STDVID} "${dates[*]}" 3

2. Tabulate and Subset Tweets.

2.1. Tabulating Tweets.

Make sure you have the processed data in the [X]-processed where [X] is the name of the keywords list.

Use the example bash script below where it will tabulate the processed data using A keyword list from September 23, 2017 to September 24, 2017. The tabulated tweets is saved into [X]-tabulated folder.

./tabulateTweets.py 'A' '09232017-09232017'
./tabulateTweets.py 'A' '09232017-09242017'

2.2. Subsetting Tabulated Tweets.

Create a folder named keywords-subset and make a text file named keywords-[F]-[Y].txt where [F] is the column to filter and [Y] is for whatever you want to name your keyword subset list. Then, list all the keywords like shown below or the hashtag list in Section 2.

hashtag
love
hate

Use the example bash script below where it will subset the tabulated data set using the hashtag keyword list named a from September 23, 2017 to September 24, 2017. The tabulated tweets is saved into [X]-subset folder. The third line of the below script is where you can merge two tabulated tweets. The merged tweet is saved into [X]-merged folder.

./indexTweets.py 'A' 'hashtag' 'a' '09232017-09242017' 100000 '09232017-09242017'
./subsetTweets.py 'A' 'hashtag' 'a' '09232017-09232017' 100000 '09232017-09242017'
./subsetTweets.py 'A' 'hashtag' 'a' '09232017-09242017' 100000 '09232017-09242017'
./mergeTweets.py 'A' 'hashtag' 'a' '09232017-09242017'

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
1gram-list		1gram-list
keywords-stream		keywords-stream
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
filterTweets.py		filterTweets.py
indexTweets.py		indexTweets.py
loop.streamTweets.sh		loop.streamTweets.sh
mergeTweets.py		mergeTweets.py
processTweets.py		processTweets.py
requirements.txt		requirements.txt
run.loop.streamTweets.sh		run.loop.streamTweets.sh
run.processTweets.sh		run.processTweets.sh
run.subsetTweets.sh		run.subsetTweets.sh
run.tabulateTweets.sh		run.tabulateTweets.sh
streamTweets.py		streamTweets.py
subsetTweets.py		subsetTweets.py
tabulateTweets.py		tabulateTweets.py
twitter.py		twitter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Live Twitter Data Streamer, Filterer, and Processor

Alex John Quijano

1. Stream, Filter and Process Live Tweets.

1.1 Dependencies.

1.2 Setting Up the Twitter Application Programming Interface (API).

1.3 Streaming and Filtering Live Tweets.

1.4 Processing Live Tweets.

2. Tabulate and Subset Tweets.

2.1. Tabulating Tweets.

2.2. Subsetting Tabulated Tweets.

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Live Twitter Data Streamer, Filterer, and Processor

Alex John Quijano

1. Stream, Filter and Process Live Tweets.

1.1 Dependencies.

1.2 Setting Up the Twitter Application Programming Interface (API).

1.3 Streaming and Filtering Live Tweets.

1.4 Processing Live Tweets.

2. Tabulate and Subset Tweets.

2.1. Tabulating Tweets.

2.2. Subsetting Tabulated Tweets.

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages