This project utilizes various libraries, including SpaCy, Geonamescache, and Instagrapi, to process text data with a specific focus on hashtags and graffiti-related content. The goal is to parse hashtags into meaningful words, identify entities, and classify terms related to graffiti, cities, and railroad lingo.
- Geonamescache is used for city and country information.
- Instagrapi handles interactions with Instagram data.
- SpaCy is used for natural language processing. Custom extensions are added to SpaCy's
Tokenclass to include properties likeis_city,is_graffiti_lingo, andis_railroad_lingo.
The initialize_words function loads wordlists from text files for:
- General vocabulary
- City names
- Graffiti lingo
- Railroad lingo
These wordlists are essential for parsing and classifying hashtags.
The pipeline processes hashtags using the following components:
Merges tokens starting with # or @ into single tokens.
Adds extensions to tokens, identifying if they are hashtags or mentions.
Splits hashtags into meaningful words using a recursive approach.
Identifies graffiti-related entities like writers and crews, and looks for specific patterns or keywords in hashtags.
For example, the hashtag #freightgraffitiChicago is processed as follows:
- The hashtag is stripped of the
#symbol. - Words are identified sequentially from the beginning:
freightgraffitiChicago
- Each word is checked against predefined wordlists:
freightmatches general vocabulary.graffitimatches graffiti lingo.Chicagois identified as a city using Geonamescache.
-
Example:
#mecro- The term
mecrois identified as an out-of-vocabulary word (OOV). - If it is not part of the wordlist, it is classified as a graffiti writer.
- The term
-
Example:
#mskcrew- The term
mskis identified, and the suffixcrewsignifies it as a graffiti crew.
- The term
#blackbook: Refers to sketchbooks used by graffiti artists to draft designs.#burnersonthestreet: Highlights impressive or notable street graffiti.#vandalsMexico: Represents graffiti or street art associated with Mexico.
Each of these hashtags undergoes the same processing pipeline to extract meaningful words and classify entities.
The parse_tag function recursively identifies words within a hashtag:
- The input string is split by dashes or processed as a whole.
- Words are extracted iteratively from the start of the string.
- Each word is matched against the wordlist or categorized as an OOV entity.
Custom extensions allow the program to annotate tokens with additional metadata, such as:
is_city: Whether the token represents a city.is_graffiti_lingo: Whether the token is graffiti-related.custom_entity_cat: Custom classification for unidentified entities.
The application can be containerized to create multiple instances using Docker. Below is the Dockerfile used for building the image:
FROM python:3.11.0b4-buster
ARG _USER="spacy"
ARG _UID="1001"
ARG _GID="100"
ARG _SHELL="/bin/bash"
# Install apt dependencies
RUN apt-get update && apt-get install -y \
nano \
git \
wget
RUN useradd -m -s "${_SHELL}" -N -u "${_UID}" "${_USER}"
ENV USER ${_USER}
ENV UID ${_UID}
ENV GID ${_GID}
ENV HOME /home/${_USER}
ENV PATH "${HOME}/.local/bin/:${PATH}"
ENV PIP_NO_CACHE_DIR "true"
RUN mkdir /home/${_USER}/app && chown ${UID}:${GID} /home/${_USER}/app
USER ${_USER}
COPY --chown=${UID}:${GID} config* /home/${_USER}/app
COPY --chown=${UID}:${GID} requirements* /home/${_USER}/app
COPY --chown=${UID}:${GID} ./* /home/${_USER}/app/
WORKDIR /home/${_USER}/app
RUN pip install -r requirements.txt
CMD bashFirst, clone the repository:
git clone https://github.com/abundis-rmn2/Spacy-Hashtag-Geolocator.gitRun the following command to build the Docker image with the name hashtags (you can use any name you prefer):
docker build -t hashtags .To create and run a container, use the following command. In this case, we name the container hash1:
docker run -it --name hash1 -d hashtagsIf you want to keep the process running in a separate terminal session, you can use screen:
screen -S hashRun a specific script or test inside the container by specifying the container name and the MUID:
# docker exec -t [container_name] python test.py -MUID=[MUID]
docker exec -t hash1 python test.py -MUID=nearaxs_1_hashtagTop_9_48c69711After executing the process, you will see output indicating the script has started:
Initialize Words wl file
<class 'list'>
185606
Initialize Words cities file
<class 'list'>
26797
Initialize Words graffiti-lingo file
<class 'list'>
551
Initialize Words railroad-lingo file
<class 'list'>
681
Looking for caption in MUID: fr8porn_1_hashtagTop_9_3bf76f18
MUID found : 513
- Scalability: Run multiple instances of the application by creating additional containers.
- Consistency: The Docker image ensures the environment remains consistent across deployments.
- Ease of Use: The containerized setup simplifies the process of setting up and running the application.
This project provides a structured way to analyze hashtags, focusing on graffiti-related content. It identifies cities, graffiti terms, and unique entities like writers and crews, enhancing the understanding of social media data in this niche.