This project is a web scraping tool built using Scrapy, designed to scrape directory listings and save the results in JSON format. It uses spiders to collect specific information from websites and outputs the data in a structured format.
Follow the instructions below to set up the environment and run the scraper.
Make sure you have the following installed:
- Python 3.11
-
Clone the repository to your local machine.
git clone <repository-url> cd <repository-directory>
-
Create a Python virtual environment using Python 3.11.
python3.11 -m venv env
-
Activate the virtual environment.
-
On macOS/Linux:
source env/bin/activate -
On Windows:
env\Scripts\activate
-
==============================================================
4a. Install the required dependencies by running:
bash pip install -r requirements.txt
Alternatively, ignore step 4a, and run step 4b instead (suggested):
4b. Install the project setup.py The project is organized as a Python package.
To install it in editable mode (development):
- -e (editable mode): This allows you to make changes to the project source code, and those changes will be immediately reflected without needing to reinstall the package.
pip install -e .
To install it in non-editable mode
bash pip install .
You will be able to install dependencies, and import the project’s internal modules (e.g utils.file_utils) without needing to modify sys.path.
==============================================================
To run an individual spider & produce an output, use the following command:
scrapy crawl <spider_name> -o result.jsonExample:
scrapy crawl mof -o output.jsonTo run this project, you will need to have Tesseract OCR installed on your system. Specifically, for running:
- spiders/kpkt.py
- Install Tesseract via Homebrew :
brew install tesseract
- Verify the installation by running:
tesseract --version
- Check path (this will be the path to .env TESSERACT_PATH):
which tesseract
To build and run the API, you will need to download install Docker and Docker Compose.
-
Install Docker and Docker Compose
-
Navigate to the root folder of the repo
-
Build the container by running:
docker compose -f docker/docker-compose.yml build --no-cache -
Run the container using:
docker compose -f docker/docker-compose.yml upTo run in detached mode use:
docker compose -f docker/docker-compose.yml up --detach -
Access the API by going to:
http://0.0.0.0:80or to access the interactive docs, go to:
http://0.0.0.0:80/docs -
To shut down the container run:
docker compose -f docker/docker-compose.yml down