Directory Scraper

Overview

This project is a web scraping tool built using Scrapy, designed to scrape directory listings and save the results in JSON format. It uses spiders to collect specific information from websites and outputs the data in a structured format.

Getting Started

Follow the instructions below to set up the environment and run the scraper.

Make sure you have the following installed:

Python 3.11

Installation

Clone the repository to your local machine.

git clone <repository-url>
cd <repository-directory>

Create a Python virtual environment using Python 3.11.
```
python3.11 -m venv env
```
Activate the virtual environment.
- On macOS/Linux:
```
source env/bin/activate
```
- On Windows:
```
env\Scripts\activate
```

==============================================================

4a. Install the required dependencies by running: bash pip install -r requirements.txt Alternatively, ignore step 4a, and run step 4b instead (suggested):

4b. Install the project setup.py The project is organized as a Python package.

To install it in editable mode (development):

-e (editable mode): This allows you to make changes to the project source code, and those changes will be immediately reflected without needing to reinstall the package.
```
pip install -e .
```

To install it in non-editable mode bash pip install .

You will be able to install dependencies, and import the project’s internal modules (e.g utils.file_utils) without needing to modify sys.path.

==============================================================

Running the Spider

To run an individual spider & produce an output, use the following command:

scrapy crawl <spider_name> -o result.json

Example:

scrapy crawl mof -o output.json

Installation of Tesseract OCR

To run this project, you will need to have Tesseract OCR installed on your system. Specifically, for running:

spiders/kpkt.py

1. macOS

Steps:

Install Tesseract via Homebrew :
```
brew install tesseract
```
Verify the installation by running:
```
tesseract --version
```
Check path (this will be the path to .env TESSERACT_PATH):
```
which tesseract
```

Building Docker Image and Running Container

To build and run the API, you will need to download install Docker and Docker Compose.

1. macOS

Steps:

Install Docker and Docker Compose
Navigate to the root folder of the repo

Build the container by running:

docker compose -f docker/docker-compose.yml build --no-cache

Run the container using:

docker compose -f docker/docker-compose.yml up

To run in detached mode use:

docker compose -f docker/docker-compose.yml up --detach

Access the API by going to: http://0.0.0.0:80

or to access the interactive docs, go to: http://0.0.0.0:80/docs

To shut down the container run:

docker compose -f docker/docker-compose.yml down

Name		Name	Last commit message	Last commit date
Latest commit History 610 Commits
.github/workflows		.github/workflows
directory_scraper		directory_scraper
docker		docker
sheet_scripts		sheet_scripts
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Directory Scraper

Overview

Getting Started

Installation

Running the Spider

Installation of Tesseract OCR

1. macOS

Steps:

Building Docker Image and Running Container

1. macOS

Steps:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Directory Scraper

Overview

Getting Started

Installation

Running the Spider

Installation of Tesseract OCR

1. macOS

Steps:

Building Docker Image and Running Container

1. macOS

Steps:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages