🤖 EthicalScraper

A Python library for ethical web scraping that automatically respects robots.txt rules, implements rate limiting, and provides advanced analysis tools for public price monitoring.

🎯 Key Features

✅ Automatic robots.txt verification before any access
⏱️ Configurable rate limiting to respect servers
🤖 Customizable user-agent for proper identification
📊 Advanced analysis of robots.txt files (crawl-delay, sitemaps, rules)
🔄 Batch processing with parallelization support
📈 Report export in CSV and JSON formats
🚨 Robust error handling for network failures
📝 Detailed logs with timestamps for auditing
💾 Smart caching to avoid unnecessary downloads

🚀 Quick Installation

# Clone the repository
git clone https://github.com/dindicoelho/scraper-etico.git
cd scraper-etico

# Install dependencies (choose one option)
pip3 install -r requirements.txt
# OR if pip3 not found:
python3 -m pip install -r requirements.txt
# OR using conda:
# conda install requests

# Configure your credentials
cp production_config.example.py production_config.py
nano production_config.py  # Edit with your data

# Test installation - run this file:
python3 example_usage.py

🚀 What Files Can You Run?

🧪 For Testing/Learning:

# 1. Test basic functionality:
python3 example_usage.py

# 2. Run interactive tutorial:
python3 -m jupyter notebook notebooks/ethical_scraper_tutorial.ipynb

# 3. Custom scraping with your own URLs:
python3 custom_scraping.py

🏭 For Production:

# 1. Configure your sites first:
cp production_config.example.py production_config.py
nano production_config.py  # Add your real sites and settings

# 2. Run production scraping:
python3 run_production.py

# 3. Analyze results:
python3 analyze_results.py

🧪 For Testing:

# Run all tests:
python3 tests/run_tests.py

# Test with your specific sites:
python3 tests/examples/my_monitoring.py

📖 Basic Usage (Code Examples)

Simple URL Verification

from src.scraper_etico import ScraperEtico

# Create scraper instance
scraper = ScraperEtico(
    user_agent="MyBot/1.0 (+http://mysite.com/contact)",
    default_delay=2.0
)

# Check if a URL can be accessed
can_access = scraper.can_fetch("https://example.com/page")
print(f"Can access: {can_access}")

# Make request with automatic rate limiting  
response = scraper.get("https://example.com/page")
if response:
    print(f"Content retrieved: {len(response.text)} characters")

Batch Processing with Export

from src.batch_processor import BatchProcessor

# URLs to verify
urls = [
    "https://portal.compras.gov.br",
    "https://transparencia.gov.br",
    "https://www.gov.br/economia"
]

# Process in batch
processor = BatchProcessor()
processor.scraper = ScraperEtico(
    user_agent="MyBot/1.0",
    default_delay=5.0  # Respectful for gov sites
)

job_state = processor.process_batch(urls, max_workers=2)

# Automatic export to CSV and JSON
processor.export_to_csv(job_state, "results.csv")
processor.export_to_json(job_state, "results.json")

🎓 Complete Tutorial

Run the interactive tutorial notebook:

# First install jupyter (if not already installed)
pip3 install jupyter
# OR: python3 -m pip install jupyter

# Then run the tutorial (choose one option)
cd notebooks/
jupyter notebook ethical_scraper_tutorial.ipynb
# OR if jupyter command not found:
python3 -m jupyter notebook ethical_scraper_tutorial.ipynb

🏗️ System Architecture

📁 Project Structure & Executable Files

scraper_etico/
├── 🟢 EXECUTABLE FILES (run these):
│   ├── example_usage.py           # 👉 python3 example_usage.py
│   ├── custom_scraping.py         # 👉 python3 custom_scraping.py  
│   ├── run_production.py          # 👉 python3 run_production.py
│   ├── analyze_results.py         # 👉 python3 analyze_results.py
│   │
├── 🔧 CONFIGURATION FILES (edit these):
│   ├── production_config.example.py  # Copy to production_config.py
│   ├── requirements.txt              # Dependencies list
│   │
├── 📚 TUTORIAL & TESTS:
│   ├── notebooks/ethical_scraper_tutorial.ipynb  # 👉 jupyter notebook
│   ├── tests/run_tests.py                       # 👉 python3 tests/run_tests.py
│   ├── tests/production_test.py                 # 👉 python3 tests/production_test.py
│   └── tests/examples/my_monitoring.py          # 👉 python3 tests/examples/my_monitoring.py
│   │
├── 📦 LIBRARY CODE (don't edit):
│   ├── src/scraper_etico.py       # Main scraping class
│   ├── src/analyzer.py            # Robots.txt analysis
│   ├── src/batch_processor.py     # Batch processing
│   └── src/utils.py               # Utility functions
│   │
└── 📊 RESULTS FOLDERS:
    ├── production_data/           # Your scraping results
    ├── data_backup/               # Automatic backups
    ├── logs/                      # Execution logs
    └── batch_states/              # Job resumption data

🚀 Quick Command Reference:

# 🧪 Testing & Learning:
python3 example_usage.py                    # Test basic functionality
python3 custom_scraping.py                  # Custom URL scraping
python3 -m jupyter notebook notebooks/ethical_scraper_tutorial.ipynb

# 🏭 Production:
python3 run_production.py                   # Main production script
python3 analyze_results.py                  # Analyze results

# 🧪 Testing:
python3 tests/run_tests.py                  # Run all tests
python3 tests/production_test.py            # Test production config

Main Components

1. ScraperEtico (`src/scraper_etico.py`)

Responsibility: Individual ethical scraping with automatic compliance

Features:

✅ Automatic robots.txt verification
⏱️ Intelligent rate limiting per domain
🔄 robots.txt caching for performance
📝 Detailed logging with timestamps
🛡️ Robust network error handling
🕐 Automatic crawl-delay detection

Execution flow:

URL Input → Robots.txt Check → Rate Limiting → HTTP Request → Response

2. RobotsAnalyzer (`src/analyzer.py`)

Responsibility: Advanced and detailed robots.txt file analysis

Features:

🔍 Complete robots.txt parsing with validation
📊 Detailed statistics per user-agent
🗺️ Sitemap extraction and validation
⚖️ Comparison between different user-agents
📈 Restrictiveness and pattern analysis
📋 Formatted report generation

Use cases:

robots.txt policy auditing
Scraping strategy planning
robots.txt change monitoring

3. BatchProcessor (`src/batch_processor.py`)

Responsibility: Concurrent and efficient processing of multiple URLs

Features:

🔄 Concurrent processing with ThreadPoolExecutor
💾 Persistent state for interrupted job resumption
📊 Real-time progress bar (tqdm)
📤 Automatic export to CSV and JSON
📈 Detailed statistical reports
🛡️ Coordinated rate limiting between threads
🔍 Integration with robots.txt analysis

Threading architecture:

BatchProcessor → ThreadPoolExecutor → [Worker1, Worker2, Worker3...] → Domain Locks → ScraperEtico

Rate Limiting System

The system implements multiple layers of thread-safe rate limiting:

1. Global Rate Limiting (ScraperEtico)
   ↓
2. Domain-specific Rate Limiting (BatchProcessor)  
   ↓
3. Robots.txt Crawl-Delay Compliance
   ↓
4. Thread-safe Domain Locks

Complete Data Flow

URL List → BatchProcessor → JobState
     ↓
ThreadPoolExecutor → Worker Threads
     ↓
Domain Rate Limiting → ScraperEtico
     ↓
Robots.txt Check → HTTP Request → Response
     ↓
RobotsAnalyzer → BatchResult
     ↓
Persistent State → Exports (CSV/JSON) → Reports

Persistence and Resumption

The system maintains persistent state using .pkl files in batch_states/:

JobState: Job metadata and progress
Completed URLs: URLs already processed
Failed URLs: URLs that failed
Results: Detailed results for each URL

Performance and Scalability

Implemented optimizations:

💾 Smart robots.txt caching
🔄 HTTP connection reuse with requests.Session
📊 Asynchronous processing with ThreadPool
🧵 Scalable thread pool (configurable)
💿 Persistent state for long jobs
🔐 Thread-safe domain locking

Recommended limits:

Max workers: 5-10 (depending on target server)
Default delay: 1-2 seconds (5s+ for gov sites)
Timeout: 30 seconds
Batch size: 100-1000 URLs

⚙️ Production Configuration

Copy the example file:

cp production_config.example.py production_config.py

Edit your settings:

# Identify your bot properly
USER_AGENT = "MyProject/1.0 (+https://mysite.com; contact@mysite.com)"

# Configure respectful delays
DEFAULT_DELAY = 5.0  # Government sites

# Add your sites
PRODUCTION_SITES = [
    "https://portal.compras.gov.br",
    "https://transparencia.gov.br",
    # Your sites...
]

Execute production:
```
python3 run_production.py
```

🧪 Testing

Run These Files to Test:

# 1. Run all automated tests:
python3 tests/run_tests.py

# 2. Test production configuration:
python3 tests/production_test.py

# 3. Test with your specific sites (edit this file first):
python3 tests/examples/my_monitoring.py

📊 Data Analysis

Run This File to Analyze Results:

# 1. Automatic analysis of all results:
python3 analyze_results.py

# 2. Open results in Excel/Google Sheets:
open production_data/monitoring_*.csv
# OR on Linux: xdg-open production_data/monitoring_*.csv

🛡️ Ethical Principles

This library strictly follows web scraping ethical best practices:

✅ Always Do:

Check robots.txt before any access
Use adequate delays between requests (minimum 1s, for gov sites: 5s+)
Identify your bot with descriptive user-agent
Include contact information in user-agent
Monitor logs to detect problems
Respect rate limits and specified crawl-delays

❌ Never Do:

Ignore robots.txt rules
Make excessive simultaneous requests
Use fake browser user-agents
Scrape 24/7 without breaks
Ignore HTTP or network errors

📝 Use Cases

🏛️ Public Bidding Monitoring
💰 Government Price Analysis
📊 Academic Research on public data
🔍 Government Transparency Auditing
📈 Public Market Analysis

🤝 Contributing

Fork the project
Create a branch (git checkout -b feature/new-feature)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Open a Pull Request

📄 License

MIT License - see LICENSE for details.

🔧 Troubleshooting

Common Installation Issues

Issue: pip: command not found or pip3: command not found Solution: Use Python module syntax:

python3 -m pip install -r requirements.txt

Issue: jupyter: command not found Solution: Use Python module syntax:

# Install jupyter if needed
python3 -m pip install jupyter
# Then run via python3 -m
python3 -m jupyter notebook notebooks/ethical_scraper_tutorial.ipynb

Issue: ModuleNotFoundError: No module named 'requests' Solution: Install requirements properly:

cd scraper_etico
python3 -m pip install -r requirements.txt

Issue: Permission errors on macOS/Linux Solution: Use virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
python3 -m pip install -r requirements.txt

⚠️ Disclaimer

This tool was developed for ethical and educational use. Users are responsible for:

Respecting website terms of service
Checking the legality of scraping in their jurisdiction
Not overloading servers
Using only for legitimate purposes

🆘 Support

📖 Documentation: See notebooks/ and README.md
🐛 Issues: Open an issue on GitHub
💬 Discussions: Use GitHub Discussions
📧 Contact: Open Issue

🌟 Ethical Scraping is Responsible Scraping 🌟

"With great power comes great responsibility" - Use this library to build a more respectful and collaborative internet.

Made with ❤️ for ethical web scraping

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
analyze_results.py		analyze_results.py
custom_scraping.py		custom_scraping.py
example_usage.py		example_usage.py
production_config.example.py		production_config.example.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_production.py		run_production.py
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🤖 EthicalScraper

🎯 Key Features

🚀 Quick Installation

🚀 What Files Can You Run?

🧪 For Testing/Learning:

🏭 For Production:

🧪 For Testing:

📖 Basic Usage (Code Examples)

Simple URL Verification

Batch Processing with Export

🎓 Complete Tutorial

🏗️ System Architecture

📁 Project Structure & Executable Files

🚀 Quick Command Reference:

Main Components

1. ScraperEtico (src/scraper_etico.py)

2. RobotsAnalyzer (src/analyzer.py)

3. BatchProcessor (src/batch_processor.py)

Rate Limiting System

Complete Data Flow

Persistence and Resumption

Performance and Scalability

⚙️ Production Configuration

🧪 Testing

Run These Files to Test:

📊 Data Analysis

Run This File to Analyze Results:

🛡️ Ethical Principles

✅ Always Do:

❌ Never Do:

📝 Use Cases

🤝 Contributing

📄 License

🔧 Troubleshooting

Common Installation Issues

⚠️ Disclaimer

🆘 Support

🌟 Ethical Scraping is Responsible Scraping 🌟

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. ScraperEtico (`src/scraper_etico.py`)

2. RobotsAnalyzer (`src/analyzer.py`)

3. BatchProcessor (`src/batch_processor.py`)

Packages