Skip to content

dindicoelho/scraper-etico

Repository files navigation

πŸ€– EthicalScraper

A Python library for ethical web scraping that automatically respects robots.txt rules, implements rate limiting, and provides advanced analysis tools for public price monitoring.

Python License Status

🎯 Key Features

  • βœ… Automatic robots.txt verification before any access
  • ⏱️ Configurable rate limiting to respect servers
  • πŸ€– Customizable user-agent for proper identification
  • πŸ“Š Advanced analysis of robots.txt files (crawl-delay, sitemaps, rules)
  • πŸ”„ Batch processing with parallelization support
  • πŸ“ˆ Report export in CSV and JSON formats
  • 🚨 Robust error handling for network failures
  • πŸ“ Detailed logs with timestamps for auditing
  • πŸ’Ύ Smart caching to avoid unnecessary downloads

πŸš€ Quick Installation

# Clone the repository
git clone https://github.com/dindicoelho/scraper-etico.git
cd scraper-etico

# Install dependencies (choose one option)
pip3 install -r requirements.txt
# OR if pip3 not found:
python3 -m pip install -r requirements.txt
# OR using conda:
# conda install requests

# Configure your credentials
cp production_config.example.py production_config.py
nano production_config.py  # Edit with your data

# Test installation - run this file:
python3 example_usage.py

πŸš€ What Files Can You Run?

πŸ§ͺ For Testing/Learning:

# 1. Test basic functionality:
python3 example_usage.py

# 2. Run interactive tutorial:
python3 -m jupyter notebook notebooks/ethical_scraper_tutorial.ipynb

# 3. Custom scraping with your own URLs:
python3 custom_scraping.py

🏭 For Production:

# 1. Configure your sites first:
cp production_config.example.py production_config.py
nano production_config.py  # Add your real sites and settings

# 2. Run production scraping:
python3 run_production.py

# 3. Analyze results:
python3 analyze_results.py

πŸ§ͺ For Testing:

# Run all tests:
python3 tests/run_tests.py

# Test with your specific sites:
python3 tests/examples/my_monitoring.py

πŸ“– Basic Usage (Code Examples)

Simple URL Verification

from src.scraper_etico import ScraperEtico

# Create scraper instance
scraper = ScraperEtico(
    user_agent="MyBot/1.0 (+http://mysite.com/contact)",
    default_delay=2.0
)

# Check if a URL can be accessed
can_access = scraper.can_fetch("https://example.com/page")
print(f"Can access: {can_access}")

# Make request with automatic rate limiting  
response = scraper.get("https://example.com/page")
if response:
    print(f"Content retrieved: {len(response.text)} characters")

Batch Processing with Export

from src.batch_processor import BatchProcessor

# URLs to verify
urls = [
    "https://portal.compras.gov.br",
    "https://transparencia.gov.br",
    "https://www.gov.br/economia"
]

# Process in batch
processor = BatchProcessor()
processor.scraper = ScraperEtico(
    user_agent="MyBot/1.0",
    default_delay=5.0  # Respectful for gov sites
)

job_state = processor.process_batch(urls, max_workers=2)

# Automatic export to CSV and JSON
processor.export_to_csv(job_state, "results.csv")
processor.export_to_json(job_state, "results.json")

πŸŽ“ Complete Tutorial

Run the interactive tutorial notebook:

# First install jupyter (if not already installed)
pip3 install jupyter
# OR: python3 -m pip install jupyter

# Then run the tutorial (choose one option)
cd notebooks/
jupyter notebook ethical_scraper_tutorial.ipynb
# OR if jupyter command not found:
python3 -m jupyter notebook ethical_scraper_tutorial.ipynb

πŸ—οΈ System Architecture

πŸ“ Project Structure & Executable Files

scraper_etico/
β”œβ”€β”€ 🟒 EXECUTABLE FILES (run these):
β”‚   β”œβ”€β”€ example_usage.py           # πŸ‘‰ python3 example_usage.py
β”‚   β”œβ”€β”€ custom_scraping.py         # πŸ‘‰ python3 custom_scraping.py  
β”‚   β”œβ”€β”€ run_production.py          # πŸ‘‰ python3 run_production.py
β”‚   β”œβ”€β”€ analyze_results.py         # πŸ‘‰ python3 analyze_results.py
β”‚   β”‚
β”œβ”€β”€ πŸ”§ CONFIGURATION FILES (edit these):
β”‚   β”œβ”€β”€ production_config.example.py  # Copy to production_config.py
β”‚   β”œβ”€β”€ requirements.txt              # Dependencies list
β”‚   β”‚
β”œβ”€β”€ πŸ“š TUTORIAL & TESTS:
β”‚   β”œβ”€β”€ notebooks/ethical_scraper_tutorial.ipynb  # πŸ‘‰ jupyter notebook
β”‚   β”œβ”€β”€ tests/run_tests.py                       # πŸ‘‰ python3 tests/run_tests.py
β”‚   β”œβ”€β”€ tests/production_test.py                 # πŸ‘‰ python3 tests/production_test.py
β”‚   └── tests/examples/my_monitoring.py          # πŸ‘‰ python3 tests/examples/my_monitoring.py
β”‚   β”‚
β”œβ”€β”€ πŸ“¦ LIBRARY CODE (don't edit):
β”‚   β”œβ”€β”€ src/scraper_etico.py       # Main scraping class
β”‚   β”œβ”€β”€ src/analyzer.py            # Robots.txt analysis
β”‚   β”œβ”€β”€ src/batch_processor.py     # Batch processing
β”‚   └── src/utils.py               # Utility functions
β”‚   β”‚
└── πŸ“Š RESULTS FOLDERS:
    β”œβ”€β”€ production_data/           # Your scraping results
    β”œβ”€β”€ data_backup/               # Automatic backups
    β”œβ”€β”€ logs/                      # Execution logs
    └── batch_states/              # Job resumption data

πŸš€ Quick Command Reference:

# πŸ§ͺ Testing & Learning:
python3 example_usage.py                    # Test basic functionality
python3 custom_scraping.py                  # Custom URL scraping
python3 -m jupyter notebook notebooks/ethical_scraper_tutorial.ipynb

# 🏭 Production:
python3 run_production.py                   # Main production script
python3 analyze_results.py                  # Analyze results

# πŸ§ͺ Testing:
python3 tests/run_tests.py                  # Run all tests
python3 tests/production_test.py            # Test production config

Main Components

1. ScraperEtico (src/scraper_etico.py)

Responsibility: Individual ethical scraping with automatic compliance

Features:

  • βœ… Automatic robots.txt verification
  • ⏱️ Intelligent rate limiting per domain
  • πŸ”„ robots.txt caching for performance
  • πŸ“ Detailed logging with timestamps
  • πŸ›‘οΈ Robust network error handling
  • πŸ• Automatic crawl-delay detection

Execution flow:

URL Input β†’ Robots.txt Check β†’ Rate Limiting β†’ HTTP Request β†’ Response

2. RobotsAnalyzer (src/analyzer.py)

Responsibility: Advanced and detailed robots.txt file analysis

Features:

  • πŸ” Complete robots.txt parsing with validation
  • πŸ“Š Detailed statistics per user-agent
  • πŸ—ΊοΈ Sitemap extraction and validation
  • βš–οΈ Comparison between different user-agents
  • πŸ“ˆ Restrictiveness and pattern analysis
  • πŸ“‹ Formatted report generation

Use cases:

  • robots.txt policy auditing
  • Scraping strategy planning
  • robots.txt change monitoring

3. BatchProcessor (src/batch_processor.py)

Responsibility: Concurrent and efficient processing of multiple URLs

Features:

  • πŸ”„ Concurrent processing with ThreadPoolExecutor
  • πŸ’Ύ Persistent state for interrupted job resumption
  • πŸ“Š Real-time progress bar (tqdm)
  • πŸ“€ Automatic export to CSV and JSON
  • πŸ“ˆ Detailed statistical reports
  • πŸ›‘οΈ Coordinated rate limiting between threads
  • πŸ” Integration with robots.txt analysis

Threading architecture:

BatchProcessor β†’ ThreadPoolExecutor β†’ [Worker1, Worker2, Worker3...] β†’ Domain Locks β†’ ScraperEtico

Rate Limiting System

The system implements multiple layers of thread-safe rate limiting:

1. Global Rate Limiting (ScraperEtico)
   ↓
2. Domain-specific Rate Limiting (BatchProcessor)  
   ↓
3. Robots.txt Crawl-Delay Compliance
   ↓
4. Thread-safe Domain Locks

Complete Data Flow

URL List β†’ BatchProcessor β†’ JobState
     ↓
ThreadPoolExecutor β†’ Worker Threads
     ↓
Domain Rate Limiting β†’ ScraperEtico
     ↓
Robots.txt Check β†’ HTTP Request β†’ Response
     ↓
RobotsAnalyzer β†’ BatchResult
     ↓
Persistent State β†’ Exports (CSV/JSON) β†’ Reports

Persistence and Resumption

The system maintains persistent state using .pkl files in batch_states/:

  • JobState: Job metadata and progress
  • Completed URLs: URLs already processed
  • Failed URLs: URLs that failed
  • Results: Detailed results for each URL

Performance and Scalability

Implemented optimizations:

  • πŸ’Ύ Smart robots.txt caching
  • πŸ”„ HTTP connection reuse with requests.Session
  • πŸ“Š Asynchronous processing with ThreadPool
  • 🧡 Scalable thread pool (configurable)
  • πŸ’Ώ Persistent state for long jobs
  • πŸ” Thread-safe domain locking

Recommended limits:

  • Max workers: 5-10 (depending on target server)
  • Default delay: 1-2 seconds (5s+ for gov sites)
  • Timeout: 30 seconds
  • Batch size: 100-1000 URLs

βš™οΈ Production Configuration

  1. Copy the example file:

    cp production_config.example.py production_config.py
  2. Edit your settings:

    # Identify your bot properly
    USER_AGENT = "MyProject/1.0 (+https://mysite.com; contact@mysite.com)"
    
    # Configure respectful delays
    DEFAULT_DELAY = 5.0  # Government sites
    
    # Add your sites
    PRODUCTION_SITES = [
        "https://portal.compras.gov.br",
        "https://transparencia.gov.br",
        # Your sites...
    ]
  3. Execute production:

    python3 run_production.py

πŸ§ͺ Testing

Run These Files to Test:

# 1. Run all automated tests:
python3 tests/run_tests.py

# 2. Test production configuration:
python3 tests/production_test.py

# 3. Test with your specific sites (edit this file first):
python3 tests/examples/my_monitoring.py

πŸ“Š Data Analysis

Run This File to Analyze Results:

# 1. Automatic analysis of all results:
python3 analyze_results.py

# 2. Open results in Excel/Google Sheets:
open production_data/monitoring_*.csv
# OR on Linux: xdg-open production_data/monitoring_*.csv

πŸ›‘οΈ Ethical Principles

This library strictly follows web scraping ethical best practices:

βœ… Always Do:

  • Check robots.txt before any access
  • Use adequate delays between requests (minimum 1s, for gov sites: 5s+)
  • Identify your bot with descriptive user-agent
  • Include contact information in user-agent
  • Monitor logs to detect problems
  • Respect rate limits and specified crawl-delays

❌ Never Do:

  • Ignore robots.txt rules
  • Make excessive simultaneous requests
  • Use fake browser user-agents
  • Scrape 24/7 without breaks
  • Ignore HTTP or network errors

πŸ“ Use Cases

  • πŸ›οΈ Public Bidding Monitoring
  • πŸ’° Government Price Analysis
  • πŸ“Š Academic Research on public data
  • πŸ” Government Transparency Auditing
  • πŸ“ˆ Public Market Analysis

🀝 Contributing

  1. Fork the project
  2. Create a branch (git checkout -b feature/new-feature)
  3. Commit your changes (git commit -am 'Add new feature')
  4. Push to the branch (git push origin feature/new-feature)
  5. Open a Pull Request

πŸ“„ License

MIT License - see LICENSE for details.

πŸ”§ Troubleshooting

Common Installation Issues

Issue: pip: command not found or pip3: command not found Solution: Use Python module syntax:

python3 -m pip install -r requirements.txt

Issue: jupyter: command not found Solution: Use Python module syntax:

# Install jupyter if needed
python3 -m pip install jupyter
# Then run via python3 -m
python3 -m jupyter notebook notebooks/ethical_scraper_tutorial.ipynb

Issue: ModuleNotFoundError: No module named 'requests' Solution: Install requirements properly:

cd scraper_etico
python3 -m pip install -r requirements.txt

Issue: Permission errors on macOS/Linux Solution: Use virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
python3 -m pip install -r requirements.txt

⚠️ Disclaimer

This tool was developed for ethical and educational use. Users are responsible for:

  • Respecting website terms of service
  • Checking the legality of scraping in their jurisdiction
  • Not overloading servers
  • Using only for legitimate purposes

πŸ†˜ Support

  • πŸ“– Documentation: See notebooks/ and README.md
  • πŸ› Issues: Open an issue on GitHub
  • πŸ’¬ Discussions: Use GitHub Discussions
  • πŸ“§ Contact: Open Issue

🌟 Ethical Scraping is Responsible Scraping 🌟

"With great power comes great responsibility" - Use this library to build a more respectful and collaborative internet.

Made with ❀️ for ethical web scraping

About

Ethical web scraping library for public price monitoring with automatic robots.txt compliance

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors