A Python library for ethical web scraping that automatically respects robots.txt rules, implements rate limiting, and provides advanced analysis tools for public price monitoring.
- β Automatic robots.txt verification before any access
- β±οΈ Configurable rate limiting to respect servers
- π€ Customizable user-agent for proper identification
- π Advanced analysis of robots.txt files (crawl-delay, sitemaps, rules)
- π Batch processing with parallelization support
- π Report export in CSV and JSON formats
- π¨ Robust error handling for network failures
- π Detailed logs with timestamps for auditing
- πΎ Smart caching to avoid unnecessary downloads
# Clone the repository
git clone https://github.com/dindicoelho/scraper-etico.git
cd scraper-etico
# Install dependencies (choose one option)
pip3 install -r requirements.txt
# OR if pip3 not found:
python3 -m pip install -r requirements.txt
# OR using conda:
# conda install requests
# Configure your credentials
cp production_config.example.py production_config.py
nano production_config.py # Edit with your data
# Test installation - run this file:
python3 example_usage.py# 1. Test basic functionality:
python3 example_usage.py
# 2. Run interactive tutorial:
python3 -m jupyter notebook notebooks/ethical_scraper_tutorial.ipynb
# 3. Custom scraping with your own URLs:
python3 custom_scraping.py# 1. Configure your sites first:
cp production_config.example.py production_config.py
nano production_config.py # Add your real sites and settings
# 2. Run production scraping:
python3 run_production.py
# 3. Analyze results:
python3 analyze_results.py# Run all tests:
python3 tests/run_tests.py
# Test with your specific sites:
python3 tests/examples/my_monitoring.pyfrom src.scraper_etico import ScraperEtico
# Create scraper instance
scraper = ScraperEtico(
user_agent="MyBot/1.0 (+http://mysite.com/contact)",
default_delay=2.0
)
# Check if a URL can be accessed
can_access = scraper.can_fetch("https://example.com/page")
print(f"Can access: {can_access}")
# Make request with automatic rate limiting
response = scraper.get("https://example.com/page")
if response:
print(f"Content retrieved: {len(response.text)} characters")from src.batch_processor import BatchProcessor
# URLs to verify
urls = [
"https://portal.compras.gov.br",
"https://transparencia.gov.br",
"https://www.gov.br/economia"
]
# Process in batch
processor = BatchProcessor()
processor.scraper = ScraperEtico(
user_agent="MyBot/1.0",
default_delay=5.0 # Respectful for gov sites
)
job_state = processor.process_batch(urls, max_workers=2)
# Automatic export to CSV and JSON
processor.export_to_csv(job_state, "results.csv")
processor.export_to_json(job_state, "results.json")Run the interactive tutorial notebook:
# First install jupyter (if not already installed)
pip3 install jupyter
# OR: python3 -m pip install jupyter
# Then run the tutorial (choose one option)
cd notebooks/
jupyter notebook ethical_scraper_tutorial.ipynb
# OR if jupyter command not found:
python3 -m jupyter notebook ethical_scraper_tutorial.ipynbscraper_etico/
βββ π’ EXECUTABLE FILES (run these):
β βββ example_usage.py # π python3 example_usage.py
β βββ custom_scraping.py # π python3 custom_scraping.py
β βββ run_production.py # π python3 run_production.py
β βββ analyze_results.py # π python3 analyze_results.py
β β
βββ π§ CONFIGURATION FILES (edit these):
β βββ production_config.example.py # Copy to production_config.py
β βββ requirements.txt # Dependencies list
β β
βββ π TUTORIAL & TESTS:
β βββ notebooks/ethical_scraper_tutorial.ipynb # π jupyter notebook
β βββ tests/run_tests.py # π python3 tests/run_tests.py
β βββ tests/production_test.py # π python3 tests/production_test.py
β βββ tests/examples/my_monitoring.py # π python3 tests/examples/my_monitoring.py
β β
βββ π¦ LIBRARY CODE (don't edit):
β βββ src/scraper_etico.py # Main scraping class
β βββ src/analyzer.py # Robots.txt analysis
β βββ src/batch_processor.py # Batch processing
β βββ src/utils.py # Utility functions
β β
βββ π RESULTS FOLDERS:
βββ production_data/ # Your scraping results
βββ data_backup/ # Automatic backups
βββ logs/ # Execution logs
βββ batch_states/ # Job resumption data
# π§ͺ Testing & Learning:
python3 example_usage.py # Test basic functionality
python3 custom_scraping.py # Custom URL scraping
python3 -m jupyter notebook notebooks/ethical_scraper_tutorial.ipynb
# π Production:
python3 run_production.py # Main production script
python3 analyze_results.py # Analyze results
# π§ͺ Testing:
python3 tests/run_tests.py # Run all tests
python3 tests/production_test.py # Test production configResponsibility: Individual ethical scraping with automatic compliance
Features:
- β Automatic robots.txt verification
- β±οΈ Intelligent rate limiting per domain
- π robots.txt caching for performance
- π Detailed logging with timestamps
- π‘οΈ Robust network error handling
- π Automatic crawl-delay detection
Execution flow:
URL Input β Robots.txt Check β Rate Limiting β HTTP Request β Response
Responsibility: Advanced and detailed robots.txt file analysis
Features:
- π Complete robots.txt parsing with validation
- π Detailed statistics per user-agent
- πΊοΈ Sitemap extraction and validation
- βοΈ Comparison between different user-agents
- π Restrictiveness and pattern analysis
- π Formatted report generation
Use cases:
- robots.txt policy auditing
- Scraping strategy planning
- robots.txt change monitoring
Responsibility: Concurrent and efficient processing of multiple URLs
Features:
- π Concurrent processing with ThreadPoolExecutor
- πΎ Persistent state for interrupted job resumption
- π Real-time progress bar (tqdm)
- π€ Automatic export to CSV and JSON
- π Detailed statistical reports
- π‘οΈ Coordinated rate limiting between threads
- π Integration with robots.txt analysis
Threading architecture:
BatchProcessor β ThreadPoolExecutor β [Worker1, Worker2, Worker3...] β Domain Locks β ScraperEtico
The system implements multiple layers of thread-safe rate limiting:
1. Global Rate Limiting (ScraperEtico)
β
2. Domain-specific Rate Limiting (BatchProcessor)
β
3. Robots.txt Crawl-Delay Compliance
β
4. Thread-safe Domain Locks
URL List β BatchProcessor β JobState
β
ThreadPoolExecutor β Worker Threads
β
Domain Rate Limiting β ScraperEtico
β
Robots.txt Check β HTTP Request β Response
β
RobotsAnalyzer β BatchResult
β
Persistent State β Exports (CSV/JSON) β Reports
The system maintains persistent state using .pkl files in batch_states/:
- JobState: Job metadata and progress
- Completed URLs: URLs already processed
- Failed URLs: URLs that failed
- Results: Detailed results for each URL
Implemented optimizations:
- πΎ Smart robots.txt caching
- π HTTP connection reuse with requests.Session
- π Asynchronous processing with ThreadPool
- π§΅ Scalable thread pool (configurable)
- πΏ Persistent state for long jobs
- π Thread-safe domain locking
Recommended limits:
- Max workers: 5-10 (depending on target server)
- Default delay: 1-2 seconds (5s+ for gov sites)
- Timeout: 30 seconds
- Batch size: 100-1000 URLs
-
Copy the example file:
cp production_config.example.py production_config.py
-
Edit your settings:
# Identify your bot properly USER_AGENT = "MyProject/1.0 (+https://mysite.com; contact@mysite.com)" # Configure respectful delays DEFAULT_DELAY = 5.0 # Government sites # Add your sites PRODUCTION_SITES = [ "https://portal.compras.gov.br", "https://transparencia.gov.br", # Your sites... ]
-
Execute production:
python3 run_production.py
# 1. Run all automated tests:
python3 tests/run_tests.py
# 2. Test production configuration:
python3 tests/production_test.py
# 3. Test with your specific sites (edit this file first):
python3 tests/examples/my_monitoring.py# 1. Automatic analysis of all results:
python3 analyze_results.py
# 2. Open results in Excel/Google Sheets:
open production_data/monitoring_*.csv
# OR on Linux: xdg-open production_data/monitoring_*.csvThis library strictly follows web scraping ethical best practices:
- Check robots.txt before any access
- Use adequate delays between requests (minimum 1s, for gov sites: 5s+)
- Identify your bot with descriptive user-agent
- Include contact information in user-agent
- Monitor logs to detect problems
- Respect rate limits and specified crawl-delays
- Ignore robots.txt rules
- Make excessive simultaneous requests
- Use fake browser user-agents
- Scrape 24/7 without breaks
- Ignore HTTP or network errors
- ποΈ Public Bidding Monitoring
- π° Government Price Analysis
- π Academic Research on public data
- π Government Transparency Auditing
- π Public Market Analysis
- Fork the project
- Create a branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Open a Pull Request
MIT License - see LICENSE for details.
Issue: pip: command not found or pip3: command not found
Solution: Use Python module syntax:
python3 -m pip install -r requirements.txtIssue: jupyter: command not found
Solution: Use Python module syntax:
# Install jupyter if needed
python3 -m pip install jupyter
# Then run via python3 -m
python3 -m jupyter notebook notebooks/ethical_scraper_tutorial.ipynbIssue: ModuleNotFoundError: No module named 'requests'
Solution: Install requirements properly:
cd scraper_etico
python3 -m pip install -r requirements.txtIssue: Permission errors on macOS/Linux Solution: Use virtual environment (recommended):
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
python3 -m pip install -r requirements.txtThis tool was developed for ethical and educational use. Users are responsible for:
- Respecting website terms of service
- Checking the legality of scraping in their jurisdiction
- Not overloading servers
- Using only for legitimate purposes
- π Documentation: See notebooks/ and README.md
- π Issues: Open an issue on GitHub
- π¬ Discussions: Use GitHub Discussions
- π§ Contact: Open Issue
"With great power comes great responsibility" - Use this library to build a more respectful and collaborative internet.
Made with β€οΈ for ethical web scraping