We release patches for security vulnerabilities in the following versions:
| Version | Supported |
|---|---|
| 0.1.x | ✅ |
We take the security of zbze-crawler seriously. If you believe you have found a security vulnerability, please report it to us as described below.
Please do not report security vulnerabilities through public GitHub issues.
Instead, please report them via email to:
- Email: a.panagoa@gmail.com
- Subject: [SECURITY] zbze-crawler vulnerability report
Please include the following information in your report:
- Type of issue (e.g., injection vulnerability, data leakage, authentication bypass)
- Full paths of source file(s) related to the manifestation of the issue
- The location of the affected source code (tag/branch/commit or direct URL)
- Any special configuration required to reproduce the issue
- Step-by-step instructions to reproduce the issue
- Proof-of-concept or exploit code (if possible)
- Impact of the issue, including how an attacker might exploit it
- Acknowledgment: We will acknowledge your email within 48 hours
- Initial Assessment: We will provide an initial assessment within 5 business days
- Updates: We will keep you informed of the progress towards a fix
- Resolution: We will notify you when the issue is fixed
- Disclosure: We will coordinate public disclosure with you
When developing spiders, please follow these security guidelines:
-
Input Validation:
- Validate and sanitize all extracted data
- Be cautious with URLs and external links
- Don't execute extracted code
-
Credential Management:
- Never hardcode credentials in spider code
- Use environment variables or secure configuration files
- Add credential files to
.gitignore
-
Data Storage:
- Ensure proper file permissions for data directory
- Don't store sensitive data in plain text
- Be mindful of copyright and privacy when collecting data
-
Network Security:
- Use HTTPS when available
- Verify SSL certificates in production
- Respect robots.txt and rate limits
- Don't bypass authentication mechanisms
-
Code Injection Prevention:
- Never use
eval()on extracted data - Sanitize data before database insertion
- Use parameterized queries if using SQL databases
- Never use
Bad Practice:
# DON'T DO THIS
def parse_item(self, response):
# Dangerous: executing extracted code
code = response.css('script::text').get()
eval(code) # NEVER DO THIS
# Dangerous: SQL injection risk
url = response.url
query = f"INSERT INTO urls VALUES ('{url}')" # VULNERABLEGood Practice:
# DO THIS INSTEAD
def parse_item(self, response):
# Safe: just extract and store data
title = response.css('h1::text').get()
content = ''.join(response.css('div.content ::text').getall())
# Safe: use proper database methods
item = {
'url': response.url,
'title': title,
'content': content,
}
self.save_to_db(item) # Uses upsert with query
yield itemThe project uses HTTP caching to avoid re-downloading content. Cached files are stored locally in:
zbze_scrapy/.scrapy/httpcache/
Security Implications:
- Cached responses may contain sensitive data
- Cache directory should have appropriate file permissions
- Consider clearing cache periodically
Mitigation:
- Cache directory is in
.gitignore - Users should secure their local environment
Collected data is stored in:
data/directory (JSON Lines and TinyDB files)
Security Implications:
- May contain personal data from public sources
- Files are stored unencrypted
- Sensitive information from crawled pages may be stored
Mitigation:
- Data directory is in
.gitignore - Users are responsible for securing their data
- Follow copyright and privacy guidelines
This project depends on:
- Scrapy
- BeautifulSoup4
- Requests
- TinyDB
Security Implications:
- Vulnerabilities in dependencies affect this project
Mitigation:
- Keep dependencies updated
- Monitor security advisories
- Use
pip-toolsfor reproducible builds
Important:
- Always review website terms of service
- Respect robots.txt
- Honor copyright and intellectual property rights
- Only use data for academic/research purposes as stated
Before submitting code, ensure:
- No hardcoded credentials or API keys
- No sensitive data in code or commits
- Input validation for all extracted data
- No use of
eval()orexec() - Proper error handling (no information leakage)
- Dependencies are up to date
- No SQL injection vulnerabilities
- File paths are validated
- HTTP caching doesn't expose sensitive data
Security updates will be:
- Released as patch versions (0.1.x)
- Documented in CHANGELOG.md
- Announced via GitHub releases
- Emailed to reporters
When we receive a security bug report, we will:
- Confirm the problem and determine affected versions
- Audit code to find similar problems
- Prepare fixes for all supported versions
- Release patches as soon as possible
- Publicly disclose the vulnerability
If you have suggestions on how this process could be improved, please submit a pull request or email a.panagoa@gmail.com.
Last Updated: January 2025