Web Crawler for Collecting Kabardian Language Texts
Features • Installation • Quick Start • Documentation • Contributing
🌐 Русская версия | English
zbze-crawler is a Scrapy-based web crawler designed to collect and preserve texts in the Kabardian language from publicly available online sources. This project supports linguistic research, language preservation, and the development of natural language processing tools for the Kabardian language.
- 🕷️ Multiple Spiders - Specialized crawlers for different Kabardian news sources
- 📰 News Collection - Automated harvesting of articles, publications, and journals
- 💾 Multiple Storage Formats - JSON Lines, TinyDB, and HTML archives
- 🔄 Incremental Crawling - HTTP caching to avoid re-downloading content
- 🎯 Respectful Crawling - robots.txt compliance and rate limiting
- 📊 Structured Data - Extracts title, content, author, date, and metadata
- 🌐 Production-Ready - Tested on thousands of pages
The crawler includes dedicated spiders for:
| Source | Description | Spider Name |
|---|---|---|
| Адыгэ Псалъэ | Electronic newspaper in Kabardian | apkbr_ru |
| Адыгэ Псалъэ RSS | RSS feed crawler | apkbr_ru_feed |
| Kabardino-Balkaria | News in Kabardian language | elgkbr_ru |
| Iуащхьэмахуэ (Elbrus) | Cultural journal | oshhamaho |
-
Article Extraction
- Title and headline
- Publication date
- Author information
- Full text content
- Source URL and metadata
-
Storage Options
- JSON Lines (.jsonl) - One article per line
- TinyDB (.json) - Queryable document database
- Raw HTML archives for backup
-
Crawler Features
- Auto-throttling to respect server limits
- HTTP caching to avoid duplicate requests
- Link extraction and following
- Duplicate detection and prevention
System Requirements:
| Tool | Description | Min Version |
|---|---|---|
| Python | Programming language | 3.10+ |
| uv | Fast Python package installer | Latest |
Install uv:
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or with pip
pip install uvUsing Makefile (recommended):
# Clone the repository
git clone https://github.com/zbze-org/zbze-crawler.git
cd zbze-crawler
# Create venv and install dependencies
make install
# Or install with dev dependencies
make install-devManual installation with uv:
# Create virtual environment
uv venv .venv
# Activate virtual environment
source .venv/bin/activate # Unix/macOS
# or
.venv\Scripts\activate # Windows
# Install dependencies
uv pip install -r requirements.in# List available spiders
make list-spidersExpected output:
apkbr_ru
apkbr_ru_feed
elgkbr_ru
oshhamaho
Using Makefile (recommended):
# Run a single spider
make crawl-apkbr
# Or other spiders
make crawl-elgkbr
make crawl-oshhamaho
# Run all spiders
make crawl-allDirect scrapy command:
# Activate virtual environment first
source .venv/bin/activate
# Navigate to Scrapy project directory
cd zbze_scrapy
# Run a single spider
scrapy crawl apkbr_ru
# Run with specific settings
scrapy crawl elgkbr_ru -s DOWNLOAD_DELAY=1Collected data is saved to data/ directory:
data/
├── apkbr_ru/
│ ├── apkbr_ru.jsonl # JSON Lines format
│ ├── apkbr_ru.json # TinyDB database
│ └── *.html # Raw HTML files
├── elgkbr_ru/
│ └── ...
└── oshhamaho/
└── ...
# View JSON Lines (one article per line)
head data/apkbr_ru/apkbr_ru.jsonl
# Query TinyDB (Python)
python -c "
from tinydb import TinyDB
db = TinyDB('data/apkbr_ru/apkbr_ru.json')
print(f'Total articles: {len(db.all())}')
"
# Count collected articles
wc -l data/apkbr_ru/apkbr_ru.jsonlConfigure crawling behavior in zbze_scrapy/settings.py:
# Download delay between requests (seconds)
DOWNLOAD_DELAY = 0.25
# Concurrent requests per domain
CONCURRENT_REQUESTS_PER_DOMAIN = 8
# Enable HTTP caching
HTTPCACHE_ENABLED = True
# Auto-throttle settings
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60Create a new spider for additional sources:
cd zbze_scrapy
scrapy genspider example example.comThen edit zbze_scrapy/spiders/example.py to customize extraction logic.
#!/bin/bash
cd zbze_scrapy
# Run all spiders sequentially
for spider in apkbr_ru elgkbr_ru oshhamaho; do
echo "Running $spider..."
scrapy crawl $spider
done# Add to crontab for daily collection at 2 AM
0 2 * * * cd /path/to/zbze_crawler/zbze_scrapy && scrapy crawl apkbr_rucd zbze_scrapy
# Export to CSV format
scrapy crawl apkbr_ru -o ../data/apkbr_ru/output.csv -t csvimport json
# Read JSON Lines file
articles = []
with open('data/apkbr_ru/apkbr_ru.jsonl', 'r', encoding='utf-8') as f:
for line in f:
article = json.loads(line)
articles.append(article)
# Print statistics
print(f"Total articles: {len(articles)}")
print(f"Sample article: {articles[0]['title']}")zbze-crawler/
├── zbze_scrapy/ # Scrapy project
│ ├── zbze_scrapy/
│ │ ├── spiders/ # Spider definitions
│ │ │ ├── apkbr_ru.py
│ │ │ ├── apkbr_ru_feed.py
│ │ │ ├── elgkbr_ru.py
│ │ │ └── oshhamaho.py
│ │ ├── items.py # Data models
│ │ ├── pipelines.py # Processing pipelines
│ │ ├── middlewares.py # Custom middlewares
│ │ └── settings.py # Project settings
│ └── scrapy.cfg # Scrapy configuration
│
├── data/ # Collected data (gitignored)
│ ├── apkbr_ru/
│ ├── elgkbr_ru/
│ └── oshhamaho/
│
├── requirements.in # Python dependencies
├── README.md # This file
├── README.ru.md # Russian version
└── LICENSE # License information
Source: http://www.apkbr.ru
Information from this website may be used exclusively under the following conditions:
- A link to http://www.apkbr.ru must be provided at the end of the text
- Modification of texts is not permitted; text must be copied in its original form
- Removal of the link to this website from material texts is not allowed
Source: http://www.elgkbr.ru
Information from this website may be used exclusively under the following conditions:
- A link to http://www.elgkbr.ru must be provided at the end of the text
- Modification of texts is not permitted; text must be copied in its original form
- Removal of the link to this website from material texts is not allowed
Source: https://smikbr.ru
Information from this website may be used exclusively under the following conditions:
- A link to the SMI KBR Portal is MANDATORY when reprinting materials
Materials and data presented in this project are intended exclusively for academic use and research. They may be useful for linguists, researchers, and students interested in studying the Kabardian language, its structure, history, and development.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Add New Sources - Create spiders for additional Kabardian websites
- Improve Extraction - Enhance parsing logic for better data quality
- Report Issues - Submit bug reports and feature requests
- Documentation - Improve guides and examples
This project is licensed under the MIT License - see the LICENSE file for details.
Adam Panagov
- Email: a.panagoa@gmail.com
- GitHub: @zbze-org
- Source websites for providing valuable content in Kabardian language
- Scrapy for the excellent web crawling framework
- Kabardian language community for supporting preservation efforts
- zbze-org contributors for testing and feedback
This crawler is part of a four-project ecosystem for Kabardian language digitization:
| Project | Purpose | Repository |
|---|---|---|
| zbze-crawler (this) | Web crawler for collecting Kabardian texts from online sources | GitHub |
| tesseract-kbd-model | Distributable Tesseract OCR models for Kabardian language | GitHub |
| zbze_ocr | Training infrastructure with Airflow, notebooks, and data preparation | GitHub |
| zbze_ocr_cli | Production-ready OCR CLI tool with advanced image processing | GitHub |
Project Workflow:
zbze-crawler (this) → Data Collection
├── Collects web texts ↓
└── Provides corpus data zbze_ocr → Training Infrastructure
├── Trains models
└── Exports to tesseract-kbd-model → Models
↓
zbze_ocr_cli → OCR Processing
Which project should I use?
- 📰 Want to collect Kabardian texts? → Use this repository (zbze-crawler)
- 🎓 Want to train OCR models? → Use zbze_ocr
- 📦 Just need the OCR models? → Use tesseract-kbd-model
- 🔧 Need to process scanned documents? → Use zbze_ocr_cli
- Scrapy - Fast and powerful web crawling framework
- TinyDB - Lightweight document-oriented database
- Kabardian Language Wikipedia - Language information
- zbze-org - Organization supporting Kabardian language digitization
- UNESCO Endangered Languages - List of vulnerable languages
Made with ❤️ for the Kabardian language community