Thank you for your interest in contributing to the Gutenberg scraper! This document provides guidelines for contributing to the project.
For general openZIM contribution guidelines, see the openZIM Contributing Wiki.
The project consists of several components:
scraper/: Python scraper that downloads books and generates ZIM filesui/: Vue.js frontend that provides the user interface within the ZIMlocales/: UI translation files (multiple languages supported)scraper/docs/: Technical documentationJSON_FILE_STRUCTURE.md: JSON schema documentation for the Vue.js UIGUTENBERG_STRUCTURE.md: Project Gutenberg structure and metadata documentation
UI translations are managed through TranslateWiki. We welcome volunteers to contribute translations in their native languages.
When a new language <new_code> starts being translated, developers need to add support for it:
-
Add to
ui/src/plugins/i18n.ts:- Add the language to the
supportedLanguagesdictionary - Specify its native name and whether it's RTL (right-to-left)
- Add the language to the
-
Update locale files:
- Add
languageNames.<new_code>key inlocales/en.json - Add
languageNames.<new_code>key inlocales/qqq.json(documentation) - Add
languageNames.<new_code>key inlocales/<new_code>.json
- Add
Example: See commit adding Hindi support
The scraper is located in scraper/src/gutenberg2zim/. Key files:
entrypoint.py: CLI argument parsingzim.py: ZIM file creationdownload.py: Book downloading logicexport.py: JSON generation for Vue.js UIrdf.py: RDF metadata parsing
Setup:
cd scraper
pip install hatch
hatch shellTesting:
hatch run test:runLinting:
Linux/macOS:
hatch run lint:allWindows (hatch scripts don't work due to pty limitation):
black src
ruff check srcType Checking:
hatch run check:allDocumentation:
Before contributing, familiarize yourself with these key documents:
-
JSON File Structure: Detailed specification of the JSON schema used by the Vue.js UI. Essential reading if you're working on data export (
export.py) or the Vue.js frontend. Explains the two-tier architecture (preview + detail files), file naming conventions, and loading strategies. -
Gutenberg Structure: Comprehensive overview of the project architecture, including directory structure, Python-to-Vue.js data flow, Pydantic schemas, and design decisions. Useful for understanding how the scraper and UI work together.
The UI is located in ui/src/. Key directories:
views/: Page components (Home, Book Detail, Author List, etc.)components/: Reusable componentsstores/: Pinia state managementrouter/: Vue Router configurationplugins/: i18n and Vuetify setup
Setup:
cd ui
npm installDevelopment:
npm run devBuild:
npm run buildLinting:
npm run lintWhen developing the UI, you need JSON assets (books.json, authors.json, etc.) generated by the scraper. Here's the recommended workflow:
1. Build the Docker image:
docker build -t local-gutenberg .2. Generate a small ZIM file with JSON assets:
docker run --rm -it -v "$PWD/output":/output \
local-gutenberg \
gutenberg2zim --books 1,2,3 --languages en --formats html \
--zim-file gutenberg_dev --output /outputAdjust --books, --languages, and --formats to match your test dataset.
3. Extract the assets:
# Clean previous assets
find ui/public/ -mindepth 1 ! -name ".gitignore" -delete
# Extract from ZIM
docker run -it --rm -v $(pwd)/output:/data ghcr.io/openzim/zim-tools:latest \
zimdump dump --dir=/data/gutenberg_dev /data/gutenberg_dev.zim
# Move to UI public folder
mv output/gutenberg_dev/* ui/public/
rm -rf output/gutenberg_devOn Windows, run these commands in WSL or adapt them to PowerShell.
4. Start the UI development server:
cd ui
npm install
npm run devThe UI will be available at http://localhost:5173 with hot reload.
Important: Clean ui/public/ before building the Docker image again to avoid shipping extracted assets in production.
- Follow PEP 8 style guide
- Use
rufffor linting (configured inpyproject.toml) - Use
blackfor formatting - Run
hatch run lint:allbefore committing
- Follow the project's ESLint configuration
- Use Prettier for formatting (configured in
.prettierrc.json) - Run
npm run lintbefore committing - Use TypeScript for type safety
- Use TAB indentation (not spaces)
- Use CRLF line endings (Windows style)
- Keep keys sorted alphabetically
- Create a feature branch from
main - Write clear commit messages describing what and why
- Test your changes thoroughly
- Update documentation if needed (README, CONTRIBUTING, etc.)
- Ensure all tests pass and linting is clean
- Keep PRs focused - one feature or fix per PR
- Rebase and squash commits before submitting to keep history clean
- Issues: Check existing issues or create a new one
- Discussions: Use GitHub Discussions for questions
- Code Review: Maintainers will review PRs and provide feedback
By contributing, you agree that your contributions will be licensed under the GPLv3 license.