This project is no longer maintained, developed, or supported.
Development ceased in August 2024. The infrastructure has been shut down, all download links are broken, and the scrapers are likely outdated. This repository is preserved as an educational reference only.
Open-source-legislation was an ambitious attempt to democratize access to global legislative data. The vision was to scrape, process, and standardize legislation from 50+ jurisdictions into a unified SQL format with LLM-ready embeddings, making it easy for developers to build legal applications without the typical barriers of accessing primary source legislative data.
- 9 complete jurisdiction scrapers with working SQL dumps at the time (AL, AZ, CA, CT, DE, FL, ID, IL, IN, VA)
- 50+ partial scrapers in various states of completion
- Unified PostgreSQL schema with pgvector support for embeddings
- Pydantic-based data models for validation and type safety
- 3-phase scraping architecture (Read → Scrape → Process)
- Hierarchical node-based legislation modeling with cross-corpus references
- Ask Abe AI - A legal education assistant that used this data (also shut down)
Everything that matters for production use:
-
All SQL download links are dead - The Supabase storage hosting was shut down. Every download link in the old documentation returns 404.
-
Ask Abe AI is down - The companion legal education application that validated this approach has been shut down.
-
Scrapers are likely outdated - Government websites change their HTML structure constantly. Scrapers that worked in 2024 may not work anymore.
-
No hosted database - There's no live database with current legislation data.
-
No support or maintenance - Issues won't be addressed, pull requests won't be reviewed, and the code won't be updated.
-
API costs - Generating embeddings requires OpenAI API access (costs money per run).
Honest reflection on what went wrong:
-
Unsustainable maintenance burden - Legislative websites change constantly, requiring ongoing scraper updates. This is a full-time job disguised as a side project.
-
Scope was too ambitious - 50+ jurisdictions × constant changes = endless work for a solo developer.
-
Infrastructure costs - Hosting SQL dumps and running embedding generation APIs costs real money without a revenue model.
-
No validated market need - Despite the noble goal, there wasn't sufficient user adoption (12 GitHub stars) to justify the effort.
-
Developer moved on - I built this to support Ask Abe AI. When I stopped needing it, I stopped maintaining it.
Why this code is still worth looking at:
This codebase contains solid patterns for legislative web scraping and data modeling that may be useful as reference material:
The 3-phase pipeline is a clean pattern for large-scale scraping:
- Phase 1 (Read): Extract top-level links from table of contents
- Phase 2 (Scrape): Recursively scrape legislative content with multiple strategies
- Phase 3 (Process): Generate embeddings and establish node relationships
See src/1_SCRAPE_TEMPLATE/ for the template structure.
src/utils/pydanticModels.py demonstrates:
- Hierarchical identifier systems (NodeID)
- Validation for complex nested legal structures
- Cross-reference tracking (NodeText with citations)
- Type-safe data pipelines
Shows how to model hierarchical legislation as a graph:
- Node-based architecture (structure vs content nodes)
- Parent-child relationships with cross-corpus references
- Vector embeddings for semantic search
- JSONB for flexible metadata
See CLAUDE.md for detailed schema documentation.
Multiple scraping strategies in src/scrapers/:
- Regular method: Separate functions per hierarchy level
- Recursive method: Single function for nested structures
- Stack method: Stack-based traversal for single-page hierarchies
- Selenium method: For JavaScript-heavy sites
src/utils/ contains reusable helpers:
scrapingHelpers.py: Retry logic, node insertion, duplicate handlingprocessingHelpers.py: Batch embedding generationutilityFunctions.py: Database operations, API clients
You're welcome to fork and revive this project, but be aware:
- PostgreSQL database with pgvector extension
- OpenAI API key for embedding generation (costs money)
- Time and patience to fix broken scrapers
- Ongoing maintenance commitment - this isn't a one-time setup
- Scrapers will need updates - Government websites have likely changed since 2024
- No SQL dumps available - You'll need to run all scrapers yourself
- API costs add up - Generating embeddings for 50 states isn't cheap
- You're on your own - Don't expect support or guidance
If you still want to try:
# Clone the repository
git clone https://github.com/spartypkp/open-source-legislation.git
cd open-source-legislation
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up PostgreSQL with pgvector
# CREATE EXTENSION vector;
# Create .env file with credentials
# DB_NAME, DB_HOST, DB_USERNAME, DB_PASSWORD, DB_PORT
# OPENAI_API_KEY
# Try running a scraper (may not work)
cd src/scrapers/us/\(states\)/ca/statutes/
python readCA.py # Extract table of contents
python scrapeCA.py # Scrape content
python processCA.py # Generate embeddingsSee CLAUDE.md for detailed technical documentation (note: it was written when the project was active, so adjust expectations accordingly).
I don't have great answers here, but some thoughts:
Legislative data scraping is genuinely hard. The challenges that killed this project aren't unique to me:
- Government websites aren't designed for programmatic access - They change constantly, use inconsistent formats, and sometimes actively prevent scraping
- No universal standard - Every jurisdiction does things differently
- Maintenance is the real cost - Initial scraping is easy; keeping it working is brutal
Possible approaches:
- Use official APIs when they exist - Some jurisdictions offer official legislative data APIs (rare but valuable)
- Focus on one jurisdiction - Don't try to do 50+ states like I did
- Check for existing services - There may be paid legal data providers that handle maintenance for you
- Use LLMs differently - Instead of pre-processing all legislation, use on-demand scraping + LLM analysis
I don't have specific service recommendations, but I'd encourage looking for solutions that someone else is maintaining.
Legislation is modeled as a hierarchical node graph:
- Country/Jurisdiction/Corpus structure (e.g.,
us/ca/statutes) - Node types: Structure (chapters, titles) vs Content (sections with text)
- Hierarchical IDs:
us/ca/statutes/title=1/chapter=2/section=3 - Cross-references: Nodes can reference other nodes within or across corpora
The schema was designed for LLM applications:
- Pre-generated embeddings (via OpenAI API)
- Vector similarity search (pgvector)
- Structured metadata for prompt engineering
- Citation tracking for source attribution
Type-safe data handling with validation:
open-source-legislation/
├── src/
│ ├── scrapers/
│ │ ├── us/
│ │ │ ├── (states)/ # State-level scrapers
│ │ │ └── federal/ # Federal legislation
│ │ └── mhl/ # Marshall Islands
│ ├── 1_SCRAPE_TEMPLATE/ # Template for new scrapers
│ └── utils/
│ ├── pydanticModels.py # Core data models
│ ├── scrapingHelpers.py # Scraping utilities
│ ├── processingHelpers.py # Embedding generation
│ └── utilityFunctions.py # Database & API clients
├── docs/
├── deprecated/
├── public/ # Documentation images
├── CLAUDE.md # Detailed technical docs
├── requirements.txt
└── README.md # This file
This project represents a lot of work and good intentions. The code quality is solid, the architecture is sound, and the vision was noble. It failed not because of bad engineering, but because the problem is genuinely hard and the scope was unsustainable for unfunded solo development.
If you're interested in legislative data access, I hope this code provides useful patterns. If you want to revive the project, you have my blessing—just know what you're getting into.
If you're me from the future looking back at this: you learned a lot building this, even if it didn't work out. That's worth something.
Project Timeline:
- Active development: 2023-2024
- Last significant update: August 2024
- Infrastructure shutdown: 2024
- Officially archived: November 2025
License: MIT (use the code however you want)
Created by: @spartypkp
- CLAUDE.md - Comprehensive technical documentation (written when project was active)
- deprecated/README.md - Historical documentation
- docs/refactoring.md - Schema migration notes
For questions or interest in reviving this project, feel free to open an issue. I may not respond quickly, but I'm not completely ghost.




