Deterministic, evidence-linked document intelligence for Indian health insurance policy documents.
PolicyLens is a standalone document compiler for Indian health insurance policy PDFs. It compiles policy documents into structured, provenance-backed data with explicit evidence and status fields. It was built to answer a hard document-intelligence problem: legal and product PDFs are not useful unless extracted facts stay tied to source text, structure, and context.
This repository is currently optimized for:
- Indian health insurance policy wordings and adjacent official product documents
- structured extraction with explicit evidence
- evaluation-driven development on a reviewed gold corpus
- cautious, precision-first structured outputs
It is not a chat app, a generic RAG wrapper, or a universal “extract anything from any PDF” product.
The engine turns policy documents into a layered representation:
-
Identity
- corpus lockdown
- document-type filtering
- insurer / plan normalization
- UIN reconciliation
-
Physical parsing
- page / block / line extraction
- coordinates and layout metadata
-
Logical parsing
- heading scoring
- section tree construction
- clause segmentation
-
Table handling
- table detection and extraction
- header lineage
- source linkage
-
Fact extraction
- deterministic extractors for 20 priority concepts
- evidence-backed fact candidates
- conflict handling
-
Compiled export
- JSON export package for inspection, analysis, or downstream use
- explicit
fact_status, confidence, and evidence
This project started from a real document and product-truth frustration:
- policy aggregators were noisy and opaque
- important products were hidden or inconsistently surfaced
- sales flows were stronger than explanation flows
- “compare plans” UX often lacked source-backed truth
The first idea was to let a general LLM read PDFs directly. That failed for three reasons:
- cost and scaling pressure,
- weak structure fidelity on long legal documents,
- no reliable typed output layer for comparison-grade data.
So this project became a document compiler instead of a generic RAG demo.
This project is complete as a learning / research / open-source artifact.
Active feature development is closed. The repository is being preserved because the underlying work is real and useful:
- deterministic document parsing for complex regulated PDFs
- evidence-linked structured extraction
- evaluation-driven engineering discipline
- source-bundle modeling for product-truth problems
It is not being actively continued as a live insurance-comparison product roadmap.
If you are reading this as a builder:
- use the repo as a case study, reference implementation, or starting point
- do not assume the insurer corpus will stay current without ongoing document operations
- do not treat the exported facts as a production insurance recommendation service
There are three useful ways to try PolicyLens.
This is the fastest way to confirm the project is healthy locally:
PYTHONPATH=. .venv/bin/python scripts/validate_gold_corpus.py
PYTHONPATH=. .venv/bin/python scripts/validate_source_bundles.py data/manifests/product_source_bundles_v1.draft.json
PYTHONPATH=. .venv/bin/python scripts/validate_source_bundles.py data/manifests/product_source_bundles_mvp_v1.json
PYTHONPATH=. .venv/bin/python -m pytest tests/ --tb=shortIf you want to understand what the engine produces, start with:
gold_corpus/policies/
data/reports/
data/manifests/product_source_bundles_v1.draft.json
data/manifests/product_source_bundles_mvp_v1.json
data/processed/product_b_export_v1/ # historical path name for the compiled export package
Recommended reading order:
gold_corpus/policies/*/facts.jsondata/reports/dse025_source_bundle_baseline_audit.mddata/reports/dse026_top10_top5_selection.mddata/reports/dse027_curated_mvp_bundle_closeout_v1.md
If you want to adapt the project, the main reusable entry points are:
scripts/corpus_lockdown.py
scripts/uin_match_report.py
scripts/run_heading_scorer.py
scripts/run_section_tree.py
scripts/run_table_engine.py
scripts/run_fact_extractors.py
scripts/run_clause_store.py
scripts/run_export.py
The project is most suitable when your documents are:
- text-layer PDFs,
- structurally repetitive within a domain,
- high-stakes enough to require evidence,
- and better served by precision-first extraction than broad fuzzy retrieval.
As of the final documented state:
- 20 reviewed gold-corpus policies
- full parser and extractor pipeline implemented
- 20/20 priority deterministic concepts active
- full test suite passing
- curated MVP insurer universe selected
- curated 30-product MVP source-bundle registry built
- current-version drift against live insurer sources explicitly tracked
- public open-source release hardening completed
- project closeout documented
The most important strategic lesson so far:
Policy wording PDFs alone are not enough for launch-grade insurance comparison.
Many comparison-critical values live in:
- Product Benefit Tables (PBTs)
- Customer Information Sheets (CIS)
- brochures / prospectuses
- variant-specific tables
That is why the repository now includes a source-bundle registry and a curated MVP bundle set.
- active-policy corpus filtering
- UIN lifecycle matching
- physical layout extraction from text-layer PDFs
- heading scorer and section tree builder
- clause segmentation
- table extraction and evaluation
- source span generation
Deterministic extractors are active for 20 priority concepts, including:
- pre-existing disease waiting period
- initial waiting period
- co-pay
- deductible
- room rent limit
- ICU limit
- claim intimation timeline
- restoration benefit
- modern treatment coverage
- newborn coverage
Every exported concept uses explicit status values such as:
presentexplicitly_not_coverednot_applicablenot_foundambiguousconflictingrequires_manual_review
- draft full-corpus source-bundle registry
- curated MVP source-bundle registry for 30 current/live products
- source-quality tracking:
completeacceptable_with_known_gapmissing_pbtmissing_cisstale_version- others
- no user-facing UI
- no consumer chat product
- no embeddings/vector-search layer for end-user retrieval
- no OCR-first pipeline
- no “extract anything from any document” promise
This repository is intentionally narrower:
structured, evidence-aware compilation of regulated insurance documents
PolicyLens/
├── README.md
├── IMPLEMENTATION_PLAN.md
├── pyproject.toml
├── docs/
│ ├── architecture.md
│ ├── evaluation.md
│ ├── export_contract.md
│ ├── data_contracts.md
│ ├── database_strategy.md
│ ├── development_protocol.md
│ ├── ai_execution_protocol.md
│ ├── decisions.md
│ ├── changelog.md
│ ├── risk_register.md
│ ├── tasks.md
│ ├── glossary.md
│ └── product_b_mvp_gtm_strategy.md
├── identity/
├── pdf_parser/
├── structure_parser/
├── table_engine/
├── clause_store/
├── extractors/
├── ontology/
├── derived/
├── source_bundles/
├── scripts/
├── tests/
├── data/
│ ├── manifests/
│ ├── interim/
│ ├── processed/
│ ├── reports/
│ └── tmp/
└── runs/
├── sessions/
├── experiments/
└── evals/
- Python 3.9+
- macOS/Linux shell
- text-layer PDFs for the main parser path
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"PYTHONPATH=. .venv/bin/python scripts/validate_gold_corpus.py
PYTHONPATH=. .venv/bin/python -m pytest tests/ --tb=shortPYTHONPATH=. .venv/bin/python scripts/validate_source_bundles.py data/manifests/product_source_bundles_v1.draft.json
PYTHONPATH=. .venv/bin/python scripts/validate_source_bundles.py data/manifests/product_source_bundles_mvp_v1.jsonReviewed benchmark corpus:
gold_corpus/policies/
data/manifests/active_policy_wordings_v1.json
data/manifests/uin_match_report_v1.json
data/manifests/product_source_bundles_v1.draft.json
data/manifests/product_source_bundles_mvp_v1.json
data/manifests/mvp_insurer_selection_v1.json
data/manifests/mvp_product_candidates_v1.json
data/manifests/mvp_product_candidates_verified_v1.json
data/processed/product_b_export_v1/
If you are new to the repo, read in this order:
- README.md
- IMPLEMENTATION_PLAN.md
- docs/architecture.md
- docs/evaluation.md
- docs/data_contracts.md
- docs/export_contract.md
- docs/project_closeout.md
- docs/tasks.md
- docs/decisions.md
- latest file in runs/sessions/
Most important public docs:
- architecture
- evaluation
- data contracts
- export contract
- database strategy
- project closeout / retrospective
This repo is being published primarily as:
- a working document-intelligence system
- a learning artifact in evaluation-driven AI engineering
- a case study in structured extraction from regulated documents
- a truthful record of where a consumer-AI product idea stopped being operationally sane
It is not being presented as:
- a complete insurance comparison business
- a fully self-updating national insurance database
- a legal/commercial source of truth for all live products
The curated MVP source-bundle work exists precisely because “parse old PDFs once and trust them forever” is not good enough.
- This is a domain-shaped engine, not a universal document extractor.
- Current insurer product truth can drift over time.
- Some product bundles are still incomplete even after latest-version verification.
- The downstream recommendation product must respect source quality.
not_foundmust never be treated asnot covered.- The repo is optimized for text-layer PDFs, not scanned-image corpora.
- Clauses are source truth. Fields are derived views.
- Unknown is acceptable. Wrong is fatal.
- Precision over recall.
- No extracted fact without evidence.
- Compiled outputs are safer to consume than raw parser internals.
Please read:
For substantial changes, keep the project’s operating rules:
- small verified steps
- tests/evals before claims
- session logs for meaningful work
- no silent failures
This project is licensed under the MIT License.