PolicyLens

Deterministic, evidence-linked document intelligence for Indian health insurance policy documents.

PolicyLens is a standalone document compiler for Indian health insurance policy PDFs. It compiles policy documents into structured, provenance-backed data with explicit evidence and status fields. It was built to answer a hard document-intelligence problem: legal and product PDFs are not useful unless extracted facts stay tied to source text, structure, and context.

This repository is currently optimized for:

Indian health insurance policy wordings and adjacent official product documents
structured extraction with explicit evidence
evaluation-driven development on a reviewed gold corpus
cautious, precision-first structured outputs

It is not a chat app, a generic RAG wrapper, or a universal “extract anything from any PDF” product.

What this project does

The engine turns policy documents into a layered representation:

Identity
- corpus lockdown
- document-type filtering
- insurer / plan normalization
- UIN reconciliation
Physical parsing
- page / block / line extraction
- coordinates and layout metadata
Logical parsing
- heading scoring
- section tree construction
- clause segmentation
Table handling
- table detection and extraction
- header lineage
- source linkage
Fact extraction
- deterministic extractors for 20 priority concepts
- evidence-backed fact candidates
- conflict handling
Compiled export
- JSON export package for inspection, analysis, or downstream use
- explicit fact_status, confidence, and evidence

Why it exists

This project started from a real document and product-truth frustration:

policy aggregators were noisy and opaque
important products were hidden or inconsistently surfaced
sales flows were stronger than explanation flows
“compare plans” UX often lacked source-backed truth

The first idea was to let a general LLM read PDFs directly. That failed for three reasons:

cost and scaling pressure,
weak structure fidelity on long legal documents,
no reliable typed output layer for comparison-grade data.

So this project became a document compiler instead of a generic RAG demo.

Repository status

This project is complete as a learning / research / open-source artifact.

Active feature development is closed. The repository is being preserved because the underlying work is real and useful:

deterministic document parsing for complex regulated PDFs
evidence-linked structured extraction
evaluation-driven engineering discipline
source-bundle modeling for product-truth problems

It is not being actively continued as a live insurance-comparison product roadmap.

If you are reading this as a builder:

use the repo as a case study, reference implementation, or starting point
do not assume the insurer corpus will stay current without ongoing document operations
do not treat the exported facts as a production insurance recommendation service

How to try it

There are three useful ways to try PolicyLens.

1. Verify the repository state

This is the fastest way to confirm the project is healthy locally:

PYTHONPATH=. .venv/bin/python scripts/validate_gold_corpus.py
PYTHONPATH=. .venv/bin/python scripts/validate_source_bundles.py data/manifests/product_source_bundles_v1.draft.json
PYTHONPATH=. .venv/bin/python scripts/validate_source_bundles.py data/manifests/product_source_bundles_mvp_v1.json
PYTHONPATH=. .venv/bin/python -m pytest tests/ --tb=short

2. Inspect the included artifacts

If you want to understand what the engine produces, start with:

gold_corpus/policies/
data/reports/
data/manifests/product_source_bundles_v1.draft.json
data/manifests/product_source_bundles_mvp_v1.json
data/processed/product_b_export_v1/   # historical path name for the compiled export package

3. Reuse the engine on your own text-layer PDFs

If you want to adapt the project, the main reusable entry points are:

scripts/corpus_lockdown.py
scripts/uin_match_report.py
scripts/run_heading_scorer.py
scripts/run_section_tree.py
scripts/run_table_engine.py
scripts/run_fact_extractors.py
scripts/run_clause_store.py
scripts/run_export.py

The project is most suitable when your documents are:

text-layer PDFs,
structurally repetitive within a domain,
high-stakes enough to require evidence,
and better served by precision-first extraction than broad fuzzy retrieval.

Current status

As of the final documented state:

20 reviewed gold-corpus policies
full parser and extractor pipeline implemented
20/20 priority deterministic concepts active
full test suite passing
curated MVP insurer universe selected
curated 30-product MVP source-bundle registry built
current-version drift against live insurer sources explicitly tracked
public open-source release hardening completed
project closeout documented

The most important strategic lesson so far:

Policy wording PDFs alone are not enough for launch-grade insurance comparison.

Many comparison-critical values live in:

Product Benefit Tables (PBTs)
Customer Information Sheets (CIS)
brochures / prospectuses
variant-specific tables

That is why the repository now includes a source-bundle registry and a curated MVP bundle set.

Current capabilities

Parser / structure

active-policy corpus filtering
UIN lifecycle matching
physical layout extraction from text-layer PDFs
heading scorer and section tree builder
clause segmentation
table extraction and evaluation
source span generation

Semantic extraction

Deterministic extractors are active for 20 priority concepts, including:

pre-existing disease waiting period
initial waiting period
co-pay
deductible
room rent limit
ICU limit
claim intimation timeline
restoration benefit
modern treatment coverage
newborn coverage

Every exported concept uses explicit status values such as:

present
explicitly_not_covered
not_applicable
not_found
ambiguous
conflicting
requires_manual_review

Product identity / source bundles

draft full-corpus source-bundle registry
curated MVP source-bundle registry for 30 current/live products
source-quality tracking:
- complete
- acceptable_with_known_gap
- missing_pbt
- missing_cis
- stale_version
- others

What this project does not do

no user-facing UI
no consumer chat product
no embeddings/vector-search layer for end-user retrieval
no OCR-first pipeline
no “extract anything from any document” promise

This repository is intentionally narrower:

structured, evidence-aware compilation of regulated insurance documents

Repo layout

PolicyLens/
├── README.md
├── IMPLEMENTATION_PLAN.md
├── pyproject.toml
├── docs/
│   ├── architecture.md
│   ├── evaluation.md
│   ├── export_contract.md
│   ├── data_contracts.md
│   ├── database_strategy.md
│   ├── development_protocol.md
│   ├── ai_execution_protocol.md
│   ├── decisions.md
│   ├── changelog.md
│   ├── risk_register.md
│   ├── tasks.md
│   ├── glossary.md
│   └── product_b_mvp_gtm_strategy.md
├── identity/
├── pdf_parser/
├── structure_parser/
├── table_engine/
├── clause_store/
├── extractors/
├── ontology/
├── derived/
├── source_bundles/
├── scripts/
├── tests/
├── data/
│   ├── manifests/
│   ├── interim/
│   ├── processed/
│   ├── reports/
│   └── tmp/
└── runs/
    ├── sessions/
    ├── experiments/
    └── evals/

Quick start

Requirements

Python 3.9+
macOS/Linux shell
text-layer PDFs for the main parser path

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Basic validation

PYTHONPATH=. .venv/bin/python scripts/validate_gold_corpus.py
PYTHONPATH=. .venv/bin/python -m pytest tests/ --tb=short

Source bundle validation

PYTHONPATH=. .venv/bin/python scripts/validate_source_bundles.py data/manifests/product_source_bundles_v1.draft.json
PYTHONPATH=. .venv/bin/python scripts/validate_source_bundles.py data/manifests/product_source_bundles_mvp_v1.json

Key artifacts

Gold corpus

Reviewed benchmark corpus:

gold_corpus/policies/

Full-corpus scale identity

data/manifests/active_policy_wordings_v1.json
data/manifests/uin_match_report_v1.json

Source-bundle registries

data/manifests/product_source_bundles_v1.draft.json
data/manifests/product_source_bundles_mvp_v1.json

MVP insurer/product selection

data/manifests/mvp_insurer_selection_v1.json
data/manifests/mvp_product_candidates_v1.json
data/manifests/mvp_product_candidates_verified_v1.json

Compiled export package

data/processed/product_b_export_v1/

Documentation map

If you are new to the repo, read in this order:

Most important public docs:

Open-source scope

This repo is being published primarily as:

a working document-intelligence system
a learning artifact in evaluation-driven AI engineering
a case study in structured extraction from regulated documents
a truthful record of where a consumer-AI product idea stopped being operationally sane

It is not being presented as:

a complete insurance comparison business
a fully self-updating national insurance database
a legal/commercial source of truth for all live products

The curated MVP source-bundle work exists precisely because “parse old PDFs once and trust them forever” is not good enough.

Important limitations

This is a domain-shaped engine, not a universal document extractor.
Current insurer product truth can drift over time.
Some product bundles are still incomplete even after latest-version verification.
The downstream recommendation product must respect source quality.
not_found must never be treated as not covered.
The repo is optimized for text-layer PDFs, not scanned-image corpora.

Design principles

Clauses are source truth. Fields are derived views.
Unknown is acceptable. Wrong is fatal.
Precision over recall.
No extracted fact without evidence.
Compiled outputs are safer to consume than raw parser internals.

Contributing

Please read:

For substantial changes, keep the project’s operating rules:

small verified steps
tests/evals before claims
session logs for meaningful work
no silent failures

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github		.github
clause_store		clause_store
data		data
derived		derived
docs		docs
extractors		extractors
gold_corpus		gold_corpus
identity		identity
normalizers		normalizers
ontology		ontology
pdf_parser		pdf_parser
runs		runs
schemas		schemas
scripts		scripts
source_bundles		source_bundles
structure_parser		structure_parser
table_engine		table_engine
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
dse-020-tracking		dse-020-tracking
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

PolicyLens

What this project does

Why it exists

Repository status

How to try it

1. Verify the repository state

2. Inspect the included artifacts

3. Reuse the engine on your own text-layer PDFs

Current status

Current capabilities

Parser / structure

Semantic extraction

Product identity / source bundles

What this project does not do

Repo layout

Quick start

Requirements

Setup

Basic validation

Source bundle validation

Key artifacts

Gold corpus

Full-corpus scale identity

Source-bundle registries

MVP insurer/product selection

Compiled export package

Documentation map

Open-source scope

Important limitations

Design principles

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages