Analysis of a PAN dataset to surface common data quality issues and validate records against defined format rules. Covers null handling, duplicate detection, whitespace and casing inconsistencies, followed by rule-based classification and summary generation.
-
pan_validation.sql Main script covering:
- data quality checks (nulls, duplicates, spacing, casing)
- regex-based format validation
- custom functions for sequence and repetition checks
- classification into Valid / Invalid PAN
- summary query (processed, valid, invalid, incomplete)
-
PAN Validation Analysis.pdf Contains the full context of the Dataset, project motivation, and full set of validation rules.
-
PAN Number Validation Dataset.xlsx Source dataset.
Environment: PostgreSQL
- Create / connect to a database
- Load data into
pan_numbers_dataset - Run
pan_validation.sql
Outputs:
vw_valid_invalid_pan→ PAN classification- Final query → summary counts
- Validation rules are defined in PAN Validation Analysis.pdf
- Script is commented where required; follow it top-down
- Results are derived directly; no separate cleaned table maintained