Skip to content

sai-vamsi-m/pan-validation-sql

Repository files navigation

PAN Number Validation using SQL

Analysis of a PAN dataset to surface common data quality issues and validate records against defined format rules. Covers null handling, duplicate detection, whitespace and casing inconsistencies, followed by rule-based classification and summary generation.

Files to Refer

  • pan_validation.sql Main script covering:

    • data quality checks (nulls, duplicates, spacing, casing)
    • regex-based format validation
    • custom functions for sequence and repetition checks
    • classification into Valid / Invalid PAN
    • summary query (processed, valid, invalid, incomplete)
  • PAN Validation Analysis.pdf Contains the full context of the Dataset, project motivation, and full set of validation rules.

  • PAN Number Validation Dataset.xlsx Source dataset.

Running the Script

Environment: PostgreSQL

  1. Create / connect to a database
  2. Load data into pan_numbers_dataset
  3. Run pan_validation.sql

Outputs:

  • vw_valid_invalid_pan → PAN classification
  • Final query → summary counts

Notes

  • Validation rules are defined in PAN Validation Analysis.pdf
  • Script is commented where required; follow it top-down
  • Results are derived directly; no separate cleaned table maintained

About

SQL-based analysis of a PAN dataset to identify data quality issues and classify records as valid/invalid based on PAN Number format compliance.

Topics

Resources

Stars

Watchers

Forks

Contributors