Uprn integrity checks by GeoWill · Pull Request #50 · DemocracyClub/dc-data-baker

GeoWill · 2026-06-23T09:54:47Z

Previously we were comparing row counts, rather than whether we had the same addresses in two tables.

I've done a couple of things here, 1. replace the row counts checks with comparing checksum (this exists in presto/trino and is a order invariant aggregate function). 2. I've tried to make what we're checking, and at what stage it's happening make more sense.

So if we want to check the data while its in the table partitioned by first letter, I've put the check after the step that creates that table.

The checksum approach doesn't work on the 'by outcode' tables, becasue we sometimes write empty outcode files if we have no data. However this is no worse than we were before with row count. I did find an edge case where we had a stale outcode - i.e. one that only existed in a previous addressbase version. I've added a step to clean these up.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1215654066618125

Everywhere we were doing a row count, we can make sure we have the same rows. Calulating checksum(uprn) for 2 tables takes ~3 seconds. So there's no real cost to putting this everywhere. Row counts and distinct UPRNs are also compared, which should catch duplicates.

Do an order insensitive checksum and check for any duplicates. This let's us compare two glue tables, and check they contain the same set of UPRNs. They need to have a 'uprn' column. This should help us isolate where we're getting issues with duplicate UPRNs, and any future/unknown issues with missing UPRNs. Doing this on partition/unload is really just a proof of concept. Next step is to wire it into other steps.

This checks the current_ballots_joined_to_address_base table. At the moment we don't definte the current_elections_parquet table in cdk. This is the final product from this stack (current elections split by outcode). However it's also not suitable for checksum comparison because we write empty files for outcodes where there is no relevant data. I've moved the location of the check to make it clearer what is being checked. It might be worth getting the current_elections_parquet into cdk so that we can run some checks on it.

Rename data quality checks (source and checksum) to make it clear the checks are happening on the table partitioned by first letter. Add a source check to the table partioned by outcode. This means the AddressbaseSourceCheckConstruct is used twice (once directly on outcodes, once via AddressbaseDataQualityCheckConstruct). This means we have to namespace states definied in the construct by construct_id because state_names need to be unique within a state machine.

If there are outcodes in the 'by outcode' table (current_boundary_reviews_parquet) that are not in the 'by first letter' table (current_boundary_reviews_joined_to_addressbase) then it is safe to assume that they are stale. Presumably originating from a previous addressbase generation. Since we don't know about any addresses in this outcode from the current addressbase we can safely delete this outcode files. An example of this occuring is outcode: NP90 between: 'addressbase/2025-12-17' and 'addressbase/2026-02-02'

GeoWill had a problem deploying to development June 23, 2026 09:55 — with GitHub Actions Failure

GeoWill added 4 commits June 23, 2026 11:51

GeoWill force-pushed the uprn-integrity-checks branch from 0f6548e to 8ee3038 Compare June 23, 2026 10:56

GeoWill temporarily deployed to development June 23, 2026 10:56 — with GitHub Actions Inactive

GeoWill had a problem deploying to development June 23, 2026 10:58 — with GitHub Actions Failure

GeoWill temporarily deployed to development June 23, 2026 12:28 — with GitHub Actions Inactive

GeoWill temporarily deployed to development June 23, 2026 12:30 — with GitHub Actions Inactive

GeoWill force-pushed the uprn-integrity-checks branch from a0b870a to 2738411 Compare June 23, 2026 12:40

GeoWill temporarily deployed to development June 23, 2026 12:41 — with GitHub Actions Inactive

GeoWill deployed to development June 23, 2026 12:43 — with GitHub Actions Active

GeoWill marked this pull request as ready for review June 23, 2026 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uprn integrity checks#50

Uprn integrity checks#50
GeoWill wants to merge 5 commits into
mainfrom
uprn-integrity-checks

GeoWill commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

GeoWill commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GeoWill commented Jun 23, 2026 •

edited

Loading