Skip to content

Uprn integrity checks#50

Open
GeoWill wants to merge 5 commits into
mainfrom
uprn-integrity-checks
Open

Uprn integrity checks#50
GeoWill wants to merge 5 commits into
mainfrom
uprn-integrity-checks

Conversation

@GeoWill

@GeoWill GeoWill commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Previously we were comparing row counts, rather than whether we had the same addresses in two tables.

I've done a couple of things here, 1. replace the row counts checks with comparing checksum (this exists in presto/trino and is a order invariant aggregate function). 2. I've tried to make what we're checking, and at what stage it's happening make more sense.

So if we want to check the data while its in the table partitioned by first letter, I've put the check after the step that creates that table.

The checksum approach doesn't work on the 'by outcode' tables, becasue we sometimes write empty outcode files if we have no data. However this is no worse than we were before with row count. I did find an edge case where we had a stale outcode - i.e. one that only existed in a previous addressbase version. I've added a step to clean these up.


GeoWill added 4 commits June 23, 2026 11:51
Everywhere we were doing a row count, we can make sure we have the same
rows. Calulating checksum(uprn) for 2 tables takes ~3 seconds. So
there's no real cost to putting this everywhere.

Row counts and distinct UPRNs are also compared, which should catch
duplicates.
Do an order insensitive checksum and check for any duplicates.
This let's us compare two glue tables, and check they contain the same
set of UPRNs. They need to have a 'uprn' column.

This should help us isolate where we're getting issues
with duplicate UPRNs, and any future/unknown issues with missing UPRNs.

Doing this on partition/unload is really just a proof of concept.
Next step is to wire it into other steps.
This checks the current_ballots_joined_to_address_base table. At the
moment we don't definte the current_elections_parquet table in cdk. This
is the final product from this stack (current elections split by
outcode). However it's also not suitable for checksum comparison because
we write empty files for outcodes where there is no relevant data.

I've moved the location of the check to make it clearer what is being
checked. It might be worth getting the current_elections_parquet into
cdk so that we can run some checks on it.
Rename data quality checks (source and checksum) to make it clear the
checks are happening on the table partitioned by first letter.

Add a source check to the table partioned by outcode.

This means the AddressbaseSourceCheckConstruct is used twice (once
directly on outcodes, once via AddressbaseDataQualityCheckConstruct).
This means we have to namespace states definied in the construct by
construct_id because state_names need to be unique within a state
machine.
If there are outcodes in the 'by outcode' table
(current_boundary_reviews_parquet) that are not in the 'by first letter'
table (current_boundary_reviews_joined_to_addressbase) then it is safe
to assume that they are stale. Presumably originating from a previous
addressbase generation. Since we don't know about any addresses in this
outcode from the current addressbase we can safely delete this outcode
files.

An example of this occuring is outcode: NP90 between:
'addressbase/2025-12-17' and 'addressbase/2026-02-02'
@GeoWill GeoWill force-pushed the uprn-integrity-checks branch from a0b870a to 2738411 Compare June 23, 2026 12:40
@GeoWill GeoWill deployed to development June 23, 2026 12:43 — with GitHub Actions Active
@GeoWill GeoWill marked this pull request as ready for review June 23, 2026 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant