Uprn integrity checks#50
Open
GeoWill wants to merge 5 commits into
Open
Conversation
Everywhere we were doing a row count, we can make sure we have the same rows. Calulating checksum(uprn) for 2 tables takes ~3 seconds. So there's no real cost to putting this everywhere. Row counts and distinct UPRNs are also compared, which should catch duplicates.
Do an order insensitive checksum and check for any duplicates. This let's us compare two glue tables, and check they contain the same set of UPRNs. They need to have a 'uprn' column. This should help us isolate where we're getting issues with duplicate UPRNs, and any future/unknown issues with missing UPRNs. Doing this on partition/unload is really just a proof of concept. Next step is to wire it into other steps.
This checks the current_ballots_joined_to_address_base table. At the moment we don't definte the current_elections_parquet table in cdk. This is the final product from this stack (current elections split by outcode). However it's also not suitable for checksum comparison because we write empty files for outcodes where there is no relevant data. I've moved the location of the check to make it clearer what is being checked. It might be worth getting the current_elections_parquet into cdk so that we can run some checks on it.
Rename data quality checks (source and checksum) to make it clear the checks are happening on the table partitioned by first letter. Add a source check to the table partioned by outcode. This means the AddressbaseSourceCheckConstruct is used twice (once directly on outcodes, once via AddressbaseDataQualityCheckConstruct). This means we have to namespace states definied in the construct by construct_id because state_names need to be unique within a state machine.
0f6548e to
8ee3038
Compare
If there are outcodes in the 'by outcode' table (current_boundary_reviews_parquet) that are not in the 'by first letter' table (current_boundary_reviews_joined_to_addressbase) then it is safe to assume that they are stale. Presumably originating from a previous addressbase generation. Since we don't know about any addresses in this outcode from the current addressbase we can safely delete this outcode files. An example of this occuring is outcode: NP90 between: 'addressbase/2025-12-17' and 'addressbase/2026-02-02'
a0b870a to
2738411
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously we were comparing row counts, rather than whether we had the same addresses in two tables.
I've done a couple of things here, 1. replace the row counts checks with comparing checksum (this exists in presto/trino and is a order invariant aggregate function). 2. I've tried to make what we're checking, and at what stage it's happening make more sense.
So if we want to check the data while its in the table partitioned by first letter, I've put the check after the step that creates that table.
The checksum approach doesn't work on the 'by outcode' tables, becasue we sometimes write empty outcode files if we have no data. However this is no worse than we were before with row count. I did find an edge case where we had a stale outcode - i.e. one that only existed in a previous addressbase version. I've added a step to clean these up.