WIP: Rebuild everything into a completely new pipeline, handling both SNVs and CNVs by ifokkema · Pull Request #19 · LOVDnl/VKGL_import

ifokkema · 2026-05-18T11:25:45Z

This is a work in progress; this PR has been created for code reviews.

I ran into problems when I wanted to use this class for multiple files at the same time. So we'll have to use it like a normal object and a constructor, which receives the file name.

The exit codes are re-used everywhere; better create them here and add them to the settings.

The settings file is not committed, so this needs to be done manually in the repo, configured to your needs.

It can be configured to also write to the screen at the same time.

We'll fully automate interaction with the servers, so we'll need to have the SSH key passphrases. To make sure we can check them immediately when given, we should connect to the server, but that's overkill. So, instead, store the hash and compare. That doesn't guarantee that they will work, but once a working hash has been cached, we'll have a quick check.

We'll need to store here the files that we need to create a release. That's better done per center, which means we need to rebuild how we store information on the centers in the settings.

This required me to add a feature to delete settings.

Also check for .gz files that we can decompress.

I don't like those terms, but that's what they're called within the VKGL project.

We will no longer generate VCF, but use the given HGVS. The VCF was causing issues for inversions, that were mistranslated into WT variants. This will also solve that.

We need this for the Radboud format.

Used the HGVS library to parse whatever value is in the transcripts_or_dna field, which is a mixed bag of transcripts, cDNA descriptions (with or without transcripts), genomic DNA descriptions, or protein descriptions.

This requires the validator to store the statistics internally, after which the pipeline can retrieve and store them. This also removes code from the validator that loads the Settings class. Also, don't do tricks with the directory names that only work for as long as we don't update the directory structure. Use the proper variables so it always works.

- Renamed to validateAggregatedData(), which follows naming guidelines and allows us later to use the Validator class for other validations, too. - Fix or improve some comments, explaining better what we're actually doing right now. - Instead of creating an array just to use count() on it a few times, better get the count right away. - Use single-use variables only if it significantly shortens a line and improves readability.

- We don't need an else when all if()s and elseif()s die. Not using that else will reduce the indentation and improve readability. - Instead of using arrays that then need to be counted, just increase a simple counter by one each time.

We can simplify the formatting, as the newest HGVS library has new functionality that will help us later (e.g., the pter/qter recognition).

The logic has, with modifications, been migrated there.

Don't always use the result; only when it looks valid. Also fix some issues with the pter/qter to '?' replacement.

For some of the code, I couldn't figure out and I had to remove it. Overall, we ended up with more variants. That's in part because of the improvements in the HGVS library, but it also really seems the old code rejected entries sometimes that were fine. This new code also makes sure that rejected lines are actually logged, so they end up in the error log.

That way, if we have to re-run parts of the pipeline, they will simply get overwritten without any mingling by accident.

ifokkema and others added 30 commits February 20, 2026 16:49

Add the basics of a Settings helper class.

b4eb8e6

When settings.json can't be loaded, try to create it.

6e8c7b8

Add method to get data from the settings.

03a92f4

Add a method to save the settings.

f19c658

Add method to set the value of a setting.

71130b2

Add type hints to all methods in the Settings class.

8162323

Add support for nested arrays in Settings::set().

6c89c78

Also add support to Settings::get() for nested arrays.

d825dcd

Rebuild the Settings class to no longer be static.

a2c23cf

I ran into problems when I wanted to use this class for multiple files at the same time. So we'll have to use it like a normal object and a constructor, which receives the file name.

Start the basics of the pipeline, creating some settings.

253985f

The exit codes are re-used everywhere; better create them here and add them to the settings.

Move the hard-coded release dates to the settings file.

bfab944

The settings file is not committed, so this needs to be done manually in the repo, configured to your needs.

Add a helper class to write to log files.

ccb7336

It can be configured to also write to the screen at the same time.

Prepare the release folder and the status log.

45b03c8

Stick to the standard for variable naming in classes.

14c4828

Convert some older settings to a newer format.

cca414a

We'll need to store here the files that we need to create a release. That's better done per center, which means we need to rebuild how we store information on the centers in the settings.

Delete some old settings.

5eda76b

This required me to add a feature to delete settings.

Fix bug; multi-line log entries weren't stored well.

fadd64d

Check and then init the release status.

d1f87ed

Check if we have all the files that are configured in the settings.

ecf8823

Also check for .gz files that we can decompress.

Start rebuilding the formatter.

64e52db

Handle the NKI JSON format.

08c0b20

Process the UMCG JSON data format.

a7d1e9f

Better keep "SNV" and "CNV" data formats apart.

3373f71

I don't like those terms, but that's what they're called within the VKGL project.

Implement the basics of TSV parsing and add the Alissa format.

8a89eb1

Add support for another Alissa format.

957f66b

Rewrite the handling of the LUMC format.

44023c3

We will no longer generate VCF, but use the given HGVS. The VCF was causing issues for inversions, that were mistranslated into WT variants. This will also solve that.

Updating the HGVS library.

17ad14b

We need this for the Radboud format.

Rewrote the handling of the Radboud/MUMC format.

1f437a4

Used the HGVS library to parse whatever value is in the transcripts_or_dna field, which is a mixed bag of transcripts, cDNA descriptions (with or without transcripts), genomic DNA descriptions, or protein descriptions.

Add the NKI format, and clean up the code.

d890a68

ifokkema and others added 30 commits May 29, 2026 16:42

Finish rewriting the new validateAggregatedData().

7c54b7e

- We don't need an else when all if()s and elseif()s die. Not using that else will reduce the indentation and improve readability. - Instead of using arrays that then need to be counted, just increase a simple counter by one each time.

Update the HGVS library to v1.2.0.

adfe10a

Formatter: Add the source to the data when we know it.

3cf33df

Formatter: Add support for the Franklin CNV format.

3e08cf7

Formatter: Add support for the NxClinical format.

21c7bcb

We can simplify the formatting, as the newest HGVS library has new functionality that will help us later (e.g., the pter/qter recognition).

Fix all zygosity naming.

2443519

Use the HGVS library to parse the Radboud CNV data.

54d30b3

The logic has, with modifications, been migrated there.

Process the HGVS column.

e0ab21b

Don't always use the result; only when it looks valid. Also fix some issues with the pter/qter to '?' replacement.

Create the third HGVS description based on the position fields.

d779876

Clean up after the last CNV format from the old code was migrated.

4d930b6

Add support for another UMCG CSV format.

37c766e

Let the formatter collect statistics about errors already.

db319db

Add the first version of the Normalizer.

7d6a8dd

Start by formatting everything to HGVS using our library.

b3d564a

Handle more formats.

89549ae

Check for invalid results and log these well.

477ce29

Let the pipeline send custom VV options to the Caches class.

575a77e

Start building the caches, check the result, and print updates.

570f339

Handle variants with errors (e.g., EREFs).

e72a752

Update the HGVS library to v1.2.2.

a186846

Collect the statistics in the status.json for now.

f4d8535

That way, if we have to re-run parts of the pipeline, they will simply get overwritten without any mingling by accident.

Improve the output of the normalizer.

13344df

Collect the data from the caches and store internally.

76b6ad4

Save the normalized output and error file.

dca6817

Let the normalizer collect statistics and store them.

b0cc98a

Update the formatter; Emedgene files don't always have pathogenicity.

56eb836

Update the Normalizer; remove rejected data from the normal output.

6ce9156

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Rebuild everything into a completely new pipeline, handling both SNVs and CNVs#19

WIP: Rebuild everything into a completely new pipeline, handling both SNVs and CNVs#19
ifokkema wants to merge 116 commits into
masterfrom
rebuild-everything

ifokkema commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ifokkema commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant