feat: add --cpg-only mode for "positive conversion"-based methylation methods#101
feat: add --cpg-only mode for "positive conversion"-based methylation methods#101mmcguffie-twist wants to merge 3 commits into
--cpg-only mode for "positive conversion"-based methylation methods#101Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new --cpg-only option to bwameth.py to support converting cytosines only in the CpG context. The feedback highlights a potential bug in argument parsing for the c2t command where --cpg-only can disrupt positional indexing of fastq files. Additionally, the reviewer recommends replacing assert statements with standard conditional checks for command-line argument validation in the index and index-mem2 commands to prevent unhandled exceptions if assertions are globally disabled.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if len(args) > 0 and args[0] == "c2t": | ||
| sys.exit(convert_reads(args[1], args[2])) | ||
| cpg_only = "--cpg-only" in args | ||
| sys.exit(convert_reads(args[1], args[2], cpg_only=cpg_only)) |
There was a problem hiding this comment.
When parsing arguments for the c2t command, if --cpg-only is provided before the fastq files (e.g., bwameth.py c2t --cpg-only reads_1.fq reads_2.fq), args[1] will be '--cpg-only' and args[2] will be 'reads_1.fq'. This will cause the command to fail because it treats --cpg-only as a fastq file.
Additionally, if a user runs single-end c2t manually (e.g., bwameth.py c2t reads_1.fq --cpg-only), args[2] will be '--cpg-only', which will also fail.
To make the argument parsing robust, filter out --cpg-only from the positional arguments and default the second fastq file to 'NA' if not provided.
| if len(args) > 0 and args[0] == "c2t": | |
| sys.exit(convert_reads(args[1], args[2])) | |
| cpg_only = "--cpg-only" in args | |
| sys.exit(convert_reads(args[1], args[2], cpg_only=cpg_only)) | |
| if len(args) > 0 and args[0] == "c2t": | |
| cpg_only = "--cpg-only" in args | |
| c2t_args = [a for a in args[1:] if a != "--cpg-only"] | |
| if not c2t_args: | |
| sys.stderr.write("error: must specify at least one fastq file\n") | |
| sys.exit(1) | |
| fq1 = c2t_args[0] | |
| fq2 = c2t_args[1] if len(c2t_args) > 1 else "NA" | |
| sys.exit(convert_reads(fq1, fq2, cpg_only=cpg_only)) |
| if len(args) > 0 and args[0] == "index": | ||
| assert len(args) == 2, ("must specify fasta as 2nd argument") | ||
| sys.exit(bwa_index(convert_fasta(args[1]))) | ||
| cpg_only = "--cpg-only" in args | ||
| idx_args = [a for a in args[1:] if a != "--cpg-only"] | ||
| assert len(idx_args) == 1, ("must specify fasta as 2nd argument") | ||
| sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only))) |
There was a problem hiding this comment.
Using assert statements for command-line argument validation is discouraged in Python because assertions can be globally disabled when Python is run with the -O (optimize) flag. If disabled, the assertion is ignored, which would lead to an unhandled IndexError on the subsequent line when accessing idx_args[0].
Instead, use a standard if statement to validate the arguments and print a helpful error message to sys.stderr before exiting.
| if len(args) > 0 and args[0] == "index": | |
| assert len(args) == 2, ("must specify fasta as 2nd argument") | |
| sys.exit(bwa_index(convert_fasta(args[1]))) | |
| cpg_only = "--cpg-only" in args | |
| idx_args = [a for a in args[1:] if a != "--cpg-only"] | |
| assert len(idx_args) == 1, ("must specify fasta as 2nd argument") | |
| sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only))) | |
| if len(args) > 0 and args[0] == "index": | |
| cpg_only = "--cpg-only" in args | |
| idx_args = [a for a in args[1:] if a != "--cpg-only"] | |
| if len(idx_args) != 1: | |
| sys.stderr.write("error: must specify fasta as 2nd argument\n") | |
| sys.exit(1) | |
| sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only))) |
References
- Do not use assert statements for runtime validation of user input or command-line arguments, as assertions can be globally disabled in Python using the -O flag.
| if len(args) > 0 and args[0] == "index-mem2": | ||
| assert len(args) == 2, ("must specify fasta as 2nd argument") | ||
| sys.exit(bwa_index(convert_fasta(args[1]), ver = "mem2")) | ||
| cpg_only = "--cpg-only" in args | ||
| idx_args = [a for a in args[1:] if a != "--cpg-only"] | ||
| assert len(idx_args) == 1, ("must specify fasta as 2nd argument") | ||
| sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only), ver = "mem2")) |
There was a problem hiding this comment.
Using assert statements for command-line argument validation is discouraged in Python because assertions can be globally disabled when Python is run with the -O (optimize) flag. If disabled, the assertion is ignored, which would lead to an unhandled IndexError on the subsequent line when accessing idx_args[0].
Instead, use a standard if statement to validate the arguments and print a helpful error message to sys.stderr before exiting.
| if len(args) > 0 and args[0] == "index-mem2": | |
| assert len(args) == 2, ("must specify fasta as 2nd argument") | |
| sys.exit(bwa_index(convert_fasta(args[1]), ver = "mem2")) | |
| cpg_only = "--cpg-only" in args | |
| idx_args = [a for a in args[1:] if a != "--cpg-only"] | |
| assert len(idx_args) == 1, ("must specify fasta as 2nd argument") | |
| sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only), ver = "mem2")) | |
| if len(args) > 0 and args[0] == "index-mem2": | |
| cpg_only = "--cpg-only" in args | |
| idx_args = [a for a in args[1:] if a != "--cpg-only"] | |
| if len(idx_args) != 1: | |
| sys.stderr.write("error: must specify fasta as 2nd argument\n") | |
| sys.exit(1) | |
| sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only), ver = "mem2")) |
References
- Do not use assert statements for runtime validation of user input or command-line arguments, as assertions can be globally disabled in Python using the -O flag.
|
@brentp I'll leave these Gemini suggestions to your discretion, but they all lgtm |
In CpG-only mode, CpG-poor regions produce identical forward and reverse reference copies (CG→TG does nothing when there are no CpGs). BWA finds two perfect hits to the same position and assigns MAPQ=0. This causes ~280K reads to be incorrectly filtered genome-wide. Fix: detect reads whose only alternative alignment (XA tag) is the opposite strand at the same position, then compute a principled MAPQ using BWA's own formula with the next-best non-strand alternative score. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
thanks very much. Can you carefully document and I think that this will change (likely for the better) the behavior of default bwameth usage. Can you quantify that end-to-end if you have samples available? |
|
also, there are some changes from last month that you'll need to pull in to get a clean PR against master |
Yes, definitely. I initially thought this PR was a simple drop in replacement, but I then realized that in "CpG deserts" this approach had issues because BWAmem assigns these loci a MAPQ of 0 (maps to fwd and rev equally). I initially didn't see this signal because I was working with TE / hybrid capture data on CpG islands. I likely still need to tweak this and test -- apologies, it seems this PR was a little premature.
Of course, once I have what I feel is a correct fix in place I will quantify and share the results. Thanks again! |
|
Thank you! |
Add
--cpg-onlymode for "positive conversion"-based methylation methodsAdds a
--cpg-onlyflag that restricts bisulfite-style conversion to CpG context only, for use with methods that use opposite chemistry conversion from classical bisulfite sequencing methods (eg: Cytosine Deaminase, TAPS+).Motivation
Standard bisulfite/bwameth converts all C→T (forward) and G→A (reverse). Cytosine Deaminase (CDA) chemistry (and TAPS+) only deaminates methylated cytosines in CpG context, so the in-silico conversion should match: only CpG sites are converted in both the reference and reads. This preserves non-CpG sequence identity, improving alignment specificity and reducing spurious mismatches.
How it works
The conversion creates 2 reference copies (same as standard mode):
f*contigs:CG→TG(forward)r*contigs:CG→CA(reverse)Reads are converted the same way before alignment:
CG→TGCG→CAThis collapses methylation state at CpG sites — both methylated (
TGfrom conversion) and unmethylated (CG) becomeTGin the aligner input, matching the converted reference. After alignment, the original sequence is restored from theYS:Z:tag.Changes
convert_fasta()—cpg_onlyparameter converts only CpG sites inf*/r*contigs. Uses.bwameth.cpg.c2tsuffix.convert_and_write_read()—cpg_onlyparameter appliesCG→TG/CG→CAinstead ofC→T/G→A.convert_fqs()/convert_reads()— threadcpg_onlyflag through to read conversion.bwa_mem()— passescpg_onlyto select the correct index.--cpg-onlyflag onindex,index-mem2, and alignment commands.Usage
Let me know if you have any questions or other asks @brentp! Happy to get a testing framework in place too if you think it would be helpful. Thanks so much for this tool.