Skip to content

feat: add --cpg-only mode for "positive conversion"-based methylation methods#101

Draft
mmcguffie-twist wants to merge 3 commits into
brentp:masterfrom
mmcguffie-twist:mcg-add-cda-support
Draft

feat: add --cpg-only mode for "positive conversion"-based methylation methods#101
mmcguffie-twist wants to merge 3 commits into
brentp:masterfrom
mmcguffie-twist:mcg-add-cda-support

Conversation

@mmcguffie-twist

Copy link
Copy Markdown

Add --cpg-only mode for "positive conversion"-based methylation methods

Adds a --cpg-only flag that restricts bisulfite-style conversion to CpG context only, for use with methods that use opposite chemistry conversion from classical bisulfite sequencing methods (eg: Cytosine Deaminase, TAPS+).

Motivation

Standard bisulfite/bwameth converts all C→T (forward) and G→A (reverse). Cytosine Deaminase (CDA) chemistry (and TAPS+) only deaminates methylated cytosines in CpG context, so the in-silico conversion should match: only CpG sites are converted in both the reference and reads. This preserves non-CpG sequence identity, improving alignment specificity and reducing spurious mismatches.

How it works

The conversion creates 2 reference copies (same as standard mode):

  • f* contigs: CG→TG (forward)
  • r* contigs: CG→CA (reverse)

Reads are converted the same way before alignment:

  • Read 1: CG→TG
  • Read 2: CG→CA

This collapses methylation state at CpG sites — both methylated (TG from conversion) and unmethylated (CG) become TG in the aligner input, matching the converted reference. After alignment, the original sequence is restored from the YS:Z: tag.

Changes

  • convert_fasta()cpg_only parameter converts only CpG sites in f*/r* contigs. Uses .bwameth.cpg.c2t suffix.
  • convert_and_write_read()cpg_only parameter applies CG→TG/CG→CA instead of C→T/G→A.
  • convert_fqs() / convert_reads() — thread cpg_only flag through to read conversion.
  • bwa_mem() — passes cpg_only to select the correct index.
  • CLI--cpg-only flag on index, index-mem2, and alignment commands.

Usage

# Build CpG-only index
bwameth.py index --cpg-only ref.fa

# Align
bwameth.py --cpg-only --reference ref.fa reads_1.fq reads_2.fq

Let me know if you have any questions or other asks @brentp! Happy to get a testing framework in place too if you think it would be helpful. Thanks so much for this tool.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new --cpg-only option to bwameth.py to support converting cytosines only in the CpG context. The feedback highlights a potential bug in argument parsing for the c2t command where --cpg-only can disrupt positional indexing of fastq files. Additionally, the reviewer recommends replacing assert statements with standard conditional checks for command-line argument validation in the index and index-mem2 commands to prevent unhandled exceptions if assertions are globally disabled.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread bwameth.py
Comment on lines 545 to +547
if len(args) > 0 and args[0] == "c2t":
sys.exit(convert_reads(args[1], args[2]))
cpg_only = "--cpg-only" in args
sys.exit(convert_reads(args[1], args[2], cpg_only=cpg_only))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When parsing arguments for the c2t command, if --cpg-only is provided before the fastq files (e.g., bwameth.py c2t --cpg-only reads_1.fq reads_2.fq), args[1] will be '--cpg-only' and args[2] will be 'reads_1.fq'. This will cause the command to fail because it treats --cpg-only as a fastq file.

Additionally, if a user runs single-end c2t manually (e.g., bwameth.py c2t reads_1.fq --cpg-only), args[2] will be '--cpg-only', which will also fail.

To make the argument parsing robust, filter out --cpg-only from the positional arguments and default the second fastq file to 'NA' if not provided.

Suggested change
if len(args) > 0 and args[0] == "c2t":
sys.exit(convert_reads(args[1], args[2]))
cpg_only = "--cpg-only" in args
sys.exit(convert_reads(args[1], args[2], cpg_only=cpg_only))
if len(args) > 0 and args[0] == "c2t":
cpg_only = "--cpg-only" in args
c2t_args = [a for a in args[1:] if a != "--cpg-only"]
if not c2t_args:
sys.stderr.write("error: must specify at least one fastq file\n")
sys.exit(1)
fq1 = c2t_args[0]
fq2 = c2t_args[1] if len(c2t_args) > 1 else "NA"
sys.exit(convert_reads(fq1, fq2, cpg_only=cpg_only))

Comment thread bwameth.py
Comment on lines 533 to +537
if len(args) > 0 and args[0] == "index":
assert len(args) == 2, ("must specify fasta as 2nd argument")
sys.exit(bwa_index(convert_fasta(args[1])))
cpg_only = "--cpg-only" in args
idx_args = [a for a in args[1:] if a != "--cpg-only"]
assert len(idx_args) == 1, ("must specify fasta as 2nd argument")
sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only)))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using assert statements for command-line argument validation is discouraged in Python because assertions can be globally disabled when Python is run with the -O (optimize) flag. If disabled, the assertion is ignored, which would lead to an unhandled IndexError on the subsequent line when accessing idx_args[0].

Instead, use a standard if statement to validate the arguments and print a helpful error message to sys.stderr before exiting.

Suggested change
if len(args) > 0 and args[0] == "index":
assert len(args) == 2, ("must specify fasta as 2nd argument")
sys.exit(bwa_index(convert_fasta(args[1])))
cpg_only = "--cpg-only" in args
idx_args = [a for a in args[1:] if a != "--cpg-only"]
assert len(idx_args) == 1, ("must specify fasta as 2nd argument")
sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only)))
if len(args) > 0 and args[0] == "index":
cpg_only = "--cpg-only" in args
idx_args = [a for a in args[1:] if a != "--cpg-only"]
if len(idx_args) != 1:
sys.stderr.write("error: must specify fasta as 2nd argument\n")
sys.exit(1)
sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only)))
References
  1. Do not use assert statements for runtime validation of user input or command-line arguments, as assertions can be globally disabled in Python using the -O flag.

Comment thread bwameth.py
Comment on lines 539 to +543
if len(args) > 0 and args[0] == "index-mem2":
assert len(args) == 2, ("must specify fasta as 2nd argument")
sys.exit(bwa_index(convert_fasta(args[1]), ver = "mem2"))
cpg_only = "--cpg-only" in args
idx_args = [a for a in args[1:] if a != "--cpg-only"]
assert len(idx_args) == 1, ("must specify fasta as 2nd argument")
sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only), ver = "mem2"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using assert statements for command-line argument validation is discouraged in Python because assertions can be globally disabled when Python is run with the -O (optimize) flag. If disabled, the assertion is ignored, which would lead to an unhandled IndexError on the subsequent line when accessing idx_args[0].

Instead, use a standard if statement to validate the arguments and print a helpful error message to sys.stderr before exiting.

Suggested change
if len(args) > 0 and args[0] == "index-mem2":
assert len(args) == 2, ("must specify fasta as 2nd argument")
sys.exit(bwa_index(convert_fasta(args[1]), ver = "mem2"))
cpg_only = "--cpg-only" in args
idx_args = [a for a in args[1:] if a != "--cpg-only"]
assert len(idx_args) == 1, ("must specify fasta as 2nd argument")
sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only), ver = "mem2"))
if len(args) > 0 and args[0] == "index-mem2":
cpg_only = "--cpg-only" in args
idx_args = [a for a in args[1:] if a != "--cpg-only"]
if len(idx_args) != 1:
sys.stderr.write("error: must specify fasta as 2nd argument\n")
sys.exit(1)
sys.exit(bwa_index(convert_fasta(idx_args[0], cpg_only=cpg_only), ver = "mem2"))
References
  1. Do not use assert statements for runtime validation of user input or command-line arguments, as assertions can be globally disabled in Python using the -O flag.

@mmcguffie-twist

Copy link
Copy Markdown
Author

@brentp I'll leave these Gemini suggestions to your discretion, but they all lgtm

@mmcguffie-twist mmcguffie-twist marked this pull request as draft June 3, 2026 04:12
In CpG-only mode, CpG-poor regions produce identical forward and reverse
reference copies (CG→TG does nothing when there are no CpGs). BWA finds
two perfect hits to the same position and assigns MAPQ=0. This causes
~280K reads to be incorrectly filtered genome-wide.

Fix: detect reads whose only alternative alignment (XA tag) is the opposite
strand at the same position, then compute a principled MAPQ using BWA's own
formula with the next-best non-strand alternative score.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@brentp

brentp commented Jun 4, 2026

Copy link
Copy Markdown
Owner

thanks very much. Can you carefully document _compute_rescued_mapq ? Like each line so it's transparent. I don't exactly see, on quick look, how it's using the BWA formula that the docstring suggests.

and I think that this will change (likely for the better) the behavior of default bwameth usage. Can you quantify that end-to-end if you have samples available?

@brentp

brentp commented Jun 4, 2026

Copy link
Copy Markdown
Owner

also, there are some changes from last month that you'll need to pull in to get a clean PR against master

@mmcguffie-twist

Copy link
Copy Markdown
Author

thanks very much. Can you carefully document _compute_rescued_mapq ? Like each line so it's transparent. I don't exactly see, on quick look, how it's using the BWA formula that the docstring suggests.

Yes, definitely. I initially thought this PR was a simple drop in replacement, but I then realized that in "CpG deserts" this approach had issues because BWAmem assigns these loci a MAPQ of 0 (maps to fwd and rev equally). I initially didn't see this signal because I was working with TE / hybrid capture data on CpG islands. I likely still need to tweak this and test -- apologies, it seems this PR was a little premature.

and I think that this will change (likely for the better) the behavior of default bwameth usage. Can you quantify that end-to-end if you have samples available?

Of course, once I have what I feel is a correct fix in place I will quantify and share the results.

Thanks again!

@brentp

brentp commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants