Skip to content

Releases: openvax/pyensembl

v2.10.1

13 May 18:38
bcfbeaa

Choose a tag to compare

What's new

Resolves #169: three-tier protein-coding biotype ontology.

pyensembl.Gene / pyensembl.Transcript now expose three layered flags for "does this entry make a polypeptide?":

Flag Includes
is_protein_coding (unchanged) strict canonical protein_coding only
is_protein_coding_extended (new) + IG_{C,D,J,V}_gene, TR_{C,D,J,V}_gene, polymorphic_pseudogene, translated_{processed,unprocessed}_pseudogene
is_translated (new) + nonsense_mediated_decay, non_stop_decay

The strict tier is unchanged so downstream effect predictors like varcode keep their existing behavior. Use is_protein_coding_extended when you want IG/TR gene segments and translated pseudogenes (e.g. immunology workflows). Use is_translated when you only care about ribosome occupancy regardless of stable expression (e.g. peptide search, top-variant-effect picking).

The underlying biotype sets are exported as PROTEIN_CODING_BIOTYPES, EXTENDED_PROTEIN_CODING_BIOTYPES, TRANSLATED_BIOTYPES from pyensembl.locus_with_genome for callers who want to derive their own categorization.

Full Changelog: v2.10.0...v2.10.1

v2.10.0

13 May 17:53
d25332b

Choose a tag to compare

Closes #351 — FASTA-header versions are now preserved in SequenceData instead of stripped at parse time.

What's new

  • fasta_parse._parse_header_id keeps ENS .N version suffixes and properly splits GENCODE pipe-delimited headers.
  • SequenceData keys versioned IDs verbatim and builds a _stripped_index for bare-ID resolution (the GENCODE case).
  • New SequenceData.fasta_version(id) accessor.
  • New Transcript.fasta_version — version recorded in the cDNA FASTA header (vs transcript_version from the GTF).
  • New Protein.fasta_version — version recorded in the protein FASTA header. Protein now carries an optional genome= reference.

Compatibility

  • Existing v1 (bare-keyed) pickle caches load cleanly under the new code path. No re-index forced on upgrade.
  • Pure Ensembl callers see no behavior change.

When the GTF-derived *_version disagrees with fasta_version, the FASTA-header version is the authoritative source-of-truth for the bytes returned by transcript.sequence / transcript.protein_sequence.

v2.9.8

13 May 15:59
339f07d

Choose a tag to compare

Fix #335 (part 1): wire GENCODE_BIOTYPE_ALIASES from gtfparse 2.7.0 into pyensembl's read_gtf call. GENCODE GTFs (which use gene_type / transcript_type) now get those columns renamed to the Ensembl canonical gene_biotype / transcript_biotype at parse time, so Transcript.is_protein_coding and biotype-filtered queries work without a manual rename pass. Bumps gtfparse dep floor to >=2.7.0. Combined with v2.9.6 (versioned protein-ID FASTA matching), this closes the original #335 GENCODE-genome repro end-to-end.

v2.9.7

13 May 14:38
5e08bb6

Choose a tag to compare

Internal: rename the FASTA lookup helper added in v2.9.6 from sequence_lookup_with_ens_fallback to lookup_sequence_with_version_fallback. The old name was misleading — both Ensembl and GENCODE IDs start with ENS; the actual fallback is to a version-stripped form, and the ENS-prefix check is just a guard against stripping non-Ensembl .N isoform suffixes. No public API change (helper isn't exported from pyensembl/__init__.py).

v2.9.6

13 May 03:14
cb94345

Choose a tag to compare

Fix #335 (part 2): tolerate versioned protein/transcript IDs in FASTA lookups for GENCODE-style genomes. Transcript.protein_sequence, Transcript.sequence, Genome.protein_sequence(id), and Genome.transcript_sequence(id) now strip ENS .N suffixes on lookup miss instead of returning None.

v2.9.5

12 May 23:20
58cc69a

Choose a tag to compare

Follow-up to PR #334: adds Xenopus (xenopus_tropicalis) on main Ensembl with two assemblies (Xenopus_tropicalis_v9.1 for r98-106, UCB_Xtro_10.0 for r107+), adds soybean (glycine_max) on Ensembl Plants, and tightens the maize / tomato lower release bounds from r40 to r54 / r42 respectively (the assembly versions don't actually exist before those releases). All generated URLs HEAD-verified against the live FTP servers.

v2.9.4

12 May 22:17
5f0d4c0

Choose a tag to compare

Fix #190: Genome.merged_gene_intervals(contig, strand=None) returns the union of all gene loci on the contig as a sorted list of non-overlapping (start, end) tuples. Adjacent intervals (end+1 == next start) are merged.

v2.9.3

12 May 22:08
92e4297

Choose a tag to compare

Fix #186: Locus.intersect(other, ignore_strand=False) returns a new Locus covering the inclusive-inclusive overlap, or None when the loci are disjoint, on different contigs, or on opposite strands.

v2.9.2

12 May 22:00
6582355

Choose a tag to compare

Fix #283: Genome.nearest_gene(contig, position, end=None, strand=None) and Genome.nearest_transcript(...) return (distance, locus) to the closest annotated feature, even when no feature overlaps the query.

v2.9.1

12 May 21:52
b28e207

Choose a tag to compare

Fix #177: Genome.genes(), Genome.transcripts(), gene_ids(), and transcript_ids() now accept a biotype= kwarg that pushes the filter into the SQL query.