Releases: openvax/pyensembl
v2.10.1
What's new
Resolves #169: three-tier protein-coding biotype ontology.
pyensembl.Gene / pyensembl.Transcript now expose three layered flags for "does this entry make a polypeptide?":
| Flag | Includes |
|---|---|
is_protein_coding (unchanged) |
strict canonical protein_coding only |
is_protein_coding_extended (new) |
+ IG_{C,D,J,V}_gene, TR_{C,D,J,V}_gene, polymorphic_pseudogene, translated_{processed,unprocessed}_pseudogene |
is_translated (new) |
+ nonsense_mediated_decay, non_stop_decay |
The strict tier is unchanged so downstream effect predictors like varcode keep their existing behavior. Use is_protein_coding_extended when you want IG/TR gene segments and translated pseudogenes (e.g. immunology workflows). Use is_translated when you only care about ribosome occupancy regardless of stable expression (e.g. peptide search, top-variant-effect picking).
The underlying biotype sets are exported as PROTEIN_CODING_BIOTYPES, EXTENDED_PROTEIN_CODING_BIOTYPES, TRANSLATED_BIOTYPES from pyensembl.locus_with_genome for callers who want to derive their own categorization.
Full Changelog: v2.10.0...v2.10.1
v2.10.0
Closes #351 — FASTA-header versions are now preserved in SequenceData instead of stripped at parse time.
What's new
fasta_parse._parse_header_idkeeps ENS.Nversion suffixes and properly splits GENCODE pipe-delimited headers.SequenceDatakeys versioned IDs verbatim and builds a_stripped_indexfor bare-ID resolution (the GENCODE case).- New
SequenceData.fasta_version(id)accessor. - New
Transcript.fasta_version— version recorded in the cDNA FASTA header (vstranscript_versionfrom the GTF). - New
Protein.fasta_version— version recorded in the protein FASTA header.Proteinnow carries an optionalgenome=reference.
Compatibility
- Existing v1 (bare-keyed) pickle caches load cleanly under the new code path. No re-index forced on upgrade.
- Pure Ensembl callers see no behavior change.
When the GTF-derived *_version disagrees with fasta_version, the FASTA-header version is the authoritative source-of-truth for the bytes returned by transcript.sequence / transcript.protein_sequence.
v2.9.8
Fix #335 (part 1): wire GENCODE_BIOTYPE_ALIASES from gtfparse 2.7.0 into pyensembl's read_gtf call. GENCODE GTFs (which use gene_type / transcript_type) now get those columns renamed to the Ensembl canonical gene_biotype / transcript_biotype at parse time, so Transcript.is_protein_coding and biotype-filtered queries work without a manual rename pass. Bumps gtfparse dep floor to >=2.7.0. Combined with v2.9.6 (versioned protein-ID FASTA matching), this closes the original #335 GENCODE-genome repro end-to-end.
v2.9.7
Internal: rename the FASTA lookup helper added in v2.9.6 from sequence_lookup_with_ens_fallback to lookup_sequence_with_version_fallback. The old name was misleading — both Ensembl and GENCODE IDs start with ENS; the actual fallback is to a version-stripped form, and the ENS-prefix check is just a guard against stripping non-Ensembl .N isoform suffixes. No public API change (helper isn't exported from pyensembl/__init__.py).
v2.9.6
v2.9.5
Follow-up to PR #334: adds Xenopus (xenopus_tropicalis) on main Ensembl with two assemblies (Xenopus_tropicalis_v9.1 for r98-106, UCB_Xtro_10.0 for r107+), adds soybean (glycine_max) on Ensembl Plants, and tightens the maize / tomato lower release bounds from r40 to r54 / r42 respectively (the assembly versions don't actually exist before those releases). All generated URLs HEAD-verified against the live FTP servers.