Match same-CAS flows correctly when scoring against EF v3.1 by ccomb · Pull Request #136 · ccomb/volca

ccomb · 2026-06-10T12:57:50Z

Why

Scoring native ecoinvent 3.11 against EF v3.1 loaded from the JRC ILCD package diverged sharply from ecoinvent's own published EF v3.1 scores (embedded in the parallel _lcia package). The cause was in how buildMethodTables routes characterization factors into the engine's lookup tables.

Two distinct defects, same root — a CF landing in the wrong table:

Regionalized rows polluted the broadcast name tables. Per-location CF rows were keyed by flow name alongside the global row, so one arbitrary location's value won the key. A water-abundant region's 0 erased the global water credit; a high-scarcity region's factor inflated the global charge. Net water consumers could score the wrong sign.
The CAS bridge broadcast indiscriminately. Letting same-CAS flows share a CF is correct where the name carries no discrimination — every Water, * resource shares CAS 7732-18-5, so AWARE's "river water" must reach ecoinvent's "Water, river". But it also leaked a name-distinguished CF onto a sibling the method deliberately separates: "methane (biogenic)" (CAS 74-82-8) landed on "Methane, fossil" too, inflating biogenic climate ~80× and distorting the toxicity categories.

What changed

CAS and name aren't competing fallbacks — CAS identifies the molecule, the name identifies the variant. So:

Name/synonym now resolves before the generic CAS bridge ([uuid, name, synonym, cas]).
Only CAS-matched CFs populate the CAS bridge table.

A CF whose name/synonym pins a specific flow ("methane (biogenic)" → existing synonym → "Methane, non-fossil") never broadcasts across same-CAS variants, so fossil methane stays uncharacterized. A CF the name can't place ("river water") still bridges every same-CAS flow. Regionalized rows are kept out of the broadcast tables entirely.

Validation

Against ecoinvent's published EF v3.1 across all 25,412 activities:

	before	after
Climate change-Biogenic (median ratio)	81×	1.00
Human toxicity, cancer	failing	~1.00
Water-use sign inversions	~23,000	~1,600

No regression in the climate/ozone/eutrophication/metals categories that already matched. A hermetic SharedCASCoverageSpec reproduces the water-sign and fossil-vs-biogenic cases; full suite green (1357).

Follow-ups (same PR)

Two residuals are data-coverage gaps, not engine bugs, to be closed with data/ commits on this branch: the land-use-change carbon CF needs a synonym to ecoinvent's land-use carbon flow (else it falls to CAS and leaks onto fossil CO₂), and freshwater-eutrophication phosphate needs broader subcompartment coverage.

…ctly Scoring native ecoinvent against EF v3.1 (loaded from the JRC ILCD package) diverged from ecoinvent's own published EF v3.1 scores because buildMethodTables routed characterization factors into the wrong lookup tables in two ways: - Regionalized (per-location) CF rows leaked into the broadcast name tables. One location's value then won the shared name key — a water-abundant region's 0 erased the global water credit, a high-scarcity region's factor inflated the global charge — so a net water consumer could even score the wrong sign. - The CAS bridge that lets many same-CAS flows share one CF (every "Water, *" resource shares 7732-18-5, so AWARE's "river water" must reach ecoinvent's "Water, river") broadcast indiscriminately. It leaked a name-distinguished CF onto a sibling the method separates: "methane (biogenic)" (CAS 74-82-8) landed on "Methane, fossil" too, inflating biogenic climate ~80x and the toxicity categories. CAS and name are not competing fallbacks — CAS fixes the molecule, the name fixes the variant. So name/synonym now resolves before the generic CAS bridge, and only CAS-matched CFs populate that bridge. A CF whose name or synonym pins a specific flow ("methane (biogenic)" -> via the existing synonym -> "Methane, non-fossil") never broadcasts across same-CAS variants; a CF the name cannot place ("river water") still bridges every same-CAS flow. Regionalized rows stay out of the broadcast tables entirely. Validated against ecoinvent's published EF v3.1 across all 25,412 activities: biogenic climate 81x -> 1.00, the human-toxicity cancer family from failing to ~1.00, water-use sign inversions cut from 23k to ~1.6k, no regression in the climate/ozone/eutrophication/metals categories that already matched.

Follow-ups to the same-CAS routing fix, same defect class: - mtUuidCF admitted regionalized rows, so one arbitrary location's value could stand for a UUID-matched flow everywhere (M.fromList keeps the last row). Filter them out like the name tables; the rows still reach mtRegionalizedCF. - The CAS bridge ignored subcompartments: a CF pinned to a niche subcomp broadcast onto every same-CAS flow in the medium — the exact leak cfSubcompMatchesFlow guards against elsewhere. The bridge tables now carry the CF's subcomp (wildcards at "") and the read paths probe the flow's own subcomp before the wildcard slot. The regionalized fallback also unions direct rows with CAS-bridged locations instead of taking the first non-empty map, so a flow with a few direct rows still picks up bridged locations beyond them. - preferBetter still ranked ByCAS above ByName on name-key collisions, contradicting the cascade reorder; aligned to UUID > name > synonym > CAS.

lookupCFForFlow hand-rolled its own UUID/name lookup and missed the CAS bridge, so findUncharacterized flagged flows the score path now characterizes. Delegate to lookupCascadeCF (the flow as a singleton DB) so the two can't drift again; drop the duplicated normalizeMediumTop in favour of normalizeMedium.

Keying the CAS bridge by (CAS, medium, subcompartment) zeroed whole resource categories: minerals are reachable only through the bridge, and mineral/water resource flows routinely disagree with the method CFs on subcompartment after normalization, so requiring agreement left every such flow uncharacterized (minerals → 0 everywhere) and inverted Water use (net negative, ~19k sign-flips against the oracle). Revert the bridge to its (CAS, medium) keying. The niche-subcomp leak the keying guarded against is rare; the cross-subcompartment resource match it broke is the common case. The mtUuidCF regionalized-row guard and the preferBetter strategy ordering (UUID > name > synonym > CAS) are correct independently of the keying and stay.

EF v3.1's "Carbon dioxide / carbon monoxide / methane (land use change)" CFs carry the same CAS as their fossil counterparts (124-38-9, 630-08-0, 74-82-8). With no name or synonym match they fell to the CAS bridge and broadcast the land-use factor onto every same-CAS air flow, so fossil CO2 emissions were counted as land-use-change CO2 — the land-use category over-counted ~1490x and contaminated the climate-change total. Add synonyms pairing each with ecoinvent's distinct land-use carbon flow ("…, from soil or biomass stock"). They now match by synonym, which keeps them out of the CAS bridge and characterizes only the genuine land-use flow. Validated against ecoinvent's own published EF v3.1 scores: the land-use category goes from 1490x to ratio ~1.01 (99.7% within 20%) and the climate-change total's sign disagreements drop from ~2100 to ~70, with no other category affected.

… v3.1 model ecoinvent splits freshwater into surface water / ground- / ground-, long-term and marine into ocean; EF v3.1 (JRC ILCD) characterizes a coarser fresh water / unspecified / unspecified (long-term) / sea water set. Unaligned, ecoinvent's freshwater emissions never reach the EF freshwater factors: freshwater eutrophication scored ~0 and freshwater ecotoxicity undercounted ~6x. Map surface water and ground- onto the freshwater factor and ocean/sea water onto the marine factor. Long-term groundwater maps to EF's unspecified (long-term) bucket rather than to fresh water on purpose: EF zeroes long-term water toxicity but keeps long-term eutrophication at full weight, so routing long-term through fresh water would multiply geological-time emission masses by the short-term toxicity factor (10-30x overcount in the toxicity categories). Validated against ecoinvent's own published EF v3.1 scores over 25,412 activities: eutrophication freshwater 0.0001x -> 1.00x and marine 0.90x -> 1.00x; freshwater ecotoxicity 0.15x -> 1.00x; human toxicity non-cancer inorganics 0.89x -> 0.98x. No category regresses.

ccomb added 6 commits June 10, 2026 14:55

ccomb mentioned this pull request Jun 10, 2026

feat(lcia): characterize energy-denominated CFs via per-flow energy density #137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match same-CAS flows correctly when scoring against EF v3.1#136

Match same-CAS flows correctly when scoring against EF v3.1#136
ccomb wants to merge 6 commits into
mainfrom
fix/cas-coverage-scoring

ccomb commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ccomb commented Jun 10, 2026

Why

What changed

Validation

Follow-ups (same PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant