Skip to content

Match same-CAS flows correctly when scoring against EF v3.1#136

Open
ccomb wants to merge 6 commits into
mainfrom
fix/cas-coverage-scoring
Open

Match same-CAS flows correctly when scoring against EF v3.1#136
ccomb wants to merge 6 commits into
mainfrom
fix/cas-coverage-scoring

Conversation

@ccomb

@ccomb ccomb commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Why

Scoring native ecoinvent 3.11 against EF v3.1 loaded from the JRC ILCD package diverged sharply from ecoinvent's own published EF v3.1 scores (embedded in the parallel _lcia package). The cause was in how buildMethodTables routes characterization factors into the engine's lookup tables.

Two distinct defects, same root — a CF landing in the wrong table:

  1. Regionalized rows polluted the broadcast name tables. Per-location CF rows were keyed by flow name alongside the global row, so one arbitrary location's value won the key. A water-abundant region's 0 erased the global water credit; a high-scarcity region's factor inflated the global charge. Net water consumers could score the wrong sign.

  2. The CAS bridge broadcast indiscriminately. Letting same-CAS flows share a CF is correct where the name carries no discrimination — every Water, * resource shares CAS 7732-18-5, so AWARE's "river water" must reach ecoinvent's "Water, river". But it also leaked a name-distinguished CF onto a sibling the method deliberately separates: "methane (biogenic)" (CAS 74-82-8) landed on "Methane, fossil" too, inflating biogenic climate ~80× and distorting the toxicity categories.

What changed

CAS and name aren't competing fallbacks — CAS identifies the molecule, the name identifies the variant. So:

  • Name/synonym now resolves before the generic CAS bridge ([uuid, name, synonym, cas]).
  • Only CAS-matched CFs populate the CAS bridge table.

A CF whose name/synonym pins a specific flow ("methane (biogenic)" → existing synonym → "Methane, non-fossil") never broadcasts across same-CAS variants, so fossil methane stays uncharacterized. A CF the name can't place ("river water") still bridges every same-CAS flow. Regionalized rows are kept out of the broadcast tables entirely.

Validation

Against ecoinvent's published EF v3.1 across all 25,412 activities:

before after
Climate change-Biogenic (median ratio) 81× 1.00
Human toxicity, cancer failing ~1.00
Water-use sign inversions ~23,000 ~1,600

No regression in the climate/ozone/eutrophication/metals categories that already matched. A hermetic SharedCASCoverageSpec reproduces the water-sign and fossil-vs-biogenic cases; full suite green (1357).

Follow-ups (same PR)

Two residuals are data-coverage gaps, not engine bugs, to be closed with data/ commits on this branch: the land-use-change carbon CF needs a synonym to ecoinvent's land-use carbon flow (else it falls to CAS and leaks onto fossil CO₂), and freshwater-eutrophication phosphate needs broader subcompartment coverage.

ccomb added 6 commits June 10, 2026 14:55
…ctly

Scoring native ecoinvent against EF v3.1 (loaded from the JRC ILCD
package) diverged from ecoinvent's own published EF v3.1 scores because
buildMethodTables routed characterization factors into the wrong lookup
tables in two ways:

- Regionalized (per-location) CF rows leaked into the broadcast name
  tables. One location's value then won the shared name key — a
  water-abundant region's 0 erased the global water credit, a
  high-scarcity region's factor inflated the global charge — so a net
  water consumer could even score the wrong sign.

- The CAS bridge that lets many same-CAS flows share one CF (every
  "Water, *" resource shares 7732-18-5, so AWARE's "river water" must
  reach ecoinvent's "Water, river") broadcast indiscriminately. It
  leaked a name-distinguished CF onto a sibling the method separates:
  "methane (biogenic)" (CAS 74-82-8) landed on "Methane, fossil" too,
  inflating biogenic climate ~80x and the toxicity categories.

CAS and name are not competing fallbacks — CAS fixes the molecule, the
name fixes the variant. So name/synonym now resolves before the generic
CAS bridge, and only CAS-matched CFs populate that bridge. A CF whose
name or synonym pins a specific flow ("methane (biogenic)" -> via the
existing synonym -> "Methane, non-fossil") never broadcasts across
same-CAS variants; a CF the name cannot place ("river water") still
bridges every same-CAS flow. Regionalized rows stay out of the broadcast
tables entirely.

Validated against ecoinvent's published EF v3.1 across all 25,412
activities: biogenic climate 81x -> 1.00, the human-toxicity cancer
family from failing to ~1.00, water-use sign inversions cut from 23k to
~1.6k, no regression in the climate/ozone/eutrophication/metals
categories that already matched.
Follow-ups to the same-CAS routing fix, same defect class:

- mtUuidCF admitted regionalized rows, so one arbitrary location's value
  could stand for a UUID-matched flow everywhere (M.fromList keeps the
  last row). Filter them out like the name tables; the rows still reach
  mtRegionalizedCF.

- The CAS bridge ignored subcompartments: a CF pinned to a niche subcomp
  broadcast onto every same-CAS flow in the medium — the exact leak
  cfSubcompMatchesFlow guards against elsewhere. The bridge tables now
  carry the CF's subcomp (wildcards at "") and the read paths probe the
  flow's own subcomp before the wildcard slot. The regionalized fallback
  also unions direct rows with CAS-bridged locations instead of taking
  the first non-empty map, so a flow with a few direct rows still picks
  up bridged locations beyond them.

- preferBetter still ranked ByCAS above ByName on name-key collisions,
  contradicting the cascade reorder; aligned to UUID > name > synonym > CAS.
lookupCFForFlow hand-rolled its own UUID/name lookup and missed the CAS
bridge, so findUncharacterized flagged flows the score path now
characterizes. Delegate to lookupCascadeCF (the flow as a singleton DB)
so the two can't drift again; drop the duplicated normalizeMediumTop in
favour of normalizeMedium.
Keying the CAS bridge by (CAS, medium, subcompartment) zeroed whole
resource categories: minerals are reachable only through the bridge, and
mineral/water resource flows routinely disagree with the method CFs on
subcompartment after normalization, so requiring agreement left every
such flow uncharacterized (minerals → 0 everywhere) and inverted Water
use (net negative, ~19k sign-flips against the oracle).

Revert the bridge to its (CAS, medium) keying. The niche-subcomp leak
the keying guarded against is rare; the cross-subcompartment resource
match it broke is the common case. The mtUuidCF regionalized-row guard
and the preferBetter strategy ordering (UUID > name > synonym > CAS) are
correct independently of the keying and stay.
EF v3.1's "Carbon dioxide / carbon monoxide / methane (land use change)"
CFs carry the same CAS as their fossil counterparts (124-38-9, 630-08-0,
74-82-8). With no name or synonym match they fell to the CAS bridge and
broadcast the land-use factor onto every same-CAS air flow, so fossil CO2
emissions were counted as land-use-change CO2 — the land-use category
over-counted ~1490x and contaminated the climate-change total.

Add synonyms pairing each with ecoinvent's distinct land-use carbon flow
("…, from soil or biomass stock"). They now match by synonym, which keeps
them out of the CAS bridge and characterizes only the genuine land-use
flow. Validated against ecoinvent's own published EF v3.1 scores: the
land-use category goes from 1490x to ratio ~1.01 (99.7% within 20%) and
the climate-change total's sign disagreements drop from ~2100 to ~70,
with no other category affected.
… v3.1 model

ecoinvent splits freshwater into surface water / ground- / ground-, long-term
and marine into ocean; EF v3.1 (JRC ILCD) characterizes a coarser fresh water /
unspecified / unspecified (long-term) / sea water set. Unaligned, ecoinvent's
freshwater emissions never reach the EF freshwater factors: freshwater
eutrophication scored ~0 and freshwater ecotoxicity undercounted ~6x.

Map surface water and ground- onto the freshwater factor and ocean/sea water
onto the marine factor. Long-term groundwater maps to EF's unspecified
(long-term) bucket rather than to fresh water on purpose: EF zeroes long-term
water toxicity but keeps long-term eutrophication at full weight, so routing
long-term through fresh water would multiply geological-time emission masses by
the short-term toxicity factor (10-30x overcount in the toxicity categories).

Validated against ecoinvent's own published EF v3.1 scores over 25,412
activities: eutrophication freshwater 0.0001x -> 1.00x and marine 0.90x ->
1.00x; freshwater ecotoxicity 0.15x -> 1.00x; human toxicity non-cancer
inorganics 0.89x -> 0.98x. No category regresses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant