Match same-CAS flows correctly when scoring against EF v3.1#136
Open
ccomb wants to merge 6 commits into
Open
Conversation
…ctly
Scoring native ecoinvent against EF v3.1 (loaded from the JRC ILCD
package) diverged from ecoinvent's own published EF v3.1 scores because
buildMethodTables routed characterization factors into the wrong lookup
tables in two ways:
- Regionalized (per-location) CF rows leaked into the broadcast name
tables. One location's value then won the shared name key — a
water-abundant region's 0 erased the global water credit, a
high-scarcity region's factor inflated the global charge — so a net
water consumer could even score the wrong sign.
- The CAS bridge that lets many same-CAS flows share one CF (every
"Water, *" resource shares 7732-18-5, so AWARE's "river water" must
reach ecoinvent's "Water, river") broadcast indiscriminately. It
leaked a name-distinguished CF onto a sibling the method separates:
"methane (biogenic)" (CAS 74-82-8) landed on "Methane, fossil" too,
inflating biogenic climate ~80x and the toxicity categories.
CAS and name are not competing fallbacks — CAS fixes the molecule, the
name fixes the variant. So name/synonym now resolves before the generic
CAS bridge, and only CAS-matched CFs populate that bridge. A CF whose
name or synonym pins a specific flow ("methane (biogenic)" -> via the
existing synonym -> "Methane, non-fossil") never broadcasts across
same-CAS variants; a CF the name cannot place ("river water") still
bridges every same-CAS flow. Regionalized rows stay out of the broadcast
tables entirely.
Validated against ecoinvent's published EF v3.1 across all 25,412
activities: biogenic climate 81x -> 1.00, the human-toxicity cancer
family from failing to ~1.00, water-use sign inversions cut from 23k to
~1.6k, no regression in the climate/ozone/eutrophication/metals
categories that already matched.
Follow-ups to the same-CAS routing fix, same defect class: - mtUuidCF admitted regionalized rows, so one arbitrary location's value could stand for a UUID-matched flow everywhere (M.fromList keeps the last row). Filter them out like the name tables; the rows still reach mtRegionalizedCF. - The CAS bridge ignored subcompartments: a CF pinned to a niche subcomp broadcast onto every same-CAS flow in the medium — the exact leak cfSubcompMatchesFlow guards against elsewhere. The bridge tables now carry the CF's subcomp (wildcards at "") and the read paths probe the flow's own subcomp before the wildcard slot. The regionalized fallback also unions direct rows with CAS-bridged locations instead of taking the first non-empty map, so a flow with a few direct rows still picks up bridged locations beyond them. - preferBetter still ranked ByCAS above ByName on name-key collisions, contradicting the cascade reorder; aligned to UUID > name > synonym > CAS.
lookupCFForFlow hand-rolled its own UUID/name lookup and missed the CAS bridge, so findUncharacterized flagged flows the score path now characterizes. Delegate to lookupCascadeCF (the flow as a singleton DB) so the two can't drift again; drop the duplicated normalizeMediumTop in favour of normalizeMedium.
Keying the CAS bridge by (CAS, medium, subcompartment) zeroed whole resource categories: minerals are reachable only through the bridge, and mineral/water resource flows routinely disagree with the method CFs on subcompartment after normalization, so requiring agreement left every such flow uncharacterized (minerals → 0 everywhere) and inverted Water use (net negative, ~19k sign-flips against the oracle). Revert the bridge to its (CAS, medium) keying. The niche-subcomp leak the keying guarded against is rare; the cross-subcompartment resource match it broke is the common case. The mtUuidCF regionalized-row guard and the preferBetter strategy ordering (UUID > name > synonym > CAS) are correct independently of the keying and stay.
EF v3.1's "Carbon dioxide / carbon monoxide / methane (land use change)"
CFs carry the same CAS as their fossil counterparts (124-38-9, 630-08-0,
74-82-8). With no name or synonym match they fell to the CAS bridge and
broadcast the land-use factor onto every same-CAS air flow, so fossil CO2
emissions were counted as land-use-change CO2 — the land-use category
over-counted ~1490x and contaminated the climate-change total.
Add synonyms pairing each with ecoinvent's distinct land-use carbon flow
("…, from soil or biomass stock"). They now match by synonym, which keeps
them out of the CAS bridge and characterizes only the genuine land-use
flow. Validated against ecoinvent's own published EF v3.1 scores: the
land-use category goes from 1490x to ratio ~1.01 (99.7% within 20%) and
the climate-change total's sign disagreements drop from ~2100 to ~70,
with no other category affected.
… v3.1 model ecoinvent splits freshwater into surface water / ground- / ground-, long-term and marine into ocean; EF v3.1 (JRC ILCD) characterizes a coarser fresh water / unspecified / unspecified (long-term) / sea water set. Unaligned, ecoinvent's freshwater emissions never reach the EF freshwater factors: freshwater eutrophication scored ~0 and freshwater ecotoxicity undercounted ~6x. Map surface water and ground- onto the freshwater factor and ocean/sea water onto the marine factor. Long-term groundwater maps to EF's unspecified (long-term) bucket rather than to fresh water on purpose: EF zeroes long-term water toxicity but keeps long-term eutrophication at full weight, so routing long-term through fresh water would multiply geological-time emission masses by the short-term toxicity factor (10-30x overcount in the toxicity categories). Validated against ecoinvent's own published EF v3.1 scores over 25,412 activities: eutrophication freshwater 0.0001x -> 1.00x and marine 0.90x -> 1.00x; freshwater ecotoxicity 0.15x -> 1.00x; human toxicity non-cancer inorganics 0.89x -> 0.98x. No category regresses.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Scoring native ecoinvent 3.11 against EF v3.1 loaded from the JRC ILCD package diverged sharply from ecoinvent's own published EF v3.1 scores (embedded in the parallel
_lciapackage). The cause was in howbuildMethodTablesroutes characterization factors into the engine's lookup tables.Two distinct defects, same root — a CF landing in the wrong table:
Regionalized rows polluted the broadcast name tables. Per-location CF rows were keyed by flow name alongside the global row, so one arbitrary location's value won the key. A water-abundant region's
0erased the global water credit; a high-scarcity region's factor inflated the global charge. Net water consumers could score the wrong sign.The CAS bridge broadcast indiscriminately. Letting same-CAS flows share a CF is correct where the name carries no discrimination — every
Water, *resource shares CAS 7732-18-5, so AWARE's"river water"must reach ecoinvent's"Water, river". But it also leaked a name-distinguished CF onto a sibling the method deliberately separates:"methane (biogenic)"(CAS 74-82-8) landed on"Methane, fossil"too, inflating biogenic climate ~80× and distorting the toxicity categories.What changed
CAS and name aren't competing fallbacks — CAS identifies the molecule, the name identifies the variant. So:
[uuid, name, synonym, cas]).A CF whose name/synonym pins a specific flow (
"methane (biogenic)"→ existing synonym →"Methane, non-fossil") never broadcasts across same-CAS variants, so fossil methane stays uncharacterized. A CF the name can't place ("river water") still bridges every same-CAS flow. Regionalized rows are kept out of the broadcast tables entirely.Validation
Against ecoinvent's published EF v3.1 across all 25,412 activities:
No regression in the climate/ozone/eutrophication/metals categories that already matched. A hermetic
SharedCASCoverageSpecreproduces the water-sign and fossil-vs-biogenic cases; full suite green (1357).Follow-ups (same PR)
Two residuals are data-coverage gaps, not engine bugs, to be closed with
data/commits on this branch: the land-use-change carbon CF needs a synonym to ecoinvent's land-use carbon flow (else it falls to CAS and leaks onto fossil CO₂), and freshwater-eutrophication phosphate needs broader subcompartment coverage.