Problem
The current two-stage entity mapping pipeline (positions, birthplaces, citizenships) searches Meilisearch for 100 candidates and dumps them all into the LLM context for selection. This has issues:
- High noise: Semantic ratio was lowered from 0.8 to 0.5 because of irrelevant results, but this also reduces cross-lingual recall
- Single-shot search: The query can't adapt when results are poor — no way to disambiguate (e.g. narrowing "Minister of Defence" with a country) or reformulate when the correct entity has a slightly different name
- No retry path: If the correct entity isn't in the top 100, the mapping silently fails (returns None)
- Expensive: 100 candidate descriptions (5000-8000 tokens) are fed into every mapping call
Proposal
Replace the "search then select" pattern with an agentic approach: give the mapping LLM a search tool and let it formulate its own queries.
Search tool interface:
query (string): The search text, formulated by the LLM
semantic_ratio (float): Balance between keyword and semantic search
- Returns ~25 candidates per call
What this enables:
- LLM can call the search tool multiple times with different queries — reformulating, narrowing, or broadening as needed
- LLM adds jurisdiction context (e.g. appends country name) to disambiguate
- LLM chooses semantic ratio based on the situation (high for cross-lingual, low for exact labels)
- The number of search iterations is up to the LLM based on the difficulty of the mapping
Cost
With cached input tokens at 90% discount, multiple tool calls are cheap — each subsequent call only pays full price for the new search results while the rest of the context is cached. Even with 3-4 search rounds, this is likely comparable to or cheaper than the current single call with 100 candidates.
Scope
This applies to all three two-stage extraction types:
- Position mapping (
POSITIONS_CONFIG)
- Birthplace mapping (
BIRTHPLACES_CONFIG)
- Citizenship mapping (
CITIZENSHIPS_CONFIG)
Problem
The current two-stage entity mapping pipeline (positions, birthplaces, citizenships) searches Meilisearch for 100 candidates and dumps them all into the LLM context for selection. This has issues:
Proposal
Replace the "search then select" pattern with an agentic approach: give the mapping LLM a search tool and let it formulate its own queries.
Search tool interface:
query(string): The search text, formulated by the LLMsemantic_ratio(float): Balance between keyword and semantic searchWhat this enables:
Cost
With cached input tokens at 90% discount, multiple tool calls are cheap — each subsequent call only pays full price for the new search results while the rest of the context is cached. Even with 3-4 search rounds, this is likely comparable to or cheaper than the current single call with 100 candidates.
Scope
This applies to all three two-stage extraction types:
POSITIONS_CONFIG)BIRTHPLACES_CONFIG)CITIZENSHIPS_CONFIG)