Skip to content

Agentic entity mapping: give the LLM a search tool instead of dumping 100 candidates #149

@monneyboi

Description

@monneyboi

Problem

The current two-stage entity mapping pipeline (positions, birthplaces, citizenships) searches Meilisearch for 100 candidates and dumps them all into the LLM context for selection. This has issues:

  • High noise: Semantic ratio was lowered from 0.8 to 0.5 because of irrelevant results, but this also reduces cross-lingual recall
  • Single-shot search: The query can't adapt when results are poor — no way to disambiguate (e.g. narrowing "Minister of Defence" with a country) or reformulate when the correct entity has a slightly different name
  • No retry path: If the correct entity isn't in the top 100, the mapping silently fails (returns None)
  • Expensive: 100 candidate descriptions (5000-8000 tokens) are fed into every mapping call

Proposal

Replace the "search then select" pattern with an agentic approach: give the mapping LLM a search tool and let it formulate its own queries.

Search tool interface:

  • query (string): The search text, formulated by the LLM
  • semantic_ratio (float): Balance between keyword and semantic search
  • Returns ~25 candidates per call

What this enables:

  • LLM can call the search tool multiple times with different queries — reformulating, narrowing, or broadening as needed
  • LLM adds jurisdiction context (e.g. appends country name) to disambiguate
  • LLM chooses semantic ratio based on the situation (high for cross-lingual, low for exact labels)
  • The number of search iterations is up to the LLM based on the difficulty of the mapping

Cost

With cached input tokens at 90% discount, multiple tool calls are cheap — each subsequent call only pays full price for the new search results while the rest of the context is cached. Even with 3-4 search rounds, this is likely comparable to or cheaper than the current single call with 100 candidates.

Scope

This applies to all three two-stage extraction types:

  • Position mapping (POSITIONS_CONFIG)
  • Birthplace mapping (BIRTHPLACES_CONFIG)
  • Citizenship mapping (CITIZENSHIPS_CONFIG)

Metadata

Metadata

Assignees

No one assigned

    Labels

    loomPoliloom core project issues
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions