Skip to content

fix: skip redundant name_embedding computation in create_entity_node_embeddings#1457

Open
GraphiteEdgeR wants to merge 1 commit into
getzep:mainfrom
GraphiteEdgeR:fix/skip-existing-node-embeddings
Open

fix: skip redundant name_embedding computation in create_entity_node_embeddings#1457
GraphiteEdgeR wants to merge 1 commit into
getzep:mainfrom
GraphiteEdgeR:fix/skip-existing-node-embeddings

Conversation

@GraphiteEdgeR
Copy link
Copy Markdown

Summary

create_entity_node_embeddings() unconditionally computes embeddings for all nodes with a non-empty name, even if they already have a valid name_embedding. This PR adds a simple filter to skip nodes that already have their embedding set, making the function idempotent and avoiding redundant API calls.

Problem

During add_episode(), resolved nodes (merged to existing graph entities) go through create_entity_node_embeddings() which calls embedder.create_batch() for all nodes. Since get_entity_node_return_query() deliberately excludes name_embedding from its return fields (to reduce query payload), resolved nodes arrive with name_embedding=None and get re-embedded unnecessarily.

However, if callers pre-load the embedding (via node.load_name_embedding() or node_load_embeddings_bulk()), the current code still re-computes it because there's no check for existing values.

Change

`diff

  • filter out falsey values from nodes

  • filtered_nodes = [node for node in nodes if node.name]
  • Only compute embeddings for nodes that need them (have a name but no existing embedding)

  • filtered_nodes = [node for node in nodes if node.name and node.name_embedding is None]
    `

Benefits

  1. Idempotent: Safe to call multiple times without redundant API calls
  2. Enables optimization: Callers can now pre-load embeddings from DB (via load_name_embedding() / node_load_embeddings_bulk()) before calling this function, and the pre-loaded values will be respected
  3. Zero risk: Nodes without embeddings still get computed as before

Suggested follow-up

For maximum benefit, a follow-up PR could add a pre-loading step in extract_attributes_from_nodes() before calling create_entity_node_embeddings():

`python

Pre-load existing name_embeddings for resolved nodes

nodes_missing = [n for n in nodes if n.name and n.name_embedding is None]
if nodes_missing:
await clients.driver.graph_operations_interface.node_load_embeddings_bulk(
clients.driver, nodes_missing
)
`

This would eliminate redundant embedding API calls for all resolved nodes (typically 2-4 per episode).

…dding

In create_entity_node_embeddings(), filter out nodes that already have
name_embedding set. This avoids redundant embedding API calls for nodes
that were resolved to existing graph entities whose embeddings were
pre-loaded from the database.

Previously, all nodes with a non-empty name were unconditionally
re-embedded, even if they already had a valid name_embedding from
a prior computation or database load. This wasted API calls and tokens
for resolved (merged) nodes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant