Skip to content

Term indexing extremely slow for large ontologies due to per-class SPARQL ancestor traversal #278

Description

@alexskr

Summary

Term indexing performance degrades severely for large ontologies. The root cause is that index_doc calls retrieve_hierarchy_ids(:ancestors) individually for every class, issuing iterative SPARQL queries level-by-level to walk up the hierarchy. For an ontology with 100K+ classes at average depth ~8, this produces hundreds of thousands of SPARQL round-trips just to populate the parents field in Solr.

Current behavior

During OntologySubmissionIndexer#index, for each batch of 2,500 classes:

  1. Class.indexBatch(page_classes) is called
  2. This calls indexable_objectindex_doc on every class
  3. Each index_doc calls retrieve_hierarchy_ids(:ancestors) which iterates level-by-level up to 40 levels, issuing a SPARQL query at each level
  4. For a class at depth D, that's D SPARQL round-trips

Total cost: O(N × avg_depth) SPARQL queries, where N is the number of classes.

Proposed fix

Replace the per-class SPARQL ancestor traversal with a bulk precomputation:

  1. Before the indexing loop, fetch all parent-child edges in a single paginated SPARQL query
  2. Build the transitive closure in memory using memoized BFS — each edge visited at most once, total work O(V + E)
  3. Store the precomputed ancestor map as a class-level cache on LinkedData::Models::Class for the duration of bulk indexing
  4. index_doc reads ancestors from the cache instead of issuing SPARQL queries
  5. Clear the cache after indexing completes (ensure block)

This replaces ~800K SPARQL round-trips with 1 paginated SPARQL query + in-memory computation.

Files involved

  • lib/ontologies_linked_data/services/submission_process/operations/submission_indexer.rb — indexing orchestration
  • lib/ontologies_linked_data/models/class.rbindex_doc, retrieve_hierarchy_ids

Additional context

There are secondary performance bottlenecks in the indexer (SPARQL page fetch inside sync block, bring_remaining in CSV writer under lock) that could be addressed in follow-up work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions