Skip to content

feat: add semantic_scholar client#420

Merged
e06084 merged 3 commits into
MigoXLab:devfrom
e06084:dev
Jun 6, 2026
Merged

feat: add semantic_scholar client#420
e06084 merged 3 commits into
MigoXLab:devfrom
e06084:dev

Conversation

@e06084

@e06084 e06084 commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Semantic Scholar backend (SemanticScholarClient) for retrieval evaluation and updates the retrieval execution logic to store and expose the execution summary. The feedback highlights several critical improvements: in retrieval.py, initializing self.summary with SummaryModel() could fail if there are required fields, so initializing it to None is recommended. In the new Semantic Scholar backend, holding a lock during time.sleep() in _rate_limit_wait blocks other threads and should be avoided by sleeping outside the lock. Additionally, the backend should defensively validate that the API response is a dictionary and that papers is a list, and normalize the paper scores using the number of returned papers rather than the total database matches to ensure consistent scoring.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread dingo/retrieval/backends/semantic_scholar.py Outdated
Comment thread dingo/exec/retrieval.py
Comment on lines +53 to +56
self.summary = SummaryModel()

def get_summary(self):
return self.summary

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Initializing self.summary to SummaryModel() without arguments is highly likely to raise a TypeError or ValidationError at runtime if SummaryModel has required fields (such as task_id, task_name, etc.) and no default values. It is safer to initialize self.summary to None and type it as SummaryModel | None.

Suggested change
self.summary = SummaryModel()
def get_summary(self):
return self.summary
self.summary: SummaryModel | None = None
def get_summary(self) -> SummaryModel | None:
return self.summary

Comment on lines +131 to +133
data = resp.json()
papers = data.get("data") or []
total = data.get("total", len(papers))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the API returns a valid JSON response that is not a dictionary (e.g., a list or a string), calling data.get() will raise an AttributeError. Additionally, if data.get("data") is not a list, iterating over papers or calling len(papers) could raise a TypeError. To ensure robust error handling and defensive programming, we should explicitly verify that data is a dictionary and papers is a list before processing them.

            data = resp.json()
            if not isinstance(data, dict):
                return SearchResponse(
                    query=query,
                    results=[],
                    response_time_ms=elapsed_ms,
                    status_code=resp.status_code,
                    error="Invalid JSON response format: expected a dictionary",
                )

            papers = data.get("data")
            if not isinstance(papers, list):
                papers = []
            total = data.get("total", len(papers))

a.get("name", "") for a in authors_raw if isinstance(a, dict)
]
year = paper.get("year")
score = 1.0 - (i / max(total, 1))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the total number of search results in the entire database (total) to normalize the rank-based score makes the scores highly inconsistent and dependent on the query's popularity. For queries with a very large number of matches, the scores of all top retrieved papers will be extremely close to 1.0, compressing the score range and losing relative ranking distinction. It is much more consistent to normalize the score based on the number of returned papers (len(papers)).

Suggested change
score = 1.0 - (i / max(total, 1))
score = 1.0 - (i / len(papers)) if papers else 0.0

@e06084 e06084 merged commit 4c7cdd7 into MigoXLab:dev Jun 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant