feat: add google_scholar client by e06084 · Pull Request #424 · MigoXLab/dingo

e06084 · 2026-06-09T06:11:55Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces a new Google Scholar retrieval backend using SerpAPI, makes the --api-url parameter optional to support backend-specific defaults, and adds support for extracting and prefixing query instructions for specific MTEB tasks (such as IFIR tasks). The review feedback highlights several opportunities to improve robustness, including safer handling of null summary values, ensuring fallback author parsing when name keys are missing, using a reverse-token scan to more accurately extract publication years, and implementing a stricter check for empty or NaN values in query instructions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-09T06:13:47Z

+    def _parse_result(item: dict[str, Any], rank: int) -> PaperResult:
+        publication_info = item.get("publication_info") or {}
+        authors = GoogleScholarClient._parse_authors(publication_info)
+        year = GoogleScholarClient._parse_year(publication_info.get("summary", ""))


If summary is explicitly set to None in the publication_info dictionary, publication_info.get("summary", "") will return None instead of "". This causes _parse_year to receive None and convert it to "None". Using publication_info.get("summary") or "" is safer and more idiomatic.

Suggested change

year = GoogleScholarClient._parse_year(publication_info.get("summary", ""))

year = GoogleScholarClient._parse_year(publication_info.get("summary") or "")

gemini-code-assist · 2026-06-09T06:13:47Z

+    @staticmethod
+    def _parse_authors(publication_info: dict[str, Any]) -> list[str]:
+        authors = publication_info.get("authors") or []
+        if isinstance(authors, list) and authors:
+            return [
+                str(author.get("name") or "")
+                for author in authors
+                if isinstance(author, dict) and author.get("name")
+            ]
+
+        summary = str(publication_info.get("summary") or "")
+        if not summary:
+            return []
+        author_part = summary.split(" - ", 1)[0]
+        return [name.strip() for name in author_part.split(",") if name.strip()]


In _parse_authors, if authors is a list of dicts but none of them contain a "name" key, the list comprehension returns an empty list []. However, because the if isinstance(authors, list) and authors: block is entered, the function immediately returns [] and never falls back to parsing the summary string. We should only return the parsed list if it actually contains elements, otherwise fall back to parsing the summary.

Suggested change

@staticmethod

def _parse_authors(publication_info: dict[str, Any]) -> list[str]:

authors = publication_info.get("authors") or []

if isinstance(authors, list) and authors:

return [

str(author.get("name") or "")

for author in authors

if isinstance(author, dict) and author.get("name")

]

summary = str(publication_info.get("summary") or "")

if not summary:

return []

author_part = summary.split(" - ", 1)[0]

return [name.strip() for name in author_part.split(",") if name.strip()]

@staticmethod

def _parse_authors(publication_info: dict[str, Any]) -> list[str]:

authors = publication_info.get("authors") or []

if isinstance(authors, list) and authors:

parsed = [

str(author.get("name") or "")

for author in authors

if isinstance(author, dict) and author.get("name")

]

if parsed:

return parsed

summary = str(publication_info.get("summary") or "")

if not summary:

return []

author_part = summary.split(" - ", 1)[0]

return [name.strip() for name in author_part.split(",") if name.strip()]

gemini-code-assist · 2026-06-09T06:13:47Z

+    @staticmethod
+    def _parse_year(text: str) -> int | None:
+        for token in str(text).replace(",", " ").split():
+            if token.isdigit() and len(token) == 4:
+                year = int(token)
+                if 1000 <= year <= 3000:
+                    return year
+        return None


In _parse_year, iterating through the tokens from left to right can incorrectly extract a 4-digit number that is not a year (e.g., "A Author - Journal of 1200 Cases, 2024" would return 1200 instead of 2024). Since publication years are almost always at the end of the summary/citation string, iterating through the tokens in reverse order (from right to left) is a much more robust heuristic.

Suggested change

@staticmethod

def _parse_year(text: str) -> int | None:

for token in str(text).replace(",", " ").split():

if token.isdigit() and len(token) == 4:

year = int(token)

if 1000 <= year <= 3000:

return year

return None

@staticmethod

def _parse_year(text: str) -> int | None:

for token in reversed(str(text).replace(",", " ").split()):

if token.isdigit() and len(token) == 4:

year = int(token)

if 1000 <= year <= 3000:

return year

return None

gemini-code-assist · 2026-06-09T06:13:47Z

+def _extract_query_instructions(queries: Any, total: int) -> list[str | None]:
+    for column_name in INSTRUCTION_COLUMNS:
+        values = _get_query_column_values(queries, column_name, total)
+        if any(value not in (None, "") for value in values):
+            return [
+                str(value).strip() if value not in (None, "") else None
+                for value in values
+            ]
+    return [None] * total


In _extract_query_instructions, checking value not in (None, "") is insufficient because:

If a value is a float NaN (which is common in datasets loaded via pandas/mteb), value not in (None, "") evaluates to True, resulting in "nan" being sent as an instruction.

If a value is a whitespace-only string (e.g., " "), it will be stripped to "" but still treated as a valid instruction.
We should implement a robust helper function to check for empty/null/NaN values and use it here.

Suggested change

def _extract_query_instructions(queries: Any, total: int) -> list[str | None]:

for column_name in INSTRUCTION_COLUMNS:

values = _get_query_column_values(queries, column_name, total)

if any(value not in (None, "") for value in values):

return [

str(value).strip() if value not in (None, "") else None

for value in values

]

return [None] * total

def _extract_query_instructions(queries: Any, total: int) -> list[str | None]:

def is_empty(v: Any) -> bool:

if v in (None, ""):

return True

if isinstance(v, str) and not v.strip():

return True

if isinstance(v, float) and v != v:

return True

return False

for column_name in INSTRUCTION_COLUMNS:

values = _get_query_column_values(queries, column_name, total)

if any(not is_empty(value) for value in values):

return [

str(value).strip() if not is_empty(value) else None

for value in values

]

return [None] * total

feat: add google_scholar client

6333d69

e06084 force-pushed the dev branch from 7cc7137 to 6333d69 Compare June 9, 2026 06:12

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

fix ut

b6bf18c

e06084 merged commit bbf627b into MigoXLab:dev Jun 9, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add google_scholar client#424

feat: add google_scholar client#424
e06084 merged 2 commits into
MigoXLab:devfrom
e06084:dev

e06084 commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	year = GoogleScholarClient._parse_year(publication_info.get("summary", ""))
	year = GoogleScholarClient._parse_year(publication_info.get("summary") or "")

Conversation

e06084 commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant