Skip to content

fix(community): resolve document_ids MIME types via files().get (#1867)#1868

Open
Luis Martinez (lcmartinezdev) wants to merge 1 commit into
langchain-ai:mainfrom
lcmartinezdev:fix/drive-loader-document-ids-files-get
Open

fix(community): resolve document_ids MIME types via files().get (#1867)#1868
Luis Martinez (lcmartinezdev) wants to merge 1 commit into
langchain-ai:mainfrom
lcmartinezdev:fix/drive-loader-document-ids-files-get

Conversation

@lcmartinezdev

Copy link
Copy Markdown
Contributor

Description

GoogleDriveLoader.load() fails with HTTP 400 for any document_ids. _load_documents_from_ids resolves MIME types with files().list(q="id = '...'"), but the Drive API q parameter has no id field, so the request is rejected with "Invalid Value". Loading by document_ids is broken against the real API on every release since #1644 (4.0.0, 5.0.0, and main). 3.x is not affected. Mocked unit tests did not catch it because the Drive service is stubbed.

Fix

Resolve each id with files().get(fileId=..., fields="mimeType") — the correct way to fetch a file by id. The per-type dispatch added in #1644 (Sheets → _load_sheet_from_id, PDF → _load_file_from_id, else → _load_document_from_id) is unchanged.

Tests

  • Updated the mock service to stub files().get() (the call the loader now makes).
  • Added a regression test asserting MIME resolution uses files().get(fileId=...) and never files().list.
  • All unit tests pass; ruff and mypy are clean.

Note

This makes one files().get call per id instead of a single files().list. That is required, since files().list cannot query by id, and document_ids lists are typically small.

Fixes #1867

…chain-ai#1867)

GoogleDriveLoader._load_documents_from_ids resolved each file's MIME type
with files().list(q="id = '...'"). The Drive API `q` parameter has no `id`
field, so that request is rejected with HTTP 400 "Invalid Value" against the
real API, and load() fails for every document_id. Mocked tests did not catch
it because the Drive service is stubbed.

Resolve each id with files().get(fileId=..., fields="mimeType") instead, which
is the correct way to fetch a file by id. The per-type dispatch from langchain-ai#1644
(Sheets -> _load_sheet_from_id, PDF -> _load_file_from_id, else ->
_load_document_from_id) is unchanged.

Fixes langchain-ai#1867
@lcmartinezdev

Copy link
Copy Markdown
Contributor Author

Why files.get per id (and not a single request)

Per the Drive API docs, id is not a valid files.list query term. The supported q terms are:

name, fullText, mimeType, modifiedTime, viewedByMeTime, trashed, starred, parents, owners, writers, readers, sharedWithMe, createdTime, properties, appProperties, visibility, shortcutDetails.targetId

So there is no "fixed query" version of files.list(q="id = '...'") — you fetch a file by id with files.get, not with search. That leaves two shapes:

  1. files.get per id (this fix) — N calls, simple and standard.
  2. Batch HTTP request — one round-trip for up to 100 ids, but more error-handling code, and it does not reduce quota: the docs state "n requests batched together count as n requests, not one."

Batching would only save round-trip latency, not API usage, so it adds complexity for a benefit most callers don't need (id lists are usually short). I went with files.get per id: it is the documented way to fetch by id and keeps the change minimal. A batched variant can be added later if a caller needs to load many ids at once.

Docs: file-specific query terms, batching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GoogleDriveLoader.load() fails with HTTP 400 for document_ids (invalid files.list query)

1 participant