Skip to content

Raise informative IOError on corrupt tfidf vectors cache file#588

Open
H2908 wants to merge 1 commit into
allenai:mainfrom
H2908:fix/badzip-corrupt-cache-error
Open

Raise informative IOError on corrupt tfidf vectors cache file#588
H2908 wants to merge 1 commit into
allenai:mainfrom
H2908:fix/badzip-corrupt-cache-error

Conversation

@H2908

@H2908 H2908 commented Jun 8, 2026

Copy link
Copy Markdown

Problem
When scipy.sparse.load_npz() encounters a corrupted .npz file (typically from an interrupted download), it raises BadZipFile, EOFError, or ValueError with no indication of which file is affected or how to fix it. Users are left with a cryptic traceback and no recovery path.
This was noted in issue #534 where a maintainer suggested modifying the package to print out which paths it is trying to load.
Fix
In load_approximate_nearest_neighbours_index:

Extract the resolved local cache path into a named variable before calling load_npz()
Catch BadZipFile, EOFError, and ValueError
Re-raise as IOError with the exact file path and instructions to delete the file so it can be re-downloaded

Testing
Added a test in tests/test_candidate_generation.py that writes a malformed .npz file to a temporary directory and asserts that the new IOError is raised.

When scipy.sparse.load_npz encounters a corrupted .npz file (e.g. from
an interrupted download), it raises BadZipFile or EOFError with no
indication of which file is affected or how to fix it.

This commit catches those exceptions and raises an IOError that includes
the local file path and instructs the user to delete the file so it
can be re-downloaded.

Fixes allenai#534
@H2908

H2908 commented Jun 8, 2026

Copy link
Copy Markdown
Author

Hi, I'd like to work on this issue.
I've reproduced the problem — when scipy.sparse.load_npz() encounters a corrupted .npz file (typically from an interrupted download), it raises BadZipFile or EOFError with no indication of which file is affected or how to fix it.
My fix extracts the resolved local cache path into a variable before calling load_npz(), then catches BadZipFile, EOFError and ValueError, and re-raises as an IOError that includes the file path and instructs the user to delete it so it can be re-downloaded. I've also added a test that writes a malformed .npz file and asserts the new IOError is raised.
I'll open a PR shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant