Skip to content

fix: shorten dataset cache filenames to avoid NAME_MAX errors#589

Open
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/539-short-cache-filename
Open

fix: shorten dataset cache filenames to avoid NAME_MAX errors#589
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/539-short-cache-filename

Conversation

@Chessing234

Copy link
Copy Markdown

Summary

Fixes #539.

UMLS linker setup fails with OSError: [Errno 36] File name too long when writing cached linker files on filesystems with a low NAME_MAX (e.g. eCryptfs at 143 bytes).

Root cause

url_to_filename() appended the full trailing URL component after two sha256 hashes, producing filenames around 155 characters for UMLS assets like tfidf_vectors_sparse.npz.

Fix

Keep only the file extension from the URL tail instead of the full basename, so hashed cache paths stay within NAME_MAX while remaining deterministic per URL/etag.

Test plan

  • Smoke-tested url_to_filename() for a UMLS tfidf URL — filename length 133, ends with .npz

Made with Cursor

url_to_filename appended the full URL tail (e.g. tfidf_vectors_sparse.npz)
after double sha256 hashes, producing paths over eCryptfs NAME_MAX and
raising OSError on cache writes (fixes allenai#539).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File name too long

1 participant