fix: shorten dataset cache filenames to avoid NAME_MAX errors by Chessing234 · Pull Request #589 · allenai/scispacy

Chessing234 · 2026-06-10T11:00:31Z

Summary

Fixes #539.

UMLS linker setup fails with OSError: [Errno 36] File name too long when writing cached linker files on filesystems with a low NAME_MAX (e.g. eCryptfs at 143 bytes).

Root cause

url_to_filename() appended the full trailing URL component after two sha256 hashes, producing filenames around 155 characters for UMLS assets like tfidf_vectors_sparse.npz.

Fix

Keep only the file extension from the URL tail instead of the full basename, so hashed cache paths stay within NAME_MAX while remaining deterministic per URL/etag.

Test plan

Smoke-tested url_to_filename() for a UMLS tfidf URL — filename length 133, ends with .npz

Made with Cursor

url_to_filename appended the full URL tail (e.g. tfidf_vectors_sparse.npz) after double sha256 hashes, producing paths over eCryptfs NAME_MAX and raising OSError on cache writes (fixes allenai#539). Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: shorten dataset cache filenames to avoid NAME_MAX errors#589

fix: shorten dataset cache filenames to avoid NAME_MAX errors#589
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/539-short-cache-filename

Chessing234 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Jun 10, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant