Resolve tokenizer from local cache in offline mode#249
Open
sarathfrancis90 wants to merge 1 commit into
Open
Conversation
download_tokenizer_from_hf_hub crashed offline (HF_HUB_OFFLINE=1) when a branch/tag revision such as "main" was passed: - huggingface_hub raises OfflineModeIsEnabled, a builtin ConnectionError subclass that is not a requests.* error, so the existing except tuple missed it and the local-files fallback was skipped. - list_local_hf_repo_files only resolved refs/<DEFAULT_REVISION> when revision was None, so an explicit "main" was looked up as a literal snapshots/main directory (which never exists) and returned no files. Catch OfflineModeIsEnabled alongside the requests errors, and resolve any branch/tag revision to its commit hash via refs/<revision> before reading the snapshot directory.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #248
Bug
With
HF_HUB_OFFLINE=1and a model fully present in the local cache,download_tokenizer_from_hf_hub(repo_id, revision="main")crashes instead of resolving the tokenizer from the cache. This is hit in practice via vLLM / transformers v5, which route any model shippingtekken.jsontoMistralCommonBackendand passrevision="main".Cause
Two issues combine in
tokens/tokenizers/utils.py:huggingface_hubraisesOfflineModeIsEnabled, which is a builtinConnectionErrorsubclass, not arequests.*error. Theexcept (requests.ConnectionError, requests.HTTPError, requests.Timeout)aroundlist_repo_filestherefore misses it and the existing local-files fallback is skipped.list_local_hf_repo_filesonly consultedrefs/<DEFAULT_REVISION>whenrevisionwasNone. An explicit branch/tag like"main"was looked up as a literalsnapshots/maindirectory, which never exists (snapshots are keyed by commit hash), so it returned[].Fix
huggingface_hub.errors.OfflineModeIsEnabledto the caught exceptions so offline mode reaches the same local-files fallback as a dropped connection (force_downloadstill re-raises, unchanged).revisionto its commit hash viarefs/<revision>before reading the snapshot directory.Testing
revision="main"case totest_list_local_hf_repo_filesand a newtest_download_tokenizer_from_hf_hub_offline_mode; both fail before the change and pass after.pytest tests/test_utils.py(23 passing),ruff check,ruff format --check, andmypy srcare all green.