Skip to content

Allow NUMERICALEARTH_DATA_DIRECTORY to override the Scratch.jl cache#367

Open
glwagner wants to merge 2 commits into
mainfrom
data-directory-env-var
Open

Allow NUMERICALEARTH_DATA_DIRECTORY to override the Scratch.jl cache#367
glwagner wants to merge 2 commits into
mainfrom
data-directory-env-var

Conversation

@glwagner

Copy link
Copy Markdown
Member

Summary

Adds an opt-in NUMERICALEARTH_DATA_DIRECTORY environment variable that lets users redirect where downloaded datasets are cached, as an alternative to the default Scratch.jl space.

Scratch.jl ties its scratch space to the active Julia depot and offers no per-package env-var override — the only knob is JULIA_DEPOT_PATH, which relocates everything (installed packages, precompile caches, all scratchspaces). This is too blunt for the common cases:

  • HPC clusters where the depot lives on a small / quota-limited filesystem (e.g. $HOME) but a large scratch filesystem is available for data;
  • sharing one cache of large datasets (GEBCO ~8 GB, IBCAO ~25 GB, ECCO, ...) across depots or users.

What changed

  • New helper DataWrangling.download_cache(key):
    • default → @get_scratch!(key) (unchanged behavior);
    • if NUMERICALEARTH_DATA_DIRECTORY is set → mkpath(joinpath(ENV[...], key)).
  • All 15 dataset/bathymetry caches now resolve through this helper in their __init__ instead of calling @get_scratch! directly, which also removes the duplicated Scratch import from each submodule.
  • Per-key subdirectories are preserved (e.g. ECCO/v4, WOA/annual), and the per-Metadata dir keyword still overrides everything for individual datasets.

Because the helper lives in DataWrangling and all submodules share NumericalEarth as their module root, @get_scratch! still resolves to the same package UUID — so existing default caches keep their current on-disk location (verified: ~/.julia/scratchspaces/904d977b-.../JRA55).

Verification

  • Env-var path: default_download_directory(...) lands under the configured directory (incl. ECCO/v4 subdir). ✅
  • Default path: unchanged Scratch location. ✅
  • All ExplicitImports QA checks (check_no_implicit_imports, check_no_stale_explicit_imports, check_all_explicit_imports_via_owners, check_all_qualified_accesses_via_owners, check_no_self_qualified_accesses) return nothing. ✅
  • New testset in test/test_metadata.jl covering both the fallback and the env-var override. ✅

Docs

Added a "Where data is cached" section to docs/src/Metadata/metadata_overview.md.

cc @monsieuralok

🤖 Generated with Claude Code

Add `DataWrangling.download_cache(key)`, a single helper that returns the
directory used to cache downloaded data. By default it resolves to a Scratch.jl
space (unchanged behavior, same package UUID, so existing caches stay put). When
the `NUMERICALEARTH_DATA_DIRECTORY` environment variable is set, data is instead
cached under a per-key subdirectory of it.

Every dataset module (and the Bathymetry cache) now resolves its cache through
this helper in `__init__` instead of calling `@get_scratch!` directly, removing
the duplicated Scratch import from each submodule.

This is useful on HPC systems where the Julia depot lives on a small or
quota-limited filesystem, or to share one cache of large datasets across depots
and users — neither of which Scratch.jl supports on its own (its only knob,
JULIA_DEPOT_PATH, relocates everything).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/DataWrangling/DataWrangling.jl 75.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant