Mirrors Movebank study metadata and event data to a local file/folder structure. A first run catches up historical data in chunks; subsequent runs pull only updates.
Ships as both a command-line tool (cli/) and an embeddable Java library
(lib/). Built on top of movebank-api-client.
- Re-runnable: per-(tag, sensor type) cursor + state file means an interrupted run resumes cleanly.
- Backoff-aware updates: 5 min → 15 min → 1 h → 6 h → 24 h → 7 d, reset on any pass with new rows.
- Single-process safe:
<mirror-dir>/.lockblocks concurrent runs against the same directory.
Build the CLI distribution and put it on your PATH:
./gradlew :cli:installDist
export PATH="$PWD/cli/build/install/movebank-mirror/bin:$PATH"
export MOVEBANK_USER=... # Movebank username
export MOVEBANK_PASSWORD=... # see "Passing the password" below for safer alternatives
# Bootstrap a fresh mirror: pull metadata for every accessible study, then
# start the event-data sync loop (Ctrl-C / SIGTERM stops cleanly).
movebank-mirror -d /var/lib/mvb syncCommon shapes:
# Dry-run: list every accessible study without downloading.
movebank-mirror metadata --list
# Mirror two specific studies' metadata.
movebank-mirror -d /var/lib/mvb metadata --study 2911040 --study 2911107
# Daily cron job: metadata refresh + a single event-data pass, then exit.
movebank-mirror -d /var/lib/mvb sync --once
# Continuous service (e.g. systemd / Docker): event-data loop only,
# assuming metadata is already on disk.
movebank-mirror -d /var/lib/mvb eventdatamovebank-mirror [GLOBAL OPTS] <command> [COMMAND OPTS]
| Command | What it does |
|---|---|
metadata |
Pull study metadata. One JSON file per study, named %012d.json under --mirror-dir. |
eventdata |
Sync event data. Loops forever by default; use --once for cron. |
sync |
metadata pass, then eventdata (loop, or --once). The "just keep this current" mode. |
help |
Show help for a command (also: <command> --help). |
| Flag | Env var | Default | Notes |
|---|---|---|---|
-u, --user |
MOVEBANK_USER |
(required) | |
-p, --password |
MOVEBANK_PASSWORD |
Discouraged on CLI — visible in ps. |
|
--password-file FILE |
Read first line of FILE. |
||
--password-stdin |
Read first line of stdin. | ||
--base-url URL |
MOVEBANK_BASE_URL |
https://www.movebank.org/movebank |
|
-d, --mirror-dir DIR |
MOVEBANK_MIRROR_DIR |
./movebank-mirror |
Directory the mirror reads/writes. |
-v / -vv / -q |
info | -v debug, -vv trace, -q warnings only. |
|
-h, --help / -V, --version |
| Flag | Notes |
|---|---|
--study ID |
Restrict to specific study ids. Repeatable. |
--exclude ID |
Skip these study ids. Repeatable. |
--license MODE |
auto / record (default) / reject. See License modes. |
--dry-run |
Print which studies would be fetched, don't download. |
--list |
Print accessible study ids and names, then exit. Implies --dry-run. |
| Flag | Notes |
|---|---|
--once |
Single pass, then exit. Pairs with cron / systemd-timer. |
--chunk-size N |
Records per catch-up chunk. Default 50000. |
--update-sleep DURATION |
Sleep between passes once caught up. 5m, 1h30m, 250ms, PT5M. Default 5m. |
--study ID |
Restrict to specific study ids. Repeatable. |
Combines metadata + eventdata flags: --once, --chunk-size, --update-sleep,
--license, --study, plus --skip-metadata to run only the event-data side.
| Mode | Behavior |
|---|---|
auto |
Accept every license silently. Read terms first — you're agreeing on every study. |
record |
Default. Accept and write <studyId>-license.json next to the metadata for an audit trail. |
reject |
Refuse acceptance. Studies that require a license fail; useful as a permission probe. |
In order of preference:
MOVEBANK_PASSWORDenv var (clean for systemd / Docker / shell scripts).--password-file ~/.movebank.pw(file mode600).--password-stdinfor piping:pass show movebank | movebank-mirror --password-stdin sync.--password ...last resort — visible to other users viaps.
| Code | Meaning |
|---|---|
0 |
Success |
1 |
Generic error |
2 |
Usage / bad args |
3 |
Authentication |
4 |
Mirror-dir lock held by another process |
5 |
I/O error |
130 |
Interrupted (SIGINT/SIGTERM) |
<mirror-dir>/
├── 000002911040.json study metadata: study, sensors, tags,
│ deployments, individuals, attribute lists
│ per sensor type
├── 000002911040-license.json accepted-license record (license=record)
├── .lock single-process lock (managed by the CLI)
└── 000002911040/ one directory per study
└── <tagId>/ one directory per tag
├── state_<sensorTypeId>.json sync cursor + backoff state
├── <sensorTypeId>_<yyyyMMddHHmmss>.csv catch-up chunks
└── <sensorTypeId>_update_<yyyyMMddHHmmss>.csv update batches
Filenames use a 12-digit zero-padded study id so directory listings sort chronologically by Movebank's id assignment.
For each (tag, sensorType) pair the downloader alternates between two modes,
tracked in state_<sensorTypeId>.json:
Catch-up. Downloads up to --chunk-size records per chunk (default 50 000),
using the last seen timestamp as the cursor. Repeats until a chunk returns
fewer rows than the limit; then transitions to update mode.
Update. Polls for new rows using update_ts as the cursor. An exponential
backoff ladder (5 min → 15 min → 1 h → 6 h → 24 h → 7 d) extends the polling
interval after empty passes; any pass with new rows resets the interval to its
minimum.
A pass over the whole mirror yields control between passes for --update-sleep
(default 5 min) so the loop is friendly under systemd / screen.
The mirror is built on the assumption that Movebank is mostly append-only — new tags get added, new event rows stream in, old data is mostly immutable. For studies where that holds, the mirror stays in sync indefinitely. For studies where it doesn't, it can silently drift from upstream. The current release does not have automation for the drift cases; those are real gaps, documented honestly here so you know what to expect.
The metadata subcommand writes <studyId>.json for each accessible study
the first time it runs. Subsequent runs of eventdata / sync read those
JSON files but never re-fetch them. If individuals, tags, deployments, or
study attributes are added / edited / deleted upstream, the mirror won't
notice. Concretely:
- A new tag deployed on an existing study after first mirroring → its events will never download, because the sensor isn't in the cached metadata.
- A deployment window edited upstream (e.g. end timestamp extended) → the
mirror keeps the old window, so deployment-window-aware tools downstream
(like
movebank-mirror-api's event enrichment) use stale data. - A new sensor type added to a tag → its CSVs won't appear.
- Licence terms or
i_can_see_datachanges → not picked up.
Workaround for now: re-run movebank-mirror metadata periodically (e.g.
weekly via cron) to overwrite the cached JSONs. Event-data sync afterward
picks up newly-mirrored tags and sensor types automatically.
The update-mode poll queries Movebank for rows whose update_ts is newer
than the last seen value. That works for new arrivals and for late-arriving
late data from a still-active tag. It does not work for:
- Retroactive edits to historical rows. If a row's
update_tsis bumped upstream because the row was corrected, we'll download the corrected version into an_update_…csvfile — but the original version still sits in an older catch-up chunk on disk. Downstream readers see both, with no rule for which wins. - Deletions. Movebank's API does not emit tombstones. If a row is removed upstream, the mirror still has it, and there's no mechanism to detect the divergence.
- Bulk corrections (e.g. recomputed location_lat / location_long after a positional-fix-quality re-run) — same drift story as edits, at scale.
Workaround for now: for studies that are actively curated, periodically
delete the mirror's event data for the study and let catch-up re-download
from scratch. There is no built-in command for this yet; rm -rf <mirror>/<studyId>/<tagId>/ is the manual move.
There's no verify subcommand that compares local row counts to upstream
row counts to confirm the mirror is faithful. If you need that confidence
for a paper or a long-term archive, run a periodic external check (e.g.
total event-row count per (tag, sensor type) against the live API, with
-vv to log per-pair counts).
The directory tree under the mirror root has no _format_version marker.
If the format changes in a future release (currently unplanned), tooling
that consumed pre-existing mirrors would need to migrate manually.
movebank-mirror-api's enrichment logic happens to be backward-compatible
with all current mirror layouts; future tools may not be.
These are sometimes raised but are working as intended:
- Network errors mid-chunk → the chunk file is deleted if no rows were written; next sync re-tries from the same cursor.
- Concurrent runs against the same mirror dir → blocked by the
.lockfile; the second invocation exits cleanly with a clear error. - Studies you no longer have access to → the existing
<studyId>.jsonstays on disk; the next event-data pass will fail-with-log for that study but keep going with the others.
If you have a real use case that's blocked by one of these, please open an issue — it'll move up the list.
movebank-mirror metadata --refresh— re-fetch metadata for already- mirrored studies, overwriting cached JSON. Resolves the metadata-drift gap and surfaces study-level / metadata-level upstream deletions.- Detection of
update_tscollisions with already-downloaded rows, with a dedupe pass onevent_id(now that we request that column at download time). Resolves the retroactive-edits gap for most cases. movebank-mirror verify— per-(tag, sensor type) row-count comparison against the live API. Surfaces row-count discrepancies cheaply.- Event-row deletion detection —
verify --deepmode that diffsevent_idsets against the live API to find upstream deletions, with a--on-deletionpolicy (mirror/quarantine/preserve) that lets archival mirrors keep what upstream loses. The research-integrity layer. - On-disk schema-version tag — single
_format_versionfile under the mirror root so future tooling can detect old layouts. movebank-mirror prune --study <id>— clean removal of one study's event data without manualrm -rf. Useful both for the "re-download from scratch" workaround and for managing disk usage.
Detailed designs for each are in TODO.md.
The library (Maven coords de.firetail.compat.movebank:movebank-mirror) exposes
the same building blocks the CLI uses:
import com.fasterxml.jackson.databind.ObjectMapper;
import de.firetail.compat.movebank.api.client.LicenseChecker;
import de.firetail.compat.movebank.api.client.MovebankApiClient;
import de.firetail.compat.movebank.mirror.MovebankMirror;
import de.firetail.compat.movebank.mirror.Study;
import de.firetail.compat.movebank.mirror.StudyId;
import de.firetail.compat.movebank.mirror.StudyJson;
import de.firetail.compat.movebank.mirror.eventdata.EventDataDownloader;
import de.firetail.compat.movebank.mirror.eventdata.EventDataSyncLoop;
import java.io.File;
import java.util.Set;
LicenseChecker autoAccept = html -> true;
MovebankApiClient client = new MovebankApiClient(
"https://www.movebank.org/movebank",
System.getenv("MOVEBANK_USER"),
System.getenv("MOVEBANK_PASSWORD"),
autoAccept);
File baseDir = new File("/var/lib/mvb");
baseDir.mkdirs();
// 1. Pull metadata for every accessible study.
MovebankMirror mirror = new MovebankMirror(client);
ObjectMapper mapper = new ObjectMapper();
for (StudyId studyId : mirror.getAllStudyIds()) {
Study study = mirror.getStudyRefData(studyId);
File studyFile = new File(baseDir,
String.format("%012d.json", Long.parseLong(studyId.studyId())));
mapper.writerWithDefaultPrettyPrinter()
.writeValue(studyFile, new StudyJson(study));
}
// 2. Sync event data — tunable chunk size, update sleep, study filter.
EventDataDownloader downloader = new EventDataDownloader(client, baseDir, 50_000);
new EventDataSyncLoop(baseDir, downloader)
.setUpdateSleepMs(5 * 60 * 1_000L)
.setStudyIdFilter(Set.of("2911040", "2911107")::contains)
.run(); // blocking; interrupt the thread to stopImplement your own LicenseChecker to plug in a different acceptance policy
(prompt-driven, log-only, deny-by-default, etc.).
Requires JDK 21.
./gradlew build # compile + unit tests for both modules
./gradlew :lib:integrationTest # hits the Movebank API; needs MOVEBANK_USER/PASSWORD
./gradlew :cli:installDist # cli/build/install/movebank-mirror/bin/movebank-mirror
./gradlew :cli:distTar :cli:distZip # release archives in cli/build/distributions/
./gradlew :lib:publish # publish library jar (signing keys required)Integration tests self-skip when credentials are not set, so they are safe to
include in :check.
movebank-mirror/
├── lib/ de.firetail.compat.movebank:movebank-mirror — library, published to Maven Central
└── cli/ de.firetail.compat.movebank:movebank-mirror-cli — CLI distribution (application plugin)
The library has zero CLI dependencies; consumers who only want the embeddable API don't pull in picocli.
If you have movebank-api-client checked out as a sibling directory:
./gradlew -PlocalMovebankApiClient=true build…or set localMovebankApiClient=true in gradle.properties. The same pattern
applies downstream: BasicImport's gradle.properties exposes
localMovebankMirror=true to consume this repo via includeBuild.
LGPL-2.1, matching movebank-api-client. See LICENSE.