Add tx-granular ledger snapshot source using meta, RPC, and archives#1657
Add tx-granular ledger snapshot source using meta, RPC, and archives#1657leighmcculloch wants to merge 48 commits into
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
42f37c9 to
598ad4c
Compare
0a8ae20 to
19904da
Compare
19904da to
cdc7199
Compare
|
is it possible to set it so the timestamp is also set after the snapshot has been loaded? Currently we need to manually set it after the |
|
This seems like an interesting way of handling a simulation locally. Or at least doing a read call locally. |
|
Looks awesome. Thanks, Leigh! ❤️ |
| let ledger_cache_dir = self.cache_path.join( | ||
| self.tx_hash | ||
| .map(|h| { | ||
| let tx_hash_str: String = h.iter().map(|b| format!("{b:02x}")).collect(); | ||
| format!("{}-{}-before", self.fetcher.ledger(), tx_hash_str) | ||
| }) | ||
| .unwrap_or_else(|| format!("{}-after", self.fetcher.ledger())), | ||
| ); | ||
|
|
||
| // Ensure cache directory exists | ||
| std::fs::create_dir_all(&ledger_cache_dir).expect("failed to create cache directory"); |
There was a problem hiding this comment.
This probably should be the cache crate's responsibility. If it is abstracted more than this crate will never need to use a fs. This way we could eventually use Wasm to run this in the browser.
| fn fetch(&self, key: &LedgerKey) -> Option<LedgerEntry> { | ||
| // Compute cache file path: <cache_path>/<ledger>/<tx_hash_or_none>/<hash_of_key>.json | ||
| let key_xdr = key.to_xdr(Limits::none()).expect("failed to encode key"); | ||
| let key_hash = Sha256::digest(&key_xdr); | ||
| let ledger_cache_dir = self.cache_path.join( | ||
| self.tx_hash | ||
| .map(|h| { | ||
| let tx_hash_str: String = h.iter().map(|b| format!("{b:02x}")).collect(); | ||
| format!("{}-{}-before", self.fetcher.ledger(), tx_hash_str) | ||
| }) | ||
| .unwrap_or_else(|| format!("{}-after", self.fetcher.ledger())), | ||
| ); | ||
|
|
||
| // Ensure cache directory exists | ||
| std::fs::create_dir_all(&ledger_cache_dir).expect("failed to create cache directory"); | ||
|
|
||
| // Use cache function to handle reading/writing cache file | ||
| let fetch_read = cache( | ||
| ledger_cache_dir.join(format!("{:x}.json", key_hash)), | ||
| |write| -> Result<(), Box<dyn std::error::Error>> { | ||
| // Fetch the data from the underlying fetcher | ||
| let result = self.fetcher.fetch(key)?; | ||
|
|
||
| // Serialize to JSON | ||
| serde_json::to_writer_pretty(write, &result)?; | ||
|
|
||
| Ok(()) | ||
| }, | ||
| ) | ||
| .expect("failed to cache entry"); | ||
|
|
||
| // Parse the cached result | ||
| serde_json::from_reader(fetch_read).expect("failed to parse cached entry") |
There was a problem hiding this comment.
SnapshotSource::get swallows fetcher errors with .expect().
Every IO/serde/network error in the workspace-level cache layer panics: failed to create cache directory, failed to cache entry, failed to parse cached entry. The underlying LedgerEntryFetcher::fetch returns Result<_, Error>, and SnapshotSource::get itself returns Result<_, HostError> — so errors can be propagated. Panicking from inside a host call is harder to diagnose than a returned error, especially when the test runner only shows the panic backtrace and not the underlying network/cache failure.
Recommend converting these into HostErrors (or a typed wrapper), so callers can at least see the original cause in the test failure.
Tied to @willemneal's comment about abstracting fs access — the same restructuring would let the cache surface typed errors instead of forcing the caller into .expect().
| let fetch_read = cache( | ||
| ledger_cache_dir.join(format!("{:x}.json", key_hash)), | ||
| |write| { | ||
| // Fetch the data | ||
| let result = self.fetch_with_dl_cache(key, &cache_path)?; | ||
|
|
||
| // Serialize to JSON | ||
| serde_json::to_writer_pretty(write, &result)?; | ||
|
|
||
| Ok(()) | ||
| }, | ||
| )?; |
There was a problem hiding this comment.
No schema versioning on the committed cache.
fetch_with_entry_cache serializes Option<LedgerEntry> to JSON via stellar-xdr's serde representation, and the files live under tests-snapshot-source/ which is checked into the repo. If stellar-xdr's JSON representation of LedgerEntry changes (rename, field add/remove, enum variant shift), every cached file silently becomes incompatible with the new code — the deserialization will succeed-but-with-wrong-data in some cases, or panic in others, and there's no fingerprint that says "this cache was produced by xdr version X".
Two low-effort mitigations:
- Embed the stellar-xdr version and/or a schema-hash in the JSON wrapper (e.g.
{"xdr_version": "26.0.1", "entry": ...}) and reject mismatches. - Or pin the XDR version as
=in Cargo.toml (you have=26.0.1already at the workspace level — make sure that's load-bearing here and called out).
Worth deciding before users start committing snapshot fixtures into their repos that they then have to regenerate.
| let bucket_0 = &bucket[0..=1]; | ||
| let bucket_1 = &bucket[2..=3]; | ||
| let bucket_2 = &bucket[4..=5]; | ||
| let bucket_url = | ||
| format!("{archive_url}/bucket/{bucket_0}/{bucket_1}/{bucket_2}/bucket-{bucket}.xdr.gz"); |
There was a problem hiding this comment.
Slice indexing will panic on a malformed bucket hash.
&bucket[0..=1], [2..=3], [4..=3] panic if bucket.len() < 6 or if the byte is not on a UTF-8 boundary (any non-ASCII would also panic since these use ..= byte-index slicing). In practice the bucket strings come from History::current_buckets which the archive server returns, so a malformed value would only arise from a compromised/corrupted archive response — but defensive validation here (if bucket.len() < 6 { return Err(...) }) is cheap and would turn a confusing panic into a typed error.
Same pattern in get_history on lines 46–48 is safe because format!("{:08x}") guarantees an 8-char ASCII string.
| [dependencies] | ||
| soroban-ledger-snapshot-source-tx = "23" | ||
| ``` |
There was a problem hiding this comment.
Version mismatch: the README's example dependency uses "23" but Cargo.toml has version.workspace = true → 26.0.1. The PR body also says "Requires soroban-sdk v23.4.0 or later", which is fine, but the example here should match the crate's actual published major ("26").
| [dependencies] | ||
| soroban-sdk = { workspace = true, features = ["testutils"] } | ||
| stellar-xdr = { workspace = true, features = ["curr", "std", "base64", "serde"] } | ||
| cargo_metadata = "0.19" | ||
| directories = "6.0" | ||
| sha2 = "0.10" | ||
| hex = "0.4" | ||
| serde = { version = "1", features = ["derive"] } | ||
| serde_json = "1" | ||
| thiserror = "2.0" | ||
| tracing = "0.1" | ||
| tracing-subscriber = { version = "0.3", features = ["env-filter"] } | ||
| reqwest = { version = "0.12", default-features = false, features = ["json", "blocking", "rustls-tls"] } | ||
| flate2 = { version = "1.0", features = ["rust_backend"] } | ||
| zstd = "0.13" | ||
| fs2 = "0.4" |
There was a problem hiding this comment.
Dependency hygiene concerns flagged by Socket Security.
A few items worth resolving before merge:
-
fs2 = "0.4"—fs2is unmaintained. Successors arefs4(drop-in API, actively maintained) orfile-lock. Given the cache module relies entirely onFileExt::lock_exclusive, this is a low-risk swap. -
reqwest = { default-features = false, features = [..., "rustls-tls"] }— pulls inrustls-webpki@0.101.7(transitively viastellar-rpc-client) which has GHSA-82j2-j2ch-gfr8 (DoS via malformed CRL, High). Bumpingstellar-rpc-client(if you depend on it) or pinning a newer rustls-webpki via a workspace dep would clear this. -
License-policy alerts:
option-ext(MPL-2.0, viadirectories) andzstd-sys(GPL-2.0+, viazstd) both triggered Socket warnings. Since the deny.toml in this PR explicitly allowsMPL-2.0, the option-ext one is intentional. Thezstd-sysGPL note may need an explicit allow or a clarify entry indeny.toml— currently I don't see GPL-2.0+ in the allow list, so this may be a build break undercargo deny check licenses. Worth verifying CI on the bans/licenses/sources job. -
Loose version specifiers:
cargo_metadata = "0.19",directories = "6.0",sha2 = "0.10", etc. are minimum-version specs. The rest of the workspace pins XDR/env crates with=. For a test utility this is fine, but consider whether reproducible fixture generation argues for tighter pins here too (see my comment onfetch_with_entry_cacheabout schema versioning).
- parse_ledger now takes the ledger sequence so an empty-batch LedgerNotFound error reports the real sequence instead of 0. - Add publish = false to the crate (matches its exclusion from the semver-checks and publish-dry-run CI jobs). - Tighten the unused convenience wrappers (from_meta_storage::ledger, from_history_archive::history) from pub to pub(crate). - Name the detached meta-prefetch thread and log spawn failures so a panic inside it is attributable. - Clarify that LedgerCloseMetaV1.tx_processing is V0-shaped, so mapping it to the V0 variant is by type and not a lossy downcast.
|
Claude finished @leighmcculloch's task in 2m 47s —— View job Code Review — incremental on top of prior passRe-reviewed at ✅ Resolved in
|
- Network::testnet now takes the SEP-54 meta_url as an argument instead of hardcoding a date-pinned testnet epoch (testnet is periodically reset and the AWS dataset partitions meta by epoch, so there is no stable default). Mirrors how mainnet() takes rpc_url. - The RPC usability check now also requires the node to have observed the target ledger (latestLedger >= ledger), so a lagging node can no longer have a potentially-stale entry persisted to the cache. The getLedgerEntries response's latestLedger is parsed (defaulting to 0 when absent, which is conservatively treated as unusable).
|
Claude finished @leighmcculloch's task in 2m 42s —— View job Code Review — incremental on top of
|
| // modified at or after `ledger` once the node catches up, so the | ||
| // current response could be stale; fall back to meta/archive rather | ||
| // than persisting a potentially wrong answer. | ||
| let usable = entry.last_modified_ledger_seq < ledger && latest_ledger >= ledger; |
There was a problem hiding this comment.
New finding from this commit: the RPC response is cached even when it's not usable.
The latest_ledger >= ledger check correctly addresses the lagging-node correctness issue I raised previously, but the new check happens after cache() has persisted the response body to rpc-{ledger}-{key_hash}.json. The cached file contains the latestLedger value reported at the moment the cache was written.
Consequence: if the first lookup happens against a lagging node, the cached file permanently reports the same stale latestLedger. Every subsequent run reads the same cached body, gets the same latest_ledger, and the usability gate keeps rejecting it — even though the node has long since caught up and a fresh request would now succeed. The user has no signal that the RPC fast path is permanently disabled for this (ledger, key) pair; they'd have to manually delete files under ~/.cache/<bundle-id>/snapshot-source-tx/.
Two possible fixes:
- Don't commit non-usable responses to the cache. Move the usability check inside the collector closure passed to
cache(), and return an error from the collector when the node is lagging. The atomic-write logic incache.rsalready deletes the temp.dlfile on collector error and does not rename to the final path. The downside is you lose the ability to surface atracing::debug!for the not-usable case from the cached path on subsequent runs, but the trade-off seems worth it. - Delete the cache file post-hoc. After parsing, if
!usable, drop the file. Simpler to bolt on but races with concurrent readers.
The first approach is cleaner and works with the lock semantics already in place. A small follow-up: once the file is in the cache and the node was non-lagging at the time, subsequent runs do still need to revalidate latest_ledger >= self.ledger against the (frozen) cached value — that comparison happens to be correct because latest_ledger only ever grows on the node, but it's worth a comment so the next reader doesn't worry about it.
Test coverage gap: the existing missing_latest_ledger_defaults_to_zero test confirms parsing, but there's no end-to-end test that a non-usable cached response is either (a) not persisted or (b) revalidated on next call. Worth adding once the fix is in.
Revert the publish = false addition; the crate is intended to be published.
|
Claude finished @leighmcculloch's task in 2m 12s —— View job Code Review — incremental on top of
|
Network::testnet now takes testnet_start_date (YYYY-MM-DD) and builds the AWS testnet meta partition URL internally, rather than requiring the caller to assemble the whole meta_url.
|
Claude finished @leighmcculloch's task in 1m 45s —— View job Code Review — incremental on top of
|
Resolve the security advisories reported by cargo deny check advisories
with the smallest possible dependency changes:
- Bump stellar-rpc-client 25.1.0 -> 26.0.0 (dev-dep of test_fork). This
is the only available newer release; 25.x has no patched variant. It
moves jsonrpsee 0.20 -> 0.26, dropping the vulnerable rustls 0.21 /
rustls-webpki 0.101.7 stack in favor of rustls 0.23 /
rustls-webpki 0.103.13. Clears:
- RUSTSEC-2026-0104 (CRL parsing panic)
- RUSTSEC-2026-0098 (URI name constraints)
- RUSTSEC-2026-0099 (wildcard name constraints)
- RUSTSEC-2025-0134 (rustls-pemfile unmaintained; removed)
serde_json was bumped to the minimum required by jsonrpsee-core 0.26
(1.0.128 -> 1.0.142); no duplicate introduced.
- Bump rand 0.8.5 -> 0.8.6 (in-place patch) to clear RUSTSEC-2026-0097
(ThreadRng unsoundness).
- Remove the stale RUSTSEC-2024-0388 (derivative) ignore from deny.toml;
cargo-deny reports advisory-not-detected and the crate is no longer in
the lockfile. RUSTSEC-2024-0436 (paste) is still matched and retained.
cargo deny check advisories is now ok; bans/licenses/sources still ok.
Lockfile churn is confined to the rpc/jsonrpsee/rustls/webpki subtree
(plus the rand patch); no soroban/serde/quote/syn or test-wasm changes.
https://claude.ai/code/session_015kBhoioxYh6bD9o6eb9Ksh
|
Claude finished @leighmcculloch's task in 2m 12s —— View job Code Review — incremental on top of
|
This reverts commit 65ec7ed. The bump of stellar-rpc-client to 26.0.0 (to clear the rustls-webpki advisory) requires rustc 1.93.0, which exceeds this workspace's MSRV of 1.91.0. That broke the gating expand-test-wasms / MSRV build jobs (cargo expand of the test_fork crate failed under 1.91). The rustls-webpki advisory only surfaces in the non-gating cargo-deny (advisories) check, so restoring MSRV-compatible dependencies is the correct tradeoff. https://claude.ai/code/session_015kBhoioxYh6bD9o6eb9Ksh
|
Claude finished @leighmcculloch's task in 1m 36s —— View job Code Review — incremental on top of
|
The rustls-webpki advisory (RUSTSEC-2026-0104 and related) can only be cleared by stellar-rpc-client 26.0.0, which requires rustc 1.93.0. Raise the workspace MSRV 1.91.0 -> 1.93.0 to allow it and re-apply the bumps that were reverted for MSRV-compatibility: - stellar-rpc-client 25 -> 26 (dev-dependency of test_fork), pulling the patched rustls-webpki 0.103.13 / rustls 0.23 / jsonrpsee 0.26 stack - rand 0.8.5 -> 0.8.6 (RUSTSEC-2026-0097) - remove the now-stale RUSTSEC-2024-0388 ignore from deny.toml Regenerate tests-expanded/ and the wasm-embedding test_snapshots under the 1.93.0 toolchain, since cargo-expand output and contract codegen shift with the compiler version. Snapshots from ignored/network tests (fork) and feature-gated tests are toolchain-independent and unchanged. cargo deny check advisories -> advisories ok cargo deny check bans licenses sources -> bans ok, licenses ok, sources ok https://claude.ai/code/session_015kBhoioxYh6bD9o6eb9Ksh
|
Claude finished @leighmcculloch's task in 2m 12s —— View job Code Review — incremental on top of
|
The test-with-soroban-examples and test-with-openzeppelin-stellar-contracts workflows installed Rust with a bare `rustup update` and no toolchain override, so a rust-toolchain.toml pin (in this repo or the external repo) could take effect. With the MSRV now at 1.93, an older pinned toolchain makes cargo's MSRV-aware resolver pick an older soroban-sdk for transitive deps (e.g. soroban-poseidon) than the path-patched local 26.0.1, compiling two copies of soroban-sdk and failing with duplicate lang item (panic_impl) errors. Force RUSTUP_TOOLCHAIN=stable so these builds always use the latest stable Rust regardless of any pin. https://claude.ai/code/session_015kBhoioxYh6bD9o6eb9Ksh
|
Claude finished @leighmcculloch's task in 1m 49s —— View job Code Review — incremental on top of
|
The MSRV 1.91 -> 1.93 bump (to clear the rustls-webpki advisory via stellar-rpc-client 26) broke ~70 downstream test-crate jobs in the soroban-examples and openzeppelin-stellar-contracts workflows: with the local soroban-sdk at rust-version 1.93, cargo's MSRV-aware resolver no longer unifies the path-patched local soroban-sdk 26.0.1 with the copy that transitive crates.io deps (e.g. soroban-poseidon) pull in, so two soroban-sdk crates compile and the wasm build fails with duplicate lang item (panic_impl) errors. Forcing RUSTUP_TOOLCHAIN=stable did not help, confirming the cause is the MSRV declaration itself, not the build toolchain version. Revert to MSRV 1.91, restoring the all-green dependency state. The rustls-webpki advisory remains covered by the existing ignore in the non-gating cargo-deny advisories check. This reverts commits 53afdac and 3f183c8. https://claude.ai/code/session_015kBhoioxYh6bD9o6eb9Ksh
|
Claude finished @leighmcculloch's task in 1m 55s —— View job Code Review — incremental on top of
|
Resolve Cargo.lock conflict by regenerating the lockfile, retaining bytes-lit 0.0.5 (tests) and 0.0.6 (soroban-sdk) as required after main's bytes-literals work. Regenerate tests-expanded/ with the MSRV toolchain (cargo-expand 1.0.116, cargo-hack 0.5.28) to stay consistent with the expand-test-wasms CI check.
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
Base the lockfile on origin/main's resolution and let cargo add only the snapshot-source-tx feature dependencies on top, instead of regenerating from scratch. The full regeneration had floated transitive deps (toml 0.9.5->0.9.12 -> indexmap ->2.14.0 -> hashbrown 0.17.1, thiserror 1.0.63->1.0.69, trybuild 1.0.115->1.0.116) past the versions deny.toml's multiple-versions skip list pins, tripping 'cargo deny check bans' on a duplicate hashbrown. Restores hashbrown 0.14.5 / thiserror 1.0.63 / trybuild 1.0.115 while keeping bytes-lit 0.0.6 for soroban-sdk and 0.0.5 for the tests.
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
The embedded WASM in test_spec_shaking_v2_tests.rs was generated against the pre-fix lockfile; regenerate it with the minimized lockfile so the expand-test-wasms CI check (git diff HEAD --exit-code) passes.
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
What
Add new crates for fetching ledger entries from multiple data sources to enable more seamless and built-in fork testing in the sdk. Add
soroban-ledger-snapshot-source-txfor transaction-level snapshot sources.sequenceDiagram autonumber actor App as participant SelTxMeta as Query<br/>Ledger<br/>Tx Meta participant LedgerMeta as Query<br/>Ledger – 1..N<br/>Tx Meta participant Archive as History<br/>Archive participant RPC as RPC App->>SelTxMeta: Lookup with key Note over App,SelTxMeta: If found → use it.<br/>If not → continue. App->>SelTxMeta: Look in prior txs in same ledger Note over App,SelTxMeta: If found → use it.<br/>If not → continue. opt Optionally use RPC App->>RPC: getLedgerEntries([key]) Note over App,RPC: Use if (lastModified < queryLedger)<br/>AND (rpcLatestSeen >= queryLedger).<br/>Otherwise continue. end App->>LedgerMeta: Look in prior ledger meta Note over App,LedgerMeta: If found → use it.<br/>If checkpoint ledger not reached → continue to next ledger meta<br/>If checkpoint ledger is reached → continue to archive. App->>Archive: Download checkpoint from history archive and search Note over App,Archive: If found → use it.<br/>If not → does not exist.Why
The current fork testing experience utilising the stellar-cli has low granularity only at the boundaries of ledgers, and requires downloading full history archives and manually identifying footprints ahead of time, which is difficult to do well and a poor developer experience. This change enables the SDK to lazily fetch ledger entries on-demand from the most efficient source available, caching results locally for subsequent runs. Developers will be able to fork test against any ledger and transaction without pre-identifying the footprint.
The change uses a Ledger Meta Storage (SEP-54), an RPC, and a History Archive to collect ledger entries.
The change caches results in three layers. All raw files downloaded are cached in the system cache directory and reused across tests, across workspaces. All ledger entries found are cached in the system cache directory. All ledger entries found for the current workspace are cached in the
tests-snapshot-sourcedirectory intended to be committed so that CI runs reproducibly without needing to collect entries. Note that the format of that cache is not a ledger snapshot json file because for many tests running concurrently one file per ledger entry is easier to manage.Close #1448
Try it out
Add the following dependencies to your
Cargo.toml:Note: Requires
soroban-sdkv23.4.0or later.Example
Observing State Changes
The
TxSnapshotSourcelooks up state with transaction-level granularity when a transaction hash is provided allowing developers to debug a transaction by starting at the point just before the transaction. For example:Before a specific tx: Pass the tx hash to get state just before that tx executed:
End of ledger (after all txs): Use
Nonefor the tx_hash to get state at the end of the ledger:Different ledgers/txs: Update the test to investigate the state of the balance at each of the following ledgers and transactions to see how the transactions affected the balance. Check out the links to see the operations the transactions performed and how they align with the changes in balances observed.
Debugging with RUST_LOG
Enable logging to see which data sources are being queried and what entries are found:
Example output:
TODO
Thanks
Thanks @orbitlens for sharing the idea of using transaction meta as a way to collect recent state. Thanks to all the people who provided feedback to me about how they use the existing
stellar snapshot createfunctionality.