Skip to content

add support for incremental snapshots#103

Open
Crypt-iQ wants to merge 6 commits into
oss-garage:masterfrom
Crypt-iQ:01082026/tmp_snapshot_copy
Open

add support for incremental snapshots#103
Crypt-iQ wants to merge 6 commits into
oss-garage:masterfrom
Crypt-iQ:01082026/tmp_snapshot_copy

Conversation

@Crypt-iQ

@Crypt-iQ Crypt-iQ commented Jan 9, 2026

Copy link
Copy Markdown
Contributor

Posting mainly to get high-level feedback on the design before I continue any further.

Basically, an IncrementalSnapshotStage is created that wraps the mutation stage and will insert an Operation::IncrementalSnapshot into the program (not persisted to the corpus). Mutators are then aware not to mutate before the snapshot point. It then runs for 50 iterations with the incremental snapshot.

It is pretty slow which is something I want to look into. I can clean things up a lot (i.e. not making it default to incremental snapshots). I think some more intelligent snapshot placement could be used as well (not placing the snapshot operation so early in the program, only placing it after certain messages / operations, etc.). Another thing that is wonky is that sometimes the input gets evicted to disk since we're using a CachedOnDiskCorpus, which means occasionally we need to reload the input.

@Crypt-iQ Crypt-iQ marked this pull request as draft January 9, 2026 23:16
@Crypt-iQ Crypt-iQ force-pushed the 01082026/tmp_snapshot_copy branch from a5227b4 to 73d4e3d Compare January 9, 2026 23:18
Comment thread fuzzamoto-libafl/src/stages/incremental_snapshot_stage.rs Outdated
@Crypt-iQ Crypt-iQ force-pushed the 01082026/tmp_snapshot_copy branch from 73d4e3d to 17fb0e0 Compare January 14, 2026 20:54
@tokatoka

Copy link
Copy Markdown
Contributor

This is more for a discussion.

If I understand correctly, what we do now is like this:
After we take a snapshot, we hold IncrementalSnapshotMetadata and there we manage

  • the corpus ID relevant to the snapshot
  • the prefix that is frozen in the snapshot
  • the number of times we used this snapshot
    And the next time we enter the IncrementalSnapshotStage we restore the corpus ID again, and return to return to fuzz the corpus entry using snapshot that we saved earlier in the IncrementalSnapshotMetadata.

I wonder what would be the reason that we don't do like this in the IncrementalSnapshotStage

  1. Always take a tmp snapshot at the start of the stage.
  2. then make the inner_stage use them later on.
  3. At the end of the stage discard this snapshot.

This way it would look more simple to manage.

Is it because we don't want to discard the tmp snapshot too early?

@Crypt-iQ

Crypt-iQ commented Jan 16, 2026

Copy link
Copy Markdown
Contributor Author

This is more for a discussion.

If I understand correctly, what we do now is like this: After we take a snapshot, we hold IncrementalSnapshotMetadata and there we manage

* the corpus ID relevant to the snapshot

* the prefix that is frozen in the snapshot

* the number of times we used this snapshot
  And the next time we enter the IncrementalSnapshotStage we restore the corpus ID again, and return to return to fuzz the corpus entry using snapshot that we saved earlier in the  `IncrementalSnapshotMetadata`.

Yup, that's right.

I wonder what would be the reason that we don't do like this in the IncrementalSnapshotStage

1. Always take a tmp snapshot at the start of the stage.

2. then make the inner_stage use them later on.

3. At the end of the stage discard this snapshot.

This way it would look more simple to manage.

Is it because we don't want to discard the tmp snapshot too early?

I think this works if IncrementalSnapshotStage calls inner_stage.perform N times instead of 1 like it does now. I think this would also allow using the probe metadata instead of ignoring it when a tmp snapshot exists and lets me get rid of the logic that overrides whatever the scheduler chose.

I also think IncrementalSnapshotMetadata should be kept so that the mutators in inner_stage know frozen_prefix_len, though it's possible to instead pass a variable instead of using metadata.

@Crypt-iQ Crypt-iQ force-pushed the 01082026/tmp_snapshot_copy branch 2 times, most recently from 2f3862a to 0444d65 Compare January 29, 2026 15:55
@Crypt-iQ Crypt-iQ changed the title add support for tmp snapshots add support for incremental snapshots Jan 29, 2026
@Crypt-iQ Crypt-iQ force-pushed the 01082026/tmp_snapshot_copy branch 2 times, most recently from a29f111 to 2548f1c Compare January 29, 2026 16:36
@Crypt-iQ Crypt-iQ marked this pull request as ready for review January 29, 2026 17:07
@Crypt-iQ

Crypt-iQ commented Jan 29, 2026

Copy link
Copy Markdown
Contributor Author

Implemented @tokatoka suggestion so that IncrementalSnapshotStage runs the snapshot iterations inside of it and then discards at the end. This also lets us get rid of IncrementalSnapshotMetadata which is nice. Something to note is that the probing and stability stages will work on the initial input used in IncrementalSnapshotStage and not any of the mutated versions of this input. I wasn't really sure how to address that.

There is a memory leak which I'm investigating. I'm also going to come up with a benchmarking plan because in theory this should give better and quicker coverage than running without incremental snapshots. One TODO is incremental snapshots will be behind a flag so that it's not always enabled. Lastly, there is a commit here Crypt-iQ@1567d36 that can be used to increase the instruction limit when testing.

@Crypt-iQ Crypt-iQ force-pushed the 01082026/tmp_snapshot_copy branch from 2548f1c to 2ecef2d Compare February 11, 2026 13:55
…d_next

Also change the scenarios to accept runners
@Crypt-iQ Crypt-iQ force-pushed the 01082026/tmp_snapshot_copy branch from 2ecef2d to 10cad9c Compare February 12, 2026 21:59
@Crypt-iQ

Copy link
Copy Markdown
Contributor Author

The results from benchmarking incremental snapshots to master were underwhelming. I ran the two branches on a clean machine each with 6 cores for 2 hours with the instruction count increased 10x to 40960. I then parsed the stdout of each to create the below graphs. I also compared the coverage with a script similar to deterministic-fuzz-coverage to see exactly where the branches differed in coverage.

fuzzamoto-comparison-r8

The results in the run that produced the graph were pretty consistent across several different benchmarks where I varied:

  • where I was placing the snapshot
  • amount of time (this run was the longest at 2 hours, most were ~1 hour)
  • if I was placing the snapshot at all (depending on how many instructions were present)
  • changing the number of iterations the snapshot is used for

In the vast majority of them, incremental snapshots had less coverage (typically in txgraph.cpp or in script/interpreter.cpp), a smaller corpus size, and were always faster. The stability measurement usually varied across the benchmarks so I can't make any meaningful conclusion about that. Also, I think something happened in this particular run with the stability for the master branch, so I've decided to ignore it since most of the time the master branch had ~same stability.

Going into this, I expected incremental snapshots to have less coverage overall, but "deeper" coverage in some areas. There are some "quirky" edges that are hit (e.g. being unable to fetch a value from the index in index/blockfilterindex.cpp, having a too-large v3 child in policy/truc_policy.cpp), but I can't really say that it's hitting "deeper" branches. I think running the benchmarks for much longer (~1 week) would give me a better picture, but I use the machine for other things sometimes which I found pretty quickly ruins the benchmark. The other thing that I think would really help is having intelligent snapshot placement because right now the placement doesn't bias towards anything, it just takes the snapshot wherever (regardless of whether the input is even interesting!). If this instead biased towards taking the snapshot after long sequences of blocks, long sequences of txns, or because of custom feedback (like sometimes assertions or maybe even using things like reacting to the bitcoind logs), I think that could make this branch more effective.

…incremental snapshots

The IncrementalSnapshotStage is configured to run with an incremental
snapshot for a configurable number of iterations. The mutators have
awareness of the snapshot position.
This should prevent some cache thrashing when inputs are evicted to disk.
@Crypt-iQ Crypt-iQ force-pushed the 01082026/tmp_snapshot_copy branch from 10cad9c to 29be6f0 Compare April 22, 2026 15:02
@Crypt-iQ

Crypt-iQ commented Apr 22, 2026

Copy link
Copy Markdown
Contributor Author

Build failures are unrelated (bitcoin core broke fuzzamoto and linter warnings).

I went over the LibAFL code and realized I was implementing the Restartable trait incorrectly. In the old code, the IncrementalSnapshotStage would run the inner TuneableMutationalStage for 128 iterations and after that return early in a loop until max_reuse_count iterations were done. After fixing this, I notice a speed increase. Incremental snapshots are almost twice as fast, more stable, but have less coverage (my guess is because it's focusing on individual inputs for more time). The below is a visual comparison to master over 12 hours with 31 cores:
fuzzamoto-31c-12h-comparison

@tokatoka

tokatoka commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Can you tell me what you were doing inside Restartable trait?, the before & after changes of your impl

Restartable trait is relevant if the fuzzer exits and restarts from crashes.
(Imagine those harnesses on oss-fuzz; where fuzzer and target code lives inside the same process. The fuzzer has to restart when the target crashes, because they are on the same process.)

And for nyx, since the fuzzer process doesn't exit from crashes at all, I'm interested in what exactly made the difference.
But I agree that Restartable trait is very confusing 🤦‍♂️ .

@tokatoka

Copy link
Copy Markdown
Contributor

I see the diff here: 29be6f0
To me I don't really think that it's about the impl of Restartable trait. Because this nyx executor doesn't do any restarts;

How many trials did you just run your experiments?
The exec/s, corpus size, stability vastly differs across runs; much more than coverage.
(So comparing fuzzer performance in terms of "speed, counted in the number of execs/sec, is not easy)
and bitcoin core being unstable target doesn't make all these difficulty any easy.

If you can, can you maybe run the whole experiment for at least 10~ times?

@Crypt-iQ

Crypt-iQ commented Apr 22, 2026

Copy link
Copy Markdown
Contributor Author

Can you tell me what you were doing inside Restartable trait?, the before & after changes of your impl

The old code called the inner_stage's (TuneableMutationalStage) should_restart and clear_progress functions directly in IncrementalSnapshotStage::should_restart and IncrementalSnapshotStage::clear_progress. So here the inner stage would stop running after 128 iterations and the outer stage (IncrementalSnapshotStage) would just keep looping and call inner_stage.perform that did no mutations. These are called from the RestartableStage here which afaict will always call 1) should_restart followed by 2) clear_progress. That's my reading of LibAFL anyways.

How many trials did you just run your experiments?
The exec/s, corpus size, stability vastly differs across runs; much more than coverage.

Yeah, I can run more trials in the latest version. The reason the charts are different is because of the amount of time (12h vs 2h) (which is what I think you're referring to). I ran five or six smaller trials of the latest changes for around 20 minutes and the results were consistent (incremental snapshots faster, slower stability drop-off, less coverage). I ran a much longer benchmark here just to see that the incremental snapshots didn't slow down (relative to master) over time. I also ran the earlier 2h benchmark (before the latest change) many times and the results were consistent. Once I automate the benchmarking it should be easier.

@tokatoka

Copy link
Copy Markdown
Contributor

The old code called the inner_stage's (TuneableMutationalStage) should_restart and clear_progress functions directly in IncrementalSnapshotStage::should_restart and IncrementalSnapshotStage::clear_progress. So here the inner stage would stop running after 128 iterations and the outer stage (IncrementalSnapshotStage) would just keep looping and call inner_stage.perform that did no mutations. These are called from the RestartableStage here which afaict will always call 1) should_restart followed by 2) clear_progress. That's my reading of LibAFL anyways.

Ah ok I understand this. Thanks for the explanation 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants