Skip to content

no merge: test benchmarking with one of claude's improvement ideas#148

Closed
dergoegge wants to merge 1 commit into
oss-garage:masterfrom
dergoegge:havoc-tokens
Closed

no merge: test benchmarking with one of claude's improvement ideas#148
dergoegge wants to merge 1 commit into
oss-garage:masterfrom
dergoegge:havoc-tokens

Conversation

@dergoegge

Copy link
Copy Markdown
Member

No description provided.

@dergoegge dergoegge added the needs benchmark smoke Test the benchmark pipeline label Jun 22, 2026
@github-actions

This comment was marked as outdated.

@dergoegge dergoegge removed the needs benchmark smoke Test the benchmark pipeline label Jun 23, 2026
@dergoegge dergoegge added the needs benchmark smoke Test the benchmark pipeline label Jun 23, 2026
@github-actions

This comment was marked as outdated.

All `LoadBytes` values are mutated through `LibAflByteMutator`, and they
end up as raw output/input scripts, witness stack elements and raw p2p
message payloads. The mutator ran `havoc_mutations()` over a state with
no `Tokens` metadata, so it did pure structureless havoc with no Bitcoin
awareness and rarely assembled a valid opcode sequence or a recognizable
script template by chance.

Load a curated dictionary of Script opcodes and standard script-template
fragments into the mutator's state and switch to
`havoc_mutations().merge(tokens_mutations())` so `TokenInsert`/
`TokenReplace` can splice meaningful fragments in, reaching script
interpreter branches that random bytes almost never hit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dergoegge dergoegge added needs benchmark Run the benchmark pipeline on this PR and removed needs benchmark smoke Test the benchmark pipeline labels Jun 23, 2026
@github-actions

Copy link
Copy Markdown

Fuzzing Evaluation Report

Baseline (A): master (73961470762d)
Experiment (B): havoc-tokens (651075736db5)

1. Summary Statistics

Evaluation window: 0.984 h · trials: 10 baseline / 10 experiment

Metric Baseline Experiment
Median final coverage (%) 9.846 9.820
Median AUC (coverage·h) 9.215 9.205
Median execs/s 22.824 17.600
Crashes (total) 0 4

Experiment vs. baseline comparison:

Statistic Coverage AUC (speed)
Adj. p-value 0.241 1.000
Â12 0.340 0.500

Raw P-values and Interquartile Ranges (IQRs) are available in evaluation_metrics.csv.


📦 Download the full report (visualizations + interpretation guide, as a zip).

@dergoegge dergoegge closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs benchmark Run the benchmark pipeline on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant