Skip to content

fix(toolkit): emit successful txs from batch-single-tx on partial failure#1731

Open
chrispalaskas wants to merge 1 commit into
mainfrom
fix/batch-single-tx-emit-partial
Open

fix(toolkit): emit successful txs from batch-single-tx on partial failure#1731
chrispalaskas wants to merge 1 commit into
mainfrom
fix/batch-single-tx-emit-partial

Conversation

@chrispalaskas

Copy link
Copy Markdown
Contributor

Problem

generate-txs batch-single-tx builds N transfers concurrently, then discards every successfully-built tx if even one transfer fails. It returns PartialFailure before serializing anything, so --dest-file is never written:

if failed > 0 {
    return Err(BatchSingleTxError::PartialFailure { succeeded, failed });
}
Ok(SerializedTxBatches { batches: vec![txs] })   // never reached on partial failure

This wastes the whole batch. Downstream load appliers (the perf repo's tx_load_applier.pygenerate_txs_round_robin.py) see the non-zero exit, delete the batch directory, and skip submission entirely. In practice a 250-transfer batch with even one failed transfer sends zero transactions — so the load test produces no traffic at all:

Built tx ... index=250 total=250
failed to build transactions: 3 of 250 transfers failed
Error: BuildTransactions(PartialFailure { succeeded: 247, failed: 3 })
❌ batch-single-tx failed with exit code 1 and produced no transactions

All 247 valid txs are thrown away because 3 failed.

Fix

Emit every tx that built successfully, even on partial failure. Only error when the entire batch fails (AllFailed):

  • succeeded > 0: log a warning, write the successful txs, exit 0.
  • succeeded == 0: return AllFailed, exit non-zero (the wrapper already handles this as "produced no transactions").

No wrapper change is needed — the existing valid_count == 0 && returncode != 0 guard in generate_txs_round_robin.py now only trips on a fully-empty batch, and the applier submits partial batches.

Testing

  • cargo check -p midnight-node-toolkit passes.

🤖 Generated with Claude Code

…lure

batch-single-tx discarded every successfully-built tx whenever a single
transfer failed, returning PartialFailure before serializing anything to
--dest-file. Downstream load appliers (tx_load_applier.py) saw the
non-zero exit, deleted the batch, and skipped submission entirely — so a
batch of 250 transfers with even one failure sent zero transactions.

Now emit all txs that built successfully and only error when the entire
batch fails (AllFailed), which the wrapper already handles as "produced
no transactions".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chrispalaskas chrispalaskas requested a review from a team as a code owner June 19, 2026 17:37
@datadog-official

datadog-official Bot commented Jun 19, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 3 Pipeline jobs failed

+audit (cargo deny + npm audit) | audit job npm   View in Datadog   GitHub Actions

CI + E2E | Push combined multi-arch manifest   View in Datadog   GitHub Actions

Changes Check | check-changes   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 5342803 | Docs | Give us feedback!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 534280335c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +256 to +258
if succeeded == 0 {
return Err(BatchSingleTxError::AllFailed { failed });
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve state when accepting partial batches

When failed > 0 but at least one transfer succeeds, this now continues with the mutated shared builder context and emits the remaining txs. For failures that occur after tx_info.prove() starts (for example shielded transfers where OfferInfo::build_inputs has already spent wallet state before proof/validation returns ProvingFailed), a failed, un-emitted transfer can leave local coins/DUST marked spent or create local change; later successes built from the same wallet may then depend on state that will not exist on-chain once only the successful txs are submitted. Please isolate or roll back per-transfer context mutations before treating partial failures as success.

Useful? React with 👍 / 👎.

@HarleysCodes HarleysCodes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix — the behavior change is correct and the rationale (don't discard 247 valid txs because 3 failed) is exactly right. A few concrete observations from reading the diff:

Behavior change is sound. The rename of PartialFailureAllFailed is a clean refactor: the variant is only constructed in one place and only matched by the BuildTxs Error impl in this same file, so there's no other call site to update. Verified via gh search code 'PartialFailure' returning only the original construction site. The new succeeded > 0 short-circuit correctly turns a partial-failure error into a successful return with the partial batch.

Suggestion: add a test for the new behavior. The PR description only mentions cargo check -p midnight-node-toolkit passing — there's no test that exercises failed > 0 && succeeded > 0 and verifies the function returns Ok with the partial batch. Even a small unit test in batch_single_tx.rs (constructing a builder with 5 mock transfers where 2 fail) would prevent regressions and document the new contract. Given the perf-repo load applier in the description relies on the new Ok path, locking the behavior in a test is high-value.

Minor: warn-message arg order. In the new tracing::warn!:

tracing::warn!(
    succeeded = succeeded,
    failed = failed,
    "Partial batch: {} of {} transfers failed; emitting {} successful txs",
    failed,
    succeeded + failed,
    succeeded
);

The format string lists arguments as failed / total / succeeded but the structured fields above are succeeded = succeeded, failed = failed. I'd swap the format placeholders to {}/{}, succeeded {}failed/total/succeeded for consistency with the structured fields, OR move the structured fields below the format string with failed = {}-style placeholders. Cosmetic, but the dual layout makes it easy to misread at a glance.

Pre-existing edge case, not a blocker. If the batch is empty (0 transfers, succeeded == 0 && failed == 0), the new code returns Ok(SerializedTxBatches { batches: vec![] }). Pre-existing behavior was the same, so this PR doesn't regress anything — but worth filing a follow-up if the team considers an empty batch an error condition (e.g. missing --num-transfers flag). Out of scope for this fix.

Optional doc nit. The PR body mentions the perf-repo wrapper (tx_load_applier.pygenerate_txs_round_robin.py) but doesn't link it. A maintainer reviewing this would have to grep for those script paths manually. A couple of <url> references would let the reviewer verify the claim that 'no wrapper change is needed.'

Overall the fix is small, focused, and addresses a real waste-of-work bug. The main ask is adding a test; the rest is polish. Happy to follow up on any of these.

@HarleysCodes HarleysCodes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small follow-up inline comments after the main PR-level review above.

return Err(BatchSingleTxError::PartialFailure { succeeded, failed });
if succeeded == 0 {
return Err(BatchSingleTxError::AllFailed { failed });
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit on the warn-message: structured fields above are succeeded = succeeded, failed = failed but the format placeholders consume them as failed, succeeded + failed, succeeded — i.e. the format string reads {failed} of {total} transfers failed; emitting {succeeded} successful txs. Either swap the format placeholders to match the structured-field ordering, or reorder the args. Cosmetic, but the dual layout is easy to misread.

@HarleysCodes HarleysCodes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional inline comment.

succeeded = succeeded,
failed = failed,
"Partial batch: {} of {} transfers failed; emitting {} successful txs",
failed,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new code path worth a test: failed > 0 && succeeded > 0 must return Ok with the partial batch, not Err. The perf-repo load applier in the PR description relies on this exact behavior. Even a small test that constructs a BatchSingleTxBuilder with 5 transfers where 2 deliberately fail and verifies the function returns Ok(SerializedTxBatches) with 3 entries would lock the contract and document the new semantics for future maintainers. Without it, a future refactor could revert to Err on partial failure and break the load-test pipeline silently.

@@ -246,8 +249,21 @@ impl<C: BuilderContext<DefaultDB>> BuildTxs for BatchSingleTxBuilder<C> {

progress.finish(format!("batch-single-tx: {} succeeded, {} failed", succeeded, failed));

// Emit every tx we managed to build, even on partial failure. Discarding the
// successful txs because a few transfers failed wastes the whole batch and starves
// downstream load appliers. Only treat a batch with zero successes as an error.
if failed > 0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make sure that process returns a valid JSON for downstream instead of playing with error codes here.
As pointed by another comment: 0 of 0 is success, 0 of 1 if failure, but in both cases downstream doesn't have much to work on. It seems that return code alone isn't sufficient here.

Also, is 1 okay out of 250 a success? It hardly seems so to me.

It seems simpler to me if downstream makes decision what is enough for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants