Skip to content

PEN-118: add daemon dispatch job lifecycle#80

Merged
wauputr4 merged 1 commit into
mainfrom
agent/dimas/f1cbfda8
Jun 11, 2026
Merged

PEN-118: add daemon dispatch job lifecycle#80
wauputr4 merged 1 commit into
mainfrom
agent/dimas/f1cbfda8

Conversation

@wauputr4

Copy link
Copy Markdown
Member

Summary

  • Add dispatch_jobs storage for accepted/leased/succeeded/failed/timed_out daemon dispatch lifecycle.
  • Add daemon-authenticated job lease and completion endpoints.
  • Let non-stream gateway chat fallback to healthy daemon nodes when no provider model route exists, preserving normal provider routes as priority.
  • Add daemon-side polling that leases one job after heartbeat, forwards it to the local OpenAI-compatible provider, and reports success/failure.

Scope notes

  • Streaming daemon-backed chat returns a clear unsupported error in this first path.
  • Dispatch retry/backoff and richer adapter auth remain follow-up scope.

Verification

  • git diff --check passed.
  • Could not run cargo fmt, cargo check, or tests because this runtime has no Rust toolchain on PATH (cargo, rustc, and rustfmt are unavailable).

Refs PEN-118
GitHub issue: #68

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a database-backed job dispatching system to route LLM chat completion requests to daemon nodes, adding a new dispatch module in mizan-api, a dispatch_jobs database table, and updating mizan-daemon to lease and execute these jobs. The review feedback highlights critical architectural and performance concerns: the sequential execution of jobs in the daemon blocks heartbeats and prevents concurrency; inline expiration checks on every lease request risk database write contention; and active database polling for job completion under high concurrency could exhaust the connection pool. It is recommended to run heartbeats and job execution concurrently, offload expiration checks to a background worker, and leverage Redis Pub/Sub for job completion notifications.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +102 to +104
if let Err(error) = lease_and_run_one_job(&client, &token, &config).await {
warn!(error = %error, "daemon dispatch job processing failed");
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation of the daemon loop has two major architectural issues:

  1. Heartbeat Blocking: Since lease_and_run_one_job is awaited sequentially in the main loop, any long-running LLM request (which can easily take 10-60+ seconds) will block the entire loop. This will delay the heartbeat, causing the control plane to mark the daemon node as stale/offline.
  2. No Concurrency: The daemon cannot process jobs concurrently even if max_concurrency is configured to be greater than 1, because job execution is fully sequential.

Recommendation:
Separate the heartbeat and job leasing into two concurrent tasks using tokio::spawn. Additionally, when a job is leased, spawn its execution into a background task (tokio::spawn) so it doesn't block leasing of subsequent jobs, up to the configured max_concurrency limit (e.g., using a semaphore or tracking active tasks).

database_backend: DatabaseBackend,
node_id: Uuid,
) -> Result<Option<DispatchJobLeaseResponse>, AppError> {
mark_expired_jobs_timed_out(database, database_backend).await?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling mark_expired_jobs_timed_out on every single job lease request will execute a global UPDATE query on the dispatch_jobs table very frequently (on every daemon poll/heartbeat). This can cause significant database write contention, transaction log bloat, and performance degradation under load.

Recommendation:
Remove this inline call and instead run the expiration logic in a periodic background task (e.g., every 10-30 seconds) within the API server.

Comment on lines +345 to +364
pub async fn wait_for_dispatch_result(
database: &AnyPool,
database_backend: DatabaseBackend,
job_id: Uuid,
timeout_duration: Duration,
) -> Result<DispatchJobResult, AppError> {
let started_at = Instant::now();

loop {
if let Some(result) = fetch_terminal_result(database, database_backend, job_id).await? {
return Ok(result);
}

if started_at.elapsed() >= timeout_duration {
return Ok(DispatchJobResult::TimedOut);
}

sleep(Duration::from_millis(100)).await;
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The wait_for_dispatch_result function polls the database every 100ms to check if a job has completed. Under high concurrency, this active polling will generate a massive number of database queries and can easily exhaust the database connection pool (AnyPool).

Since the application already has a Redis client (AppState.redis), you can leverage Redis Pub/Sub or Redis blocking commands (like BLPOP or keyspace notifications) to wait for job completion notifications. This would completely eliminate the database polling overhead and provide sub-millisecond notification latency.

@wauputr4 wauputr4 force-pushed the agent/dimas/f1cbfda8 branch 2 times, most recently from 4280704 to 500d77e Compare June 11, 2026 05:14

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ead040414e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/mizan-api/src/gateway.rs Outdated
api_key_id: Some(identity.api_key_id),
model: public_model.clone(),
request: upstream_request,
timeout_seconds: state.config.limit_lease_seconds.max(1).min(30),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid expiring jobs before the next daemon poll

With the daemon's default/documented heartbeat_interval_seconds of 30s, this caps every daemon-backed request deadline at 30s. Since run only calls lease_and_run_one_job once per heartbeat, a request that arrives just after a daemon polls can remain unleased until roughly the next heartbeat and then time out before an otherwise healthy local provider has time to respond. Consider making the dispatch deadline exceed the polling interval or decoupling job polling from the heartbeat sleep.

Useful? React with 👍 / 👎.

@wauputr4 wauputr4 force-pushed the agent/dimas/f1cbfda8 branch from 500d77e to 2e75d85 Compare June 11, 2026 05:16
Co-authored-by: multica-agent <github@multica.ai>
@wauputr4 wauputr4 force-pushed the agent/dimas/f1cbfda8 branch from 2e75d85 to 9b09c65 Compare June 11, 2026 05:18
@wauputr4

Copy link
Copy Markdown
Member Author

Thanks for the review. I agree these are the right concerns for a production-grade dispatcher.

This PR is intentionally the first v0.2 local self-hosted dispatch path: database-backed lifecycle, daemon lease/complete contract, timeout marking, gateway lease release, and green Rust CI. I am not treating this as the final scalable dispatch architecture.

Decision for this PR:

  • Keep sequential daemon job execution as the conservative first runnable path; concurrent execution can build on the same lease/complete contract without changing the public API.
  • Keep inline expiration and gateway polling for v0.2 local self-hosted use; Redis wakeups/background expiry should be follow-up work before higher concurrency or multi-node production use.
  • Do not mark the milestone clean/release-ready until QA has exercised operator smoke and the remaining dispatch scale follow-ups are either accepted as follow-up scope or implemented.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b09c65160

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

warn!(error = %error, "daemon heartbeat failed");
}
}
if let Err(error) = lease_and_run_one_job(&client, &token, &config).await {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor advertised daemon concurrency when polling

When a daemon is configured with max_concurrency > 1 (the documented sample uses 2), this loop still awaits exactly one lease_and_run_one_job call and then sleeps for the heartbeat interval, so the node can never run more than one dispatched request at a time and additional accepted jobs remain queued despite advertised capacity. Under concurrent daemon-backed traffic this underutilizes the daemon and can make queued requests time out even though the node reported free slots; poll and run jobs up to config.max_concurrency before sleeping or keep a worker pool active.

Useful? React with 👍 / 👎.

@wauputr4 wauputr4 merged commit 3e51992 into main Jun 11, 2026
1 check passed
@wauputr4 wauputr4 deleted the agent/dimas/f1cbfda8 branch June 11, 2026 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant