Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
forecast command for projecting workflow token usage and costsgh aw forecast command for projecting workflow token usage and costs
|
Hey
If you'd like a hand drafting the issue plan or adding tests, assign this prompt to your coding agent:
|
|
@mnkiefer feel free to prompt further |
|
@copilot add --repo support |
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in The Changes made in
Frontmatter metadata (triggers, concurrency) degrades gracefully to empty for remote repos since the Markdown source files are local-only. |
|
@copilot integrate episode analyzis |
|
@copilot use best in class predictive technique based on operational research and statistics, montecarlo |
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in The existing New additions:
|
…trap + Bernoulli CI) Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in The forecast now runs a 10 000-trial Monte Carlo simulation per workflow modelling three independent sources of uncertainty:
New output fields (
Console table now shows P50 as the primary estimate plus a new "80% CI (P10–P90)" column, replacing the single point estimate. The Poisson sampler uses Knuth's exact algorithm for λ ≤ 15 and a Normal approximation for larger λ to keep 10k-trial simulation time well under 10ms. 14 unit tests cover the sampler, percentile helpers, ordering invariants, and order-of-magnitude accuracy. |
|
@copilot focus on effective token forecast, remove cost forecasts |
… projections Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
Commit pushed:
|
🧪 Test Quality Sentinel ReportTest Quality Score: 85/100✅ Excellent test quality
Test Classification DetailsView all 14 tests
Flagged Items — Minor Guideline Gaps
|
| File | Test lines added | Production lines added | Ratio |
|---|---|---|---|
forecast_montecarlo_test.go |
228 | 194 (forecast_montecarlo.go) |
1.17 ✅ |
No inflation detected (threshold: > 2:1).
Language Support
- 🐹 Go (
*_test.go): 14 tests — unit (//go:build !integration) ✅ - 🟨 JavaScript (
*.test.cjs): 0 tests
Verdict
✅ Check passed. 0% of new tests are implementation tests (threshold: 30%). The test suite for
forecast_montecarlo.gois well-designed: it covers statistical invariants (Poisson properties), helper correctness, edge/boundary conditions (nil, zero, negative inputs, single element, zero success rate), and end-to-end ordering guarantees across multiple seeds. A few assertions are missing descriptive message strings — minor and non-blocking.
📖 Understanding Test Classifications
Design Tests (High Value) verify what the system does — observable outputs, state changes, error handling. They remain valid after internal refactoring and catch behavioral regressions.
Implementation Tests (Low Value) verify how the system does it — internal call counts, data structure layouts. They break during legitimate refactoring and give false assurance.
References: §25642964048
🧪 Test quality analysis by Test Quality Sentinel · ● 6.7M · ◷
🏗️ Design Decision Gate — ADR RequiredThis PR makes significant changes to core business logic (1,294 new lines across AI has analyzed the PR diff and generated a draft ADR to help you get started: 📄 Draft ADR: The draft has been committed to this PR branch (commit What to do next
Once an ADR is linked in the PR body, this gate will re-run and verify that the implementation matches the decision. Why an ADR is warranted hereThis PR introduces several decisions worth recording for posterity:
📋 Michael Nygard ADR Format ReferenceAn ADR must contain these four sections to be considered complete:
All ADRs are stored in
References: §25642964043
|
|
[/tdd] The test suite covers the Monte Carlo simulation engine well (14 tests), but the core business logic in
Per the func TestFormatForecastTokens(t *testing.T) {
assert.Equal(t, "-", formatForecastTokens(0))
assert.Equal(t, "999", formatForecastTokens(999))
assert.Equal(t, "1.0K", formatForecastTokens(1000))
assert.Equal(t, "1.00M", formatForecastTokens(1_000_000))
}Consider adding
|
There was a problem hiding this comment.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 10.2M
|
@copilot review all comments |
|
@copilot as an expert in industrial and market forecasting, review the algorithm used here |
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
All reviewer comments addressed across two commits: Code fixes (
Help text (
Spec (
Tests (
|
…able flag Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…e, and footer text Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in Algorithm review findings and changes: The original model treated the Poisson arrival rate λ as a known constant equal to the observed rate. This is the primary weakness in industrial/operational forecasting: with a short history window (e.g., 7 days, 5 runs), the rate estimate itself carries significant uncertainty, causing the CI to be artificially narrow. Key improvement — Gamma–Poisson compound model (Negative Binomial): where A new Additional improvements:
|
|
@copilot merge main and recompile |
…cli-command Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
@copilot add a flag --eval that does forecasting on past events (start forecast from a past date to be able to compare to measured data) and evaluates the quality of the forecast. |
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
New Feature
Adds
gh aw forecast— a new experimental analysis command that samples recent GitHub Actions run history and projects effective token usage and yield forward on a per-week or per-month basis using a Gamma–Poisson Monte Carlo simulation.What does this feature do?
started_at/updated_attimestamps; effective tokens are enriched from locally-cachedrun_summary.jsonfiles written bygh aw logs(no artifact re-download required)IsReliable = falseand marked in the console table--eval— shifts the entire training window back by one projection period so the forecast can be compared against actual runs in the most recent period. Computes accuracy metrics (P50ErrorAbs,P50ErrorPct,InCI) and renders a Backtesting evaluation table showing Actual Runs, Actual ET, Forecast P50, Error (abs), Error %, and whether the actual result fell within the 80% CIschedule,pull_request, etc.) from each workflow's Markdown sourceconsole.NewSpinner--jsonemits the fullForecastResultstruct for agent consumption, including themonte_carlofield with ET mean, stddev, P10/P50/P90 percentile fields,is_reliable, and (in eval mode) the fullevaluationobject per workflow--repo owner/repoforecasts workflows in any accessible repository; workflows are discovered via the GitHub API and run history is fetched withgh run list --repobuildEpisodeDataengine; surfaces per-episode token usage and episodes-per-period, and prints an episode breakdown table when orchestrator-style workflows are detected (runs/episode > 1)(experimental)and a warning is printed to stderr at runtime so users know the interface may changedocs/src/content/docs/reference/forecast-specification.md(sidebar order 1355, adjacent to the MCP Gateway and Effective Tokens specs), covering command interface, workflow discovery, the Monte Carlo algorithm, episode analysis, JSON schema, error handling, and compliance test casesImplementation details
pkg/cli/forecast_command.go--evalflag, corrected help text)pkg/cli/forecast.gopkg/cli/forecast_montecarlo.gopkg/cli/forecast_montecarlo_test.gopkg/cli/forecast_test.gocmd/gh-aw/main.goforecastin theanalysiscommand groupdocs/src/content/docs/reference/forecast-specification.mdProjection is driven by a Gamma–Poisson compound Monte Carlo simulation (the Negative Binomial model standard in actuarial science and industrial reliability). For each trial, the arrival rate λ is drawn from its Bayesian posterior
Gamma(n + 0.5, scale = λ̂/n)— where n is the observed run count and 0.5 is the Jeffreys non-informative prior shape — then the run count is drawn fromPoisson(λ_trial). Per-run effective tokens are sampled via bootstrap resampling of historical observations, and each run independently succeeds with the historical success rate (Bernoulli). This compound model naturally produces wider confidence intervals for small samples and converges to the classical Poisson estimate as n grows. ThegammaSamplefunction uses the Marsaglia-Tsang squeeze method. Aggregating 10 000 trials yields P10/P50/P90 effective-token estimates.Backtesting (
--eval) date window:The
ForecastEvaluationstruct recordstraining_start_date,training_end_date,validation_end_date,actual_runs,actual_effective_tokens,p50_error_abs,p50_error_pct, andin_ci(whether actual ET fell within the P10–P90 interval). Runs with missing timestamps are excluded from validation-window counting to avoid undefined bias.The
--daysflag accepts7or30(maximum 30 days). The console table columns are: Workflow, Sampled Runs, Success Rate, Yield/Period (throughput:success_rate × runs_per_period), Avg ET, Proj. ET (P50), 80% CI (P10–P90), and Triggers. ET values are formatted as K/M abbreviations. Workflows with fewer than 5 sampled runs are marked*in the table with a footnote warning. The--jsonoutput includes the fullmonte_carlosummary withmean_projected_effective_tokens,std_dev_effective_tokens, all three ET percentile fields, andis_reliable. TheyieldJSON field represents the throughput rate (success_rate × observed_runs_per_period), distinct fromsuccess_rate.When
--repois set, workflow discovery usesfetchGitHubWorkflows(GitHub API) instead of local.lock.ymlfiles. Provided workflow IDs are matched case-insensitively against remote workflow display names and file-path basenames. Frontmatter metadata degrades gracefully to empty for remote repos since Markdown source files are local-only.Episode analysis reuses the existing
buildEpisodeData+classifyEpisodeengine fromlogs_episode.go. Because no artifact downloads occur during forecasting, only GitHub Actions API fields (event,headSha,headBranch) are used for linkage — the resulting episode count is therefore a lower-bound estimate for orchestrator-style workflows. TheForecastEpisodeSummarystruct exposessampled_episodes,runs_per_episode,avg_effective_tokens_per_episode, andobserved_episodes_per_periodin JSON output and as a console table.