dqt Detector Benchmarks

Auto-generated by benchmarks/run_benchmarks.py. Do not edit — re-run to update.

Methodology

Trials: 30 independent runs (seeds 0-29)
Sample size: N=2,000 per fixture per trial
Fixtures: 8 synthetic scenarios (normal mean-shift, lognormal tail-shift, 5% outlier injection, 10% null injection, variance explosion, gradual ramp drift, combined drift+nulls, heavy-tail contamination)
Confidence intervals: 95% via normal approximation (mean +/- 1.96 x std / sqrt(n_trials))
Anomaly rate: 50% (8 clean / 8 anomalous per trial)
Interpretation: Detectors are grouped by intended use case. Do not compare across families (an outlier detector is not competing with a distribution drift detector).

A well-calibrated detector should beat _always_alert (F1 > 0.670) and _random_50pct (F1 > 0.500).

Detector	Description	F1 mean	Recall	FPR
`_always_alert`	Always fires — upper ceiling at 50% anomaly rate	0.667	1.000	1.000
`_never_alert`	Never fires — lower bound	0.000	0.000	0.000
`_random_50pct`	50% random alerting	0.486	0.500	0.508
`_naive_zscore`	Batch-mean z-score > 3 threshold	0.141	0.079	0.000

Detector	F1 mean	F1 std	95% CI	Recall	Precision	FPR
`auto_outlier`	0.926	0.023	[0.917, 0.934]	0.863	1.000	0.000
`zscore_outlier_fraction`	0.877	0.011	[0.873, 0.881]	0.875	0.879	0.121
`adjusted_boxplot_fraction`	0.860	0.052	[0.842, 0.879]	0.758	1.000	0.000
`iqr_fence`	0.841	0.036	[0.828, 0.854]	0.738	0.980	0.017
`double_mad_outlier_fraction`	0.536	0.037	[0.523, 0.549]	0.367	1.000	0.000
`grubbs`	0.526	0.078	[0.498, 0.554]	0.421	0.711	0.179
`generalized_esd`	0.398	0.064	[0.375, 0.420]	0.254	0.958	0.017
`mad_outlier_fraction`	0.222	0.000	[0.222, 0.222]	0.125	1.000	0.000

Detector	F1 mean	F1 std	95% CI	Recall	Precision	FPR
`wasserstein_1`	0.933	0.000	[0.933, 0.933]	0.875	1.000	0.000
`ks_pvalue`	0.920	0.033	[0.908, 0.932]	0.879	0.968	0.033
`js_divergence`	0.778	0.027	[0.768, 0.788]	0.637	1.000	0.000
`psi`	0.775	0.022	[0.767, 0.783]	0.633	1.000	0.000
`kl_divergence`	0.769	0.000	[0.769, 0.769]	0.625	1.000	0.000
`mmd`	0.708	0.051	[0.689, 0.726]	0.550	1.000	0.000

Detector	F1 mean	F1 std	95% CI	Recall	Precision	FPR
`holt_winters`	0.933	0.000	[0.933, 0.933]	0.875	1.000	0.000
`cusum`	0.884	0.043	[0.868, 0.899]	0.800	0.990	0.008
`page_hinkley`	0.776	0.088	[0.744, 0.807]	0.792	0.771	0.254
`monotonicity`	0.667	0.000	[0.667, 0.667]	1.000	0.500	1.000
`stl_residual_zscore`	0.545	0.000	[0.545, 0.545]	0.750	0.429	1.000

Detector	F1 mean	F1 std	95% CI	Recall	Precision	FPR
`benford_law_fit`	0.667	0.000	[0.667, 0.667]	1.000	0.500	1.000

Raw results (with full CI columns): examples/benchmarks/results.csv