Several pytest tests are flaky or weak because of unseeded randomness, long sweeps, and smoke-only plot checks

## Summary

Several pytest tests use random data without a fixed seed, perform broad/slow parameter sweeps, or only check that plotting functions did not crash. These tests are useful as exploratory smoke coverage, but they are weak as deterministic regression tests.

## Examples

### Unseeded randomness

Many tests call `np.random.randn`, `np.random.rand`, `np.random.permutation`, or similar global RNG APIs without a fixed seed. Examples include:

- `python/tests/unit/aout/test_analyze_error_by_phase.py`
- `python/tests/unit/aout/test_analyze_error_by_value.py`
- `python/tests/unit/aout/test_decompose_harmonics.py`
- `python/tests/unit/calibration/test_verify_estimate_frequencies.py`
- several `python/tests/unit/spectrum/*` tests

### Calibration tests are print-heavy and sweep-heavy

`python/tests/unit/calibration/test_verify_calibration_lite.py` runs many broad sweeps and prints metrics such as weight error, SNDR, and ENOB, but several sweeps do not assert the expected bounds.

`python/tests/unit/calibration/test_verify_estimate_frequencies.py` prints whether frequency estimates are good or bad, while the actual threshold assertions are commented out.

### Plot tests are often smoke-only

Some plotting tests only assert that a figure/file/result exists, for example:

- `python/tests/unit/dout/test_plot_residual_scatter.py`
- `python/tests/unit/spectrum/test_sweep_performance_vs_osr.py`
- many AOUT/Spectrum plot tests that only verify PNG creation

Smoke tests are useful, but they should be labeled as such and complemented with structural or numeric checks when the plotted data encodes important behavior.

## Why this matters

- Unseeded tests can pass locally and fail in CI with a different random draw.
- Long sweeps slow down feedback and make failures harder to isolate.
- Print-only checks do not protect against regressions.
- Smoke-only plot assertions can pass even if the plotted data is wrong.

## Suggested fixes

- Replace global RNG calls with `np.random.default_rng(seed)` or fixed `RandomState` where reproducibility matters.
- Convert printed pass/fail thresholds into explicit assertions.
- Split long calibration sweeps into a small deterministic regression test plus optional stress/performance tests.
- Mark long exploratory sweeps with a dedicated marker such as `@pytest.mark.slow`.
- For plot tests, assert returned data shape, axis count, labels, plotted line/bar counts, or representative numeric values in addition to checking that output files exist.

## Expected result

The default pytest suite should be deterministic, reasonably fast, and should fail only when a meaningful behavior contract is violated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several pytest tests are flaky or weak because of unseeded randomness, long sweeps, and smoke-only plot checks #36

Summary

Examples

Unseeded randomness

Calibration tests are print-heavy and sweep-heavy

Plot tests are often smoke-only

Why this matters

Suggested fixes

Expected result

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Several pytest tests are flaky or weak because of unseeded randomness, long sweeps, and smoke-only plot checks #36

Description

Summary

Examples

Unseeded randomness

Calibration tests are print-heavy and sweep-heavy

Plot tests are often smoke-only

Why this matters

Suggested fixes

Expected result

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions