Self-healing test automation framework for fragile frontends. When a selector breaks, PhoenixQA doesn't crash β it heals.
Frontend tests break constantly β not because the feature is broken, but because the page underneath it changed.
A class was renamed. A data-testid rotated. A wrapper <div> appeared around a button after a refactor. A component moved into a Shadow DOM boundary.
PhoenixQA intercepts those failures, feeds the context to an LLM, and either:
- proposes a fix for human review (Safe Mode β live)
- applies the fix automatically within a confidence/budget policy and continues (Autonomous Mode β live)
Every decision β human-reviewed or autonomous β is logged today, including which provider made the call, how many tokens it cost, and how long it took. Once Healing History (Sprint 7) lands, that log becomes the basis for a self-training loop that improves future healing (Sprint 8) β not yet built, but the logging that feeds it already is.
"Test fails even though the app is fine" has more than one root cause. Real-world experience (Salesforce Lightning, enterprise SPAs) shows the most common failure isn't actually a renamed selector β it's timing: an element gets detached from the DOM mid-action because the framework re-rendered, or a spinner disappears before the frontend has actually finished updating.
PhoenixQA classifies failures into four types, but builds them in phases rather than all at once:
| Failure type | Status |
|---|---|
selector_not_found β classic renamed/rotated selector |
π§ Building first (Sprint 2-5) |
detached_from_dom β framework re-render mid-action |
π Required, not yet built |
not_visible β element exists but hidden/blocked |
π Required, not yet built |
timeout_waiting β never reaches an actionable state |
π Required, not yet built |
Why phase it: better to prove the full pipeline (collect β analyze β heal
β validate) end-to-end on one well-understood failure type first, then
extend to the others with real lessons learned β rather than build four
shallow strategies at once. The other three are committed scope, not a
"maybe later" β see LEARNINGS.md for the full reasoning, or
docs/gaps.md for a quick-scan status table of every
open architectural question.
How this project approaches quality as a whole β not just unit
tests on the framework's own code, but a layered strategy covering
integration, end-to-end behavior against a real LLM, regression
benchmarking of healing effectiveness, and non-functional resilience to
malformed model output β is laid out in
docs/testing-strategy.md.
Test Failure
β
βΌ
Context Collector β DOM snapshot, weighted semantic scoring, shadow DOM piercing
β
βΌ
LLM Analyzer β Ollama (local) or Anthropic API β structured JSON proposal
β
ββββΊ Safe Mode β Human reviews full context, accepts/rejects β Ground Truth
β
ββββΊ Autonomous Mode β Confidence gate + budget check, auto-applies fix, retries
β
βΌ
Healing History β SQLite log of all decisions (Sprint 7)
β
βΌ
Self-Training β Few-shot context for better future repairs (Sprint 8)
Note: PhoenixQA recovers the ABILITY to perform an action after a
failure β it does not judge whether the resulting behavior was
business-correct (e.g. "did login actually succeed"). That judgment
stays with the test's own assertions, same as it always has. See
docs/gaps.md Gap #11 for why this boundary is deliberate.
Autonomous Mode raises one of three distinct exception types depending on why it didn't heal β HealingRejectedError (bad/low-confidence proposal), HealingLimitExceededError (budget exhausted), HealingFailedError (provider/API crashed) β so a CI failure report says exactly what happened, not just "healing didn't work."
Chaos App isn't randomized weirdness β each level isolates one variable and answers a specific research question. This is closer to a controlled experiment than a typical "Playwright + sample app" portfolio repo.
| Level | Mechanisms (cumulative) | Research question |
|---|---|---|
| LOW | selector rotation | Does the test survive a selector rename? |
| MEDIUM | + DOM structure mutation | Does the test survive a UI refactor? |
| HIGH | + async delay | Does the test survive a refactor + timing issues? |
Shadow DOM is a separate, independent flag (SHADOW_DOM_ENABLED), not a 4th level β it's a different kind of difficulty (structural DOM access), combinable with any level above (e.g. HIGH + Shadow DOM tests refactor + timing + structural access at once).
Mechanisms ranked by real-world realism (most enterprise frontends break this way most often):
| Mechanism | Realism | Why |
|---|---|---|
| DOM Mutation | 10/10 | UI library upgrades, wrapper changes, component migrations |
| Selector Rotation | 9/10 | Classic β renamed class/id/data-testid |
| Async Delay | 8/10 | Lazy loading, animations, network-dependent rendering |
| Shadow DOM | 5/10 | Real, but narrower β mostly Web Components / LWC-style platforms |
Controlling the chaos level:
# chaos_app/.env
VITE_CHAOS_LEVEL=HIGH # LOW | MEDIUM | HIGH
VITE_SHADOW_DOM_ENABLED=true # true | false β independent of levelEdit chaos_app/.env, then restart npm run dev. The "Active chaos config" panel at the top of the running app confirms which mechanisms are live β no guessing required.
End goal (Sprint 8 β Healing Benchmark Runner): run the full suite at every level, comparing No Healer vs. Heuristic Healer vs. LLM Healer β not just "it works," but "here's exactly how much the LLM adds over a cheap fuzzy-match baseline, and where."
| Chaos Level | No Healer | Heuristic Healer | LLM Healer |
|---|---|---|---|
| LOW | ~72% | ?% | ~98% |
| MEDIUM | ~51% | ?% | ~95% |
| HIGH | ~29% | ?% | ~90% |
The middle column is the actual experiment β a simple fuzzy/Levenshtein selector matcher (zero LLM calls, same provider interface as Anthropic/Ollama) might already solve a surprising fraction of cases. Without this baseline, "90% healed" doesn't prove the LLM was necessary.
Autonomous Mode has hard stop conditions from day one (max_attempts_total, token budget, max_time_per_heal) β no infinite LLM retry loops in CI, by design, not as a later hardening pass. Budget is tracked in tokens and elapsed time, never in currency β model pricing changes over time, token counts don't. See docs/architecture-decisions.md for the full reasoning.
PhoenixQA/
βββ LEARNINGS.md # chronological journal β problem β analysis β decision β test β conclusion
βββ docs/ # thematic indexes (fast lookup by topic, not by sprint)
β βββ gaps.md # all numbered architectural gaps, status at a glance
β βββ architecture-decisions.md
β βββ known-limitations.md
β βββ future-ideas.md
β βββ testing-strategy.md # unit/integration/e2e/regression-benchmark/non-functional plan + actual state
βββ chaos_app/ # React/Vite β intentionally unstable test target
β βββ src/chaos/ # selectorRotation, domMutation, asyncDelay, shadowDom
βββ phoenix/
β βββ collector/ # failure_classifier, context_collector (weighted semantic scoring)
β βββ healing/ # healer, safe_mode, decision_logger, autonomous_mode
β βββ ai/ # base_provider, ollama_provider, anthropic_provider,
β β # prompt_templates, response_parser, provider_factory
β βββ training/ # Healing history (Sprint 7)
β βββ reporting/ # Allure Phoenix Healing Report (Sprint 9)
βββ pages/ # Page Objects for Chaos App (POM pattern)
βββ tests/
β βββ chaos/ # tests running against Chaos App
β βββ unit/ # tokenizer, classifier, parser, logger, healer tests
β βββ integration/
βββ config/
| Provider | When to use |
|---|---|
ollama |
Air-gapped / NDA environments, local LLM |
anthropic |
Cloud projects, best quality healing suggestions |
Switch via single env variable. No code changes.
| Sprint | Focus | Status |
|---|---|---|
| Sprint 0 | Repo scaffold, config, AI provider stubs | β Done |
| Sprint 1 | Chaos App β React/Vite, selector rotation, DOM mutation, async delay, Shadow DOM | β Done |
| Sprint 2 | Context Collector β selector_not_found only (DOM snapshot, weighted scoring) |
β Done |
| Sprint 3 | LLM Analyzer β prompt engineering, structured JSON response, confidence score | β Done |
| Sprint 4 | Safe Mode β Human-in-the-loop terminal review, JSON-lines decision log | β Done |
| Sprint 5 | Autonomous Mode β stop conditions (attempts/tokens/time budget), confidence policy gate, distinct exception types | β Done |
| Sprint 6 | Failure type expansion β detached_from_dom, not_visible, timeout_waiting |
β³ Planned |
| Sprint 7 | Healing History β SQLite store, decision log, healing correctness definition | β³ Planned |
| Sprint 8 | Healing Benchmark Runner β Heuristic provider baseline, few-shot self-training, Safe vs Auto metrics | β³ Planned |
| Sprint 9 | Allure Phoenix Report, CI/CD, demo GIF | β³ Planned |
# 1. Clone
git clone https://github.com/MarcinMikula/PhoenixQA.git
cd PhoenixQA
# 2. Install Python deps
pip install -r requirements.txt
playwright install chromium
# 3. Configure
cp .env.example .env
# Edit .env β choose AI provider, chaos level
# 4. Run the Chaos App (test target)
cd chaos_app
npm install
cp .env.example .env
npm run dev
# β http://localhost:5173
# 5. In a SEPARATE terminal (npm run dev keeps step 4's terminal busy):
cd ..
# Run tests against it β both Safe Mode (Sprint 4) and Autonomous Mode
# (Sprint 5) are live. Switch via .env: HEALING_MODE=safe | autonomous
#
# -s is REQUIRED for Safe Mode: it prompts for accept/reject via input(),
# and pytest swallows stdin/stdout without -s β the prompt never
# reaches the terminal and the run just hangs with no explanation.
# Autonomous Mode doesn't need -s (no prompts), but it doesn't hurt either.
pytest tests/chaos/ -m chaos -sHealing is confirmed working live as of Sprint 5 β Safe Mode and Autonomous Mode have both been run end-to-end against the real Chaos App and a local LLM, with selectors successfully healed and retried in place.
The actual demo artifact, though, is parked until Sprint 9: rather than a pile of terminal screenshots, the plan is a single Allure Healing Dashboard (success rate, healing timeline, confidence distribution, top repaired selectors, failure reasons, budget usage, provider comparison) β built once Sprint 6-8 (failure type expansion, healing history, benchmark runner) produce real data for it to render. See docs/future-ideas.md for the reasoning.
PhoenixQA is one piece of a larger AI-powered QA toolkit:
| Repo | Role |
|---|---|
| qa-automation-framework | POM/SOM skeleton β PhoenixQA heals its selectors |
| defect-pilot | AI bug reproduction & retest agent |
| llm-qa-toolkit | LLM-as-judge test framework for AI chatbots |
MIT