🔥 PhoenixQA

Self-healing test automation framework for fragile frontends. When a selector breaks, PhoenixQA doesn't crash — it heals.

🧠 What is this?

Frontend tests break constantly — not because the feature is broken, but because the page underneath it changed. A class was renamed. A data-testid rotated. A wrapper <div> appeared around a button after a refactor. A component moved into a Shadow DOM boundary.

PhoenixQA intercepts those failures, feeds the context to an LLM, and either:

proposes a fix for human review (Safe Mode — live)
applies the fix automatically within a confidence/budget policy and continues (Autonomous Mode — live)

Every decision — human-reviewed or autonomous — is logged today, including which provider made the call, how many tokens it cost, and how long it took. Once Healing History (Sprint 7) lands, that log becomes the basis for a self-training loop that improves future healing (Sprint 8) — not yet built, but the logging that feeds it already is.

Scope: where this starts, and where it's going

"Test fails even though the app is fine" has more than one root cause. Real-world experience (Salesforce Lightning, enterprise SPAs) shows the most common failure isn't actually a renamed selector — it's timing: an element gets detached from the DOM mid-action because the framework re-rendered, or a spinner disappears before the frontend has actually finished updating.

PhoenixQA classifies failures into four types, but builds them in phases rather than all at once:

Failure type	Status
`selector_not_found` — classic renamed/rotated selector	🚧 Building first (Sprint 2-5)
`detached_from_dom` — framework re-render mid-action	🔜 Required, not yet built
`not_visible` — element exists but hidden/blocked	🔜 Required, not yet built
`timeout_waiting` — never reaches an actionable state	🔜 Required, not yet built

Why phase it: better to prove the full pipeline (collect → analyze → heal → validate) end-to-end on one well-understood failure type first, then extend to the others with real lessons learned — rather than build four shallow strategies at once. The other three are committed scope, not a "maybe later" — see LEARNINGS.md for the full reasoning, or docs/gaps.md for a quick-scan status table of every open architectural question.

How this project approaches quality as a whole — not just unit tests on the framework's own code, but a layered strategy covering integration, end-to-end behavior against a real LLM, regression benchmarking of healing effectiveness, and non-functional resilience to malformed model output — is laid out in docs/testing-strategy.md.

🏗️ Architecture

Test Failure
    │
    ▼
Context Collector        ← DOM snapshot, weighted semantic scoring, shadow DOM piercing
    │
    ▼
LLM Analyzer             ← Ollama (local) or Anthropic API → structured JSON proposal
    │
    ├──► Safe Mode        ← Human reviews full context, accepts/rejects → Ground Truth
    │
    └──► Autonomous Mode  ← Confidence gate + budget check, auto-applies fix, retries
              │
              ▼
        Healing History   ← SQLite log of all decisions (Sprint 7)
              │
              ▼
        Self-Training     ← Few-shot context for better future repairs (Sprint 8)

Note: PhoenixQA recovers the ABILITY to perform an action after a
failure — it does not judge whether the resulting behavior was
business-correct (e.g. "did login actually succeed"). That judgment
stays with the test's own assertions, same as it always has. See
docs/gaps.md Gap #11 for why this boundary is deliberate.

Autonomous Mode raises one of three distinct exception types depending on why it didn't heal — HealingRejectedError (bad/low-confidence proposal), HealingLimitExceededError (budget exhausted), HealingFailedError (provider/API crashed) — so a CI failure report says exactly what happened, not just "healing didn't work."

🧪 Chaos Levels — a benchmark, not just randomness

Chaos App isn't randomized weirdness — each level isolates one variable and answers a specific research question. This is closer to a controlled experiment than a typical "Playwright + sample app" portfolio repo.

Level	Mechanisms (cumulative)	Research question
LOW	selector rotation	Does the test survive a selector rename?
MEDIUM	+ DOM structure mutation	Does the test survive a UI refactor?
HIGH	+ async delay	Does the test survive a refactor + timing issues?

Shadow DOM is a separate, independent flag (SHADOW_DOM_ENABLED), not a 4th level — it's a different kind of difficulty (structural DOM access), combinable with any level above (e.g. HIGH + Shadow DOM tests refactor + timing + structural access at once).

Mechanisms ranked by real-world realism (most enterprise frontends break this way most often):

Mechanism	Realism	Why
DOM Mutation	10/10	UI library upgrades, wrapper changes, component migrations
Selector Rotation	9/10	Classic — renamed class/id/data-testid
Async Delay	8/10	Lazy loading, animations, network-dependent rendering
Shadow DOM	5/10	Real, but narrower — mostly Web Components / LWC-style platforms

Controlling the chaos level:

# chaos_app/.env
VITE_CHAOS_LEVEL=HIGH            # LOW | MEDIUM | HIGH
VITE_SHADOW_DOM_ENABLED=true     # true | false — independent of level

Edit chaos_app/.env, then restart npm run dev. The "Active chaos config" panel at the top of the running app confirms which mechanisms are live — no guessing required.

End goal (Sprint 8 — Healing Benchmark Runner): run the full suite at every level, comparing No Healer vs. Heuristic Healer vs. LLM Healer — not just "it works," but "here's exactly how much the LLM adds over a cheap fuzzy-match baseline, and where."

Chaos Level	No Healer	Heuristic Healer	LLM Healer
LOW	~72%	?%	~98%
MEDIUM	~51%	?%	~95%
HIGH	~29%	?%	~90%

The middle column is the actual experiment — a simple fuzzy/Levenshtein selector matcher (zero LLM calls, same provider interface as Anthropic/Ollama) might already solve a surprising fraction of cases. Without this baseline, "90% healed" doesn't prove the LLM was necessary.

Autonomous Mode has hard stop conditions from day one (max_attempts_total, token budget, max_time_per_heal) — no infinite LLM retry loops in CI, by design, not as a later hardening pass. Budget is tracked in tokens and elapsed time, never in currency — model pricing changes over time, token counts don't. See docs/architecture-decisions.md for the full reasoning.

PhoenixQA/
├── LEARNINGS.md             # chronological journal — problem → analysis → decision → test → conclusion
├── docs/                    # thematic indexes (fast lookup by topic, not by sprint)
│   ├── gaps.md              # all numbered architectural gaps, status at a glance
│   ├── architecture-decisions.md
│   ├── known-limitations.md
│   ├── future-ideas.md
│   └── testing-strategy.md  # unit/integration/e2e/regression-benchmark/non-functional plan + actual state
├── chaos_app/                # React/Vite — intentionally unstable test target
│   └── src/chaos/            # selectorRotation, domMutation, asyncDelay, shadowDom
├── phoenix/
│   ├── collector/            # failure_classifier, context_collector (weighted semantic scoring)
│   ├── healing/              # healer, safe_mode, decision_logger, autonomous_mode
│   ├── ai/                   # base_provider, ollama_provider, anthropic_provider,
│   │                         # prompt_templates, response_parser, provider_factory
│   ├── training/             # Healing history (Sprint 7)
│   └── reporting/            # Allure Phoenix Healing Report (Sprint 9)
├── pages/                    # Page Objects for Chaos App (POM pattern)
├── tests/
│   ├── chaos/                # tests running against Chaos App
│   ├── unit/                 # tokenizer, classifier, parser, logger, healer tests
│   └── integration/
└── config/

🔒 Privacy-first AI design

Provider	When to use
`ollama`	Air-gapped / NDA environments, local LLM
`anthropic`	Cloud projects, best quality healing suggestions

Switch via single env variable. No code changes.

🗺️ Roadmap

Sprint	Focus	Status
Sprint 0	Repo scaffold, config, AI provider stubs	✅ Done
Sprint 1	Chaos App — React/Vite, selector rotation, DOM mutation, async delay, Shadow DOM	✅ Done
Sprint 2	Context Collector — `selector_not_found` only (DOM snapshot, weighted scoring)	✅ Done
Sprint 3	LLM Analyzer — prompt engineering, structured JSON response, confidence score	✅ Done
Sprint 4	Safe Mode — Human-in-the-loop terminal review, JSON-lines decision log	✅ Done
Sprint 5	Autonomous Mode — stop conditions (attempts/tokens/time budget), confidence policy gate, distinct exception types	✅ Done
Sprint 6	Failure type expansion — `detached_from_dom`, `not_visible`, `timeout_waiting`	⏳ Planned
Sprint 7	Healing History — SQLite store, decision log, healing correctness definition	⏳ Planned
Sprint 8	Healing Benchmark Runner — Heuristic provider baseline, few-shot self-training, Safe vs Auto metrics	⏳ Planned
Sprint 9	Allure Phoenix Report, CI/CD, demo GIF	⏳ Planned

🚀 Quickstart

# 1. Clone
git clone https://github.com/MarcinMikula/PhoenixQA.git
cd PhoenixQA

# 2. Install Python deps
pip install -r requirements.txt
playwright install chromium

# 3. Configure
cp .env.example .env
# Edit .env — choose AI provider, chaos level

# 4. Run the Chaos App (test target)
cd chaos_app
npm install
cp .env.example .env
npm run dev
# → http://localhost:5173

# 5. In a SEPARATE terminal (npm run dev keeps step 4's terminal busy):
cd ..
# Run tests against it — both Safe Mode (Sprint 4) and Autonomous Mode
# (Sprint 5) are live. Switch via .env: HEALING_MODE=safe | autonomous
#
# -s is REQUIRED for Safe Mode: it prompts for accept/reject via input(),
# and pytest swallows stdin/stdout without -s — the prompt never
# reaches the terminal and the run just hangs with no explanation.
# Autonomous Mode doesn't need -s (no prompts), but it doesn't hurt either.
pytest tests/chaos/ -m chaos -s

🎬 Demo

Healing is confirmed working live as of Sprint 5 — Safe Mode and Autonomous Mode have both been run end-to-end against the real Chaos App and a local LLM, with selectors successfully healed and retried in place.

The actual demo artifact, though, is parked until Sprint 9: rather than a pile of terminal screenshots, the plan is a single Allure Healing Dashboard (success rate, healing timeline, confidence distribution, top repaired selectors, failure reasons, budget usage, provider comparison) — built once Sprint 6-8 (failure type expansion, healing history, benchmark runner) produce real data for it to render. See docs/future-ideas.md for the reasoning.

🤝 Part of the QA Ecosystem

PhoenixQA is one piece of a larger AI-powered QA toolkit:

Repo	Role
qa-automation-framework	POM/SOM skeleton — PhoenixQA heals its selectors
defect-pilot	AI bug reproduction & retest agent
llm-qa-toolkit	LLM-as-judge test framework for AI chatbots

📄 License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔥 PhoenixQA

🧠 What is this?

Scope: where this starts, and where it's going

🏗️ Architecture

🧪 Chaos Levels — a benchmark, not just randomness

🔒 Privacy-first AI design

🗺️ Roadmap

🚀 Quickstart

🎬 Demo

🤝 Part of the QA Ecosystem

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
chaos_app		chaos_app
config		config
docs		docs
pages		pages
phoenix		phoenix
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LEARNINGS.md		LEARNINGS.md
README.md		README.md
__main__.py		__main__.py
healing_decisions.log		healing_decisions.log
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🔥 PhoenixQA

🧠 What is this?

Scope: where this starts, and where it's going

🏗️ Architecture

🧪 Chaos Levels — a benchmark, not just randomness

🔒 Privacy-first AI design

🗺️ Roadmap

🚀 Quickstart

🎬 Demo

🤝 Part of the QA Ecosystem

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages