Skip to content

MarcinMikula/PhoenixQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

38 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”₯ PhoenixQA

Self-healing test automation framework for fragile frontends. When a selector breaks, PhoenixQA doesn't crash β€” it heals.

Python Playwright AI License


🧠 What is this?

Frontend tests break constantly β€” not because the feature is broken, but because the page underneath it changed. A class was renamed. A data-testid rotated. A wrapper <div> appeared around a button after a refactor. A component moved into a Shadow DOM boundary.

PhoenixQA intercepts those failures, feeds the context to an LLM, and either:

  • proposes a fix for human review (Safe Mode β€” live)
  • applies the fix automatically within a confidence/budget policy and continues (Autonomous Mode β€” live)

Every decision β€” human-reviewed or autonomous β€” is logged today, including which provider made the call, how many tokens it cost, and how long it took. Once Healing History (Sprint 7) lands, that log becomes the basis for a self-training loop that improves future healing (Sprint 8) β€” not yet built, but the logging that feeds it already is.

Scope: where this starts, and where it's going

"Test fails even though the app is fine" has more than one root cause. Real-world experience (Salesforce Lightning, enterprise SPAs) shows the most common failure isn't actually a renamed selector β€” it's timing: an element gets detached from the DOM mid-action because the framework re-rendered, or a spinner disappears before the frontend has actually finished updating.

PhoenixQA classifies failures into four types, but builds them in phases rather than all at once:

Failure type Status
selector_not_found β€” classic renamed/rotated selector 🚧 Building first (Sprint 2-5)
detached_from_dom β€” framework re-render mid-action πŸ”œ Required, not yet built
not_visible β€” element exists but hidden/blocked πŸ”œ Required, not yet built
timeout_waiting β€” never reaches an actionable state πŸ”œ Required, not yet built

Why phase it: better to prove the full pipeline (collect β†’ analyze β†’ heal β†’ validate) end-to-end on one well-understood failure type first, then extend to the others with real lessons learned β€” rather than build four shallow strategies at once. The other three are committed scope, not a "maybe later" β€” see LEARNINGS.md for the full reasoning, or docs/gaps.md for a quick-scan status table of every open architectural question.

How this project approaches quality as a whole β€” not just unit tests on the framework's own code, but a layered strategy covering integration, end-to-end behavior against a real LLM, regression benchmarking of healing effectiveness, and non-functional resilience to malformed model output β€” is laid out in docs/testing-strategy.md.


πŸ—οΈ Architecture

Test Failure
    β”‚
    β–Ό
Context Collector        ← DOM snapshot, weighted semantic scoring, shadow DOM piercing
    β”‚
    β–Ό
LLM Analyzer             ← Ollama (local) or Anthropic API β†’ structured JSON proposal
    β”‚
    β”œβ”€β”€β–Ί Safe Mode        ← Human reviews full context, accepts/rejects β†’ Ground Truth
    β”‚
    └──► Autonomous Mode  ← Confidence gate + budget check, auto-applies fix, retries
              β”‚
              β–Ό
        Healing History   ← SQLite log of all decisions (Sprint 7)
              β”‚
              β–Ό
        Self-Training     ← Few-shot context for better future repairs (Sprint 8)

Note: PhoenixQA recovers the ABILITY to perform an action after a
failure β€” it does not judge whether the resulting behavior was
business-correct (e.g. "did login actually succeed"). That judgment
stays with the test's own assertions, same as it always has. See
docs/gaps.md Gap #11 for why this boundary is deliberate.

Autonomous Mode raises one of three distinct exception types depending on why it didn't heal β€” HealingRejectedError (bad/low-confidence proposal), HealingLimitExceededError (budget exhausted), HealingFailedError (provider/API crashed) β€” so a CI failure report says exactly what happened, not just "healing didn't work."


πŸ§ͺ Chaos Levels β€” a benchmark, not just randomness

Chaos App isn't randomized weirdness β€” each level isolates one variable and answers a specific research question. This is closer to a controlled experiment than a typical "Playwright + sample app" portfolio repo.

Level Mechanisms (cumulative) Research question
LOW selector rotation Does the test survive a selector rename?
MEDIUM + DOM structure mutation Does the test survive a UI refactor?
HIGH + async delay Does the test survive a refactor + timing issues?

Shadow DOM is a separate, independent flag (SHADOW_DOM_ENABLED), not a 4th level β€” it's a different kind of difficulty (structural DOM access), combinable with any level above (e.g. HIGH + Shadow DOM tests refactor + timing + structural access at once).

Mechanisms ranked by real-world realism (most enterprise frontends break this way most often):

Mechanism Realism Why
DOM Mutation 10/10 UI library upgrades, wrapper changes, component migrations
Selector Rotation 9/10 Classic β€” renamed class/id/data-testid
Async Delay 8/10 Lazy loading, animations, network-dependent rendering
Shadow DOM 5/10 Real, but narrower β€” mostly Web Components / LWC-style platforms

Controlling the chaos level:

# chaos_app/.env
VITE_CHAOS_LEVEL=HIGH            # LOW | MEDIUM | HIGH
VITE_SHADOW_DOM_ENABLED=true     # true | false β€” independent of level

Edit chaos_app/.env, then restart npm run dev. The "Active chaos config" panel at the top of the running app confirms which mechanisms are live β€” no guessing required.

End goal (Sprint 8 β€” Healing Benchmark Runner): run the full suite at every level, comparing No Healer vs. Heuristic Healer vs. LLM Healer β€” not just "it works," but "here's exactly how much the LLM adds over a cheap fuzzy-match baseline, and where."

Chaos Level No Healer Heuristic Healer LLM Healer
LOW ~72% ?% ~98%
MEDIUM ~51% ?% ~95%
HIGH ~29% ?% ~90%

The middle column is the actual experiment β€” a simple fuzzy/Levenshtein selector matcher (zero LLM calls, same provider interface as Anthropic/Ollama) might already solve a surprising fraction of cases. Without this baseline, "90% healed" doesn't prove the LLM was necessary.

Autonomous Mode has hard stop conditions from day one (max_attempts_total, token budget, max_time_per_heal) β€” no infinite LLM retry loops in CI, by design, not as a later hardening pass. Budget is tracked in tokens and elapsed time, never in currency β€” model pricing changes over time, token counts don't. See docs/architecture-decisions.md for the full reasoning.

PhoenixQA/
β”œβ”€β”€ LEARNINGS.md             # chronological journal β€” problem β†’ analysis β†’ decision β†’ test β†’ conclusion
β”œβ”€β”€ docs/                    # thematic indexes (fast lookup by topic, not by sprint)
β”‚   β”œβ”€β”€ gaps.md              # all numbered architectural gaps, status at a glance
β”‚   β”œβ”€β”€ architecture-decisions.md
β”‚   β”œβ”€β”€ known-limitations.md
β”‚   β”œβ”€β”€ future-ideas.md
β”‚   └── testing-strategy.md  # unit/integration/e2e/regression-benchmark/non-functional plan + actual state
β”œβ”€β”€ chaos_app/                # React/Vite β€” intentionally unstable test target
β”‚   └── src/chaos/            # selectorRotation, domMutation, asyncDelay, shadowDom
β”œβ”€β”€ phoenix/
β”‚   β”œβ”€β”€ collector/            # failure_classifier, context_collector (weighted semantic scoring)
β”‚   β”œβ”€β”€ healing/              # healer, safe_mode, decision_logger, autonomous_mode
β”‚   β”œβ”€β”€ ai/                   # base_provider, ollama_provider, anthropic_provider,
β”‚   β”‚                         # prompt_templates, response_parser, provider_factory
β”‚   β”œβ”€β”€ training/             # Healing history (Sprint 7)
β”‚   └── reporting/            # Allure Phoenix Healing Report (Sprint 9)
β”œβ”€β”€ pages/                    # Page Objects for Chaos App (POM pattern)
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ chaos/                # tests running against Chaos App
β”‚   β”œβ”€β”€ unit/                 # tokenizer, classifier, parser, logger, healer tests
β”‚   └── integration/
└── config/

πŸ”’ Privacy-first AI design

Provider When to use
ollama Air-gapped / NDA environments, local LLM
anthropic Cloud projects, best quality healing suggestions

Switch via single env variable. No code changes.


πŸ—ΊοΈ Roadmap

Sprint Focus Status
Sprint 0 Repo scaffold, config, AI provider stubs βœ… Done
Sprint 1 Chaos App β€” React/Vite, selector rotation, DOM mutation, async delay, Shadow DOM βœ… Done
Sprint 2 Context Collector β€” selector_not_found only (DOM snapshot, weighted scoring) βœ… Done
Sprint 3 LLM Analyzer β€” prompt engineering, structured JSON response, confidence score βœ… Done
Sprint 4 Safe Mode β€” Human-in-the-loop terminal review, JSON-lines decision log βœ… Done
Sprint 5 Autonomous Mode β€” stop conditions (attempts/tokens/time budget), confidence policy gate, distinct exception types βœ… Done
Sprint 6 Failure type expansion β€” detached_from_dom, not_visible, timeout_waiting ⏳ Planned
Sprint 7 Healing History β€” SQLite store, decision log, healing correctness definition ⏳ Planned
Sprint 8 Healing Benchmark Runner β€” Heuristic provider baseline, few-shot self-training, Safe vs Auto metrics ⏳ Planned
Sprint 9 Allure Phoenix Report, CI/CD, demo GIF ⏳ Planned

πŸš€ Quickstart

# 1. Clone
git clone https://github.com/MarcinMikula/PhoenixQA.git
cd PhoenixQA

# 2. Install Python deps
pip install -r requirements.txt
playwright install chromium

# 3. Configure
cp .env.example .env
# Edit .env β€” choose AI provider, chaos level

# 4. Run the Chaos App (test target)
cd chaos_app
npm install
cp .env.example .env
npm run dev
# β†’ http://localhost:5173

# 5. In a SEPARATE terminal (npm run dev keeps step 4's terminal busy):
cd ..
# Run tests against it β€” both Safe Mode (Sprint 4) and Autonomous Mode
# (Sprint 5) are live. Switch via .env: HEALING_MODE=safe | autonomous
#
# -s is REQUIRED for Safe Mode: it prompts for accept/reject via input(),
# and pytest swallows stdin/stdout without -s β€” the prompt never
# reaches the terminal and the run just hangs with no explanation.
# Autonomous Mode doesn't need -s (no prompts), but it doesn't hurt either.
pytest tests/chaos/ -m chaos -s

🎬 Demo

Healing is confirmed working live as of Sprint 5 β€” Safe Mode and Autonomous Mode have both been run end-to-end against the real Chaos App and a local LLM, with selectors successfully healed and retried in place.

The actual demo artifact, though, is parked until Sprint 9: rather than a pile of terminal screenshots, the plan is a single Allure Healing Dashboard (success rate, healing timeline, confidence distribution, top repaired selectors, failure reasons, budget usage, provider comparison) β€” built once Sprint 6-8 (failure type expansion, healing history, benchmark runner) produce real data for it to render. See docs/future-ideas.md for the reasoning.


🀝 Part of the QA Ecosystem

PhoenixQA is one piece of a larger AI-powered QA toolkit:

Repo Role
qa-automation-framework POM/SOM skeleton β€” PhoenixQA heals its selectors
defect-pilot AI bug reproduction & retest agent
llm-qa-toolkit LLM-as-judge test framework for AI chatbots

πŸ“„ License

MIT

About

πŸ”₯ Self-healing test automation framework. When a Playwright selector breaks, PhoenixQA diagnoses the failure with LLM, proposes a fix, and learns from every decision. Local (Ollama) or API (Anthropic).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors