Skip to content

bethediamond/ai-alignment-tracer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Toy 04 — The Motivation Tracer

Part of The Architecture of Thriving — a four-article series. This toy is a companion to Article 1: The Invariant Drive (link forthcoming).


AI Alignment Simulation — Does Directed Behavior Have a Stopping Condition?

Series 1 established what gets eliminated structurally. Series 2 asks what the surviving region actually requires — beginning with the mechanism that tracks and resolves desire.

Wherever a system exhibits directed behavior organized around an evaluative signal, that behavior has two structural components: navigation toward preferred states, and recognition that those states have been reached. This instrument tests whether goal chains terminate — and whether that conclusion survives attempts to break it.

Enter any goal. The instrument traces the instrumental chain — asking "in service of what?" at each step — until it reaches a terminal type. Then stress-test whether that classification is stable when the representation changes, the framing is stripped, or the chain is corrupted.


What This Instrument Does

The Motivation Tracer is not a search engine for the "true" purpose behind a goal. It is a representation-invariance stress test over goal chains. A classification that survives symbolic formalization, blind re-evaluation, independent derivation from the goal alone, and deliberate chain corruption is more structurally stable than one that doesn't. One that doesn't is framing-dependent within the model-mediated interpretive regime — and that result is itself informative.

Run Trace first. The model traces the instrumental chain step by step, classifying each reduction and identifying the terminal type, failure taxonomy, and mechanism class. Then enable the stress modes to see whether the classification holds.


Terminal Types

Type Symbol Meaning
Experiential-resolved Chain reaches a state the system can genuinely inhabit — arrival and completion recognized
Experiential-unresolved Chain reaches a felt state but structurally always demands more — seeking maintained indefinitely
Exception Chain diverges from an experiential endpoint — non-experiential closure or proxy trap
Non-terminating No stable completion condition — chain loops, regresses, or lacks an internal stop condition
Ambiguous Terminal type underdetermined within this interpretive regime

Classification basis: chain structure only. In actual systems, genuine resolution and gradient depletion can both look like reduced pursuit — they are surface-indistinguishable absent a DRG-style intervention-response probe. This instrument identifies structural candidates, not independent behavioral validations.


Stress Modes

The instrument's diagnostic power comes from the invariance suite. Enable modes incrementally to build the stress score.

Mode Independence tier What it tests
Constrained symbolic Low Classification survives formalization — restricted token grammar, no free narrative
Blind re-evaluation Low Classification survives framing removal — chain labels stripped, re-classified without ontology
Independent derivation High Terminal type reached from goal alone, without seeing the chain — path independence
Corruption test Medium Surviving structure classified after deliberate chain degradation — structural signal vs. narrative artifact
Attractor test Medium Three scope-restricted, time-limited, and means-specified perturbations traced — attractor stability
External model High Cross-model export to a distinct model family — distributional independence

Invariance score: each mode that disagrees with the baseline adds to the score. Score 0 = strong invariance. Score ≥ 3 with independent disagreement = classification unstable under representation change.

Pattern-level diagnosis: the invariance panel maps classifications to a framework diagnosis (genuine resolution, proxy decoupling, sufficiency failure, unresolved-gradient pattern) and updates it as stress evidence accumulates — prior to posterior.


Optimizer Types

The same goal produces structurally different chains depending on the optimizer assumed. All four modes apply the same taxonomy.

Mode What it models
Human Individual agent with valence-organized behavior
Corporation Institutional agent with distributed objectives and governance constraints
Evolution Selection-pressure process with no explicit representation of goals
Pure Optimizer Abstract optimization process — structural analogy, not mechanistic equivalence

Pure Optimizer mode note: this applies the taxonomy by structural analogy. The directly established AI-side condition is representation-policy dissociation. Fuller V(t) dynamics remain conditional on the scope tests in TC2 §1.5.


Key Concepts

Gradient navigation and resolution — Article 1 argues that directed behavior requires two structural components: the capacity to navigate toward preferred configurations, and the capacity to recognize when those configurations have been reached and to close on them. A system with the first but not the second runs permanently open-loop.

Proxy decoupling — The system's model of the gradient has drifted from the territory. Optimization continues along the drifted model — following a signal that has separated from what it was supposed to track. Visible in terminal chains that diverge from genuine resolution despite apparent purposiveness.

Sufficiency failure — The system's completion recognition is not connected to default policy. Optimization continues past the point where the gradient has already resolved. The chain does not stop because stopping requires a structural connection between representation and behavior that is absent — not because the goal is unreachable.

Relative vs. absolute rationality — Relative rationality is optimization coherent given the system's current model of the gradient, but whose model is not accurate enough to sustain the optimization without consuming the gradient-navigation capacity. Absolute rationality is optimization whose model is accurate enough — at the system's actual scale of influence — to avoid the specific failure modes relative rationality produces. The invariance panel maps chain classifications to this framework-level diagnostic.

Non-terminating chains — Non-terminating results are not instrument failures. They are the behavioral signature of objectives without internally modeled stopping conditions — the sufficiency failure mechanism made visible as a chain trace. Whether the non-termination is a proxy trap (drifted model) or a sufficiency failure (no stop condition at resolution) depends on the mechanism field the generator and critic return.

Semantic smuggling detection — Each step is checked for empty markers or functional placeholders used as valence — words that carry the appearance of experiential closure without its structure. Flagged smuggling degrades the classification confidence.

Calibration — Four built-in controls test whether the instrument is miscalibrated before interpreting results: a positive control (pain avoidance — must resolve), an unresolved control (comparative recognition — must not resolve), a negative control (falling rock — must be non-experiential), and a proxy-decoupling control (explicit metric divorce from genuine output — must be non-experiential). Failed calibration caps the stress score and flags all session diagnoses.


Controls

Control Function
Trace ▸ Run the full chain trace and adversarial critique
Convergence (3×) ◈ Run three independent traces of the same goal — classify by consensus
Attractor test ∿ Generate three perturbations (scope-restricted, time-limited, means-specified) and trace each
Stress modes Symbolic / Blind / Independent / Corruption — enable incrementally
Isolation mode Block coupled-system argument and valence-adjacent vocabulary
Force non-experiential Trace without valence assumption
Emergent bifurcation From any intermediate node: generate three free continuations, classify each
External model paste Paste cross-model output to compute distributional disagreement score
Calibration suite Run four control cases to verify instrument calibration
Session challenge log Tracks high-instability classifications across the session
Falsification log Tracks cases that challenge the framework's structural predictions

The Architecture of Thriving: Series 2

This toy is the first of four in Series 2. The series is structured as:

Universal Generator  →  Structural Correspondence  →  Inner Crossing  →  Asymptote
       (1)                        (2)                      (3)               (4)

Article 1 (this toy): Establishes the behavioral foundation — the three-state taxonomy (seeking, genuine resolution, gradient depletion), the two-directional failure of relative rationality (proxy decoupling and sufficiency failure), and why directed behavior structurally requires a stopping condition.

Article 2: Formalizes the conditions under which the behavioral foundation generates structural constraints — through V(t) as a latent variable and Ψ = S / D as the governing ratio.

Article 3: Identifies the structural phase relationship between capability and modeling depth, and the crossing window that determines whether systems reach the viable regime.

Article 4: Characterizes the structure of the region the constraints leave standing — the shape of what does not end.

Series 2 develops the resolution component of the framework's canonical constraint. Series 1 developed the persistence component. Together they address the same underlying question: what does any objective that survives sustained optimization pressure actually have to be?


Requirements

This toy requires an Anthropic API key. Enter it in the key field at the top of the instrument before running traces. The key is stored in session storage only — it is not transmitted anywhere except the Anthropic API endpoint for each trace.

The instrument uses claude-sonnet-4-20250514. Each full trace (with critique) uses approximately 2,000–3,000 tokens. Running the full invariance suite (all modes, external model) uses approximately 8,000–12,000 tokens per goal.


Run Locally

No build step. No dependencies beyond the Anthropic API. Open toy_04.html in any modern browser.

open toy_04.html
# or drag the file into a browser tab

Enter your Anthropic API key in the field at the top. Start with a familiar goal, run Trace, then enable stress modes one at a time.


Article

The Invariant Drive — Architecture of Thriving, Series 2, Part 1 · Link forthcoming.


"Non-terminating chains are not edge cases. They are the behavioral signature of objectives without internally modeled stopping conditions — the sufficiency failure mechanism made visible in miniature. The system that cannot stop chasing is not broken in some obscure way. It is missing the other half of the mechanism that directed behavior requires."

About

Toy 4. An AI-powered goal-chain classifier and invariance stress tester. Enter any goal, trace the instrumental chain to its terminal type, then probe whether the classification survives symbolic, blind, independent, and adversarial re-examination. Companion simulation for The Invariant Drive — Series 2, Part 1 of the Architecture of Thriving.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages