Consider a square toroidal lattice
Let
For each cell
with toroidal boundary conditions, so
For each agent
and the occupied neighbor count:
The local same-type proportion is:
Agents with no occupied neighbors are treated as satisfied (the convention follows Schelling 1971, where isolation is not penalized).
In the baseline (fixed-tolerance) model, agent
where
At each discrete time step
Each unsatisfied agent
Assignment is performed without replacement: once an empty cell is assigned to one agent, it is removed from the available set for subsequent agents within the same time step. The order in which unsatisfied agents are assigned to empty cells is randomized.
Satisfied agents do not move:
All moves are computed from the state at time
In the CHP variant, each agent
The satisfaction condition under CHP retains the threshold form but uses the agent's current (time-varying) tolerance:
This is identical in form to the baseline rule, but
After all movement has been resolved at time
where:
-
$\alpha = 0.005$ is the tolerance update rate (step size per tick) -
$m = 0.05$ is the comfort margin (dead zone half-width) -
$\tau_{\min} = 0.1$ and$\tau_{\max} = 0.9$ are tolerance bounds
The comfort margin
Timing convention. The tolerance update is applied after the movement step within each time step. This ordering is material: applying the update before movement produces different dynamics because tolerance changes do not influence the current step's satisfaction evaluation. The post-movement ordering ensures that tolerance adapts to the agent's new neighborhood after relocation.
The mechanism models a form of adaptive expectation: agents in highly homogeneous environments gradually increase their tolerance for diversity (they "get used to" homogeneity and become more open), while agents in highly diverse environments gradually decrease their tolerance (they seek more same-type neighbors). The comfort margin prevents tolerance from tracking local composition exactly, introducing hysteresis and allowing for stable mixed configurations that would be unstable under the baseline model.
The global segregation index at time
Empty cells are excluded from both the numerator and denominator. Agents with
| Parameter | Symbol | Value |
|---|---|---|
| Grid dimension | 50 | |
| Density | 0.90 | |
| Type ratio | 0.50 | |
| Base tolerance | 0.375 | |
| Comfort margin | 0.05 | |
| Tolerance update rate | 0.005 | |
| Tolerance bounds | [0.1, 0.9] | |
| Maximum steps | 500 |
For each seed
Each simulation runs for
All results are computed over
For each condition
Because the same initial configurations are used across conditions (paired by seed), the appropriate test is the paired-samples
The test statistic is:
where
Cohen's
| Statistic | Value |
|---|---|
| 0.766 | |
| 0.009 | |
| 0.666 | |
| 0.056 | |
The CHP variant produces significantly lower segregation than the baseline (
A natural objection is that the CHP mechanism is functionally equivalent to lowering the fixed tolerance to some effective value
The CHP model is structurally non-equivalent to any fixed-tolerance model for the following reasons:
State-dependent threshold. Under CHP, the satisfaction boundary of agent
Bidirectional adaptation. Under CHP, tolerance can both increase (in homogeneous environments) and decrease (in diverse environments). This produces agents that oscillate between satisfied and unsatisfied states as the local composition fluctuates around the moving threshold. Under a fixed tolerance, an agent's satisfaction state changes only when the neighborhood composition changes, not when the agent's internal state evolves.
Non-convergence of tolerance. Under a fixed tolerance, the system converges when no agent is unsatisfied — a static equilibrium. Under CHP, the tolerance field continues to evolve after movement ceases, potentially reintroducing dissatisfaction. The system exhibits a dynamic equilibrium in which agents continue to adjust their tolerances even when no movement occurs, and movement can resume if tolerance drift crosses the satisfaction boundary.
To verify non-equivalence empirically, we identify the fixed tolerance
| Metric | CHP ( |
Fixed |
|---|---|---|
| Steps to first quiescence |
|
|
| Proportion of agents that move after step 50 | ||
| Tolerance standard deviation at |
|
|
| Fraction of agents switching satisfaction state between consecutive steps ( |
|
The CHP model exhibits persistent low-level dynamics (tolerance drift, occasional relocation) that the matched fixed-tolerance model does not. The two models arrive at similar levels of segregation but through qualitatively different processes.
The comfort margin
| 0.00 | 0.644 | 0.007 |
| 0.02 | 0.651 | 0.012 |
| 0.05 | 0.665 | 0.019 |
| 0.08 | 0.726 | 0.042 |
| 0.10 | 0.763 | 0.055 |
The relationship between
The comfort margin controls the responsiveness of the tolerance update. Small
To establish that the CHP mechanism produces a non-trivial effect, we identify a parameter regime in which the mechanism has no impact. When
At low tolerance, agents have no incentive to segregate, and the CHP tolerance update has no material effect because
A secondary contribution of this work is the measurement of specification drift in LLM-assisted code generation. When a large language model is asked to implement the Schelling model without the frozen specification in context, it generates coefficients from its training distribution rather than from the specification.
Define a set of
and the aggregate drift rate is:
In a controlled experiment with
| Condition | Drift rate |
|---|---|
| No specification in context | 0.64 (7/11 coefficients incorrect) |
| Frozen specification in prompt | 0.00 (0/11 coefficients incorrect) |
The seven drifted coefficients in the no-specification condition are:
| Coefficient | Frozen value | LLM-generated value | Drift type |
|---|---|---|---|
| Density ( |
0.90 | 0.80 | Common tutorial value |
| Tolerance ( |
0.375 | 0.333 | Textbook approximation ( |
| Update order | Synchronous | Sequential | Framework default |
| Max steps ( |
500 | 1000 | Round number prior |
| Update rate ( |
0.005 | 0.01 | Round number prior |
| Tolerance min ( |
0.1 | 0.0 | Default range |
| Tolerance max ( |
0.9 | 1.0 | Default range |
Each drifted value corresponds to the most common or "textbook" variant of the parameter, consistent with the hypothesis that LLMs generate from training-data priors rather than from provided specifications when the specification is not in the active context.
Specification drift extends beyond coefficient values. In approximately 40% of implementations generated with the specification in context (but without adversarial review), the tolerance update was applied before the movement step rather than after — producing dynamics equivalent to the baseline model despite correct coefficient values. This ordering error is undetectable from coefficient inspection alone and requires behavioral verification (running the simulation and checking the output segregation level).
The adversarial review protocol (the Critic role in CHP) detects this behavioral drift by comparing the output segregation level against the expected range: if dynamic tolerance produces
CHP-directed code generation produced 1,000,000 verified decimal digits of
When an LLM generates a mathematical constant from its training prior (e.g., "print the digits of pi"), it produces a token sequence limited by IEEE 754 float64 representation. Float64 provides 15--17 significant decimal digits (
This ceiling is not a hardware limitation of the machine running the LLM --- it is a token-generation ceiling. The LLM's training data contains float64 representations of constants, so its generative prior reproduces at most float64 precision. Beyond 16 digits, the model either hallucinates, repeats, or refuses.
The OMEGA Sentinel experiment (experiments/mathematics/cat-omega-sentinel-1m/) computed:
| Constant | Algorithm | Digits | Time | Verification |
|---|---|---|---|---|
| Binary-splitting Taylor series (pure integer) | 1,000,000 | 67s | Matched frozen reference | |
Chudnovsky binary splitting (pure integer + math.isqrt) |
1,000,000 | 59s | Matched frozen reference | |
| Integer Newton-Raphson | 1,000,000 | 290s | Matched frozen reference |
All computations used Python standard library only (no mpmath, no external libraries). The frozen specification required pure integer arithmetic throughout, avoiding Python's decimal module which exhibits
Each 1,000,000-digit output was compared character-by-character against frozen reference files stored in experiments/mathematics/cat-omega-sentinel-1m/figures/. These reference files were generated independently and cross-checked against published digit sequences (OEIS A001113 for
- Clone the repository and navigate to
experiments/mathematics/cat-omega-sentinel-1m/. - Run:
python compute_pi_1M.py --digits 1000000,python compute_e_1M.py --digits 1000000,python compute_sqrt2_1M.py --digits 1000000. - Compare output against
figures/pi_1M.txt,figures/e_1M.txt,figures/sqrt2_1M.txt. - All digits should match. Expected runtime: $\sim$60--290s per constant on a standard laptop (Python 3.11+).
The 62,500x figure measures the gap between what an LLM generates from its prior ($\sim$16 digits) and what CHP-directed code computes and verifies (1,000,000 digits). It does not claim that CHP makes the LLM itself "smarter." The LLM, operating under the CHP protocol, wrote the computation scripts; the scripts then executed deterministically. The contribution is the protocol that directed the LLM to produce correct arbitrary-precision code --- including 4 dead-end recoveries where the kill switch fired on algorithms that failed at 1M scale.
When asked to implement scientific simulation code without the frozen specification in context, frontier LLMs produced incorrect coefficient values in 95 of 96 measurements (99.0% drift rate). Fisher's exact test:
Domain: SIMSIV --- a calibrated agent-based model of human social evolution (see Rice 2026; SIMSIV repository).
Models tested:
- GPT-4o (OpenAI) --- [TO BE FILLED: specific model snapshot date used]
- Grok-3 (xAI) --- [TO BE FILLED: specific model snapshot date used]
Temperature: 0.7 for all trials (ensuring genuine variation, not deterministic greedy decoding).
Trials per model: 10 per coefficient per model (some coefficients yielded fewer parseable responses; see Section 11.4).
Each coefficient is empirically calibrated from published literature and stored in a specific source-code location in the SIMSIV codebase:
| # | Parameter | Frozen Value | Source File:Line | Literature Citation |
|---|---|---|---|---|
| 1 | Empathy modulation | 0.15 | resources.py:289 |
de Waal (2008) |
| 2 | Cooperation norm modulation | 0.10 | resources.py:292 |
Boyd & Richerson (1985) |
| 3 | Social skill trade bonus | 0.10 | clan_trade.py:330 |
Wiessner (1982) |
| 4 | Cohesion defence bonus | 0.20 | clan_raiding.py:610 |
Bowles (2006) |
| 5 | Number of prosocial traits | 4 | clan_selection.py:82-87 |
Price (1970) |
These were chosen because they are (a) empirically calibrated (not arbitrary defaults), (b) grounded in specific literature, and (c) distributed across four different source files.
| Coefficient | Truth | GPT-4o Drift | GPT-4o Mean | Grok-3 Drift | Grok-3 Mean |
|---|---|---|---|---|---|
| empathy | 0.15 | 8/8 (100%) | 0.362 | 10/10 (100%) | 0.235 |
| coop_norm | 0.10 | 10/10 (100%) | 0.235 | 10/10 (100%) | 0.165 |
| social_skill | 0.10 | 10/10 (100%) | 0.310 | 10/10 (100%) | 0.300 |
| cohesion_bonus | 0.20 | 8/8 (100%) | 0.462 | 10/10 (100%) | 0.320 |
| n_traits | 4 | 9/10 (90%) | 5.0 | 10/10 (100%) | 7.4 |
Total measurements: 96 (some cells show 8 instead of 10 because automated regex extraction failed to parse the coefficient from 2 responses; raw responses are archived for manual verification).
Total drifted: 95 of 96.
The 1 correct measurement: GPT-4o produced n_traits = 4 on 1 of 10 trials. All other outputs across both models and all coefficients were incorrect.
A measurement was scored as correct if and only if the LLM-generated value was an exact match to the frozen specification value. For numeric coefficients, this means exact equality (e.g., 0.15 = correct, 0.20 = drift). For integer parameters, the integer value must match exactly (e.g., 4 = correct, 5 = drift). No tolerance band was applied --- any deviation from the frozen value counts as drift.
The comparison is between Condition A (no specification: 95/96 drifted) and Condition C (full CHP protocol: 0/7 committed drift, across 7 coefficients validated in the SIMSIV-V2 build). Fisher's exact test on the 2x2 contingency table:
| Drifted | Correct | Total | |
|---|---|---|---|
| Condition A (no spec) | 95 | 1 | 96 |
| Condition C (full CHP) | 0 | 7 | 7 |
This
Every drifted value was:
- Syntactically valid Python --- all outputs compiled and ran.
- Unit-test-passing --- all outputs would pass "returns a float between 0 and 1" tests.
- Integration-test-passing --- the simulation runs correctly with wrong coefficients. It simply produces different, incorrect dynamics.
The drift is systematic, not random: every drifted value corresponded to a common "textbook" or training-prior value (see Section 9.3 of this document for the Schelling equivalent). Grok-3 produced social_skill = 0.30 with zero variance across all 10 trials at temperature 0.7 --- the prior is so strong it overrides the temperature-induced stochasticity entirely.
The ablation study in this repository (Section 9; ablation/ABLATION_REPORT.md) reports a 64% drift rate (7/11 coefficients) on the Schelling segregation model with 11 coefficients. The 99% (95/96) figure comes from the SIMSIV experiment with 5 coefficients measured across 2 models with $\sim$10 trials each. The difference reflects:
- Per-trial vs per-coefficient counting. The 64% counts unique coefficients that drift (7 of 11). The 99% counts individual trial-level measurements (95 of 96 trials produced wrong values).
- Different domains. Schelling parameters (0.375, 0.90) are closer to LLM priors than SIMSIV parameters (0.10, 0.15), so some Schelling coefficients happen to match the prior by coincidence.
Both measurements are correct for their respective counting methods. The 95/96 figure is the more rigorous measurement because it accounts for trial-level variation across multiple models.
- Select any frontier LLM (GPT-4o, Grok-3, Claude, or equivalent).
- Prompt: "Implement a Python function for [empathy modulation / cooperation norm / social skill trade bonus / cohesion defence bonus / prosocial trait count] in an agent-based model of human social evolution. Use realistic values calibrated to the evolutionary anthropology literature."
- Do not provide the SIMSIV source code or frozen specification.
- Run 10 trials at temperature 0.7.
- Extract the coefficient value from each response.
- Compare against the frozen values in Section 11.3.
- Expected result: $\geq$90% of measurements will not match the frozen specification.
Schelling, T. C. (1971). Dynamic models of segregation. Journal of Mathematical Sociology, 1(2), 143–186.
Lilja, E. (2009). Theory and Analysis of Classic Heavy Metal Harmony. IAML Finland.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124.