Is proprioception a new architectural category for language models, or just an adapter technique? This repo is the experiment that tries to answer it — in the open, with honest reporting of what was proved and what wasn't.
tinyMARS is the research line behind MARS (Multi-channel Architecture for Real-time Subjectivity): a transformer that receives six proprioceptive channels — signals about its own state that a vanilla LM never sees — and learns to let them shape generation. The channels are injected into every layer through a gate (ReZero, alpha init 0), so the model starts bit-identical to its base and only "opens" a channel as the gradient warrants.
| channel | what it carries | dim |
|---|---|---|
| memory | embedding of prior user-related memories (BGE-M3) | 1024 |
| affect | valence / saliency | 2 |
| time | hour, weekday, circadian phase | 16 |
| ethics | per-category risk / caution posture | 24 |
| identity | embedding of persona / role / context | 1024 |
| continuity | embedding of recent journal / conversation | 1024 |
The bet: if the channels are causal — if flipping them measurably changes generation — then proprioception is an architectural primitive, not a fine-tuning trick.
For six iterations, the result was negative. Then we found out why — and the corrected experiment came back positive on all six capabilities.
- ❌→🔍 The negative was real, but it answered the wrong question. Iterations 1–6 trained and judged the channel as if it carried content (a specific fact, a persona) the model had to decode from a pooled embedding — which is close to impossible, and which the actual Hyphae architecture never asks for. We were testing a strawman. (And the LLM judges were unreliable, which masked it.) →
docs/english/11. - ✅ The corrected experiment works — 6/6. Put the content in the prompt (verbatim, RAG-style) and let the channel carry state (which memory is salient, which thread is active, what mood/time/posture). Then ask, with content held constant: does the channel measurably change generation, vs the same model with the channel zeroed? Across six mechanically distinct kinds of state, scored objectively with no LLM judges, on held-out data — yes, every time. →
docs/english/12.
| capability | family | metric | MARS | ablated | Δ / magnitude | p |
|---|---|---|---|---|---|---|
| memory | selection | dominant | 35% | 5% | +30 pp | 2e-12 |
| continuity | selection | dominant | 58% | 29% | +28 pp | 8e-08 |
| identity | selection | dominant | 72% | 20% | +52 pp | 1e-18 |
| affect | register | directional | 100% | 0% | gap +0.47 | 5e-20 |
| time | register | directional | 100% | 0% | gap +6.85 | 1e-18 |
| ethics | register | directional | 100% | 0% | gap +10.98 | 7e-23 |
⚠️ What this does NOT yet prove. The comparison is channel vs nothing (zeroed), not channel vs the same state written as text in the prompt. Until that test runs, a fair skeptic can say "this is retrieval + conditioning you could have prompted." These are also smokes (n≈180–200 held-out, ~1,800-row corpora, sub-trained), and six separate single-channel models, not one integrated model. The strong "new architectural category" claim is not settled — the mechanism is validated; its advantage over text and its behavior as one organ are the next two experiments.
This is an in-progress research line reported in real time. We publish the misses with the hits — the six-iteration negative is in this repo, in full, because it's how we got here.
Two results since the table above, written up as a paper: docs/paper/proprioceptive-channels.md — PDF preprint · LaTeX · archived on Zenodo (DOI).
1 — The perpendicular force (adapter). The "advantage" question above has a sharper form: under direct conflict, where the channel asserts one state and the prompt text asserts the opposite, which wins? On the frozen-Gemma adapter, the response follows the channel 264/265 times (98–100%) — an emergent override a single-channel model cannot exhibit. This moves "conditioning" toward "a second control axis."
2 — Channels from scratch (native). We trained a 110M decoder from random init with channels present from layer 1, against two channel-less baselines that bracket its parameter count (100.8M and 120.3M). On held-out data:
- Perpendicular force replicates — the channel decides the preferred output on 88.8% of 455 counterfactual pairs (chance 25%).
- Relief valve (new) — with channels zeroed, the native predicts held-out targets better than both channel-less baselines (4.825 vs 5.252 / 5.253 nats). The baselines tie each other across a 20% parameter range, so the gain is the channel pathway, not size: a state channel present during pretraining frees the base to model language.
The honest asterisks still hold. Toy scale (110M / 1B tokens), logprob on in-distribution held-out data, one iteration, channels-on not yet loss-efficient, and the injection primitive is Flamingo-lineage. This is genuine evidence that proprioceptive channels can be an architectural input — a control axis and a base-efficiency benefit, from scratch — not a validated-at-scale architecture or a usable model. Full method, numbers, and limits in the paper.
Every iter-7 number is a single trained adapter, evaluated on held-out rows, scored by a deterministic rule — never an LLM judge (the judges were a big part of what went wrong before).
- Selection capabilities (memory, continuity, identity): three candidate items live in the prompt; the channel marks one salient. Metric = does the response foreground the salient item to the exclusion of the others (
dominant= it is the uniquely most-referenced item). MARS (real channel) vs ABLATED (channel zeroed), same prompt, paired McNemar. - Register capabilities (affect, time, ethics): one affectively/temporally/posture-neutral prompt, emitted twice with the channel set to opposite poles. Metric = does the +pole response score higher than the −pole on a transparent lexicon (and VADER for affect) — a directional test. ABLATED is degenerate by construction (zeroed channel → identical responses), so the headline is MARS's reliability + effect magnitude.
Both pipelines were validated on the teacher targets before training (catching, e.g., that a keyword-presence metric leaks when the control enumerates — a fix documented in docs/english/12). Scripts: eval/apuesta1/scripts/corpus_v2/.
A 145M decoder with ChannelInjection in every layer, trained from random init. The affect channel's eval pass-rate jumped 40% → 80% once the corpus carried the right (dialogic, emotion-labeled) signal — first evidence a channel changes generation. Corpus type dominated corpus size at every step. → docs/english/02–05.
Can a frozen Gemma 4 E2B-it internalize external channels through a tiny trained adapter (~186M params; alpha=0 ⇒ bit-exact to stock Gemma)? Iter-5 (FiLM bias) acted as a tone knob; iter-6 (real multi-token cross-attention) traded register for content (temporal 24%→74%) — but 0/6 capabilities cleared significance under LLM judges. That clean-looking, convergent negative is exactly what sent us to re-read the architecture. → docs/english/09–10.
Re-reading the Hyphae pipeline showed every channel is a state/cue vector; content is delivered as verbatim text. Iter 1–6 had the channel doing the text's job. Corpus v2 fixes the specification (content-in-prompt, state-in-channel, counterfactual) and the eval (objective, no judges). Result: the table above. The capability that was flat in iter-6 (identity) gives the largest signal once specified correctly — the cleanest possible confirmation that the problem was the experiment, not the architecture. → docs/english/11–12.
Moving to Cloud TPU v5e meant fighting torch_xla, and the fixes are reusable:
- The "lazy tracing wall." First runs were 6.8 s/step (~57 tok/s — CPU speed). A per-channel Python loop + an XLA-hostile sparse-MoE dispatch re-traced a ~9,000-node graph every step. Vectorizing into einsums + a dense static-shape forward + bf16 → 0.56 s/step (~12×).
- TPU generation without recompiles.
generate(cache_implementation="static")recompiles every token on Gemma 4's sliding-window mask. A manual batched, no-cache, fixed-shape decoder compiles once and is reused across all decode steps. - OOM by dead weight. Gemma 4 E2B is multimodal; freeing the vision + audio towers we never use before the device move reclaimed the HBM that the cross-attn adapter needed — controlled comparison intact.
→ docs/english/07.
docs/english/ the full write-up, in order: 00 summary · 01 architecture ·
02–05 iterations 1–4 · 06 eval suite · 07 procedures ·
09 iter 5 · 10 iter 6 · 11 the mis-specification · 12 iter 7 (6/6) ·
13 channel-vs-text (design + P1) · 14 from experiment to product ·
15 MARS-native evaluation — RESULTS (conflict 264/265 · token-cost · persistence) ·
16 tinyMARS-Native spec (channels from layer 1, from scratch — next)
src/ from-scratch model: tinymars_native.py (decoder + ChannelInjection from
init, trainable base), train_tokenizer.py (BPE 32k)
training/ training + the frozen-base cross-attention adapter (TPU)
eval/apuesta1/scripts/corpus_v2/ iter-7 + channel-vs-text + MARS-native eval: corpus v2
generators, packer, objective scorers (foregrounding, directional),
3-arm channel-vs-text, conflict/override (eval_conflict, score_conflict),
persistence (eval_persistence, score_persistence), token_cost
Start with docs/english/00-executive-summary.md; for the latest result read 12 (6/6), 13 (channel-vs-text), then 15 — the perpendicular force: under direct conflict the channel overrides contradicting text 264/265 (98–100%), emergent and pre-registered. 16 is the from-scratch native spec (next).
The training/eval code is here. The corpus and channel vectors used in production are derived from real, private human data and are deliberately excluded; the iter-7 corpora are fully synthetic (fictional personas, teacher-generated). Each iter-7 verdict is reproducible: the held-out generations are deterministic, and score_directional.py / rescore_v2.py re-score them offline with no TPU. The methodology — counterfactual channel-causal pairs, the six-capability suite, objective (judge-free) metrics with pre-flight validation — is documented in full.
The from-scratch language pipeline is built on nanochat by Andrej Karpathy (MIT). If you build on this work, please also cite nanochat:
@misc{nanochat,
author = {Andrej Karpathy},
title = {nanochat: The best ChatGPT that \$100 can buy},
year = {2025},
publisher = {GitHub},
url = {https://github.com/karpathy/nanochat}
}Dual-licensed, copyleft — so derivatives stay open and attribution is preserved:
- Code (
src/,training/,eval/, scripts) — GNU GPL v3.0 (LICENSE). Use it, modify it, build on it, even commercially — but any distributed fork must remain GPL-licensed (open). You can't close it. - Documentation, research write-ups & figures (
docs/,README, results) — CC BY-SA 4.0 (LICENSE-docs). Share and adapt with attribution and share-alike (adaptations stay under CC BY-SA).
© 2026 Celiums Solutions LLC · Mario Gutierrez. As the copyright holder, the author is not bound by these terms and may relicense the original work.
Author: Mario Gutierrez · Celiums Solutions LLC A research log, not a product. Findings are reported honestly, including the negative ones — that's the point.