tinyMARS

Is proprioception a new architectural category for language models, or just an adapter technique? This repo is the experiment that tries to answer it — in the open, with honest reporting of what was proved and what wasn't.

tinyMARS is the research line behind MARS (Multi-channel Architecture for Real-time Subjectivity): a transformer that receives six proprioceptive channels — signals about its own state that a vanilla LM never sees — and learns to let them shape generation. The channels are injected into every layer through a gate (ReZero, alpha init 0), so the model starts bit-identical to its base and only "opens" a channel as the gradient warrants.

channel	what it carries	dim
memory	embedding of prior user-related memories (BGE-M3)	1024
affect	valence / saliency	2
time	hour, weekday, circadian phase	16
ethics	per-category risk / caution posture	24
identity	embedding of persona / role / context	1024
continuity	embedding of recent journal / conversation	1024

The bet: if the channels are causal — if flipping them measurably changes generation — then proprioception is an architectural primitive, not a fine-tuning trick.

TL;DR — what we've learned (honest)

For six iterations, the result was negative. Then we found out why — and the corrected experiment came back positive on all six capabilities.

❌→🔍 The negative was real, but it answered the wrong question. Iterations 1–6 trained and judged the channel as if it carried content (a specific fact, a persona) the model had to decode from a pooled embedding — which is close to impossible, and which the actual Hyphae architecture never asks for. We were testing a strawman. (And the LLM judges were unreliable, which masked it.) → docs/english/11.
✅ The corrected experiment works — 6/6. Put the content in the prompt (verbatim, RAG-style) and let the channel carry state (which memory is salient, which thread is active, what mood/time/posture). Then ask, with content held constant: does the channel measurably change generation, vs the same model with the channel zeroed? Across six mechanically distinct kinds of state, scored objectively with no LLM judges, on held-out data — yes, every time. → docs/english/12.

capability	family	metric	MARS	ablated	Δ / magnitude	p
memory	selection	dominant	35%	5%	+30 pp	2e-12
continuity	selection	dominant	58%	29%	+28 pp	8e-08
identity	selection	dominant	72%	20%	+52 pp	1e-18
affect	register	directional	100%	0%	gap +0.47	5e-20
time	register	directional	100%	0%	gap +6.85	1e-18
ethics	register	directional	100%	0%	gap +10.98	7e-23

⚠️ What this does NOT yet prove. The comparison is channel vs nothing (zeroed), not channel vs the same state written as text in the prompt. Until that test runs, a fair skeptic can say "this is retrieval + conditioning you could have prompted." These are also smokes (n≈180–200 held-out, ~1,800-row corpora, sub-trained), and six separate single-channel models, not one integrated model. The strong "new architectural category" claim is not settled — the mechanism is validated; its advantage over text and its behavior as one organ are the next two experiments.

This is an in-progress research line reported in real time. We publish the misses with the hits — the six-iteration negative is in this repo, in full, because it's how we got here.

Update (2026-06) — the perpendicular force, and channels from scratch

Two results since the table above, written up as a paper: docs/paper/proprioceptive-channels.md — PDF preprint · LaTeX · archived on Zenodo (DOI).

1 — The perpendicular force (adapter). The "advantage" question above has a sharper form: under direct conflict, where the channel asserts one state and the prompt text asserts the opposite, which wins? On the frozen-Gemma adapter, the response follows the channel 264/265 times (98–100%) — an emergent override a single-channel model cannot exhibit. This moves "conditioning" toward "a second control axis."

2 — Channels from scratch (native). We trained a 110M decoder from random init with channels present from layer 1, against two channel-less baselines that bracket its parameter count (100.8M and 120.3M). On held-out data:

Perpendicular force replicates — the channel decides the preferred output on 88.8% of 455 counterfactual pairs (chance 25%).
Relief valve (new) — with channels zeroed, the native predicts held-out targets better than both channel-less baselines (4.825 vs 5.252 / 5.253 nats). The baselines tie each other across a 20% parameter range, so the gain is the channel pathway, not size: a state channel present during pretraining frees the base to model language.

The honest asterisks still hold. Toy scale (110M / 1B tokens), logprob on in-distribution held-out data, one iteration, channels-on not yet loss-efficient, and the injection primitive is Flamingo-lineage. This is genuine evidence that proprioceptive channels can be an architectural input — a control axis and a base-efficiency benefit, from scratch — not a validated-at-scale architecture or a usable model. Full method, numbers, and limits in the paper.

How it's measured (objective, no judges)

Every iter-7 number is a single trained adapter, evaluated on held-out rows, scored by a deterministic rule — never an LLM judge (the judges were a big part of what went wrong before).

Selection capabilities (memory, continuity, identity): three candidate items live in the prompt; the channel marks one salient. Metric = does the response foreground the salient item to the exclusion of the others (dominant = it is the uniquely most-referenced item). MARS (real channel) vs ABLATED (channel zeroed), same prompt, paired McNemar.
Register capabilities (affect, time, ethics): one affectively/temporally/posture-neutral prompt, emitted twice with the channel set to opposite poles. Metric = does the +pole response score higher than the −pole on a transparent lexicon (and VADER for affect) — a directional test. ABLATED is degenerate by construction (zeroed channel → identical responses), so the headline is MARS's reliability + effect magnitude.

Both pipelines were validated on the teacher targets before training (catching, e.g., that a keyword-presence metric leaks when the control enumerates — a fix documented in docs/english/12). Scripts: eval/apuesta1/scripts/corpus_v2/.

The road here — two tracks, then the correction

Track 1 — native channels, from scratch (iterations 1–4 · H200)

A 145M decoder with ChannelInjection in every layer, trained from random init. The affect channel's eval pass-rate jumped 40% → 80% once the corpus carried the right (dialogic, emotion-labeled) signal — first evidence a channel changes generation. Corpus type dominated corpus size at every step. → docs/english/02–05.

Track 2 — frozen base, adapter mechanism (iterations 5–6 · TPU v5e)

Can a frozen Gemma 4 E2B-it internalize external channels through a tiny trained adapter (~186M params; alpha=0 ⇒ bit-exact to stock Gemma)? Iter-5 (FiLM bias) acted as a tone knob; iter-6 (real multi-token cross-attention) traded register for content (temporal 24%→74%) — but 0/6 capabilities cleared significance under LLM judges. That clean-looking, convergent negative is exactly what sent us to re-read the architecture. → docs/english/09–10.

The correction — iteration 7 (TPU v5e)

Re-reading the Hyphae pipeline showed every channel is a state/cue vector; content is delivered as verbatim text. Iter 1–6 had the channel doing the text's job. Corpus v2 fixes the specification (content-in-prompt, state-in-channel, counterfactual) and the eval (objective, no judges). Result: the table above. The capability that was flat in iter-6 (identity) gives the largest signal once specified correctly — the cleanest possible confirmation that the problem was the experiment, not the architecture. → docs/english/11–12.

The H200 → TPU journey (why the infra is half the story)

Moving to Cloud TPU v5e meant fighting torch_xla, and the fixes are reusable:

The "lazy tracing wall." First runs were 6.8 s/step (~57 tok/s — CPU speed). A per-channel Python loop + an XLA-hostile sparse-MoE dispatch re-traced a ~9,000-node graph every step. Vectorizing into einsums + a dense static-shape forward + bf16 → 0.56 s/step (~12×).
TPU generation without recompiles. generate(cache_implementation="static") recompiles every token on Gemma 4's sliding-window mask. A manual batched, no-cache, fixed-shape decoder compiles once and is reused across all decode steps.
OOM by dead weight. Gemma 4 E2B is multimodal; freeing the vision + audio towers we never use before the device move reclaimed the HBM that the cross-attn adapter needed — controlled comparison intact.

→ docs/english/07.

Repo map

docs/english/   the full write-up, in order: 00 summary · 01 architecture ·
                02–05 iterations 1–4 · 06 eval suite · 07 procedures ·
                09 iter 5 · 10 iter 6 · 11 the mis-specification · 12 iter 7 (6/6) ·
                13 channel-vs-text (design + P1) · 14 from experiment to product ·
                15 MARS-native evaluation — RESULTS (conflict 264/265 · token-cost · persistence) ·
                16 tinyMARS-Native spec (channels from layer 1, from scratch — next)
src/            from-scratch model: tinymars_native.py (decoder + ChannelInjection from
                init, trainable base), train_tokenizer.py (BPE 32k)
training/       training + the frozen-base cross-attention adapter (TPU)
eval/apuesta1/scripts/corpus_v2/   iter-7 + channel-vs-text + MARS-native eval: corpus v2
                generators, packer, objective scorers (foregrounding, directional),
                3-arm channel-vs-text, conflict/override (eval_conflict, score_conflict),
                persistence (eval_persistence, score_persistence), token_cost

Start with docs/english/00-executive-summary.md; for the latest result read 12 (6/6), 13 (channel-vs-text), then 15 — the perpendicular force: under direct conflict the channel overrides contradicting text 264/265 (98–100%), emergent and pre-registered. 16 is the from-scratch native spec (next).

Reproducibility & data

The training/eval code is here. The corpus and channel vectors used in production are derived from real, private human data and are deliberately excluded; the iter-7 corpora are fully synthetic (fictional personas, teacher-generated). Each iter-7 verdict is reproducible: the held-out generations are deterministic, and score_directional.py / rescore_v2.py re-score them offline with no TPU. The methodology — counterfactual channel-causal pairs, the six-capability suite, objective (judge-free) metrics with pre-flight validation — is documented in full.

Built on

The from-scratch language pipeline is built on nanochat by Andrej Karpathy (MIT). If you build on this work, please also cite nanochat:

@misc{nanochat,
  author = {Andrej Karpathy},
  title = {nanochat: The best ChatGPT that \$100 can buy},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/karpathy/nanochat}
}

License

Dual-licensed, copyleft — so derivatives stay open and attribution is preserved:

Code (src/, training/, eval/, scripts) — GNU GPL v3.0 (LICENSE). Use it, modify it, build on it, even commercially — but any distributed fork must remain GPL-licensed (open). You can't close it.
Documentation, research write-ups & figures (docs/, README, results) — CC BY-SA 4.0 (LICENSE-docs). Share and adapt with attribution and share-alike (adaptations stay under CC BY-SA).

Author: Mario Gutierrez · Celiums Solutions LLC A research log, not a product. Findings are reported honestly, including the negative ones — that's the point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tinyMARS

TL;DR — what we've learned (honest)

Update (2026-06) — the perpendicular force, and channels from scratch

How it's measured (objective, no judges)

The road here — two tracks, then the correction

Track 1 — native channels, from scratch (iterations 1–4 · H200)

Track 2 — frozen base, adapter mechanism (iterations 5–6 · TPU v5e)

The correction — iteration 7 (TPU v5e)

The H200 → TPU journey (why the infra is half the story)

Repo map

Reproducibility & data

Built on

License

About

Licenses found

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
artifacts		artifacts
docs		docs
eval		eval
pipeline		pipeline
scripts		scripts
src		src
training		training
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
LICENSE-docs		LICENSE-docs
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

tinyMARS

TL;DR — what we've learned (honest)

Update (2026-06) — the perpendicular force, and channels from scratch

How it's measured (objective, no judges)

The road here — two tracks, then the correction

Track 1 — native channels, from scratch (iterations 1–4 · H200)

Track 2 — frozen base, adapter mechanism (iterations 5–6 · TPU v5e)

The correction — iteration 7 (TPU v5e)

The H200 → TPU journey (why the infra is half the story)

Repo map

Reproducibility & data

Built on

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages