Skip to content

terrizoaguimor/tinymars

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

tinyMARS

DOI Preprint (PDF) Code: GPL-3.0 Docs: CC BY-SA 4.0

Is proprioception a new architectural category for language models, or just an adapter technique? This repo is the experiment that tries to answer it — in the open, with honest reporting of what was proved and what wasn't.

tinyMARS is the research line behind MARS (Multi-channel Architecture for Real-time Subjectivity): a transformer that receives six proprioceptive channels — signals about its own state that a vanilla LM never sees — and learns to let them shape generation. The channels are injected into every layer through a gate (ReZero, alpha init 0), so the model starts bit-identical to its base and only "opens" a channel as the gradient warrants.

channel what it carries dim
memory embedding of prior user-related memories (BGE-M3) 1024
affect valence / saliency 2
time hour, weekday, circadian phase 16
ethics per-category risk / caution posture 24
identity embedding of persona / role / context 1024
continuity embedding of recent journal / conversation 1024

The bet: if the channels are causal — if flipping them measurably changes generation — then proprioception is an architectural primitive, not a fine-tuning trick.


TL;DR — what we've learned (honest)

For six iterations, the result was negative. Then we found out why — and the corrected experiment came back positive on all six capabilities.

  • ❌→🔍 The negative was real, but it answered the wrong question. Iterations 1–6 trained and judged the channel as if it carried content (a specific fact, a persona) the model had to decode from a pooled embedding — which is close to impossible, and which the actual Hyphae architecture never asks for. We were testing a strawman. (And the LLM judges were unreliable, which masked it.) → docs/english/11.
  • The corrected experiment works — 6/6. Put the content in the prompt (verbatim, RAG-style) and let the channel carry state (which memory is salient, which thread is active, what mood/time/posture). Then ask, with content held constant: does the channel measurably change generation, vs the same model with the channel zeroed? Across six mechanically distinct kinds of state, scored objectively with no LLM judges, on held-out data — yes, every time. → docs/english/12.
capability family metric MARS ablated Δ / magnitude p
memory selection dominant 35% 5% +30 pp 2e-12
continuity selection dominant 58% 29% +28 pp 8e-08
identity selection dominant 72% 20% +52 pp 1e-18
affect register directional 100% 0% gap +0.47 5e-20
time register directional 100% 0% gap +6.85 1e-18
ethics register directional 100% 0% gap +10.98 7e-23
  • ⚠️ What this does NOT yet prove. The comparison is channel vs nothing (zeroed), not channel vs the same state written as text in the prompt. Until that test runs, a fair skeptic can say "this is retrieval + conditioning you could have prompted." These are also smokes (n≈180–200 held-out, ~1,800-row corpora, sub-trained), and six separate single-channel models, not one integrated model. The strong "new architectural category" claim is not settled — the mechanism is validated; its advantage over text and its behavior as one organ are the next two experiments.

This is an in-progress research line reported in real time. We publish the misses with the hits — the six-iteration negative is in this repo, in full, because it's how we got here.


Update (2026-06) — the perpendicular force, and channels from scratch

Two results since the table above, written up as a paper: docs/paper/proprioceptive-channels.mdPDF preprint · LaTeX · archived on Zenodo (DOI).

1 — The perpendicular force (adapter). The "advantage" question above has a sharper form: under direct conflict, where the channel asserts one state and the prompt text asserts the opposite, which wins? On the frozen-Gemma adapter, the response follows the channel 264/265 times (98–100%) — an emergent override a single-channel model cannot exhibit. This moves "conditioning" toward "a second control axis."

2 — Channels from scratch (native). We trained a 110M decoder from random init with channels present from layer 1, against two channel-less baselines that bracket its parameter count (100.8M and 120.3M). On held-out data:

  • Perpendicular force replicates — the channel decides the preferred output on 88.8% of 455 counterfactual pairs (chance 25%).
  • Relief valve (new) — with channels zeroed, the native predicts held-out targets better than both channel-less baselines (4.825 vs 5.252 / 5.253 nats). The baselines tie each other across a 20% parameter range, so the gain is the channel pathway, not size: a state channel present during pretraining frees the base to model language.

The honest asterisks still hold. Toy scale (110M / 1B tokens), logprob on in-distribution held-out data, one iteration, channels-on not yet loss-efficient, and the injection primitive is Flamingo-lineage. This is genuine evidence that proprioceptive channels can be an architectural input — a control axis and a base-efficiency benefit, from scratch — not a validated-at-scale architecture or a usable model. Full method, numbers, and limits in the paper.


How it's measured (objective, no judges)

Every iter-7 number is a single trained adapter, evaluated on held-out rows, scored by a deterministic rule — never an LLM judge (the judges were a big part of what went wrong before).

  • Selection capabilities (memory, continuity, identity): three candidate items live in the prompt; the channel marks one salient. Metric = does the response foreground the salient item to the exclusion of the others (dominant = it is the uniquely most-referenced item). MARS (real channel) vs ABLATED (channel zeroed), same prompt, paired McNemar.
  • Register capabilities (affect, time, ethics): one affectively/temporally/posture-neutral prompt, emitted twice with the channel set to opposite poles. Metric = does the +pole response score higher than the −pole on a transparent lexicon (and VADER for affect) — a directional test. ABLATED is degenerate by construction (zeroed channel → identical responses), so the headline is MARS's reliability + effect magnitude.

Both pipelines were validated on the teacher targets before training (catching, e.g., that a keyword-presence metric leaks when the control enumerates — a fix documented in docs/english/12). Scripts: eval/apuesta1/scripts/corpus_v2/.


The road here — two tracks, then the correction

Track 1 — native channels, from scratch (iterations 1–4 · H200)

A 145M decoder with ChannelInjection in every layer, trained from random init. The affect channel's eval pass-rate jumped 40% → 80% once the corpus carried the right (dialogic, emotion-labeled) signal — first evidence a channel changes generation. Corpus type dominated corpus size at every step. → docs/english/02–05.

Track 2 — frozen base, adapter mechanism (iterations 5–6 · TPU v5e)

Can a frozen Gemma 4 E2B-it internalize external channels through a tiny trained adapter (~186M params; alpha=0 ⇒ bit-exact to stock Gemma)? Iter-5 (FiLM bias) acted as a tone knob; iter-6 (real multi-token cross-attention) traded register for content (temporal 24%→74%) — but 0/6 capabilities cleared significance under LLM judges. That clean-looking, convergent negative is exactly what sent us to re-read the architecture. → docs/english/09–10.

The correction — iteration 7 (TPU v5e)

Re-reading the Hyphae pipeline showed every channel is a state/cue vector; content is delivered as verbatim text. Iter 1–6 had the channel doing the text's job. Corpus v2 fixes the specification (content-in-prompt, state-in-channel, counterfactual) and the eval (objective, no judges). Result: the table above. The capability that was flat in iter-6 (identity) gives the largest signal once specified correctly — the cleanest possible confirmation that the problem was the experiment, not the architecture. → docs/english/11–12.


The H200 → TPU journey (why the infra is half the story)

Moving to Cloud TPU v5e meant fighting torch_xla, and the fixes are reusable:

  • The "lazy tracing wall." First runs were 6.8 s/step (~57 tok/s — CPU speed). A per-channel Python loop + an XLA-hostile sparse-MoE dispatch re-traced a ~9,000-node graph every step. Vectorizing into einsums + a dense static-shape forward + bf16 → 0.56 s/step (~12×).
  • TPU generation without recompiles. generate(cache_implementation="static") recompiles every token on Gemma 4's sliding-window mask. A manual batched, no-cache, fixed-shape decoder compiles once and is reused across all decode steps.
  • OOM by dead weight. Gemma 4 E2B is multimodal; freeing the vision + audio towers we never use before the device move reclaimed the HBM that the cross-attn adapter needed — controlled comparison intact.

docs/english/07.


Repo map

docs/english/   the full write-up, in order: 00 summary · 01 architecture ·
                02–05 iterations 1–4 · 06 eval suite · 07 procedures ·
                09 iter 5 · 10 iter 6 · 11 the mis-specification · 12 iter 7 (6/6) ·
                13 channel-vs-text (design + P1) · 14 from experiment to product ·
                15 MARS-native evaluation — RESULTS (conflict 264/265 · token-cost · persistence) ·
                16 tinyMARS-Native spec (channels from layer 1, from scratch — next)
src/            from-scratch model: tinymars_native.py (decoder + ChannelInjection from
                init, trainable base), train_tokenizer.py (BPE 32k)
training/       training + the frozen-base cross-attention adapter (TPU)
eval/apuesta1/scripts/corpus_v2/   iter-7 + channel-vs-text + MARS-native eval: corpus v2
                generators, packer, objective scorers (foregrounding, directional),
                3-arm channel-vs-text, conflict/override (eval_conflict, score_conflict),
                persistence (eval_persistence, score_persistence), token_cost

Start with docs/english/00-executive-summary.md; for the latest result read 12 (6/6), 13 (channel-vs-text), then 15 — the perpendicular force: under direct conflict the channel overrides contradicting text 264/265 (98–100%), emergent and pre-registered. 16 is the from-scratch native spec (next).

Reproducibility & data

The training/eval code is here. The corpus and channel vectors used in production are derived from real, private human data and are deliberately excluded; the iter-7 corpora are fully synthetic (fictional personas, teacher-generated). Each iter-7 verdict is reproducible: the held-out generations are deterministic, and score_directional.py / rescore_v2.py re-score them offline with no TPU. The methodology — counterfactual channel-causal pairs, the six-capability suite, objective (judge-free) metrics with pre-flight validation — is documented in full.


Built on

The from-scratch language pipeline is built on nanochat by Andrej Karpathy (MIT). If you build on this work, please also cite nanochat:

@misc{nanochat,
  author = {Andrej Karpathy},
  title = {nanochat: The best ChatGPT that \$100 can buy},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/karpathy/nanochat}
}

License

Dual-licensed, copyleft — so derivatives stay open and attribution is preserved:

  • Code (src/, training/, eval/, scripts) — GNU GPL v3.0 (LICENSE). Use it, modify it, build on it, even commercially — but any distributed fork must remain GPL-licensed (open). You can't close it.
  • Documentation, research write-ups & figures (docs/, README, results) — CC BY-SA 4.0 (LICENSE-docs). Share and adapt with attribution and share-alike (adaptations stay under CC BY-SA).

© 2026 Celiums Solutions LLC · Mario Gutierrez. As the copyright holder, the author is not bound by these terms and may relicense the original work.


Author: Mario Gutierrez · Celiums Solutions LLC A research log, not a product. Findings are reported honestly, including the negative ones — that's the point.

About

Is proprioception an architectural category for language models, or an adapter technique? An open research log: channel causality proven from scratch (H200, iter 1-4), then frozen-base cross-attention iterated on TPU (iter 5-6). Honest reporting — what was proved, what wasn't.

Topics

Resources

License

GPL-3.0, CC-BY-SA-4.0 licenses found

Licenses found

GPL-3.0
LICENSE
CC-BY-SA-4.0
LICENSE-docs

Stars

Watchers

Forks

Packages

 
 
 

Contributors