Skip to content

mandarwagh9/dvd-jepa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DVD-JEPA

A tiny, fully-reproducible Joint-Embedding Predictive Architecture world model — that learns the physics of a bouncing DVD logo in representation space, dreams its future, and detects anomalies. Trains on a CPU in ~10 seconds.

Paper (PDF) Live demo HF Space Open in Colab License: MIT CPU only

Reality vs. the JEPA's rendered latent dream

Left: reality. Right: the model's dream — rolled forward purely in latent space and decoded to pixels.


Abstract

Most attempts to learn a world model from video try to predict the next frame pixel-by-pixel, and drown in detail that is fundamentally unpredictable. JEPA (Joint-Embedding Predictive Architecture, LeCun 2022) makes a different bet: predict the representation of the future, not the pixels, and let the encoder discard whatever it cannot predict.

DVD-JEPA is the smallest honest demonstration of that idea we could build. The "world" is a DVD logo bouncing in a 16×16 box. A context encoder, an EMA target encoder, and a latent predictor are trained — with no labels and no decoder — to predict the next observation in a 32-dimensional representation space. We then show three things:

  1. It learned the world. A linear probe recovers the logo's exact (y, x) position from the frozen 32-d latent to within 0.73 px — though it was never given a coordinate.
  2. It can dream (once you add a decoder). Bolt an optional decoder onto the frozen latents and roll the predictor forward: it renders a correct future-frame video of the bounce, including wall reflections, for ~20 steps before latent drift sets in.
  3. It is useful. Run it as a 1-step predictive monitor and the prediction error becomes an anomaly signal: inject a teleport and surprise spikes 88× over baseline, on the right frame.

The whole thing runs client-side in your browser at dvd-jepa.vercel.app — the trained MLPs are re-implemented in ~40 lines of JavaScript. It is a joke and it is also a correct, working instance of the architecture behind I-JEPA, V-JEPA, and V-JEPA 2.

📄 Paper

There's a full arXiv-style write-up (method, anti-collapse ablation, forecast-horizon curve, anomaly detection, references): paper/main.pdf — also attached to the latest release.

The paper is fully reproducible: paper/main.tex is the LaTeX source and paper/figures.py regenerates every figure and number in it.

python paper/figures.py     # regenerate figures + metrics.tex
tectonic paper/main.tex     # compile the PDF (any LaTeX engine works)

The idea in one picture

            ┌──────────────────────── trained without labels, without a decoder ───────────────────────┐
            │                                                                                            │
 obs_t  ──▶ │ Encoder Eθ ─▶ z_t ──▶ Predictor P ─▶ ẑ_{t+1} ─────────────▶  ‖ ẑ_{t+1} − sg(z̄_{t+1}) ‖²  │ ◀── loss is in
 (2 frames) │                                                              ▲   (prediction in latent      │     LATENT space,
            │ obs_{t+1} ─▶ Encoder E_ema (EMA, stop-grad) ─▶ z̄_{t+1} ──────┘    space, never pixels)       │     never pixels
            └────────────────────────────────────────────────────────────────────────────────────────────┘
                                              │  + VICReg variance term  →  no representation collapse
                                              ▼
        (optional, separate) Decoder D : z → 16×16 frame      ←  the "sellout" that makes the dream visible & useful

Why a bouncing logo?

It is the simplest system that still has the property that matters: the future is unreadable from a single frame (you can't tell which way a static dot is going), but perfectly predictable from two (position + velocity → the entire deterministic future, bounces included). So a context of two stacked frames is necessary and sufficient — exactly the spatio-temporal setup real video JEPAs use, minus a million hours of internet video.

Method

Component Shape Role
Context encoder 2·16·16 → 256 → 128 → 32 encodes an observation (2 stacked frames) to a latent
Target encoder E_ema same, EMA of , stop-grad produces the prediction target — the anti-collapse asymmetry
Predictor P 32 → 64 → 32 the world model: one step forward in latent space
Decoder D (optional) 32 → 64 → 256 → 256 readout to pixels; a pure JEPA omits this

Training objective. Minimise the latent prediction error plus a variance term:

L = ‖ P(Eθ(obs_t)) − sg(E_ema(obs_{t+1})) ‖²   +   Σ_d relu(1 − std(z_d))
       └──────── predict the future in representation space ────────┘     └─ VICReg anti-collapse ─┘

The target encoder is an exponential moving average (τ = 0.99) of the online encoder with a stop-gradient — the BYOL trick. Without the variance term the embedding std starts at 0.007 (collapsing to a constant); with it, std holds at ~2.4–3.0 throughout. The decoder is trained separately on the frozen encoder, so the JEPA does all the understanding and the decoder is only a readout.

Results

All numbers are produced by python -m dvd_jepa.train (seed 0, CPU, ~10 s) and saved to assets/metrics.json.

Result Value What it shows
Linear-probe position RMSE 0.73 px (box is 16 px) the 32-d latent secretly encodes exact world state
Forecast MSE, 1 step ahead 0.0005 near-perfect short-horizon prediction
Forecast MSE, 30 steps ahead 0.028 graceful latent-rollout drift, not collapse
Anomaly peak / baseline 88× a teleport is detected via prediction error…
Anomaly detected at frame 22 (injected at 24) …on the correct frame (2 early: the monitor looks 2 ahead)
Embedding std (collapse check) ~3.0 (not 0) the representation never collapsed
Predictive surprise spikes exactly on the injected anomaly

Try it — interactive demo

dvd-jepa.vercel.app — the trained model running entirely in your browser (no server, no GPU). Also mirrored on 🤗 Hugging Face Spaces. Things to do:

  • Toggle the decoder off. This is the pure JEPA. It understands the bounce perfectly and gives you nothing but 32 latent bars — it literally cannot draw. Toggle it back on and the dream renders. This is the whole joke, made interactive.
  • Inject an anomaly. Teleport the logo and watch the surprise meter spike past the threshold.
  • Dream 30 steps ahead. Freeze reality and let the predictor roll forward on its own — watch it imagine the future, then slowly drift.
The interactive browser demo

Reproduce

git clone https://github.com/mandarwagh9/dvd-jepa
cd dvd-jepa
pip install -r requirements.txt

python -m dvd_jepa.train      # trains everything, writes checkpoints/, web/weights.json, assets/
python scripts/pure_jepa.py   # the original no-decoder version: prints the ASCII latent dream

To run the browser demo locally (ES modules need a server, not file://):

cd web && python -m http.server 8000   # then open http://localhost:8000

Or open the Colab notebook and run it cell by cell.

Repository layout

dvd_jepa/            the package
  world.py           the bouncing-logo environment + observation pairs
  models.py          Encoder, Predictor, Decoder, variance term
  train.py           train, evaluate, export browser weights, render assets
web/                 the client-side interactive demo (index.html + jepa.js + weights.json)
scripts/pure_jepa.py the original decoder-free "it only does vectors" version
notebooks/           Colab notebook
assets/              rendered gif/png + metrics.json
checkpoints/         trained PyTorch weights

How this relates to real systems

DVD-JEPA is a toy, but every moving part has a full-scale counterpart:

  • I-JEPA (images) and V-JEPA / V-JEPA 2 (video) use exactly this predict-in-representation-space objective with an EMA target encoder, at ViT scale on real data.
  • V-JEPA 2-AC makes the predictor action-conditioned and plans a real robot in latent space — the same "imagine the future, pick the best" loop, with actions added.
  • The two capabilities shown here — forecast the next frames and flag when reality diverges from the forecast — are exactly what a world model contributes to an egocentric-video data pipeline: predict what the person does next, and auto-surface the unexpected moment.

Limitations (honest)

  • Latent rollout drifts after ~20 steps: the predictor is trained for a single step, so errors compound. Multi-step rollout training or a recurrent predictor would extend the horizon.
  • It's 16×16 and deterministic. There is no stochastic latent z for multi-modal futures (real JEPAs add one) because the bouncing logo has exactly one future.
  • The decoder is a crutch. A pure JEPA has none; we add it only to visualise and to compute a pixel-space surprise score.

References

  1. Y. LeCun. A Path Towards Autonomous Machine Intelligence. 2022.
  2. M. Assran et al. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA). CVPR 2023.
  3. A. Bardes et al. Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA). 2024.
  4. Meta AI. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. 2025.
  5. A. Bardes, J. Ponce, Y. LeCun. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.
  6. J.-B. Grill et al. Bootstrap Your Own Latent (BYOL). NeurIPS 2020.

Citation

@software{dvdjepa2026,
  title  = {DVD-JEPA: a tiny reproducible JEPA world model of a bouncing logo},
  author = {Wagh, Mandar},
  year   = {2026},
  url    = {https://github.com/mandarwagh9/dvd-jepa}
}

License

MIT — see LICENSE. Built as the rigorous sequel to DVD Dreamer.

About

A tiny, fully-reproducible JEPA world model that learns the physics of a bouncing DVD logo in representation space, dreams its future, and detects anomalies. Trains on a CPU in ~10s. Interactive browser demo.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors