Skip to content

qixinhu11/LongLive-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LongLive-RAG logo

πŸ” LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Paper Code Demo DeepWiki Model Card HF Paper

Qixin Hu Β· Shuai Yang Β· Wei Huang Β· Song Han Β· Yukang Chen

πŸ’‘ TL;DR

LongLive-RAG turns long video generation into a retrieval problem. Instead of attending only to the most recent sliding window, an autoregressive (AR) video generator looks back over the video it has already generated and pulls in the most relevant past latents as extra context. This cuts error accumulation, identity drift, and background flicker over long horizons, without retraining the base generator.

LongLive-RAG framework overview

πŸ“° News

  • πŸ”₯ [2026.06] We release the LongLive-RAG paper and code!

🎬 Demo

🌐 More results and video comparisons on the project page.

Long-horizon comparisons. The native sliding-window baseline (left) accumulates errors and drifts over time, while adding LongLive-RAG (right) preserves subject identity and visual quality.

Native (baseline) Native + LongLive-RAG (Ours)
native_1.mp4
native_ours_1.mp4
native_2.mp4
native_ours_2.mp4

✨ Highlights

  • πŸ₯‡ First of its kind. Among open-ended AR long video generation methods, the first to formulate self-generated latent history as content-addressable retrieval memory.
  • πŸ”Œ Plug-and-play. Works across Causal-Forcing, Self-Forcing, and LongLive with the base generator frozen.
  • πŸ”Ž Searchable history. Retrieves the most relevant past latents as extra context for each new block.
  • πŸ“ Window Temporal Delta Loss. Makes embeddings capture meaningful temporal change, not redundant local similarity.
  • ⚑ Consistent wins. Best average VBench-Long rank across lengths and backbones.

πŸ”¬ Method Overview

At block t, a standard AR model attends to a sliding-window context. LongLive-RAG inserts retrieved historical entries M_t between the sink and local windows:

Sliding window:   A_sw  = [ C_sink β€–           C_loc ]
LongLive-RAG:     A_rag = [ C_sink β€–  M_t  β€–   C_loc ]
Stage What happens
1. Indexing Encode each completed latent block into a compact embedding and store it.
2. Retrieval Match the current block against past embeddings and pull in the top-K as extra context.
3. Embedding training Train the encoder offline on self-generated latents, with the base generator frozen.

🏁 Getting Started

πŸ“¦ Installation

LongLive-RAG shares its environment with LongLive. Just follow the upstream LongLive installation guide.

πŸš€ Inference

1. Download everything β€” two commands. All LongLive-RAG assets (AR backbones, retrieval AE, prompt files, and the toy latent set) live in a single Hugging Face repo; the base WAN VAE comes from Wan:

# Base WAN VAE β€” LongLive-RAG operates in its latent space
hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B

# All LongLive-RAG assets β€” restores checkpoints/ and toydatasets/ in place
hf download qixinhu11/LongLive-RAG --local-dir . --include "checkpoints/*" "toydatasets/*"

Older setups can swap hf download for huggingface-cli download (same arguments).

The second command lays out:

checkpoints/
β”œβ”€β”€ causal_forcing.pt              # Causal-Forcing AR backbone
β”œβ”€β”€ self_forcing.pt                # Self-Forcing AR backbone
β”œβ”€β”€ longlive_base.pt               # LongLive AR backbone
β”œβ”€β”€ longlive_lora.pt               # LongLive LoRA (paired with longlive_base.pt)
β”œβ”€β”€ ae_latent_mem.pt               # Retrieval autoencoder (default for inference)
β”œβ”€β”€ moviegenbench_128_refined.txt  # 128 MovieGenBench prompts
└── vidprom_filtered_extended.txt  # Self-Forcing prompt pool (for generate_latent.py)
toydatasets/
└── latent_0000xx.pt               # tiny example latent set for the training demo

To train your own retrieval AE instead of using ae_latent_mem.pt, see Training.

3. Run. The repo ships a 3 Γ— 2 grid (three backbones Γ— two context-assembly methods) in configs/:

Backbone \ Method native (sliding-window) latentmem (LongLive-RAG, ours)
causal_forcing causal_forcing_native.yaml causal_forcing_latentmem.yaml
self_forcing self_forcing_native.yaml self_forcing_latentmem.yaml
longlive longlive_native.yaml longlive_latentmem.yaml
# Main result: Causal-Forcing backbone + LongLive-RAG retrieval
bash inference.sh causal_forcing latentmem

# Baselines: native sliding-window
bash inference.sh causal_forcing native

# GPU / port overrides
GPU=4 PORT=29510 bash inference.sh causal_forcing latentmem

πŸ” Reproducibility

For deterministic inference, inference.py sets a fixed seed (config.seed) across random / numpy / torch and enables deterministic backends:

os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"
os.environ.setdefault("PYTHONHASHSEED", str(config.seed))
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True, warn_only=True)

⚠️ Bit-exact cross-machine reproduction is strict and hard to guarantee. Even with the settings above, identical outputs across different machines require the same GPU model, the same PyTorch / CUDA / cuDNN versions, and matching checkpoints/configs. Differences in GPU architecture (e.g. A100 vs. H100), TF32 behavior, or the torch.compile autotuned attention kernels can still produce small numerical drift. To take PYTHONHASHSEED fully into effect, export it before launching: PYTHONHASHSEED=0 bash inference.sh ....

βœ… The more reliable way to validate our gains is a same-machine A/B comparison. Run the native baseline and LongLive-RAG (latentmem) back-to-back on the same GPU with the same prompts and seed, then compare the outputs directly. This isolates the effect of retrieval from any hardware/software-stack variance:

# Same backbone, same machine β€” compare baseline vs. ours
bash inference.sh causal_forcing native
bash inference.sh causal_forcing latentmem

πŸ‹οΈ Training

The base generator stays frozen; the only trainable component is the retrieval encoder (a small latent autoencoder). Training has two steps:

Step 1: Build a latent corpus. Run a frozen generator over a prompt pool to collect the clean latent blocks it produces; these become the training samples. The launcher shards generation across multiple GPUs.

bash generate_latent.sh

Step 2: Train the retrieval autoencoder. Fit the encoder on the collected latents with a reconstruction loss plus the Window Temporal Delta and trajectory-smoothing terms. Default hyperparameters live in ae/configs/.

bash train_ae_delta.sh

Retraining the base AR backbones is out of scope; backbone checkpoints are consumed as-is. See upstream LongLive / Self-Forcing to train one from scratch.

πŸ—‚οΈ Repository Layout

β”œβ”€β”€ ae/               # Retrieval autoencoder (model, configs, training)
β”œβ”€β”€ checkpoints/      # AR backbones, AE checkpoint, prompt .txt files (gitignored)
β”œβ”€β”€ configs/          # Inference YAMLs (3 backbones Γ— 2 methods) + generate_latent
β”œβ”€β”€ datasets/         # AE training latents (output of generate_latent.sh, gitignored)
β”œβ”€β”€ toydatasets/      # Tiny example latent set for the training demo (from HF, gitignored)
β”œβ”€β”€ pipeline/         # Causal inference pipeline (drives all backbones)
β”œβ”€β”€ utils/            # Dataset, memory, scheduler, lora, wan-wrapper utilities
β”œβ”€β”€ wan/, wan_models/ # WAN VAE backbone (T2V-1.3B)
β”œβ”€β”€ inference.py      # Inference entry point
β”œβ”€β”€ inference.sh      # Launcher: bash inference.sh <backbone> <method>
β”œβ”€β”€ generate_latent.py / .sh  # Latent corpus generation (multi-GPU sharded)
└── train_ae_delta.sh         # Retrieval AE launcher

πŸ“„ Citation

πŸ“œ Paper: arXiv:2606.02553

@article{longliverag2026,
  title         = {LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation},
  author        = {Hu, Qixin and Yang, Shuai and Huang, Wei and Han, Song and Chen, Yukang},
  journal       = {arXiv preprint arXiv:2606.02553},
  archivePrefix = {arXiv},
  eprint        = {2606.02553},
  year          = {2026}
}

πŸ™ Acknowledgements

LongLive-RAG builds on the codebases and ideas of:

  • LongLive: the AR long-video framework this codebase forks from.
  • Self-Forcing: causal AR training recipe and prompt pool.
  • Causal-Forcing: one of the AR backbones evaluated in this work.
  • Wan: the base video generation model and VAE latent space.

πŸ“ License

Released under the Apache 2.0 license.

About

Official Implementation of LongLive-RAG: A general retrieval-augmented framework for long video generation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors