Efficient variable-length n-gram language models using suffix arrays for O(m log n) pattern matching. Byte-level (UTF-8), works with any text.
Infinigram is a corpus-based language model where the corpus IS the model — no gradient descent, no training loop. It uses suffix arrays to find variable-length patterns and predict continuations.
Key properties:
- Instant training: Build a model by pointing at text
- Variable-length patterns: Automatically uses longest matching context
- Exact matching: Every prediction traces to actual corpus occurrences
- O(m log n) queries: Binary search on suffix arrays
- Byte-level: Fixed 256-token vocabulary (UTF-8 bytes 0-255)
- Mmap-backed: Memory-mapped files for corpora from 1KB to 100GB+
pip install py-infinigramOr install from source:
git clone https://github.com/queelius/infinigram.git
cd infinigram
pip install -e .from infinigram import Infinigram
# Create model from text (byte-level, automatic UTF-8 encoding)
model = Infinigram(b"the cat sat on the mat the cat ran")
# Predict next byte probabilities
probs = model.predict(b"the cat", top_k=5)
# {32: 1.0} # space byte (32) has highest probability
# Find longest matching suffix
position, length = model.longest_suffix(b"the cat")
print(f"Matched {length} bytes at position {position}")
# String input works too (auto-encoded to UTF-8)
probs = model.predict("the cat", top_k=5)# Build and save a model
model = Infinigram.build("The quick brown fox jumps over the lazy dog.", "my_model")
# Load it later (instant — uses mmap, not RAM)
model = Infinigram.load("my_model")
probs = model.predict("The quick")Infinigram automatically finds the longest matching suffix in the corpus:
model = Infinigram(b"abcdef abcdef abcxyz")
pos, length = model.longest_suffix(b"abcd") # length=4
pos, length = model.longest_suffix(b"abc") # length=3
pos, length = model.longest_suffix(b"zzz") # length=0 (no match)# Longest suffix match
probs = model.predict(b"the cat", top_k=10)model = Infinigram(b"hello")
model.update(b" world") # Appends and rebuilds suffix array
print(model.n) # 11Infinigram(
corpus_or_path, # bytes, List[int], str, or Path
max_length: int = None, # Max suffix length (None = unlimited)
min_count: int = 1, # Min frequency threshold
)| Method | Description |
|---|---|
predict(context, top_k=50, smoothing=0.0) |
Next-token probability distribution |
predict_logprobs(context, top_k=50) |
Log-probabilities |
continuations(context, limit=10000) |
Raw continuation counts (limit=None for all) |
longest_suffix(context) |
(position, length) of longest match |
find_all_suffix_matches(context) |
All matches at all suffix lengths |
search(pattern, limit=None) |
All occurrence positions |
count(pattern) |
Number of occurrences |
update(new_tokens) |
Append tokens and rebuild index |
clear_cache() |
Clear prediction caches |
get_context(position, window=50) |
Corpus text around a position |
| Method | Description |
|---|---|
Infinigram.build(corpus, path, ...) |
Build persistent model |
Infinigram.load(path) |
Load existing model (mmap, instant) |
Infinigram.list_models() |
List models in ~/.infinigram/models/ |
Byte-level model with mmap-backed suffix arrays:
| Corpus Size | Construction | Query Latency |
|---|---|---|
| 1 KB | <1 ms | <0.1 ms |
| 1 MB | ~100 ms | <1 ms |
| 1 GB | ~30 s | <10 ms |
Memory: ~9 bytes per corpus byte (corpus + suffix array on disk, mmap'd).
Performance optimizations (v0.5.0):
- Vectorized continuation counting with
np.bincount(3-10x speedup) - LRU cache for repeated
continuations()queries (5-20x for hot paths) - Direct bytes comparison in binary search (2-5x vs Python loop)
- Batch SQLite inserts for dataset index rebuild
Python API Infinigram class (predict, search, continuations)
Suffix Array Engine MmapSuffixArray, ChunkedMmapSuffixArray (pydivsufsort)
See ARCHITECTURE.md for details.
508 tests, 79% coverage:
pytest tests/ # All tests
pytest tests/ -v # Verbose
pytest tests/ --cov=infinigram --cov-report=html # Coverage
pytest tests/test_infinigram.py -k "cache" # Keyword filterInfinigram instances are not thread-safe. Each thread should use its own model instance, or external synchronization must be used.
MIT License - see LICENSE for details.