pico-type 🔍

license

apache-2.0

language

en	multilingual

pico-type 🔍

A tiny byte-level multi-head content classifier — ~1.5M params, ~209KB ONNX, <6ms inference.

Classifies any content into 7 categories from raw bytes in a single forward pass.

Built by eulogik — AI infrastructure for developers.

✨ Features

No tokenizer — operates directly on raw UTF-8 bytes (supports all languages, zero pre-processing)
7 heads, one forward pass — coarse type, modality, subtype, code lang, text lang, file MIME, risk flags
4 Matryoshka tiers — tiny (16d) → small (64d) → base (192d) → pro (576d)
~200KB ONNX — deploy on edge devices, serverless functions, browser (WebAssembly)
<6ms inference on CPU via ONNX Runtime (base tier, 1024 bytes)
CLI, Gradio Space, MCP server — ready for any integration
62 programming languages — Python, JS, TypeScript, Java, C, C++, Go, Rust, SQL, Bash, and 52 more
95.2% real-world accuracy — tested against 21 hand-curated inputs across all content types

📊 Performance

Head	Classes	Synthetic Accuracy	Real-World Accuracy
coarse	12	100%	100%
modality	8	100%	100%
subtype	24	95%	—
code_lang	62	39%	100% (9/9 code samples)
text_lang	30	99%	100%
file_mime	90	100%	—
risk (mAP)	6	100%	—

Evaluated on 1000 synthetic samples + 21 hand-curated real-world inputs. Base tier, ~5ms inference.

Real-world accuracy: 95.2% (20/21) — The model correctly classifies code, text, markup, config, images, binary archives, and error tracebacks. Only failure: YAML config → predicts error (a fundamental byte-level ambiguity at 2KB context).

🚀 Quick Start

CLI

pip install pico-type

echo "def hello():\n    return 42" | picotype --pretty
picotype --file document.txt
picotype --clip

Python

from picotype import PicoType, PicoTypeConfig, decode_output

model = PicoType(PicoTypeConfig()).eval()
# ... load checkpoint ...
result = decode_output(model(b"input bytes"), tier="base")

MCP Server (Claude/Cursor)

PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server

🏗 Architecture

Bytes → ByteEmbed(256→96d) → 3×Conv1D(k=3,5,7) → 2×BiAttention(RoPE) → Pool(mean‖max‖std) → 7×Matryoshka Heads

Component	Description
ByteEmbed	`nn.Embedding(256, 96)` — lookup-free byte embedding
Conv1D	3 parallel kernels (width 3, 5, 7) with residual + LayerNorm + GELU
BiAttention	Bidirectional self-attention with Rotary Position Embeddings, 4 heads
Pool	Mean + Max + Std concatenation over masked positions
Matryoshka Heads	4 tier slices of the pooled vector → 7 linear classifiers

Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)

🔧 Model Tiers

Tier	Dim	Params	ONNX Size	Speed
tiny	16	1.43M	207 KB	~3ms
small	64	1.45M	207 KB	~4ms
base	192	1.48M	209 KB	~5ms
pro	576	1.56M	206 KB	~12ms

All tiers share the same trunk; only the final linear layers differ. Switch tiers at inference with zero overhead.

🧪 Classification Heads

Head	Classes	Gated By	Examples
coarse	12	—	text, code, link, image, file, config, markup, data, error, secret, archive, binary
modality	8	—	textual, binary_image, binary_archive, binary_executable, binary_document, binary_audio, binary_video, binary_other
subtype	24	config, markup, data	json, yaml, toml, csv, html, markdown, sql, log, dockerfile
code_lang	62	code	python, javascript, typescript, java, c, cpp, go, rust, kotlin, swift, bash, sql
text_lang	30	text	en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi
file_mime	90	image, file	text/html, application/json, application/pdf, image/png, video/mp4
risk	6	—	api_key, jwt, password, email, phone, ssh_key (probabilities)

🌐 Deployment

Platform	URL
HuggingFace Space	eulogik/pico-type
HuggingFace Model	eulogik/pico-type
GitHub	eulogik/pico-type
PyPI	`pip install pico-type`
Zenodo	10.5281/zenodo.20758542

📚 Documentation

Model Card — detailed architecture, training, evaluation
Architecture Plan — full design document
Walkthrough — development log with all decisions

📄 License

Apache 2.0 — free for commercial and personal use.

_{Built with ❤️ by eulogik}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
.opencode/plans		.opencode/plans
checkpoints		checkpoints
crates		crates
docs		docs
extensions/chrome		extensions/chrome
model		model
paper		paper
spaces		spaces
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
export_meta.json		export_meta.json
gradio_app.py		gradio_app.py
hf_org_card.md		hf_org_card.md
picotype_base.onnx		picotype_base.onnx
picotype_base.onnx.data		picotype_base.onnx.data
picotype_pro.onnx		picotype_pro.onnx
picotype_pro.onnx.data		picotype_pro.onnx.data
picotype_small.onnx		picotype_small.onnx
picotype_small.onnx.data		picotype_small.onnx.data
picotype_tiny.onnx		picotype_tiny.onnx
picotype_tiny.onnx.data		picotype_tiny.onnx.data
pyproject.toml		pyproject.toml
walkthrough.md		walkthrough.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pico-type 🔍

✨ Features

📊 Performance

🚀 Quick Start

CLI

Python

MCP Server (Claude/Cursor)

🏗 Architecture

🔧 Model Tiers

🧪 Classification Heads

🌐 Deployment

📚 Documentation

📄 License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

pico-type 🔍

✨ Features

📊 Performance

🚀 Quick Start

CLI

Python

MCP Server (Claude/Cursor)

🏗 Architecture

🔧 Model Tiers

🧪 Classification Heads

🌐 Deployment

📚 Documentation

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages