Skip to content

eulogik/pico-type

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

license apache-2.0
language
en
multilingual
tags
byte-level
content-classification
onnx
edge-ai
matryoshka
multi-head
classifier
clipboard
tiny
fast
code-detection
language-detection
open-source
pipeline_tag text-classification
library_name pico-type
inference
parameters
provider
CPUExecutionProvider
pico-type

pico-type 🔍

A tiny byte-level multi-head content classifier — ~1.5M params, ~209KB ONNX, <6ms inference.

Classifies any content into 7 categories from raw bytes in a single forward pass.

License Python ONNX PyPI HuggingFace Space HuggingFace Model GitHub CI DOI

Built by eulogik — AI infrastructure for developers.


✨ Features

  • No tokenizer — operates directly on raw UTF-8 bytes (supports all languages, zero pre-processing)
  • 7 heads, one forward pass — coarse type, modality, subtype, code lang, text lang, file MIME, risk flags
  • 4 Matryoshka tiers — tiny (16d) → small (64d) → base (192d) → pro (576d)
  • ~200KB ONNX — deploy on edge devices, serverless functions, browser (WebAssembly)
  • <6ms inference on CPU via ONNX Runtime (base tier, 1024 bytes)
  • CLI, Gradio Space, MCP server — ready for any integration
  • 62 programming languages — Python, JS, TypeScript, Java, C, C++, Go, Rust, SQL, Bash, and 52 more
  • 95.2% real-world accuracy — tested against 21 hand-curated inputs across all content types

📊 Performance

Head Classes Synthetic Accuracy Real-World Accuracy
coarse 12 100% 100%
modality 8 100% 100%
subtype 24 95%
code_lang 62 39% 100% (9/9 code samples)
text_lang 30 99% 100%
file_mime 90 100%
risk (mAP) 6 100%

Evaluated on 1000 synthetic samples + 21 hand-curated real-world inputs. Base tier, ~5ms inference.

Real-world accuracy: 95.2% (20/21) — The model correctly classifies code, text, markup, config, images, binary archives, and error tracebacks. Only failure: YAML config → predicts error (a fundamental byte-level ambiguity at 2KB context).

🚀 Quick Start

CLI

pip install pico-type

echo "def hello():\n    return 42" | picotype --pretty
picotype --file document.txt
picotype --clip

Python

from picotype import PicoType, PicoTypeConfig, decode_output

model = PicoType(PicoTypeConfig()).eval()
# ... load checkpoint ...
result = decode_output(model(b"input bytes"), tier="base")

MCP Server (Claude/Cursor)

PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server

🏗 Architecture

Bytes → ByteEmbed(256→96d) → 3×Conv1D(k=3,5,7) → 2×BiAttention(RoPE) → Pool(mean‖max‖std) → 7×Matryoshka Heads
Component Description
ByteEmbed nn.Embedding(256, 96) — lookup-free byte embedding
Conv1D 3 parallel kernels (width 3, 5, 7) with residual + LayerNorm + GELU
BiAttention Bidirectional self-attention with Rotary Position Embeddings, 4 heads
Pool Mean + Max + Std concatenation over masked positions
Matryoshka Heads 4 tier slices of the pooled vector → 7 linear classifiers

Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)

🔧 Model Tiers

Tier Dim Params ONNX Size Speed
tiny 16 1.43M 207 KB ~3ms
small 64 1.45M 207 KB ~4ms
base 192 1.48M 209 KB ~5ms
pro 576 1.56M 206 KB ~12ms

All tiers share the same trunk; only the final linear layers differ. Switch tiers at inference with zero overhead.

🧪 Classification Heads

Head Classes Gated By Examples
coarse 12 text, code, link, image, file, config, markup, data, error, secret, archive, binary
modality 8 textual, binary_image, binary_archive, binary_executable, binary_document, binary_audio, binary_video, binary_other
subtype 24 config, markup, data json, yaml, toml, csv, html, markdown, sql, log, dockerfile
code_lang 62 code python, javascript, typescript, java, c, cpp, go, rust, kotlin, swift, bash, sql
text_lang 30 text en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi
file_mime 90 image, file text/html, application/json, application/pdf, image/png, video/mp4
risk 6 api_key, jwt, password, email, phone, ssh_key (probabilities)

🌐 Deployment

Platform URL
HuggingFace Space eulogik/pico-type
HuggingFace Model eulogik/pico-type
GitHub eulogik/pico-type
PyPI pip install pico-type
Zenodo 10.5281/zenodo.20758542

📚 Documentation

📄 License

Apache 2.0 — free for commercial and personal use.


Built with ❤️ by eulogik

About

Tiny (1.5M params) byte-level multi-head content classifier — detects code language, content type, modality, and risk from raw bytes. 95.2% accuracy, ONNX export, no parsing needed.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors