Audio ingestion design note (DEFERRED) — layered pipeline, use cases, video streams, licensing, dep health #6

mbachaud · 2026-04-10T22:44:58Z

mbachaud
Apr 10, 2026
Maintainer

Audio ingestion design note (DEFERRED — not roadmapped)

Status: Design exploration only. Not scheduled, not in any current milestone, not a commitment to ship. Captured so the thinking survives the session it was born in and future raude/laude/human collaborators don't have to re-derive it from scratch.

Filed by raude — Claude Code Opus 4.6 (1M context).

Why this note exists

During the 2026-04-10 session (helix-context v0.3.0b5 ship day), the user asked whether agents could "hear" — ingest audio/video content directly into the genome rather than relying on text transcripts alone. A 2+1 specialist research council was dispatched; the findings plus several subsequent architectural refinements by the user landed on a concrete design direction that's worth capturing. The audio MVP itself is deferred — Struggle 1 (density gate at ingest) and the bench harvest bug (Issue #3) are higher priority. But when audio work un-defers, this is the blueprint.

TL;DR

The architecture: ingest audio through a 10-layer pipeline of existing, mature, Apache-2.0-compatible tools, producing a structured JSON "audio gene" content field alongside a 512D CLAP embedding for cross-modal retrieval. CLAP plays three roles simultaneously — retrieval key, zero-shot classifier, and semantic critic for confidence scoring. Video is a parent-with-streams pattern: ffmpeg demuxes into audio + frames + captions, each stream goes through its own specialist pipeline, genes link via shared source_id + video_offset_s. The bet: text is a projection of audio, not the source of truth — the primary storage is embeddings + feature JSON, the transcript is a view.

Part 1 — The layered pipeline

Every layer is an existing, mature, permissively-licensed tool. Nothing here is research code or experimental.

Layer	Tool	License	Role
1. Format decode (WAV/FLAC/OGG)	`soundfile`	BSD-3 (LGPL libsndfile via wheels)	Native audio I/O
1b. MP3 decode (optional)	`ffmpeg` external	LGPL-2.1+ (system dep)	Format conversion only, no GPL codecs required
2. Resample / mono	`librosa`	ISC	Normalize to 44.1kHz mono float32
3. Loudness normalize	`pyloudnorm`	MIT	EBU R128 target (-16 LUFS)
4. Voice activity / silence segmentation	`silero-vad`	MIT (code + weights)	Finds speech boundaries; ~50ms/call ONNX
5. Spectral features	`librosa.feature.*`	ISC	mel, MFCC, chroma, centroid, rolloff, RMS, ZCR, etc.
6. Event detection (onsets, beats, tempo)	`librosa.onset` + `librosa.beat`	ISC	"What's moving" — the signal vs steady-state
7. Classification	LAION-CLAP zero-shot	Apache-2.0	Prompted with candidate labels, pick highest cosine. Dual-use with layer 8
8. Semantic embedding	LAION-CLAP	Apache-2.0	512D audio↔text shared space — the retrieval key
9. Speech transcription (conditional)	`faster-whisper` (Whisper-tiny)	MIT (code + weights)	Only runs when layer 7 classifies as speech
10. Music analysis (optional extra)	`madmom`	BSD-3-Clause	Chord, key, beat, tempo, downbeats. Lazy load — only for music genes

Key simplification (user's insight): CLAP handles layers 7 and 8 in a single model. No separate YAMNet/PANN classifier needed. Zero-shot classification via cosine against candidate label prompts is ~5-10% less accurate than a fine-tuned classifier but drops TensorFlow (~600MB) from the install tree and makes the class ontology customizable per deployment rather than baked into a pre-trained label set.

Total resident RAM fully warm: ~880 MB (CLAP ~500MB + Whisper-tiny ~75MB + silero-vad ~5MB + librosa working memory ~200MB + madmom lazy ~100MB). CPU-only, zero GPU required.

Total pip install helix-context[audio] footprint: ~300 MB one-time download, ~100 MB wheel install. Madmom isolated in its own [audio-music] extra so music-specific users opt in separately.

Part 2 — License audit + fork rights

The license mix is deliberately permissive so the audio layer survives upstream abandonment. Everything in the pipeline is fully forkable if any dep rots:

License	Count	Fork rights	Patent protection
Apache-2.0	LAION-CLAP	✅ Full, including re-license of derivatives	✅ Explicit patent grant survives fork
MIT	pyloudnorm, silero-vad, faster-whisper	✅ Full	❌ No formal protection
BSD-3-Clause	madmom, soundfile, webrtcvad	✅ Full	❌ No formal protection
ISC	librosa	✅ Full (equivalent to BSD-2)	❌ No formal protection
LGPL (external only)	ffmpeg binary, libsndfile via wheels	✅ (modifications to LGPL code must stay LGPL)	❌ Not an issue since we don't modify
AGPL	ZERO	—	—
GPL	ZERO	—	—

Fork playbook for any abandoned dep:

Fork the repo to a SwiftWing21/*-maintained or similar
Preserve original copyright + license headers; add "Modified from..." markers per Apache-2.0, add own copyright line
Rename the PyPI package if the original can't be transferred (trademark + squatting risk)
Publish under the new name
Update helix-context[audio] extra to depend on the new name + add NOTICE entry

Legal effort: ~30 minutes. Engineering effort: scales with compatibility breakage.

The single Apache-2.0-patent-grant win: CLAP is the most protected piece of the pipeline, which matters because it's where the novel research lives. MIT/BSD/ISC tools are all pure DSP or small utilities where patent risk is effectively zero — OK that they lack formal patent grants.

Essentia (AGPL) was explicitly rejected — it was the only serious contender for chord/key detection but AGPL-on-network-service would have contaminated commercial helix-context deployments. Madmom (BSD-3) replaces it cleanly.

Part 3 — Dependency health detection

Because upstream abandonment is a real concern for audio tooling specifically (several contenders in this space are research-code with slow maintenance), a detection path for "is this dep going stale?" is part of the design:

Signal	Source	Red flag threshold
Days since last commit	GitHub API `/repos/{o}/{r}` → `pushed_at`	> 365d
Days since last PyPI release	PyPI JSON → newest `upload_time`	> 365d
Repo archived	GitHub API → `archived: bool`	`true` = hard stop
Current Python classifier support	PyPI JSON → `info.classifiers`	Missing current Python > 60d after Python release
Untriaged issues	GitHub Issues API	> 50% of recent issues no maintainer response
Unpatched security advisory	GitHub Security Advisories API	Any unpatched critical CVE > 30d
Install failure on current Python	Local `pip install --dry-run`	Fails = act now

Bucketing:

GREEN: all healthy, no action
YELLOW: 2+ soft signals. Monitor, subscribe, prepare fallback.
RED: repo archived OR CVE unpatched OR install fails OR last commit >365d + no successor. Fork now.

Where it lives:

scripts/check_dep_health.py (~200 LOC standalone) — CI nightly
Opens a GitHub issue tagged dependency-health if anything goes YELLOW or RED
Also runs as part of the pre-release checklist for each v0.X.Y tag
Future extension: a helix-context doctor CLI command that checks Python version, Ollama reachability, genome health, AND dependency freshness in one command

Current pipeline risk assessment (from memory, would need real-time verification):

Dep	Status
librosa	🟢 Active
pyloudnorm	🟡 Slow releases, single maintainer
silero-vad	🟢 Active
laion-clap	🟢 Active
faster-whisper	🟢 Active
soundfile	🟢 Active
madmom	🟡 ~2 years since major release — music features isolated in `[audio-music]` extra for this reason

The pipeline is deliberately structured so a single YELLOW dep (madmom) only affects users who opt into music-specific features, not the core audio ingest.

Part 4 — CLAP as semantic critic (confidence scoring)

Beyond retrieval and classification, CLAP plays a third role — it verifies the LLM's interpretation of an audio gene via cross-modal cosine similarity. The flow:

1. LLM reads audio gene → produces text interpretation
   ("sustained 200Hz hum, speech-band activity 0.3-2.1s, ends abruptly")

2. Compute: cosine( CLAP_text(interpretation), CLAP_audio(original_waveform) )

3. Bucket the cosine score:
   > 0.35  → high confidence, answer with no hedge
   0.15-0.35 → medium, answer with "likely"
   < 0.15  → low, flag as "I'm not confident about this audio"

Ingest-time confidence stamp stored as gene metadata:

Gene(
  ...,
  audio_embedding=[512 floats],           # CLAP of the waveform
  content="[librosa feature JSON]",       # LLM reasoning surface
  interpretation_text="sustained hum...", # LLM's cached first interpretation
  interpretation_confidence=0.42,         # CLAP-verified
)

Genes with interpretation_confidence < 0.15 get flagged at ingest — either the LLM misread the features, or it's out-of-distribution audio CLAP also doesn't understand. Either way, the gene is marked low-trust and downweighted at retrieval.

This plugs into the existing ContextHealth.ellipticity pattern — audio genes contribute their confidence score to the window's health signal. A window with low-confidence audio genes returns status=sparse with reason "audio interpretation uncertain."

The speculative long-term angle: CLAP as a reward signal for fine-tuning. The LLM produces interpretations of unlabeled audio; CLAP scores each interpretation against the audio; the scores become training signal for RLHF-style updates. Nobody has published this that I know of — it's a real research direction but not MVP scope.

Honest caveats on CLAP-as-critic:

Not ground truth. Trained on LAION-Audio-630K with its own biases. Out-of-distribution audio may score low even when the LLM is correct.
Shared-space assumption. Round-trip only works if the LLM's text vocabulary overlaps with what CLAP was trained on. "Sustained 200Hz hum" works. "Harmonic content suggests Am6(add9)" might not.
Cosine scale calibration. 0.3 is "pretty similar" but there's no universal threshold — needs empirical calibration on ~50 known-good audio/text pairs to find the right operating point.

Part 5 — Gene schema for audio

Gene (audio, v0.5+):
  content:          "<audio_sema_json>"              # layered pipeline output
  complement:       "[brief summary: 'meeting recording, 47 speech segments, male voice, ~10min']"
  codons:           ["speech_onset at 0.21s",
                     "silence 45.3-47.4s",
                     "speech_onset at 48.9s", ...]
  promoter.domains: ["audio", "speech", "male_voice", "inside"]  # from CLAP zero-shot
  embedding:        [20D ΣĒMA of complement]         # existing text retrieval path
  audio_embedding:  [512D CLAP]                      # new: cross-modal retrieval
  interpretation_confidence: 0.87                    # new: CLAP critic
  interpretation_text: "[LLM's cached first interpretation]"  # new: for reuse

content field format — the "audio ΣĒMA JSON" produced by the layered pipeline:

{
  "schema": "helix-audio-gene/v1",
  "source": "meeting_20260410.mp3",
  "duration_s": 602.3,
  "sample_rate": 44100,

  "classification": {
    "top_classes": [["speech", 0.94], ["male_voice", 0.78], ["inside", 0.42]],
    "model": "laion-clap-zero-shot"
  },

  "transcript": {
    "text": "the quarterly numbers came in at...",
    "segments": [{"t": 0.2, "end": 3.1, "text": "the quarterly..."}],
    "model": "whisper-tiny"
  },

  "spectral_summary": {
    "rms_mean": 0.08, "rms_std": 0.04,
    "spectral_centroid_mean": 1843,
    "spectral_rolloff_mean": 3820,
    "zero_crossing_rate_mean": 0.062,
    "dominant_freq_band_hz": [200, 3400]
  },

  "events": [
    {"t": 0.21, "type": "speech_onset"},
    {"t": 45.3, "type": "silence", "duration": 2.1},
    ...
  ],

  "vad": {"speech_ratio": 0.73, "segments_count": 47, "longest_silence_s": 8.2}
}

Size: ~2-4 KB per gene for a 10-minute recording. 10x cheaper per gene than raw spectrogram dumping. LLM-readable structured data, not opaque vectors.

Schema migration: one new audio_embedding TEXT column on the genes table, plus three new fields on the Gene dataclass. Idempotent ALTER TABLE at startup, null-safe for all existing text genes. No breaking change.

Part 6 — What sound memories unlock (use cases)

The three new capabilities audio genes provide that text genes cannot:

1. Temporal/event reasoning — when things happened, not just what was discussed

"When did the meeting get heated?" → audio energy + voice stress features
"What time did the machine start failing?" → onset detection + anomaly cosine

2. Non-verbal context — hesitation, tone, pauses, background, speaker identity

A transcript says "yes." An audio memory knows the "yes" was mumbled after a 3-second pause with a sigh.
This is the difference between reading a meeting summary and having been in the room.

3. Cross-modal search — three-tier retrieval (BM25 transcript + ΣĒMA 20D text + CLAP 512D audio)

Text query → audio gene via transcript OR CLAP text encoder
Audio query → audio gene via CLAP audio encoder cosine
Filter by classification (speech/music/environmental) before ranking

Concrete use case table:

Capability	Without audio	With audio
Meeting recall	Text summary via transcript	"Alice said X at 14:32 with hesitation, after 8s of room silence — suggests a difficult admission. Link to audio segment."
Machine diagnostics	Impossible	"The bearing tone on pump-3 shifted 200 Hz between 2026-04-05 and 2026-04-09. Three matching recordings."
Music library reasoning	Metadata only	"17 tracks in your writing playlist are minor keys under 80 BPM. Three you skipped most this month are G minor."
Security/safety events	Impossible	Acoustic anomaly genes fire when CLAP cosine matches a stored template (glass break, impact, raised voice).
Conversation emotional trajectory	Impossible	"The client call went from collaborative to tense at 22 minutes, recovering by 28. Speaker A did the recovery work."

Part 7 — Video ingestion: parent-with-streams

Video is not a new gene type. It's a parent reference with child genes per extracted stream, using helix's existing is_fragment=True + source_id + a new video_offset_s field:

VideoGene (parent — thin reference, file path + duration + format metadata)
│
├── AudioGene(s)      ← ffmpeg extracts audio track → full audio pipeline
├── FrameGene(s)      ← ffmpeg extracts keyframes → visual pipeline (future work)
├── CaptionGene       ← ffmpeg extracts embedded subtitles → plain text ingest
└── TranscriptGene    ← Whisper output indexed separately for fast text search
                       without loading full audio feature JSON

ffmpeg does the demux in one pass:

ffmpeg -i input.mp4 -vn -acodec copy audio.m4a              # audio only
ffmpeg -i input.mp4 -vf "fps=0.1" -q:v 2 frame_%04d.jpg     # 1 frame per 10s
ffmpeg -i input.mp4 -map 0:s:0 -f srt captions.srt          # embedded subs if any

Each stream goes through its specialist pipeline; all resulting genes share a source_id + video_offset_s so they can be joined at query time.

Retrieval hits any stream independently:

"What video had the blue car?" → frame genes via CLIP
"What video mentioned the API key?" → transcript genes via BM25
"When in that video did the laughter happen?" → audio genes via CLAP classification
"What did the presenter say while the diagram was on screen?" → join frame + transcript genes in a single expression window, filtered by overlapping video_offset_s

Storage is proportional to content density, not runtime. You don't store 600 MB of video — you store 2-4 KB of audio feature JSON + a few KB of frame embeddings + the transcript. The original file is referenced by path + SHA256 hash, not inlined.

Each stream updates independently. New visual classifier? Re-run frame genes without touching audio. New speech model? Re-run Whisper without touching frames.

Part 8 — Output channels (what the agent produces with these memories)

Once audio/video genes exist, the agent's output channels grow in structured ways:

Text output enriched with non-verbal context — "You mentioned Shamir on 2026-04-10 at 47:03 with a long pause before committing — want to revisit?"
Timestamped recall with source links — pointers to specific moments, not paraphrases
Structured reports derived from audio — CSVs of speech events, JSONs of classified acoustic events, timelines of anomalies, chord charts from musical pieces
Cross-modal summaries — "This tutorial has 47 min audio, 31 speech segments, 12 onscreen code changes. Presenter spent 38% on topic X. Three moments of expressed uncertainty (pause + hedged phrasing)."
Action triggers — watch for acoustic patterns matching stored templates; fire skills when CLAP cosine > 0.7 against an alert template
Audio replay/synthesis (future, separate research arc) — with TTS bolted on, read-aloud summaries with prosody become possible. Speech-to-speech stays out of scope (per Specialist B, no open S2S stack verifiably bypasses text internally as of 2026-04).

Part 9 — Impact on existing helix-context features

Audio ingestion enriches several non-audio features without requiring changes to them:

Replication (learn-from-conversation loop) — a Helix server seeing spoken exchanges replicates them into the genome AS audio genes with transcripts. Extends the existing replication buffer in context_manager.py; no new subsystem.
ContextHealth / ellipticity — the existing delta-epsilon signal gains a new dimension: audio gene interpretation_confidence contributes to window health. A window with low-confidence audio returns sparse with reason "audio interpretation uncertain."
Horizontal Gene Transfer (.helix format) — audio genes serialize their feature JSON + CLAP embedding + transcript without requiring raw audio transfer. Two Helix instances can share recorded knowledge with no raw file transfer. Combines naturally with the Shamir Secret Sharing design note — audio genes could be SSS-split the same way text genes could.
Decoder prompts — existing DECODER_* prompts assume text-only. Adding audio requires one new prompt variant that teaches the big LLM how to read audio feature JSON. One new constant + one conditional.
MCP tool exposure — once audio genes exist, helix-context's MCP server exposes ingest_audio_file, ingest_video_file, query_audio_memory, describe_sound_at, extract_speech_segments, detect_acoustic_anomalies. Other MCP-speaking agents (BigEd, Claude Code sessions) gain audio memory as a first-class capability.
Benchmark infrastructure — an audio_bench_needle.py harvests audio-specific needles (find recordings where X was said, identify machine sounds matching templates, classify clips). Plugs into the existing compare_ab.py + benchmark state monitor.

Part 10 — The "learning from watching" loop, concretely

The user's original framing — "learning ability from video/sound files via direct audio signal to context, bypass human txt on storage side" — becomes a concrete 6-step flow:

Agent is given a URL or file path to a video (tutorial, meeting, podcast, lecture, demo)
ffmpeg demuxes → audio + frames + captions (if present)
Audio pipeline produces one or more AudioGene records (CLAP + librosa features + Whisper transcript + event timeline + zero-shot classification)
Frame pipeline (future work) produces FrameGene records
All genes share source_id + video_offset_s so they can be joined at query time
Future queries reference the memory without re-watching — "in that tutorial, the presenter said X while demonstrating Y" answered by joining audio+frame genes in a single expression window

The "bypass text on storage side" question, answered concretely:

The audio portion of the memory is genuinely text-free at the storage layer (CLAP 512D + feature JSON + event timeline)
The transcript is still text, because the big LLM needs some textual surface to reason over at expression time
But the transcript is a projection of the audio, not the source of truth. If we lost the transcript we could regenerate it from the audio. If we lost the audio we'd still have the CLAP embedding + feature JSON (enough for retrieval + structural reasoning, just not re-transcription).
Text is a view; audio is the source. This is the inverse of today's text-only helix-context.

Part 11 — Open questions (not answered by this note)

Things the user would need to decide before a real spec:

Granularity — one AudioGene per file, or per 30s chunk, or per VAD-detected utterance? Lead researcher recommended chunk-level with is_fragment=True. Matters more for hour-long podcasts than short clips.
Raw file backing — do audio genes include a file path, SHA256 hash, external URL, or inlined base64 chunk? Probably path + hash (same as current source_id pattern).
Retention / chromatin for audio — do audio genes age on a different schedule than text? A year-old meeting recording may be HETEROCHROMATIN on a different cadence than a year-old text note.
Cross-modal query-time budget — if an expression includes 5 text + 2 audio genes, how is the token budget split? Audio feature JSON is denser than plain text; audio genes might need their own budget tier (extending the existing TIGHT/FOCUSED/BROAD pattern).
Decoder prompt format — inject feature JSON verbatim, a prose summary, both, or selected fields (transcript + top classes + key events only)? Same split as the existing DECODER_MOE vs DECODER_CONDENSED decision for text.
A/B env override name — HELIX_DISABLE_AUDIO=1 mirrors HELIX_DISABLE_HEADROOM=1 — confirm the pattern.
MP3 input policy — require external ffmpeg for MP3 decode (Path A), or reject MP3 with a conversion hint (Path B)? Path A is more user-friendly, Path B is zero-external-dep.

What this note is NOT

Not a roadmap item. Not scheduled, not estimated, not committed.
Not a sprint-ready spec. The open questions above must be answered before implementation work can start.
Not a replacement for the upstream research. When audio un-defers, re-survey the landscape — 2026 is moving fast on audio LLMs and the pipeline may have better options by then.
Not a claim that text genes are wrong. Text remains the primary modality for helix-context. Audio is additive, not a replacement.

What this note IS

A durable design capture so future raude/laude/human sessions don't re-derive the same thinking
A bill of materials — exact tools, exact licenses, exact fork strategy
A use case list — concrete answers to "what would audio genes enable?"
An architectural north star — the 6-step learn-from-watching flow, the parent-with-streams video model, the CLAP-triple-duty pattern
An honest accounting of what's deferred — and why

Related discussions

#2 — Headroom adoption + N=20 benchmark + forensic detour — the v0.3.0b5 retrospective that preceded this research thread
#5 — Shamir Secret Sharing for HGT + multi-tenant shards — the crypto primitive discussion that combines interestingly with audio genes (SSS-split audio content for privacy-preserving sharing)
Issue #3 — bench_needle_1000.py KV harvest extracts phantom values — the measurement-integrity thread that must resolve before we can benchmark any new gene type, including audio

— raude, Claude Code Opus 4.6 (1M context)
Session: 2026-04-10 / v0.3.0b5 ship day / paired with laude on the benchmark track
Specialists consulted: general-purpose research agents for audio embedding models (Specialist A) and native audio LLMs / S2S stacks (Specialist B), with lead synthesis by a third agent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio ingestion design note (DEFERRED) — layered pipeline, use cases, video streams, licensing, dep health #6

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Audio ingestion design note (DEFERRED) — layered pipeline, use cases, video streams, licensing, dep health #6

Uh oh!

mbachaud Apr 10, 2026 Maintainer

Audio ingestion design note (DEFERRED — not roadmapped)

Why this note exists

TL;DR

Part 1 — The layered pipeline

Part 2 — License audit + fork rights

Part 3 — Dependency health detection

Part 4 — CLAP as semantic critic (confidence scoring)

Part 5 — Gene schema for audio

Part 6 — What sound memories unlock (use cases)

Part 7 — Video ingestion: parent-with-streams

Part 8 — Output channels (what the agent produces with these memories)

Part 9 — Impact on existing helix-context features

Part 10 — The "learning from watching" loop, concretely

Part 11 — Open questions (not answered by this note)

What this note is NOT

What this note IS

Related discussions

Replies: 0 comments

mbachaud
Apr 10, 2026
Maintainer