You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Audio ingestion design note (DEFERRED — not roadmapped)
Status: Design exploration only. Not scheduled, not in any current milestone, not a commitment to ship. Captured so the thinking survives the session it was born in and future raude/laude/human collaborators don't have to re-derive it from scratch.
Filed by raude — Claude Code Opus 4.6 (1M context).
Why this note exists
During the 2026-04-10 session (helix-context v0.3.0b5 ship day), the user asked whether agents could "hear" — ingest audio/video content directly into the genome rather than relying on text transcripts alone. A 2+1 specialist research council was dispatched; the findings plus several subsequent architectural refinements by the user landed on a concrete design direction that's worth capturing. The audio MVP itself is deferred — Struggle 1 (density gate at ingest) and the bench harvest bug (Issue #3) are higher priority. But when audio work un-defers, this is the blueprint.
TL;DR
The architecture: ingest audio through a 10-layer pipeline of existing, mature, Apache-2.0-compatible tools, producing a structured JSON "audio gene" content field alongside a 512D CLAP embedding for cross-modal retrieval. CLAP plays three roles simultaneously — retrieval key, zero-shot classifier, and semantic critic for confidence scoring. Video is a parent-with-streams pattern: ffmpeg demuxes into audio + frames + captions, each stream goes through its own specialist pipeline, genes link via shared source_id + video_offset_s. The bet: text is a projection of audio, not the source of truth — the primary storage is embeddings + feature JSON, the transcript is a view.
Part 1 — The layered pipeline
Every layer is an existing, mature, permissively-licensed tool. Nothing here is research code or experimental.
Layer
Tool
License
Role
1. Format decode (WAV/FLAC/OGG)
soundfile
BSD-3 (LGPL libsndfile via wheels)
Native audio I/O
1b. MP3 decode (optional)
ffmpeg external
LGPL-2.1+ (system dep)
Format conversion only, no GPL codecs required
2. Resample / mono
librosa
ISC
Normalize to 44.1kHz mono float32
3. Loudness normalize
pyloudnorm
MIT
EBU R128 target (-16 LUFS)
4. Voice activity / silence segmentation
silero-vad
MIT (code + weights)
Finds speech boundaries; ~50ms/call ONNX
5. Spectral features
librosa.feature.*
ISC
mel, MFCC, chroma, centroid, rolloff, RMS, ZCR, etc.
6. Event detection (onsets, beats, tempo)
librosa.onset + librosa.beat
ISC
"What's moving" — the signal vs steady-state
7. Classification
LAION-CLAP zero-shot
Apache-2.0
Prompted with candidate labels, pick highest cosine. Dual-use with layer 8
8. Semantic embedding
LAION-CLAP
Apache-2.0
512D audio↔text shared space — the retrieval key
9. Speech transcription (conditional)
faster-whisper (Whisper-tiny)
MIT (code + weights)
Only runs when layer 7 classifies as speech
10. Music analysis (optional extra)
madmom
BSD-3-Clause
Chord, key, beat, tempo, downbeats. Lazy load — only for music genes
Key simplification (user's insight): CLAP handles layers 7 and 8 in a single model. No separate YAMNet/PANN classifier needed. Zero-shot classification via cosine against candidate label prompts is ~5-10% less accurate than a fine-tuned classifier but drops TensorFlow (~600MB) from the install tree and makes the class ontology customizable per deployment rather than baked into a pre-trained label set.
Total resident RAM fully warm: ~880 MB (CLAP ~500MB + Whisper-tiny ~75MB + silero-vad ~5MB + librosa working memory ~200MB + madmom lazy ~100MB). CPU-only, zero GPU required.
Total pip install helix-context[audio] footprint: ~300 MB one-time download, ~100 MB wheel install. Madmom isolated in its own [audio-music] extra so music-specific users opt in separately.
Part 2 — License audit + fork rights
The license mix is deliberately permissive so the audio layer survives upstream abandonment. Everything in the pipeline is fully forkable if any dep rots:
License
Count
Fork rights
Patent protection
Apache-2.0
LAION-CLAP
✅ Full, including re-license of derivatives
✅ Explicit patent grant survives fork
MIT
pyloudnorm, silero-vad, faster-whisper
✅ Full
❌ No formal protection
BSD-3-Clause
madmom, soundfile, webrtcvad
✅ Full
❌ No formal protection
ISC
librosa
✅ Full (equivalent to BSD-2)
❌ No formal protection
LGPL (external only)
ffmpeg binary, libsndfile via wheels
✅ (modifications to LGPL code must stay LGPL)
❌ Not an issue since we don't modify
AGPL
ZERO
—
—
GPL
ZERO
—
—
Fork playbook for any abandoned dep:
Fork the repo to a SwiftWing21/*-maintained or similar
Preserve original copyright + license headers; add "Modified from..." markers per Apache-2.0, add own copyright line
Rename the PyPI package if the original can't be transferred (trademark + squatting risk)
Publish under the new name
Update helix-context[audio] extra to depend on the new name + add NOTICE entry
Legal effort: ~30 minutes. Engineering effort: scales with compatibility breakage.
The single Apache-2.0-patent-grant win: CLAP is the most protected piece of the pipeline, which matters because it's where the novel research lives. MIT/BSD/ISC tools are all pure DSP or small utilities where patent risk is effectively zero — OK that they lack formal patent grants.
Essentia (AGPL) was explicitly rejected — it was the only serious contender for chord/key detection but AGPL-on-network-service would have contaminated commercial helix-context deployments. Madmom (BSD-3) replaces it cleanly.
Part 3 — Dependency health detection
Because upstream abandonment is a real concern for audio tooling specifically (several contenders in this space are research-code with slow maintenance), a detection path for "is this dep going stale?" is part of the design:
RED: repo archived OR CVE unpatched OR install fails OR last commit >365d + no successor. Fork now.
Where it lives:
scripts/check_dep_health.py (~200 LOC standalone) — CI nightly
Opens a GitHub issue tagged dependency-health if anything goes YELLOW or RED
Also runs as part of the pre-release checklist for each v0.X.Y tag
Future extension: a helix-context doctor CLI command that checks Python version, Ollama reachability, genome health, AND dependency freshness in one command
Current pipeline risk assessment (from memory, would need real-time verification):
Dep
Status
librosa
🟢 Active
pyloudnorm
🟡 Slow releases, single maintainer
silero-vad
🟢 Active
laion-clap
🟢 Active
faster-whisper
🟢 Active
soundfile
🟢 Active
madmom
🟡 ~2 years since major release — music features isolated in [audio-music] extra for this reason
The pipeline is deliberately structured so a single YELLOW dep (madmom) only affects users who opt into music-specific features, not the core audio ingest.
Part 4 — CLAP as semantic critic (confidence scoring)
Beyond retrieval and classification, CLAP plays a third role — it verifies the LLM's interpretation of an audio gene via cross-modal cosine similarity. The flow:
1. LLM reads audio gene → produces text interpretation
("sustained 200Hz hum, speech-band activity 0.3-2.1s, ends abruptly")
2. Compute: cosine( CLAP_text(interpretation), CLAP_audio(original_waveform) )
3. Bucket the cosine score:
> 0.35 → high confidence, answer with no hedge
0.15-0.35 → medium, answer with "likely"
< 0.15 → low, flag as "I'm not confident about this audio"
Ingest-time confidence stamp stored as gene metadata:
Gene(
...,
audio_embedding=[512floats], # CLAP of the waveformcontent="[librosa feature JSON]", # LLM reasoning surfaceinterpretation_text="sustained hum...", # LLM's cached first interpretationinterpretation_confidence=0.42, # CLAP-verified
)
Genes with interpretation_confidence < 0.15 get flagged at ingest — either the LLM misread the features, or it's out-of-distribution audio CLAP also doesn't understand. Either way, the gene is marked low-trust and downweighted at retrieval.
This plugs into the existing ContextHealth.ellipticity pattern — audio genes contribute their confidence score to the window's health signal. A window with low-confidence audio genes returns status=sparse with reason "audio interpretation uncertain."
The speculative long-term angle: CLAP as a reward signal for fine-tuning. The LLM produces interpretations of unlabeled audio; CLAP scores each interpretation against the audio; the scores become training signal for RLHF-style updates. Nobody has published this that I know of — it's a real research direction but not MVP scope.
Honest caveats on CLAP-as-critic:
Not ground truth. Trained on LAION-Audio-630K with its own biases. Out-of-distribution audio may score low even when the LLM is correct.
Shared-space assumption. Round-trip only works if the LLM's text vocabulary overlaps with what CLAP was trained on. "Sustained 200Hz hum" works. "Harmonic content suggests Am6(add9)" might not.
Cosine scale calibration. 0.3 is "pretty similar" but there's no universal threshold — needs empirical calibration on ~50 known-good audio/text pairs to find the right operating point.
Part 5 — Gene schema for audio
Gene (audio, v0.5+):
content: "<audio_sema_json>"# layered pipeline outputcomplement: "[brief summary: 'meeting recording, 47 speech segments, male voice, ~10min']"codons: ["speech_onset at 0.21s",
"silence 45.3-47.4s",
"speech_onset at 48.9s", ...]
promoter.domains: ["audio", "speech", "male_voice", "inside"] # from CLAP zero-shotembedding: [20DΣĒMAofcomplement] # existing text retrieval pathaudio_embedding: [512DCLAP] # new: cross-modal retrievalinterpretation_confidence: 0.87# new: CLAP criticinterpretation_text: "[LLM's cached first interpretation]"# new: for reuse
content field format — the "audio ΣĒMA JSON" produced by the layered pipeline:
Size: ~2-4 KB per gene for a 10-minute recording. 10x cheaper per gene than raw spectrogram dumping. LLM-readable structured data, not opaque vectors.
Schema migration: one new audio_embedding TEXT column on the genes table, plus three new fields on the Gene dataclass. Idempotent ALTER TABLE at startup, null-safe for all existing text genes. No breaking change.
Part 6 — What sound memories unlock (use cases)
The three new capabilities audio genes provide that text genes cannot:
1. Temporal/event reasoning — when things happened, not just what was discussed
"When did the meeting get heated?" → audio energy + voice stress features
"What time did the machine start failing?" → onset detection + anomaly cosine
Text query → audio gene via transcript OR CLAP text encoder
Audio query → audio gene via CLAP audio encoder cosine
Filter by classification (speech/music/environmental) before ranking
Concrete use case table:
Capability
Without audio
With audio
Meeting recall
Text summary via transcript
"Alice said X at 14:32 with hesitation, after 8s of room silence — suggests a difficult admission. Link to audio segment."
Machine diagnostics
Impossible
"The bearing tone on pump-3 shifted 200 Hz between 2026-04-05 and 2026-04-09. Three matching recordings."
Music library reasoning
Metadata only
"17 tracks in your writing playlist are minor keys under 80 BPM. Three you skipped most this month are G minor."
Security/safety events
Impossible
Acoustic anomaly genes fire when CLAP cosine matches a stored template (glass break, impact, raised voice).
Conversation emotional trajectory
Impossible
"The client call went from collaborative to tense at 22 minutes, recovering by 28. Speaker A did the recovery work."
Part 7 — Video ingestion: parent-with-streams
Video is not a new gene type. It's a parent reference with child genes per extracted stream, using helix's existing is_fragment=True + source_id + a new video_offset_s field:
VideoGene (parent — thin reference, file path + duration + format metadata)
│
├── AudioGene(s) ← ffmpeg extracts audio track → full audio pipeline
├── FrameGene(s) ← ffmpeg extracts keyframes → visual pipeline (future work)
├── CaptionGene ← ffmpeg extracts embedded subtitles → plain text ingest
└── TranscriptGene ← Whisper output indexed separately for fast text search
without loading full audio feature JSON
ffmpeg does the demux in one pass:
ffmpeg -i input.mp4 -vn -acodec copy audio.m4a # audio only
ffmpeg -i input.mp4 -vf "fps=0.1" -q:v 2 frame_%04d.jpg # 1 frame per 10s
ffmpeg -i input.mp4 -map 0:s:0 -f srt captions.srt # embedded subs if any
Each stream goes through its specialist pipeline; all resulting genes share a source_id + video_offset_s so they can be joined at query time.
Retrieval hits any stream independently:
"What video had the blue car?" → frame genes via CLIP
"What video mentioned the API key?" → transcript genes via BM25
"When in that video did the laughter happen?" → audio genes via CLAP classification
"What did the presenter say while the diagram was on screen?" → join frame + transcript genes in a single expression window, filtered by overlapping video_offset_s
Storage is proportional to content density, not runtime. You don't store 600 MB of video — you store 2-4 KB of audio feature JSON + a few KB of frame embeddings + the transcript. The original file is referenced by path + SHA256 hash, not inlined.
Each stream updates independently. New visual classifier? Re-run frame genes without touching audio. New speech model? Re-run Whisper without touching frames.
Part 8 — Output channels (what the agent produces with these memories)
Once audio/video genes exist, the agent's output channels grow in structured ways:
Text output enriched with non-verbal context — "You mentioned Shamir on 2026-04-10 at 47:03 with a long pause before committing — want to revisit?"
Timestamped recall with source links — pointers to specific moments, not paraphrases
Structured reports derived from audio — CSVs of speech events, JSONs of classified acoustic events, timelines of anomalies, chord charts from musical pieces
Cross-modal summaries — "This tutorial has 47 min audio, 31 speech segments, 12 onscreen code changes. Presenter spent 38% on topic X. Three moments of expressed uncertainty (pause + hedged phrasing)."
Action triggers — watch for acoustic patterns matching stored templates; fire skills when CLAP cosine > 0.7 against an alert template
Audio replay/synthesis (future, separate research arc) — with TTS bolted on, read-aloud summaries with prosody become possible. Speech-to-speech stays out of scope (per Specialist B, no open S2S stack verifiably bypasses text internally as of 2026-04).
Part 9 — Impact on existing helix-context features
Audio ingestion enriches several non-audio features without requiring changes to them:
Replication (learn-from-conversation loop) — a Helix server seeing spoken exchanges replicates them into the genome AS audio genes with transcripts. Extends the existing replication buffer in context_manager.py; no new subsystem.
ContextHealth / ellipticity — the existing delta-epsilon signal gains a new dimension: audio gene interpretation_confidence contributes to window health. A window with low-confidence audio returns sparse with reason "audio interpretation uncertain."
Horizontal Gene Transfer (.helix format) — audio genes serialize their feature JSON + CLAP embedding + transcript without requiring raw audio transfer. Two Helix instances can share recorded knowledge with no raw file transfer. Combines naturally with the Shamir Secret Sharing design note — audio genes could be SSS-split the same way text genes could.
Decoder prompts — existing DECODER_* prompts assume text-only. Adding audio requires one new prompt variant that teaches the big LLM how to read audio feature JSON. One new constant + one conditional.
MCP tool exposure — once audio genes exist, helix-context's MCP server exposes ingest_audio_file, ingest_video_file, query_audio_memory, describe_sound_at, extract_speech_segments, detect_acoustic_anomalies. Other MCP-speaking agents (BigEd, Claude Code sessions) gain audio memory as a first-class capability.
Benchmark infrastructure — an audio_bench_needle.py harvests audio-specific needles (find recordings where X was said, identify machine sounds matching templates, classify clips). Plugs into the existing compare_ab.py + benchmark state monitor.
Part 10 — The "learning from watching" loop, concretely
The user's original framing — "learning ability from video/sound files via direct audio signal to context, bypass human txt on storage side" — becomes a concrete 6-step flow:
Agent is given a URL or file path to a video (tutorial, meeting, podcast, lecture, demo)
Audio pipeline produces one or more AudioGene records (CLAP + librosa features + Whisper transcript + event timeline + zero-shot classification)
Frame pipeline (future work) produces FrameGene records
All genes share source_id + video_offset_s so they can be joined at query time
Future queries reference the memory without re-watching — "in that tutorial, the presenter said X while demonstrating Y" answered by joining audio+frame genes in a single expression window
The "bypass text on storage side" question, answered concretely:
The audio portion of the memory is genuinely text-free at the storage layer (CLAP 512D + feature JSON + event timeline)
The transcript is still text, because the big LLM needs some textual surface to reason over at expression time
But the transcript is a projection of the audio, not the source of truth. If we lost the transcript we could regenerate it from the audio. If we lost the audio we'd still have the CLAP embedding + feature JSON (enough for retrieval + structural reasoning, just not re-transcription).
Text is a view; audio is the source. This is the inverse of today's text-only helix-context.
Part 11 — Open questions (not answered by this note)
Things the user would need to decide before a real spec:
Granularity — one AudioGene per file, or per 30s chunk, or per VAD-detected utterance? Lead researcher recommended chunk-level with is_fragment=True. Matters more for hour-long podcasts than short clips.
Raw file backing — do audio genes include a file path, SHA256 hash, external URL, or inlined base64 chunk? Probably path + hash (same as current source_id pattern).
Retention / chromatin for audio — do audio genes age on a different schedule than text? A year-old meeting recording may be HETEROCHROMATIN on a different cadence than a year-old text note.
Cross-modal query-time budget — if an expression includes 5 text + 2 audio genes, how is the token budget split? Audio feature JSON is denser than plain text; audio genes might need their own budget tier (extending the existing TIGHT/FOCUSED/BROAD pattern).
Decoder prompt format — inject feature JSON verbatim, a prose summary, both, or selected fields (transcript + top classes + key events only)? Same split as the existing DECODER_MOE vs DECODER_CONDENSED decision for text.
A/B env override name — HELIX_DISABLE_AUDIO=1 mirrors HELIX_DISABLE_HEADROOM=1 — confirm the pattern.
MP3 input policy — require external ffmpeg for MP3 decode (Path A), or reject MP3 with a conversion hint (Path B)? Path A is more user-friendly, Path B is zero-external-dep.
What this note is NOT
Not a roadmap item. Not scheduled, not estimated, not committed.
Not a sprint-ready spec. The open questions above must be answered before implementation work can start.
Not a replacement for the upstream research. When audio un-defers, re-survey the landscape — 2026 is moving fast on audio LLMs and the pipeline may have better options by then.
Not a claim that text genes are wrong. Text remains the primary modality for helix-context. Audio is additive, not a replacement.
What this note IS
A durable design capture so future raude/laude/human sessions don't re-derive the same thinking
A bill of materials — exact tools, exact licenses, exact fork strategy
A use case list — concrete answers to "what would audio genes enable?"
An architectural north star — the 6-step learn-from-watching flow, the parent-with-streams video model, the CLAP-triple-duty pattern
— raude, Claude Code Opus 4.6 (1M context) Session: 2026-04-10 / v0.3.0b5 ship day / paired with laude on the benchmark track Specialists consulted: general-purpose research agents for audio embedding models (Specialist A) and native audio LLMs / S2S stacks (Specialist B), with lead synthesis by a third agent
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Audio ingestion design note (DEFERRED — not roadmapped)
Status: Design exploration only. Not scheduled, not in any current milestone, not a commitment to ship. Captured so the thinking survives the session it was born in and future raude/laude/human collaborators don't have to re-derive it from scratch.
Filed by raude — Claude Code Opus 4.6 (1M context).
Why this note exists
During the 2026-04-10 session (helix-context v0.3.0b5 ship day), the user asked whether agents could "hear" — ingest audio/video content directly into the genome rather than relying on text transcripts alone. A 2+1 specialist research council was dispatched; the findings plus several subsequent architectural refinements by the user landed on a concrete design direction that's worth capturing. The audio MVP itself is deferred — Struggle 1 (density gate at ingest) and the bench harvest bug (Issue #3) are higher priority. But when audio work un-defers, this is the blueprint.
TL;DR
The architecture: ingest audio through a 10-layer pipeline of existing, mature, Apache-2.0-compatible tools, producing a structured JSON "audio gene" content field alongside a 512D CLAP embedding for cross-modal retrieval. CLAP plays three roles simultaneously — retrieval key, zero-shot classifier, and semantic critic for confidence scoring. Video is a parent-with-streams pattern: ffmpeg demuxes into audio + frames + captions, each stream goes through its own specialist pipeline, genes link via shared
source_id+video_offset_s. The bet: text is a projection of audio, not the source of truth — the primary storage is embeddings + feature JSON, the transcript is a view.Part 1 — The layered pipeline
Every layer is an existing, mature, permissively-licensed tool. Nothing here is research code or experimental.
soundfileffmpegexternallibrosapyloudnormsilero-vadlibrosa.feature.*librosa.onset+librosa.beatfaster-whisper(Whisper-tiny)madmomKey simplification (user's insight): CLAP handles layers 7 and 8 in a single model. No separate YAMNet/PANN classifier needed. Zero-shot classification via cosine against candidate label prompts is ~5-10% less accurate than a fine-tuned classifier but drops TensorFlow (~600MB) from the install tree and makes the class ontology customizable per deployment rather than baked into a pre-trained label set.
Total resident RAM fully warm: ~880 MB (CLAP ~500MB + Whisper-tiny ~75MB + silero-vad ~5MB + librosa working memory ~200MB + madmom lazy ~100MB). CPU-only, zero GPU required.
Total
pip install helix-context[audio]footprint: ~300 MB one-time download, ~100 MB wheel install. Madmom isolated in its own[audio-music]extra so music-specific users opt in separately.Part 2 — License audit + fork rights
The license mix is deliberately permissive so the audio layer survives upstream abandonment. Everything in the pipeline is fully forkable if any dep rots:
Fork playbook for any abandoned dep:
SwiftWing21/*-maintainedor similarhelix-context[audio]extra to depend on the new name + add NOTICE entryLegal effort: ~30 minutes. Engineering effort: scales with compatibility breakage.
The single Apache-2.0-patent-grant win: CLAP is the most protected piece of the pipeline, which matters because it's where the novel research lives. MIT/BSD/ISC tools are all pure DSP or small utilities where patent risk is effectively zero — OK that they lack formal patent grants.
Essentia (AGPL) was explicitly rejected — it was the only serious contender for chord/key detection but AGPL-on-network-service would have contaminated commercial helix-context deployments. Madmom (BSD-3) replaces it cleanly.
Part 3 — Dependency health detection
Because upstream abandonment is a real concern for audio tooling specifically (several contenders in this space are research-code with slow maintenance), a detection path for "is this dep going stale?" is part of the design:
/repos/{o}/{r}→pushed_atupload_timearchived: booltrue= hard stopinfo.classifierspip install --dry-runBucketing:
Where it lives:
scripts/check_dep_health.py(~200 LOC standalone) — CI nightlydependency-healthif anything goes YELLOW or REDhelix-context doctorCLI command that checks Python version, Ollama reachability, genome health, AND dependency freshness in one commandCurrent pipeline risk assessment (from memory, would need real-time verification):
[audio-music]extra for this reasonThe pipeline is deliberately structured so a single YELLOW dep (madmom) only affects users who opt into music-specific features, not the core audio ingest.
Part 4 — CLAP as semantic critic (confidence scoring)
Beyond retrieval and classification, CLAP plays a third role — it verifies the LLM's interpretation of an audio gene via cross-modal cosine similarity. The flow:
Ingest-time confidence stamp stored as gene metadata:
Genes with
interpretation_confidence < 0.15get flagged at ingest — either the LLM misread the features, or it's out-of-distribution audio CLAP also doesn't understand. Either way, the gene is marked low-trust and downweighted at retrieval.This plugs into the existing
ContextHealth.ellipticitypattern — audio genes contribute their confidence score to the window's health signal. A window with low-confidence audio genes returnsstatus=sparsewith reason "audio interpretation uncertain."The speculative long-term angle: CLAP as a reward signal for fine-tuning. The LLM produces interpretations of unlabeled audio; CLAP scores each interpretation against the audio; the scores become training signal for RLHF-style updates. Nobody has published this that I know of — it's a real research direction but not MVP scope.
Honest caveats on CLAP-as-critic:
Part 5 — Gene schema for audio
contentfield format — the "audio ΣĒMA JSON" produced by the layered pipeline:{ "schema": "helix-audio-gene/v1", "source": "meeting_20260410.mp3", "duration_s": 602.3, "sample_rate": 44100, "classification": { "top_classes": [["speech", 0.94], ["male_voice", 0.78], ["inside", 0.42]], "model": "laion-clap-zero-shot" }, "transcript": { "text": "the quarterly numbers came in at...", "segments": [{"t": 0.2, "end": 3.1, "text": "the quarterly..."}], "model": "whisper-tiny" }, "spectral_summary": { "rms_mean": 0.08, "rms_std": 0.04, "spectral_centroid_mean": 1843, "spectral_rolloff_mean": 3820, "zero_crossing_rate_mean": 0.062, "dominant_freq_band_hz": [200, 3400] }, "events": [ {"t": 0.21, "type": "speech_onset"}, {"t": 45.3, "type": "silence", "duration": 2.1}, ... ], "vad": {"speech_ratio": 0.73, "segments_count": 47, "longest_silence_s": 8.2} }Size: ~2-4 KB per gene for a 10-minute recording. 10x cheaper per gene than raw spectrogram dumping. LLM-readable structured data, not opaque vectors.
Schema migration: one new
audio_embedding TEXTcolumn on thegenestable, plus three new fields on theGenedataclass. IdempotentALTER TABLEat startup, null-safe for all existing text genes. No breaking change.Part 6 — What sound memories unlock (use cases)
The three new capabilities audio genes provide that text genes cannot:
1. Temporal/event reasoning — when things happened, not just what was discussed
2. Non-verbal context — hesitation, tone, pauses, background, speaker identity
3. Cross-modal search — three-tier retrieval (BM25 transcript + ΣĒMA 20D text + CLAP 512D audio)
Concrete use case table:
Part 7 — Video ingestion: parent-with-streams
Video is not a new gene type. It's a parent reference with child genes per extracted stream, using helix's existing
is_fragment=True+source_id+ a newvideo_offset_sfield:ffmpeg does the demux in one pass:
Each stream goes through its specialist pipeline; all resulting genes share a
source_id+video_offset_sso they can be joined at query time.Retrieval hits any stream independently:
video_offset_sStorage is proportional to content density, not runtime. You don't store 600 MB of video — you store 2-4 KB of audio feature JSON + a few KB of frame embeddings + the transcript. The original file is referenced by path + SHA256 hash, not inlined.
Each stream updates independently. New visual classifier? Re-run frame genes without touching audio. New speech model? Re-run Whisper without touching frames.
Part 8 — Output channels (what the agent produces with these memories)
Once audio/video genes exist, the agent's output channels grow in structured ways:
Part 9 — Impact on existing helix-context features
Audio ingestion enriches several non-audio features without requiring changes to them:
context_manager.py; no new subsystem.interpretation_confidencecontributes to window health. A window with low-confidence audio returnssparsewith reason "audio interpretation uncertain.".helixformat) — audio genes serialize their feature JSON + CLAP embedding + transcript without requiring raw audio transfer. Two Helix instances can share recorded knowledge with no raw file transfer. Combines naturally with the Shamir Secret Sharing design note — audio genes could be SSS-split the same way text genes could.DECODER_*prompts assume text-only. Adding audio requires one new prompt variant that teaches the big LLM how to read audio feature JSON. One new constant + one conditional.ingest_audio_file,ingest_video_file,query_audio_memory,describe_sound_at,extract_speech_segments,detect_acoustic_anomalies. Other MCP-speaking agents (BigEd, Claude Code sessions) gain audio memory as a first-class capability.audio_bench_needle.pyharvests audio-specific needles (find recordings where X was said, identify machine sounds matching templates, classify clips). Plugs into the existingcompare_ab.py+ benchmark state monitor.Part 10 — The "learning from watching" loop, concretely
The user's original framing — "learning ability from video/sound files via direct audio signal to context, bypass human txt on storage side" — becomes a concrete 6-step flow:
ffmpegdemuxes → audio + frames + captions (if present)AudioGenerecords (CLAP + librosa features + Whisper transcript + event timeline + zero-shot classification)FrameGenerecordssource_id+video_offset_sso they can be joined at query timeThe "bypass text on storage side" question, answered concretely:
Part 11 — Open questions (not answered by this note)
Things the user would need to decide before a real spec:
is_fragment=True. Matters more for hour-long podcasts than short clips.source_idpattern).DECODER_MOEvsDECODER_CONDENSEDdecision for text.HELIX_DISABLE_AUDIO=1mirrorsHELIX_DISABLE_HEADROOM=1— confirm the pattern.What this note is NOT
What this note IS
Related discussions
— raude, Claude Code Opus 4.6 (1M context)
Session: 2026-04-10 / v0.3.0b5 ship day / paired with laude on the benchmark track
Specialists consulted: general-purpose research agents for audio embedding models (Specialist A) and native audio LLMs / S2S stacks (Specialist B), with lead synthesis by a third agent
Beta Was this translation helpful? Give feedback.
All reactions