feat: add AudioModalProcessor for speech-to-text transcription by liuruing · Pull Request #280 · HKUDS/RAG-Anything

liuruing · 2026-05-22T08:42:35Z

Summary

Add support for audio file processing (MP3, WAV, FLAC, M4A, OGG, WMA, AAC, OPUS) using faster-whisper for local ASR transcription.

This enables RAG-Anything to index and retrieve content from:

🎙️ Meeting recordings
📞 Phone call recordings
🎧 Podcasts and interviews
🎓 Lectures and presentations
🗣️ Voice memos

Key Features

Timestamped transcription: Output includes [MM:SS-MM:SS] text format for precise time-based retrieval
VAD filtering: Automatically skips silence using Voice Activity Detection
Lazy model loading: Whisper model only loads when first audio file is processed (no startup cost)
Configurable: WHISPER_MODEL (tiny/base/small/medium/large-v3) and WHISPER_LANGUAGE env vars
Optional dependency: pip install raganything[audio] — does NOT affect existing users

Changes

raganything/modalprocessors_audio.py — New AudioModalProcessor class
raganything/__init__.py — Optional export of AudioModalProcessor
raganything/config.py — Extended SUPPORTED_FILE_EXTENSIONS with audio formats
raganything/processor.py — Added "audio" content type in _apply_chunk_template
pyproject.toml — Added [audio] optional dependency group
env.example — Added WHISPER_MODEL / WHISPER_LANGUAGE config
tests/test_audio_processor.py — Unit tests

Usage

from raganything import AudioModalProcessor

processor = AudioModalProcessor(
    lightrag=rag_instance,
    modal_caption_func=caption_func,
    whisper_model="large-v3",  # or set WHISPER_MODEL env
)

result = await processor.process_multimodal_content(
    modal_content={"audio_path": "/path/to/meeting.mp3"},
    content_type="audio",
)

Next Steps

Planning a follow-up PR for VideoModalProcessor (visual + audio dual-channel) using scenedetect + moviepy + faster-whisper.

Test plan

Unit tests for timestamp formatting, segment conversion, file detection
Mock-based tests for generate_description_only
Integration test with real audio file (requires faster-whisper model download)

🤖 Generated with Claude Code

Add support for audio file processing (MP3, WAV, FLAC, M4A, OGG, etc.) using faster-whisper for local ASR transcription. Key features: - Timestamped transcription output for precise retrieval - VAD filtering to skip silence - Lazy model loading (only loads whisper when first audio is processed) - Configurable via WHISPER_MODEL and WHISPER_LANGUAGE env vars - Added as optional dependency: pip install raganything[audio] Use cases: meeting recordings, phone calls, podcasts, lectures.

LarFii · 2026-06-01T06:42:54Z

Thanks for working on audio processing support. This is a valuable direction for RAG-Anything.

I retested this PR against the current main after the recent test fixes. The branch merges cleanly and git diff --check passes, but the focused audio test suite is still failing:

PYTHONPATH=. python -m pytest -q tests/test_audio_processor.py
# 2 passed, 6 failed, 7 errors

The failures all come from constructing AudioModalProcessor with a MagicMock LightRAG object in the tests. AudioModalProcessor inherits from BaseModalProcessor, and BaseModalProcessor.__init__() expects the LightRAG object to provide the real storage/config shape, including a dataclass-compatible object for asdict(lightrag). With the current tests, initialization fails with:

TypeError: asdict() should be called on dataclass instances

Before this can be merged, please update the tests to use a realistic fake/dataclass LightRAG object with the storage attributes required by BaseModalProcessor, or adjust the implementation/tests so processor initialization is covered in a way that matches the existing modal processor contract.

Once those tests pass, this PR will be much easier to review further. Thanks again for pushing this feature forward.

LarFii · 2026-06-01T08:45:45Z

Thanks a lot for this @liuruing 🙏 The audio processor here is fully contained in your follow-up #281, and both have now been consolidated — together with the integration design from #289 — into #292, which wires the processors into the pipeline (config flags, registration, routing) and is verified end-to-end (real faster-whisper transcription → insert → retrieve).

Closing this one as superseded by #281 / #292. Your work is credited via co-authorship in #292.

liuruing force-pushed the feature/audio-modal-processor branch from 13958c9 to ca75f97 Compare May 22, 2026 08:53

liuruing mentioned this pull request May 22, 2026

feat: add VideoModalProcessor with visual + audio dual-channel analysis #281

Open

3 tasks

LarFii mentioned this pull request Jun 1, 2026

feat: audio & video modal processors (integrated, end-to-end tested) #292

Open

5 tasks

LarFii closed this Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add AudioModalProcessor for speech-to-text transcription#280

feat: add AudioModalProcessor for speech-to-text transcription#280
liuruing wants to merge 1 commit into
HKUDS:mainfrom
liuruing:feature/audio-modal-processor

liuruing commented May 22, 2026

Uh oh!

LarFii commented Jun 1, 2026

Uh oh!

LarFii commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

liuruing commented May 22, 2026

Summary

Key Features

Changes

Usage

Next Steps

Test plan

Uh oh!

LarFii commented Jun 1, 2026

Uh oh!

LarFii commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants