Skip to content

feat: add AudioModalProcessor for speech-to-text transcription#280

Closed
liuruing wants to merge 1 commit into
HKUDS:mainfrom
liuruing:feature/audio-modal-processor
Closed

feat: add AudioModalProcessor for speech-to-text transcription#280
liuruing wants to merge 1 commit into
HKUDS:mainfrom
liuruing:feature/audio-modal-processor

Conversation

@liuruing

Copy link
Copy Markdown

Summary

Add support for audio file processing (MP3, WAV, FLAC, M4A, OGG, WMA, AAC, OPUS) using faster-whisper for local ASR transcription.

This enables RAG-Anything to index and retrieve content from:

  • 🎙️ Meeting recordings
  • 📞 Phone call recordings
  • 🎧 Podcasts and interviews
  • 🎓 Lectures and presentations
  • 🗣️ Voice memos

Key Features

  • Timestamped transcription: Output includes [MM:SS-MM:SS] text format for precise time-based retrieval
  • VAD filtering: Automatically skips silence using Voice Activity Detection
  • Lazy model loading: Whisper model only loads when first audio file is processed (no startup cost)
  • Configurable: WHISPER_MODEL (tiny/base/small/medium/large-v3) and WHISPER_LANGUAGE env vars
  • Optional dependency: pip install raganything[audio] — does NOT affect existing users

Changes

  • raganything/modalprocessors_audio.py — New AudioModalProcessor class
  • raganything/__init__.py — Optional export of AudioModalProcessor
  • raganything/config.py — Extended SUPPORTED_FILE_EXTENSIONS with audio formats
  • raganything/processor.py — Added "audio" content type in _apply_chunk_template
  • pyproject.toml — Added [audio] optional dependency group
  • env.example — Added WHISPER_MODEL / WHISPER_LANGUAGE config
  • tests/test_audio_processor.py — Unit tests

Usage

from raganything import AudioModalProcessor

processor = AudioModalProcessor(
    lightrag=rag_instance,
    modal_caption_func=caption_func,
    whisper_model="large-v3",  # or set WHISPER_MODEL env
)

result = await processor.process_multimodal_content(
    modal_content={"audio_path": "/path/to/meeting.mp3"},
    content_type="audio",
)

Next Steps

Planning a follow-up PR for VideoModalProcessor (visual + audio dual-channel) using scenedetect + moviepy + faster-whisper.

Test plan

  • Unit tests for timestamp formatting, segment conversion, file detection
  • Mock-based tests for generate_description_only
  • Integration test with real audio file (requires faster-whisper model download)

🤖 Generated with Claude Code

Add support for audio file processing (MP3, WAV, FLAC, M4A, OGG, etc.)
using faster-whisper for local ASR transcription.

Key features:
- Timestamped transcription output for precise retrieval
- VAD filtering to skip silence
- Lazy model loading (only loads whisper when first audio is processed)
- Configurable via WHISPER_MODEL and WHISPER_LANGUAGE env vars
- Added as optional dependency: pip install raganything[audio]

Use cases: meeting recordings, phone calls, podcasts, lectures.
@LarFii

LarFii commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Thanks for working on audio processing support. This is a valuable direction for RAG-Anything.

I retested this PR against the current main after the recent test fixes. The branch merges cleanly and git diff --check passes, but the focused audio test suite is still failing:

PYTHONPATH=. python -m pytest -q tests/test_audio_processor.py
# 2 passed, 6 failed, 7 errors

The failures all come from constructing AudioModalProcessor with a MagicMock LightRAG object in the tests. AudioModalProcessor inherits from BaseModalProcessor, and BaseModalProcessor.__init__() expects the LightRAG object to provide the real storage/config shape, including a dataclass-compatible object for asdict(lightrag). With the current tests, initialization fails with:

TypeError: asdict() should be called on dataclass instances

Before this can be merged, please update the tests to use a realistic fake/dataclass LightRAG object with the storage attributes required by BaseModalProcessor, or adjust the implementation/tests so processor initialization is covered in a way that matches the existing modal processor contract.

Once those tests pass, this PR will be much easier to review further. Thanks again for pushing this feature forward.

@LarFii

LarFii commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Thanks a lot for this @liuruing 🙏 The audio processor here is fully contained in your follow-up #281, and both have now been consolidated — together with the integration design from #289 — into #292, which wires the processors into the pipeline (config flags, registration, routing) and is verified end-to-end (real faster-whisper transcription → insert → retrieve).

Closing this one as superseded by #281 / #292. Your work is credited via co-authorship in #292.

@LarFii LarFii closed this Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants