feat: add AudioModalProcessor for speech-to-text transcription#280
feat: add AudioModalProcessor for speech-to-text transcription#280liuruing wants to merge 1 commit into
Conversation
Add support for audio file processing (MP3, WAV, FLAC, M4A, OGG, etc.) using faster-whisper for local ASR transcription. Key features: - Timestamped transcription output for precise retrieval - VAD filtering to skip silence - Lazy model loading (only loads whisper when first audio is processed) - Configurable via WHISPER_MODEL and WHISPER_LANGUAGE env vars - Added as optional dependency: pip install raganything[audio] Use cases: meeting recordings, phone calls, podcasts, lectures.
13958c9 to
ca75f97
Compare
|
Thanks for working on audio processing support. This is a valuable direction for RAG-Anything. I retested this PR against the current The failures all come from constructing Before this can be merged, please update the tests to use a realistic fake/dataclass LightRAG object with the storage attributes required by Once those tests pass, this PR will be much easier to review further. Thanks again for pushing this feature forward. |
|
Thanks a lot for this @liuruing 🙏 The audio processor here is fully contained in your follow-up #281, and both have now been consolidated — together with the integration design from #289 — into #292, which wires the processors into the pipeline (config flags, registration, routing) and is verified end-to-end (real faster-whisper transcription → insert → retrieve). Closing this one as superseded by #281 / #292. Your work is credited via co-authorship in #292. |
Summary
Add support for audio file processing (MP3, WAV, FLAC, M4A, OGG, WMA, AAC, OPUS) using faster-whisper for local ASR transcription.
This enables RAG-Anything to index and retrieve content from:
Key Features
[MM:SS-MM:SS] textformat for precise time-based retrievalWHISPER_MODEL(tiny/base/small/medium/large-v3) andWHISPER_LANGUAGEenv varspip install raganything[audio]— does NOT affect existing usersChanges
raganything/modalprocessors_audio.py— New AudioModalProcessor classraganything/__init__.py— Optional export of AudioModalProcessorraganything/config.py— Extended SUPPORTED_FILE_EXTENSIONS with audio formatsraganything/processor.py— Added "audio" content type in_apply_chunk_templatepyproject.toml— Added[audio]optional dependency groupenv.example— Added WHISPER_MODEL / WHISPER_LANGUAGE configtests/test_audio_processor.py— Unit testsUsage
Next Steps
Planning a follow-up PR for
VideoModalProcessor(visual + audio dual-channel) using scenedetect + moviepy + faster-whisper.Test plan
🤖 Generated with Claude Code