A Python library to synchronize Quran Ayat with audio recitations
Originally developed by Wider Community
This project automatically synchronizes Quranic verses (ayahs) with recitation audio by generating accurate timestamps for the start and end of each ayah.
- Roadmap
- How It Works
- Algorithm Overview
- Folder Structure
- Documentation
- Demo
- Contributing
- Debugging Video
Munajjam uses a sophisticated two-stage pipeline to synchronize Quranic audio with verse timestamps:
- Uses Tarteel AI's Whisper model (specialized for Quranic Arabic)
- Detects silence periods to segment the audio intelligently
- Identifies special segments (Isti'aza and Basmala) with pattern matching
- Supports both standard transformers and faster-whisper backends
- Optimized for Apple Silicon (MPS) GPU acceleration
- Matches transcribed segments with canonical Quranic text
- Implements smart merging when multiple segments form one ayah
- Uses buffer extension to prevent word cutoffs at boundaries
- Applies silence gap detection to identify ayah boundaries
- Handles overlapping text removal for clean alignment
The buffer system extends ayah timestamps into adjacent silence periods to capture complete recitations without cutting off words.
How it works:
- Before ayah start: Extends backward up to 0.3s into preceding silence
- After ayah end: Extends forward up to 0.3s into following silence
- Overlap prevention: Ensures no overlap with adjacent ayahs
- Adaptive: Uses actual silence data, not fixed offsets
Benefits:
- Eliminates word cutoffs at ayah boundaries
- Preserves natural pause patterns in recitation
- Maintains clean separation between ayahs
# Example: An ayah detected at 10.0s - 15.0s with silences at:
# - [8.5s - 9.8s] (before)
# - [15.2s - 16.0s] (after)
#
# Applied buffer extends to: 9.7s - 15.5s
# (0.3s backward into first silence, 0.3s forward into second)Identifies ayah boundaries by detecting significant silence gaps between segments, combined with textual analysis.
Algorithm:
- Acoustic check: Look for silence gaps >= 0.18s between segments
- Textual check: Verify next segment starts the next ayah (similarity > 0.6)
- Boundary confirmation: Only treat as ayah boundary if both conditions met
Why it matters:
- Handles cases where reciter pauses mid-ayah (doesn't split incorrectly)
- Detects merged ayahs that were transcribed as one segment
- Improves alignment accuracy for complex recitation patterns
Properly tracks Isti'aza and Basmala segments separately from ayahs.
Features:
- Assigns special
id = 0andayah_index = -1to these segments - Pattern-based detection even when metadata is missing
- Excluded from ayah counting and alignment logic
- Preserved in output with proper
typefield
Uses multiple similarity checks for robust alignment:
Last words check (primary): Compares last N words of segment with expected ayah
- Adaptive N: Uses 3 words for long ayahs, 2 for medium, 1 for short
- Threshold: 0.6 similarity score
Full text similarity (secondary): Compares entire segment with canonical text
- Guards against premature termination
- Coverage ratio check ensures >= 70% of ayah is captured
Required tokens guard: Prevents early cutoff for specific ayahs
- Example: Ayah 2 requires both "ارجع" and "فطور" before finalizing
Intelligently merges segments while removing duplicate words:
Algorithm:
- Count word frequencies in first segment
- For each word in second segment:
- If word exists in first segment, decrement counter and skip
- Otherwise, append to merged text
- Return cleaned merged text
Prevents: "...الرحيم بسم الله..." -> "...الرحيم..."
Model Caching:
- Loads model once and caches it for entire session
- Avoids expensive model reloading between surahs
- Supports both transformers and faster-whisper
Device Optimization:
- Auto-detects best device: CUDA > MPS > CPU
- Apple Silicon: Uses MPS (Metal Performance Shaders) with float32
- CUDA: Uses float16 for faster inference
- Model compilation with
torch.compile()(when supported)
Inference Optimization:
- Greedy decoding (
num_beams=1) instead of beam search - Limited token generation (
max_new_tokens=128) - Explicit attention mask passing
Here is the high-level structure of the project directory:
Munajjam/
├───.gitignore
├───current_config.json
├───main.py
├───requirements.md
├───requirements.txt
├───README.md
├───PSEUDO CODE.md
├───ROADMAP.md
├───docs/
├───data/
├───scripts/
└───src/
- src/: Contains the main Python source code for processing.
- data/: Holds all data files, including raw audio, CSVs with timestamps, and JSON outputs.
- docs/: Contains detailed documentation and requirements.
- PSEUDO CODE.md: Contains the pseudocode and detailed documentation for the project.
- requirements.md: Lists the prerequisites and dependencies for the project.
- main.py: The main script to run the application.
The project's pseudocode can be found in the PSEUDO CODE.md file.
View Pseudocode
Munajjam Workflow of 1st Edition: View Munajjam V0.1 Workflow
Short PRD: View Full Requirements
For a more detailed explanation of the project, you can view the full documentation here: View Full Documentation
A video demonstration of the project is available at the link below.
We welcome contributions from the community!
- Check the Roadmap to see planned features
- Browse open issues for tasks
Look for issues labeled good first issue if you're new!
- Primary: Tarteel AI Whisper Base (Arabic Quran-specialized)
- Backend: Supports both Hugging Face Transformers and faster-whisper
- Device: CUDA, MPS (Apple Silicon), or CPU
- Silence Detection: -30dB threshold, 300ms minimum duration
- Sample Rate: 16kHz
- Format: WAV (mono recommended)
- Similarity Threshold: 0.6 (60%)
- Buffer Duration: 0.3 seconds
- Minimum Silence Gap: 0.18 seconds
- Coverage Requirement: 0.7 (70%)
- Tarteel AI for the Quran-specialized Whisper model
- The open source community
This project is developed and maintained by Wider Community.
All rights reserved to Wider Community.
Let's build something beautiful for the Ummah together!