Modular Multimodal Intelligence
Plug any Hugging Face LLM and vision encoder together via a learnable projector.
Supports zero-shot inference today, and adapter-based fine-tuning tomorrow.
- 🔌 Plug-and-play architecture for combining LLMs and vision encoders
- 🧠 Supports popular models like Qwen, Mistral, LLaMA, CLIP, XCLIP, SAM
- 🧪 Zero-shot inference with learnable projector modules
- 🛠️ Adapter-based fine-tuning (coming soon)
- 📊 Easy benchmarking and visualization tools
pip install modu-musefrom modu_muse import Pipeline
pipe = Pipeline(
llm_name="mistralai/Mistral-7B-Instruct-v0.2",
vision_name="openai/clip-vit-base-patch16"
)
result = pipe.infer("path/to/image.jpg", "Describe the scene.")
print(result)[Image/Video] → [Vision Encoder] → [Projector] → [LLM]
- Vision encoder extracts features
- Projector maps visual features to LLM-compatible embeddings
- LLM generates text conditioned on visual context
Train your own projector using paired image-text datasets:
python train_adapter.py \
--model llm=Qwen1.5 vision=xclip \
--dataset_path ./data/relevance_dataset \
--output_dir ./checkpointsmodu_muse/
├── pipeline.py # Main multimodal pipeline
├── projector.py # Vision-to-LLM projector
├── models/
│ ├── llm.py # LLM loader
│ ├── vision.py # Vision encoder loader
├── examples/
│ └── quick_start.py # Demo script
We welcome contributions! Whether it's new model support, training scripts, or documentation improvements—open a PR or start a discussion.
This project is licensed under the MIT License.
© 2025 Wissem Elkarous
ModuMuse: Where vision meets language.