Skip to content

ELkarousWissem/ModuMuse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 ModuMuse

Modular Multimodal Intelligence
Plug any Hugging Face LLM and vision encoder together via a learnable projector.
Supports zero-shot inference today, and adapter-based fine-tuning tomorrow.

GitHub stars PyPI version License Python version


🚀 Features

  • 🔌 Plug-and-play architecture for combining LLMs and vision encoders
  • 🧠 Supports popular models like Qwen, Mistral, LLaMA, CLIP, XCLIP, SAM
  • 🧪 Zero-shot inference with learnable projector modules
  • 🛠️ Adapter-based fine-tuning (coming soon)
  • 📊 Easy benchmarking and visualization tools

📦 Installation

pip install modu-muse

🧬 Quick Start

from modu_muse import Pipeline

pipe = Pipeline(
    llm_name="mistralai/Mistral-7B-Instruct-v0.2",
    vision_name="openai/clip-vit-base-patch16"
)

result = pipe.infer("path/to/image.jpg", "Describe the scene.")
print(result)

🧠 Architecture

[Image/Video] → [Vision Encoder] → [Projector] → [LLM]
  • Vision encoder extracts features
  • Projector maps visual features to LLM-compatible embeddings
  • LLM generates text conditioned on visual context

🛠️ Fine-Tuning (Coming Soon)

Train your own projector using paired image-text datasets:

python train_adapter.py \
  --model llm=Qwen1.5 vision=xclip \
  --dataset_path ./data/relevance_dataset \
  --output_dir ./checkpoints

📁 Project Structure

modu_muse/
├── pipeline.py          # Main multimodal pipeline
├── projector.py         # Vision-to-LLM projector
├── models/
│   ├── llm.py           # LLM loader
│   ├── vision.py        # Vision encoder loader
├── examples/
│   └── quick_start.py   # Demo script

🤝 Contributing

We welcome contributions! Whether it's new model support, training scripts, or documentation improvements—open a PR or start a discussion.


📜 License

This project is licensed under the MIT License.
© 2025 Wissem Elkarous


🌐 Resources


ModuMuse: Where vision meets language.

About

Modular + Muse, for inspiring multimodal intelligence.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages