Modal is the recommended cloud GPU provider for the toolkit's AI tools. It offers $30/month free compute on the Starter plan, fast cold starts, and scale-to-zero billing.
Fastest path: Run
/setupin Claude Code — it handles Modal installation, deployment, and.envconfiguration interactively. This doc is the reference for what/setupdoes under the hood, and for manual setup.
- Go to modal.com and sign up
- Choose the Starter plan — $30/month free compute, just requires a payment method
- Typical toolkit usage is $1-2/month, well within the free allowance
- All apps scale to zero — no charges when idle
pip install modal
python3 -m modal setup # Opens browser to authenticate, saves token to ~/.modal.toml
modal app list # Verify it worksEach AI tool has its own Modal app. Deploy only what you need, or deploy all of them — idle apps cost nothing.
# Speech generation (most commonly used)
modal deploy docker/modal-qwen3-tts/app.py
# Image generation & editing
modal deploy docker/modal-flux2/app.py
modal deploy docker/modal-image-edit/app.py
modal deploy docker/modal-upscale/app.py
# Music generation
modal deploy docker/modal-music-gen/app.py
# Video processing
modal deploy docker/modal-sadtalker/app.py
modal deploy docker/modal-propainter/app.pyEach deploy prints an endpoint URL like:
https://yourname--video-toolkit-qwen3-tts-ttsengine-generate.modal.run
Save each URL to your .env file:
# Add to .env (replace with your actual URLs from deploy output)
MODAL_QWEN3_TTS_ENDPOINT_URL=https://yourname--video-toolkit-qwen3-tts-...modal.run
MODAL_FLUX2_ENDPOINT_URL=https://yourname--video-toolkit-flux2-...modal.run
MODAL_IMAGE_EDIT_ENDPOINT_URL=https://yourname--video-toolkit-image-edit-...modal.run
MODAL_UPSCALE_ENDPOINT_URL=https://yourname--video-toolkit-upscale-...modal.run
MODAL_MUSIC_GEN_ENDPOINT_URL=https://yourname--video-toolkit-music-gen-...modal.run
MODAL_SADTALKER_ENDPOINT_URL=https://yourname--video-toolkit-sadtalker-...modal.run
MODAL_DEWATERMARK_ENDPOINT_URL=https://yourname--video-toolkit-dewatermark-...modal.runTip:
/setupautomates this — it runs each deploy, parses the URL, and writes it to.envfor you.
R2 is free file storage that bridges your local machine and cloud GPUs. Without it, tools fall back to free upload services (slower, less reliable).
R2 free tier: 10GB storage, 10 million operations/month, zero egress fees.
See the R2 section in /setup, or configure manually:
- Sign up at dash.cloudflare.com
- Go to R2 Object Storage → Create bucket (name it
video-toolkit) - Create an API token: R2 → Manage R2 API Tokens → Object Read & Write
- Add to
.env:R2_ACCOUNT_ID=your_account_id R2_ACCESS_KEY_ID=your_access_key_id R2_SECRET_ACCESS_KEY=your_secret_access_key R2_BUCKET_NAME=video-toolkit
All cloud GPU tools accept --cloud modal:
# AI voiceover
python3 tools/qwen3_tts.py --text "Hello world" --speaker Ryan --output hello.mp3 --cloud modal
# AI image generation
python3 tools/flux2.py --prompt "A sunset over mountains" --output sunset.png --cloud modal
# AI image editing
python3 tools/image_edit.py --input photo.jpg --style cyberpunk --cloud modal
# AI upscaling
python3 tools/upscale.py --input photo.jpg --output photo_4x.png --cloud modal
# AI music generation (acemusic cloud API is now default — no Modal needed)
python3 tools/music_gen.py --preset corporate-bg --duration 60 --output bg.mp3
# Or use Modal: python3 tools/music_gen.py --preset corporate-bg --duration 60 --output bg.mp3 --cloud modal
# Talking head from portrait + audio
python3 tools/sadtalker.py --image portrait.png --audio voiceover.mp3 --output talking.mp4 --cloud modal
# Watermark removal
python3 tools/dewatermark.py --input video.mp4 --region 1080,660,195,40 --output clean.mp4 --cloud modal| Tool | Backend | Use Case | Est. Cost |
|---|---|---|---|
qwen3_tts |
Qwen3-TTS | AI speech generation | ~$0.005-0.02 |
flux2 |
FLUX.2 Klein | AI image generation | ~$0.01-0.03 |
image_edit |
Qwen-Image-Edit | AI image editing, style transfer | ~$0.02-0.05 |
upscale |
RealESRGAN | AI image upscaling (2x/4x) | ~$0.005-0.02 |
music_gen |
ACE-Step 1.5 | AI music generation | Free (acemusic) / ~$0.02-0.10 (Modal) |
sadtalker |
SadTalker | Talking head video | ~$0.05-0.30 |
dewatermark |
ProPainter | AI video inpainting | ~$0.05-0.50 |
All apps use A10G GPUs (24GB VRAM) except image_edit which uses A100 for its 25GB model.
First request after idle triggers a cold start while Modal loads the model:
| Tool | Cold Start | Warm Request |
|---|---|---|
qwen3_tts |
~60-90s | ~5-15s |
flux2 |
~25-30s | ~1-3s |
image_edit |
~5-8min | ~15-20s |
upscale |
~25-30s | ~3-5s |
music_gen |
~60-90s | ~10-30s |
sadtalker |
~45-60s | ~30-60s |
dewatermark |
~60-70s | varies by video length |
After 60 seconds of no requests, containers scale back to zero. No charges while idle.
# Check what's running (Tasks column should be 0 when idle)
modal app list
# Check today's spend
modal billing report --for today --json
# View container logs
modal app logs video-toolkit-upscale
# Verify your setup
python3 tools/verify_setup.pyEach tool has its own Modal app (docker/modal-*/app.py), deployed independently:
- One app per tool — independent scaling, GPU assignment, and lifecycle
- Web endpoints — HTTP POST via
@modal.fastapi_endpoint, nomodalpip dependency needed on the client - R2 file transfer — large results upload to Cloudflare R2 (if configured), otherwise base64
- Scale to zero —
scaledown_window=60means containers shut down after 1 minute idle
The client-side abstraction lives in tools/cloud_gpu.py, which routes call_cloud_endpoint() to either _call_runpod() (submit + poll) or _call_modal() (synchronous POST).
RunPod is also supported as a fallback provider. Use --cloud runpod on any tool.
| Aspect | Modal | RunPod |
|---|---|---|
| Free tier | $30/mo compute | None (pay-as-you-go) |
| Setup | modal deploy |
--setup flag per tool |
| Cold start | Faster (cached layers) | Slower (Docker pull) |
| Invocation | Synchronous POST | Async submit + poll |
| Auth | Token optional for web endpoints | RUNPOD_API_KEY required |
See runpod-setup.md for RunPod-specific instructions.