Skip to content

davinson-pezo/jarvis

Repository files navigation

🤖 Jarvis

License: MIT Python 3.11+ Platform: macOS AI: Gemini Bilingual

Bilingual (Spanish / English) voice assistant for macOS, inspired by Tony Stark's JARVIS. Listens continuously through the microphone, thinks with Google Gemini (with live Google Search), and replies out loud using the local, high-quality Kokoro TTS neural engine. Detects the language of every utterance independently, so you can switch between Spanish and English mid-conversation and Jarvis follows.

Ships as two macOS apps that share the same core:

  • 🖥️ Jarvis.app — native HUD built with CustomTkinter.
  • 🌐 Jarvis Web.app — launches a local Flask + Socket.IO server and opens the HUD in your browser.

✨ Features

  • 🎙️ Continuous listening — no wake word needed; just talk.
  • 🌍 Per-utterance language detection — ASR is run in parallel as es-ES and en-US, and a marker-scoring system picks the transcript that's internally coherent with its own language. Includes an asymmetric bias that fixes Spanish-ASR hallucinations from English audio.
  • 🧠 Gemini + Google Search — real-time answers (weather, news, prices, scores). Model is gemini-2.5-flash by default; gemini-2.5-pro or any other Gemini model can be selected via .env.
  • 💬 Short-term memory — keeps the last 8 conversational turns so follow-ups work ("and what about tomorrow?").
  • 🔊 Bilingual Neural voices — high-fidelity speech using Kokoro-82M. Uses bm_fable (classic British Jarvis sound) and em_alex (Spanish neural) by default. Voices are configurable per-language in .env (e.g., af_heart, bf_isabelle, em_santa).
  • 🧼 TTS-aware output — the response is stripped of markdown, code fences, and [1] citation marks before being read, so the speech is clean.
  • ⌨️ Text input (web HUD) — type instead of talking; Jarvis still replies with voice. Useful on calls or in noisy rooms.
  • 🎨 HUD visuals — animated arc reactor that changes color per state (idle / listening / thinking / speaking) and reacts to real mic audio levels.
  • 📦 Double-click launch — ships as two .app bundles; no terminal required after setup.

📸 Screenshots

Desktop HUD — CustomTkinter window with animated arc reactor, transmission log, and linear spectrum. Color palette shifts with state (idle / listening / thinking / speaking).

Listening Speaking
Desktop HUD — Listening Desktop HUD — Speaking

Web HUD — the same experience in your browser, with a text input box and a CORE TEMP / MODE / UPTIME footer. Shutdown overlays the screen with "Good bye, sir." before terminating the server.

Speaking Shutting down
Web HUD — Speaking Web HUD — Shutdown

📋 Requirements

  • macOS (uses Kokoro TTS via sounddevice for audio output).
  • Python 3.11+ (required for Kokoro and AI dependencies).
  • Homebrew for python-tk@3.11 (required for the desktop app).
  • A microphone (and Microphone permission granted to the app on first run).
  • A Gemini API key — free at Google AI Studio.

🚀 Quick install

git clone https://github.com/davinson-pezo/jarvis.git
cd jarvis
cp .env.example .env
# edit .env and paste your GEMINI_API_KEY
./setup.command

setup.command recreates the venv, installs dependencies, and runs PyInstaller to produce both .app bundles inside dist/. Copy them to /Applications and launch with a double click.

🛠️ Manual install (dev mode)

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env        # and paste your API key

Run directly without bundling:

python jarvis_app.py        # desktop HUD
python jarvis_web.py        # web HUD on http://localhost:5005

⚙️ Environment variables

All optional except GEMINI_API_KEY. See .env.example for the full list:

  • GEMINI_API_KEYrequired, your Gemini API key.
  • FLASK_SECRET_KEY — session-signing key for the web server. If empty, a random one is generated each startup (which invalidates web sessions across restarts).
  • JARVIS_MODEL — Gemini model name (default gemini-2.5-flash). Try gemini-2.5-pro for smarter but slightly slower replies.
  • JARVIS_VOICE_EN / JARVIS_VOICE_ES — select the Kokoro voice for each language. Default: bm_fable (EN) / em_alex (ES).
  • JARVIS_NO_BROWSER=1 — skip the automatic browser open for the web app (useful for remote hosts).

🗣️ Available Kokoro Voices

ID Lang Gender Description
bm_fable EN Male Default. Deep, British, calm (The Jarvis vibe).
af_heart EN Female Very natural, expressive American voice.
am_adam EN Male Clear, professional American male.
bf_isabelle EN Female Formal British female.
em_alex ES Male Default. Standard neutral Spanish male.
ef_dora ES Female Clear, friendly Spanish female.
em_santa ES Male Deep, resonant Spanish male.
ef_isabel ES Female Soft, natural Spanish female.

Tip

You can test these by changing the variables in your .env and restarting the app. No recompilation needed!

🎙️ Usage

On launch, Jarvis greets you in English — "Good evening, sir. Systems are online." — and immediately enters continuous listening. The loop is LISTENING → THINKING → SPEAKING → LISTENING, with a short cooldown after each reply so it doesn't hear its own voice. Phrases of up to ~20 seconds with natural pauses work fine; speak in Spanish or English and Jarvis will reply in the same language with the matching voice.

Desktop HUD (Jarvis.app): a dark window with an animated arc reactor on the left, a transmission log on the right showing everything Jarvis hears and says, a current voice / language readout, a linear audio spectrum below, and a SHUTDOWN button in the footer.

Web HUD (Jarvis Web.app): the same HUD rendered in your browser, plus a text input box in the transmission panel (Type a command, or just speak...) and a footer showing CORE TEMP, MODE: CONTINUOUS, and UPTIME. Typing + Enter (or clicking SEND) sends the message to Jarvis, which replies out loud. The Jarvis Web.app bundle runs as a macOS background agent (no Dock icon, no App Switcher entry) — its UI lives entirely in the browser tab.

Shutdown: click SHUTDOWN in either HUD. The desktop app fades out and speaks "Good bye, sir." before quitting; the web version pops a confirmation, then overlays SHUTTING DOWN / Good bye, sir. and terminates the server. You can also force-quit from Activity Monitor.

ℹ️ A wake-word scaffolding (jarvis, hey jarvis, hola jarvis, oye jarvis) is present in jarvis_core.py but is currently inactive — the voice loop processes every utterance it hears. If you'd prefer wake-word gating over continuous listening, wiring contains_wake_word into run_voice_loop is a good first PR.

🏗️ Architecture

jarvis_core.py       Brain — Gemini client, speech recognition, language
                     scoring, voice selection, TTS, history, callbacks.

jarvis_app.py        Desktop UI — CustomTkinter HUD with animated reactor
                     canvas, spectrum bar, log, shutdown.

jarvis_web.py        Flask + Socket.IO server — same HUD over WebSocket,
                     port 5005, auto-opens the browser.

templates/           Web HUD HTML.
static/              Web HUD CSS + JS (Canvas-based reactor + spectrum).

build_desktop.sh     PyInstaller recipe for Jarvis.app.
build_web.sh         PyInstaller recipe for Jarvis Web.app (LSUIElement=true).
build_app.sh         Runs both above in sequence.
setup.command        Double-click installer: venv → pip → build.
run_jarvis.command   Quick-launch script for the Web version (no bundling needed).
run_jarvis_desktop.command  Quick-launch script for the Desktop version.

The core is UI-agnostic: it exposes two callbacks, on_status(status, log) and on_speak(text, lang), and a level callback for audio visualization. Both UIs just wire into those — adding a new interface (CLI, menubar, iOS client over WebSocket…) is just wiring them up.

🔌 Extending Jarvis — OpenClaw integration (community direction)

Out of the box, Jarvis is a voice layer wrapping Gemini: great at conversation and search, but it can't touch your machine. A natural next step is to bolt Jarvis onto a local agent with tool-use capabilities — OpenClaw is a strong candidate on macOS.

🎙️  Mic  →  Jarvis (ASR + language + wake logic)
                     ↓
                  OpenClaw (local agent on port 18789, Bearer auth)
                     ↓  tool-use
          ┌──────────┼──────────┬──────────┬──────────┐
        Files     Apps       Mail      Calendar    Shell
                     ↓
🔊  Kokoro ←  Jarvis (Neural TTS in the detected language)

Concretely, this would unlock things like:

  • 📁 "Jarvis, create a folder on my Desktop called 'Taxes 2026' and move every PDF from Downloads into it."
  • 🧭 "Open Spotify and play focus music." (via Shortcuts / AppleScript bridges)
  • 📬 "Read me the latest unread email from María and draft a reply."
  • 🗓️ "What's on my calendar for Thursday? Add a 3 pm slot titled 'Dentist'."

Why this isn't merged into main: OpenClaw isn't universally installed, and forcing that dependency just to chat with Gemini would be overkill. The current repo keeps the Gemini-only path clean so anyone with a free API key can run it.

Where the integration hooks in: JarvisCore.think() in jarvis_core.py. Instead of calling self.client.models.generate_content(...), it could POST to an OpenClaw gateway and pipe the response back through the existing on_speak / on_status callbacks. Language scoring and voice routing keep working unchanged. Keep the OpenClaw path opt-in via something like JARVIS_BRAIN=openclaw in .env so existing users aren't affected. PRs welcome.

🌐 Language detection

ASR is sent to Google Speech Recognition in parallel as both es-ES and en-US. Each returned transcript is then scored by how well it matches its claimed language:

  • Spanish markers: accents / ñ / ¿¡ (+3 flat), plus common function words (el, la, qué, cómo, porque, …).
  • English markers: common function words (the, is, what, can, would, please, sir, …).

When both transcripts are internally coherent (score ≥ 2 for their own language) and the Spanish lead is ≤ 2 points, Jarvis picks the English one. This corrects a well-known behavior of Google's Spanish ASR: it happily returns a plausible Spanish sentence when the audio is actually English ("what is the weather like" → "qué es el wéder laik"), whereas the English ASR rarely does the opposite.

For typed input in the web HUD, the same scoring is used to pick the reply language.

🛡️ Privacy & Security

  • Your Gemini API key lives only in .env, which is git-ignored.
  • Voice goes to Google ASR; reasoning goes to Gemini. No other third parties.
  • No telemetry, no analytics, no tracking, no local persistence beyond an in-memory 8-turn history that dies with the process.
  • The .app bundles are ad-hoc code-signed (no Apple Developer account required); on first run macOS may ask you to grant Microphone permission.

☕ Support the Project

If this tool has been helpful and you'd like to support its development, feel free to buy me a coffee!

PayPal

📄 License

MIT — see LICENSE.

🙏 Credits

Brain: Google Gemini. TTS: Kokoro-82M via kokoro-python. ASR: Google Speech Recognition via the SpeechRecognition library. Desktop UI: CustomTkinter. Web UI: Flask + Flask-SocketIO. Packaging: PyInstaller.

About

J.A.R.V.I.S. v3 — An advanced AI voice assistant inspired by Iron Man's JARVIS. Multi-platform (desktop & web) with real-time voice recognition, Gemini integration, and interactive HUD. Features continuous listening, natural language processing in English & Spanish, and a beautiful cyberpunk-themed interface.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors