Skip to content

feat(CORE-236): add voice input (STT) and read-aloud (TTS)#91

Draft
nraffa wants to merge 5 commits into
mainfrom
feat/CORE-236-voice-input-stt
Draft

feat(CORE-236): add voice input (STT) and read-aloud (TTS)#91
nraffa wants to merge 5 commits into
mainfrom
feat/CORE-236-voice-input-stt

Conversation

@nraffa

@nraffa nraffa commented Apr 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Voice Input (STT): Mic button on chat input — speak instead of type, with real-time transcription
  • Read Aloud (TTS): Speaker button on AI messages — reads them aloud using Google Cloud TTS (Chirp 3 HD voices), with browser speechSynthesis as fallback
  • Both features behind config toggles (speechToText.enabled, textToSpeech.enabled), disabled by default
  • IaC: both GLOBAL_ENABLE_SPEECH_TO_TEXT and GLOBAL_ENABLE_TEXT_TO_SPEECH wired to Cloud Run for deployment

Voice Input (STT)

  • Hybrid approach: browser Speech Recognition for real-time text display + Google Cloud STT v2 for final transcription
  • Backend endpoint: POST /speech-to-text/transcribe (stateless proxy, multipart/form-data)
  • Browser support: real-time text in Chrome, Edge, Safari (~90% of users). Firefox falls back to cloud-only with brief "Transcribing..." state
  • 60-second max recording with auto-stop, security filter on transcribed text

Read Aloud (TTS)

  • Backend endpoint: POST /text-to-speech/synthesize (JSON {text, language} → MP3 audio)
  • Google Cloud TTS with Chirp 3 HD voices for natural-sounding speech
  • Browser speechSynthesis as fallback if backend is unavailable (marked with TODO for future removal)
  • Loading spinner while audio is being synthesized, LRU audio cache (20 entries) to avoid redundant API calls
  • MP3 format for universal browser compatibility (including Safari)
  • Language fallback: ny-ZMen-US (Chichewa not supported by any TTS provider)

nraffa added 5 commits April 9, 2026 11:03
Add a new POST /speech-to-text/transcribe endpoint that accepts audio uploads and returns transcribed text via Google Cloud Speech-to-Text v2.Feature is behind a configuration toggle (disabled by default), following the same pattern as CV upload.
Add mic button to chat input that uses Web Speech API for real-time interim text display while recording audio via MediaRecorder. On stop, audio is sent to the backend STT endpoint for final transcription. Existing text in the input is preserved and new transcription is appended. Feature is behind the GLOBAL_ENABLE_SPEECH_TO_TEXT toggle.
Enable speech.googleapis.com in backend required services, grant roles/speech.client to the backend service account, and pass GOOGLE_CLOUD_PROJECT env var to Cloud Run.
Add speaker button to AI chat messages that reads them aloud using  Google Cloud TTS (Chirp 3 HD voices), with browser speechSynthesis as fallback if the backend is unavailable.

- Backend: POST /text-to-speech/synthesize endpoint mirroring the STT module.
- Frontend: TextToSpeechService with LRU audio cache, loading spinner state.
- IaC: wire GLOBAL_ENABLE_SPEECH_TO_TEXT and GLOBAL_ENABLE_TEXT_TO_SPEECH env vars to Cloud Run for full deployment of voice-assisted features.
@nraffa nraffa force-pushed the feat/CORE-236-voice-input-stt branch from 3951c8c to b10fc79 Compare April 9, 2026 01:08
@nraffa nraffa changed the title feat(CORE-236): add voice input with real-time transcription feat(CORE-236): add voice input (STT) and read-aloud (TTS) Apr 9, 2026
@nraffa nraffa marked this pull request as draft May 28, 2026 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant