Skip to content

feat: Support pluggable speech recognition engines beyond the Web Speech API #239

Description

@untemps

Context

Today both Vocal and useVocal are hard-wired to the browser's native Web Speech API (SpeechRecognition). The recognition engine lives inside @untemps/vocal: createVocal() instantiates window.SpeechRecognition/webkitSpeechRecognition internally, and isSupported() probes for that global. react-vocal only ever consumes the resulting VocalInstance.

This couples the whole library to one engine, which has real limitations:

  • No cross-browser coverage — Firefox has no SpeechRecognition; on most platforms isSupported() returns false and the component renders nothing.
  • No offline / on-device option — e.g. Vosk, whisper.cpp / transformers.js.
  • No cloud STT option — e.g. Deepgram, Google Cloud Speech-to-Text, Azure Speech, OpenAI/Whisper API — which consumers may already pay for and want for accuracy, custom vocabulary, or diarization.

There is currently no public seam to swap the engine.

Proposal

Introduce a pluggable speech-recognition engine (adapter) abstraction so consumers can supply their own backend while keeping the existing event model, commands, timeouts and accessibility behaviour untouched. The Web Speech API stays the default, so this is purely additive and non-breaking.

The core mechanism belongs in @untemps/vocal (where the engine is wired); react-vocal then surfaces it through useVocal and Vocal.

1. @untemps/vocal — engine contract + injection

Define an engine interface that abstracts the parts createVocal currently assumes about SpeechRecognition, and emits the existing eventTypes (start, end, result, error, speechstart, speechend, nomatch, permission, …). Sketch:

export interface SpeechEngine {
  start(options?: { signal?: AbortSignal }): Promise<void>
  stop(): void
  abort(): void
  // normalized event stream consumed by Vocal core
  on<T extends EventType>(type: T, cb: EventHandlerFor<T>): void
  off<T extends EventType>(type: T, cb?: EventHandlerFor<T>): void
  cleanup(): void
  readonly isSupported: boolean
}

export type SpeechEngineFactory = (options: VocalOptions) => SpeechEngine

createVocal accepts an optional engine factory and defaults to the built-in Web Speech engine:

createVocal({ lang, grammars, maxAlternatives, continuous, engine: myEngineFactory })

The existing Web Speech behaviour is refactored into a default webSpeechEngine implementing this interface — no behaviour change when no engine is passed.

2. react-vocal — expose the seam

  • useVocal(...) gains a way to pass a custom engine factory, forwarded to createVocal.
  • Vocal gains an engine prop, forwarded to useVocal.
  • isSupported() becomes engine-aware: when a custom engine is provided, support is determined by the engine (so a cloud/offline engine can render the button even on Firefox).

Key considerations

  • Result normalization. react-vocal's onResult currently receives a raw SpeechRecognitionEvent (and _onResult reads event.results). A custom engine won't produce that shape. We need either a normalized result payload emitted by all engines, or a documented adapter shape, so tryMatchCommand / useCommands keep working. This is the main design decision.
  • Permission / getUserMedia. Cloud and on-device engines manage microphone capture themselves; the permission event contract must stay meaningful (or be opt-out per engine).
  • grammars / maxAlternatives / continuous. Engine-specific support — define how unsupported options degrade.
  • Async/streaming engines. Cloud engines stream audio and return interim/final transcripts asynchronously; the engine adapter must map that onto the existing synchronous-ish event lifecycle.
  • Bundle size. Engines must be tree-shakeable / opt-in; no cloud SDK should be pulled into the default build.
  • Types. Export SpeechEngine / SpeechEngineFactory from both packages.

Backward compatibility

Fully backward compatible: omitting engine keeps the current Web Speech API behaviour and the existing isSupported() semantics.

Acceptance criteria

  • @untemps/vocal exposes a documented SpeechEngine interface and accepts an engine factory in createVocal (defaulting to the built-in Web Speech engine).
  • useVocal and Vocal accept and forward a custom engine; Vocal gains an engine prop.
  • isSupported is engine-aware.
  • onResult receives a normalized result usable by useCommands regardless of engine.
  • Default (no engine) behaviour is unchanged and covered by existing tests.
  • A reference/example custom engine (mock or a real one such as Deepgram/Whisper) is documented in the README.
  • Tests cover engine injection, support detection, and result normalization.

Note: the bulk of this work (engine contract + default Web Speech engine) lives in @untemps/vocal. This issue tracks the react-vocal side (props/hook seam, docs, tests) and the cross-package coordination.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions