Context
Today both Vocal and useVocal are hard-wired to the browser's native Web Speech API (SpeechRecognition). The recognition engine lives inside @untemps/vocal: createVocal() instantiates window.SpeechRecognition/webkitSpeechRecognition internally, and isSupported() probes for that global. react-vocal only ever consumes the resulting VocalInstance.
This couples the whole library to one engine, which has real limitations:
- No cross-browser coverage — Firefox has no
SpeechRecognition; on most platforms isSupported() returns false and the component renders nothing.
- No offline / on-device option — e.g. Vosk, whisper.cpp /
transformers.js.
- No cloud STT option — e.g. Deepgram, Google Cloud Speech-to-Text, Azure Speech, OpenAI/Whisper API — which consumers may already pay for and want for accuracy, custom vocabulary, or diarization.
There is currently no public seam to swap the engine.
Proposal
Introduce a pluggable speech-recognition engine (adapter) abstraction so consumers can supply their own backend while keeping the existing event model, commands, timeouts and accessibility behaviour untouched. The Web Speech API stays the default, so this is purely additive and non-breaking.
The core mechanism belongs in @untemps/vocal (where the engine is wired); react-vocal then surfaces it through useVocal and Vocal.
1. @untemps/vocal — engine contract + injection
Define an engine interface that abstracts the parts createVocal currently assumes about SpeechRecognition, and emits the existing eventTypes (start, end, result, error, speechstart, speechend, nomatch, permission, …). Sketch:
export interface SpeechEngine {
start(options?: { signal?: AbortSignal }): Promise<void>
stop(): void
abort(): void
// normalized event stream consumed by Vocal core
on<T extends EventType>(type: T, cb: EventHandlerFor<T>): void
off<T extends EventType>(type: T, cb?: EventHandlerFor<T>): void
cleanup(): void
readonly isSupported: boolean
}
export type SpeechEngineFactory = (options: VocalOptions) => SpeechEngine
createVocal accepts an optional engine factory and defaults to the built-in Web Speech engine:
createVocal({ lang, grammars, maxAlternatives, continuous, engine: myEngineFactory })
The existing Web Speech behaviour is refactored into a default webSpeechEngine implementing this interface — no behaviour change when no engine is passed.
2. react-vocal — expose the seam
useVocal(...) gains a way to pass a custom engine factory, forwarded to createVocal.
Vocal gains an engine prop, forwarded to useVocal.
isSupported() becomes engine-aware: when a custom engine is provided, support is determined by the engine (so a cloud/offline engine can render the button even on Firefox).
Key considerations
- Result normalization.
react-vocal's onResult currently receives a raw SpeechRecognitionEvent (and _onResult reads event.results). A custom engine won't produce that shape. We need either a normalized result payload emitted by all engines, or a documented adapter shape, so tryMatchCommand / useCommands keep working. This is the main design decision.
- Permission /
getUserMedia. Cloud and on-device engines manage microphone capture themselves; the permission event contract must stay meaningful (or be opt-out per engine).
grammars / maxAlternatives / continuous. Engine-specific support — define how unsupported options degrade.
- Async/streaming engines. Cloud engines stream audio and return interim/final transcripts asynchronously; the engine adapter must map that onto the existing synchronous-ish event lifecycle.
- Bundle size. Engines must be tree-shakeable / opt-in; no cloud SDK should be pulled into the default build.
- Types. Export
SpeechEngine / SpeechEngineFactory from both packages.
Backward compatibility
Fully backward compatible: omitting engine keeps the current Web Speech API behaviour and the existing isSupported() semantics.
Acceptance criteria
Note: the bulk of this work (engine contract + default Web Speech engine) lives in @untemps/vocal. This issue tracks the react-vocal side (props/hook seam, docs, tests) and the cross-package coordination.
Context
Today both
VocalanduseVocalare hard-wired to the browser's native Web Speech API (SpeechRecognition). The recognition engine lives inside@untemps/vocal:createVocal()instantiateswindow.SpeechRecognition/webkitSpeechRecognitioninternally, andisSupported()probes for that global.react-vocalonly ever consumes the resultingVocalInstance.This couples the whole library to one engine, which has real limitations:
SpeechRecognition; on most platformsisSupported()returnsfalseand the component renders nothing.transformers.js.There is currently no public seam to swap the engine.
Proposal
Introduce a pluggable speech-recognition engine (adapter) abstraction so consumers can supply their own backend while keeping the existing event model, commands, timeouts and accessibility behaviour untouched. The Web Speech API stays the default, so this is purely additive and non-breaking.
The core mechanism belongs in
@untemps/vocal(where the engine is wired);react-vocalthen surfaces it throughuseVocalandVocal.1.
@untemps/vocal— engine contract + injectionDefine an engine interface that abstracts the parts
createVocalcurrently assumes aboutSpeechRecognition, and emits the existingeventTypes(start,end,result,error,speechstart,speechend,nomatch,permission, …). Sketch:createVocalaccepts an optional engine factory and defaults to the built-in Web Speech engine:The existing Web Speech behaviour is refactored into a default
webSpeechEngineimplementing this interface — no behaviour change when no engine is passed.2.
react-vocal— expose the seamuseVocal(...)gains a way to pass a custom engine factory, forwarded tocreateVocal.Vocalgains anengineprop, forwarded touseVocal.isSupported()becomes engine-aware: when a custom engine is provided, support is determined by the engine (so a cloud/offline engine can render the button even on Firefox).Key considerations
react-vocal'sonResultcurrently receives a rawSpeechRecognitionEvent(and_onResultreadsevent.results). A custom engine won't produce that shape. We need either a normalized result payload emitted by all engines, or a documented adapter shape, sotryMatchCommand/useCommandskeep working. This is the main design decision.getUserMedia. Cloud and on-device engines manage microphone capture themselves; thepermissionevent contract must stay meaningful (or be opt-out per engine).grammars/maxAlternatives/continuous. Engine-specific support — define how unsupported options degrade.SpeechEngine/SpeechEngineFactoryfrom both packages.Backward compatibility
Fully backward compatible: omitting
enginekeeps the current Web Speech API behaviour and the existingisSupported()semantics.Acceptance criteria
@untemps/vocalexposes a documentedSpeechEngineinterface and accepts anenginefactory increateVocal(defaulting to the built-in Web Speech engine).useVocalandVocalaccept and forward a custom engine;Vocalgains anengineprop.isSupportedis engine-aware.onResultreceives a normalized result usable byuseCommandsregardless of engine.