Problem
When using Patter with Twilio + OpenAIRealtime() engine, callers hear persistent static/audio artifacts when the agent speaks. This also breaks STT — the model transcribes noise as random text (e.g., URLs) because the audio is corrupted.
Root Cause
Twilio Media Streams deliver g711_ulaw @ 8 kHz audio to Patter. For OpenAI Realtime, Patter requests output_audio_format: g711_ulaw in the session config. However, OpenAI Realtime models (both gpt-realtime-mini and gpt-realtime) ignore this request and always return PCM16 @ 24 kHz even when g711_ulaw is specified.
Patter 0.6.3 assumes the output format matches what was requested (sets _input_is_mulaw = True for openai_realtime provider), so it passes the PCM16 bytes directly to Twilio as mulaw.
Result: Twilio receives 24 kHz PCM16 bytes interpreted as 8 kHz mulaw → garbled/static audio + broken STT.
Evidence from call transcript
Caller said static — model transcribed noise as:
Requested fix
Patter should either:
- Detect the actual output format returned by OpenAI and transcode accordingly
- Default to pcm16 for Twilio and handle full 24k→8k transcoding chain (like OpenAIRealtime2 does)
- At minimum, document this limitation
Environment
- getpatter: 0.6.3
- Model: gpt-realtime-mini (default)
- Carrier: Twilio
- Both phones affected: Zoom phone + physical cell phone
Problem
When using Patter with Twilio +
OpenAIRealtime()engine, callers hear persistent static/audio artifacts when the agent speaks. This also breaks STT — the model transcribes noise as random text (e.g., URLs) because the audio is corrupted.Root Cause
Twilio Media Streams deliver g711_ulaw @ 8 kHz audio to Patter. For OpenAI Realtime, Patter requests output_audio_format: g711_ulaw in the session config. However, OpenAI Realtime models (both gpt-realtime-mini and gpt-realtime) ignore this request and always return PCM16 @ 24 kHz even when g711_ulaw is specified.
Patter 0.6.3 assumes the output format matches what was requested (sets _input_is_mulaw = True for openai_realtime provider), so it passes the PCM16 bytes directly to Twilio as mulaw.
Result: Twilio receives 24 kHz PCM16 bytes interpreted as 8 kHz mulaw → garbled/static audio + broken STT.
Evidence from call transcript
Caller said static — model transcribed noise as:
Requested fix
Patter should either:
Environment