This project is a hardware retrofit of the Edifier D12 70W Bluetooth speaker, transforming it into an AI voice assistant. A Raspberry Pi 5 running a real-time API client (Gemini/OpenAI) is installed inside the original enclosure, paired with a reSpeaker XVF3800 microphone array that handles audio output, hardware beamforming, and acoustic echo cancellation (AEC).
- Safety Warning & Disclaimer
- Project Philosophy
- Architectural Choice: Real-time API vs. Traditional STT/TTS
- Hardware Architecture
- Hardware Evolution: v1 DAC+ to v2 XVF3800-only
- Hardware Implementation
- System Configuration
- Future Improvements
DANGER: High Voltage. This project involves modifying mains-powered equipment and installing an internal power supply unit (PSU) connected to AC lines.
- Risk of Electric Shock: Improper handling of high-voltage components can result in serious injury or death.
- Fire Hazard: Incorrect wiring or component selection can cause fire.
- Warranty Void: Opening the Edifier D12 enclosure will void its warranty.
The author of this project accepts no responsibility for any damage to equipment, personal injury, or property damage resulting from the replication of this project. Proceed only if you have the appropriate knowledge and experience with high-voltage electronics. Always disconnect power before working on the device.
This is a hobby project driven by two primary goals:
- To Learn: To explore the process of building a high-quality, end-to-end voice assistant from the ground up, covering hardware integration, low-level Linux audio configuration, and real-time application development.
- To Use: To create a practical home assistant that supports the Polish language β a feature still lacking in many commercial smart speakers. The aim is to build a device that is not only functional but also a permanent, useful part of a smart home ecosystem.
The project prioritizes audio quality, low latency, and a modular software design that allows for future expansion.
- Single-device Audio: The reSpeaker XVF3800 (firmware β₯ 2.0.9, 48kHz) handles both audio output in decent quality and microphone capture, eliminating the need for a separate DAC.
- Hardware AEC: The XVF3800's on-chip AEC cancels speaker echo using its internal reference β no software loopback required.
- Real-time Conversational AI: Leverages the native streaming APIs from Google Gemini and OpenAI for low-latency, natural-feeling conversations.
- Robust Voice Capture: The XVF3800 DSP provides hardware beamforming and AEC. Wake word detection runs in the application via openWakeWord (ONNX model, custom-trainable).
- Smart Home Control via Function Calling: Integrates seamlessly with OpenHAB to control smart devices like lights, switches, and sensors.
- Multi-lingual: While designed for Polish, the system can be configured for any language supported by the chosen AI provider.
- Operational Monitoring: Built-in Prometheus metrics endpoint with a pre-configured Grafana dashboard.
Most open-source voice assistant projects follow a standard offline pipeline:
Wake Word -> Speech-to-Text (STT) -> Intent Handling / LLM -> Text-to-Speech (TTS)
This project takes a different approach by leveraging Real-time Multimodal APIs (e.g., Gemini's Live API or OpenAI's Realtime API). Raw audio is streamed directly to the API, which handles VAD, STT, LLM interaction, and TTS in a single, continuous session.
- Simplicity & Speed: Eliminates the "pipeline latency" that accumulates at each step of a traditional flow. The round-trip time is significantly lower, resulting in a more natural, conversational feel.
- End-to-End AI: The entire interaction is managed by a single, powerful model, leading to more context-aware and human-like responses.
- Focus on Hardware/Integration: By offloading core AI tasks, this project can focus on the hardware build, audio stack optimization, and creating a reliable platform.
This architecture trades the privacy and offline capabilities of traditional systems for state-of-the-art speed and conversational quality.
- Base Unit: Edifier D12 Stereo Speaker
- Processing: Raspberry Pi 5 (8GB) with Active Cooler
- Audio I/O: reSpeaker XVF3800 USB Microphone Array (speaker output + 4-mic array with hardware beamforming and AEC; firmware 2.0.9+ required for 48kHz operation)
- Power Delivery: Mean Well RS-25-5 Industrial Switching Power Supply (25W, 5V, 5A)
- Connectivity:
- Shielded Cat 6a RJ45 Panel Mount
- Industrial Metal USB 3.0 Type-A Panel Mount
- Premium internal RCA and USB interconnects
Note: The original v1 build included a Raspberry Pi DAC+ for audio output. This has been superseded β see Hardware Evolution below.
The hardware and software architecture went through a significant simplification when Seeed Studio released firmware 2.0.9 for the XVF3800, adding native 48kHz USB audio support.
The original firmware only supported 16kHz on its USB audio interface. This forced a split audio path:
Music / TTS βββ PipeWire combine-sink βββ¬βββ I2S βββ DAC+ βββ Edifier amplifier βββ speaker
ββββ USB βββ XVF3800 (16kHz reference for AEC)
ββββ USB capture βββ application
This created several challenges:
- Two clock domains: DAC+ at 48kHz and XVF3800 at 16kHz required PipeWire to maintain a fixed quantum to prevent clock drift.
- Software AEC reference loopback: A PipeWire
combine-streamsink had to continuously feed a downsampled (16kHz) copy of the played audio back to the XVF3800's USB playback input as the AEC reference signal. - Precise delay calibration: Because the reference and playback took different paths,
AUDIO_MGR_SYS_DELAYhad to be precisely calibrated (within the Β±5ms range of the parameter) to align the AEC reference with the acoustic echo. Thetools/respeaker_delay_tune.pyscript was developed for this purpose. - Clock drift: Even with a fixed PipeWire clock, the two-device setup could develop minor drift over long sessions.
With firmware 2.0.9, the XVF3800 operates as a native 48kHz USB audio device for both playback and capture. The entire audio path is unified:
Music / TTS βββ PipeWire βββ USB βββ XVF3800 βββ speaker
β (internal AEC: playback β mic reference)
ββββ USB capture βββ application
What changed:
- The DAC+ and the PipeWire combine-sink are no longer needed.
- The XVF3800's DSP handles AEC internally β it uses its own USB playback output as the AEC reference, without any software loopback.
- There is only one clock domain. PipeWire's fixed quantum is still configured (480 samples @ 48kHz = 10ms) for consistent low-latency scheduling, but clock drift between devices is no longer a concern.
AUDIO_MGR_SYS_DELAYonly needs to compensate for the chip-internal acoustic path (speaker β mic), which is fixed and small (~10 samples). The AEC adaptive filter handles the bulk of the pipeline latency automatically.
The trade-off is that audio quality is now determined by the XVF3800's built-in speaker amplifier rather than a dedicated DAC. For a voice assistant the audio quality is decent and entirely fit for purpose.
The modification focuses on internalizing the compute stack while maintaining the acoustic integrity of the Edifier chassis.
Photos of the assembly process are located in hardware/pictures/. The logical sequence is:
01-enclosure-disassembled.jpg: Internal layout assessment.02-internal-psu-mount-point.jpg&03-internal-psu-mount-point.jpg: Preparing bracketry.04-internal-psu-installed.jpg: Mounting the Mean Well PSU and routing AC lines.05-rpi-dac-plus-stack.jpgthrough07-rpi-dac-plus-stack.jpg: Assembly of the RPi 5 + DAC+ stack (v1 build).08-rpi-mounted-in-chassis.jpg&09-rpi-mounted-in-chassis.jpg: Final placement near the panel mounts.10-original-pcb-with-additional-wiring.jpg&11-original-pcb-with-additional-wiring.jpg: Bridging the Edifier amplifier inputs with the RPi audio output.12-enclosure-reassembly.jpg&13-enclosure-reassembly.jpg: Final internal cable management.14-final-assembly-complete.jpg: Finished front-facing drivers.15-top-view.jpg&16-rear-view.jpg: Final external appearance and panel mount access.
The Raspberry Pi 5 runs a modern Linux audio stack optimized for low-latency voice processing. For details on the Python application, see the Application README.
To configure the application to start automatically on boot as a user service:
-
Copy the provided systemd service file:
mkdir -p ~/.config/systemd/user cp linux/home/user/.config/systemd/user/ai-smart-speaker.service ~/.config/systemd/user/
-
Enable lingering (allows user services to start on boot without login):
loginctl enable-linger $USER -
Reload and enable:
systemctl --user daemon-reload systemctl --user enable --now ai-smart-speaker.service -
View logs:
journalctl --user-unit ai-smart-speaker -f
The system uses PipeWire with the following configuration files (all under linux/):
| File | Purpose |
|---|---|
home/user/.config/pipewire/pipewire.conf.d/50-fixed-clock.conf |
Pins PipeWire to 48kHz, fixed quantum 480 (10ms) for consistent scheduling |
home/user/.config/wireplumber/wireplumber.conf.d/51-lowlatency-alsa.conf |
Keeps the XVF3800 always active (the application continuously reads mic audio for wake word detection) and tunes ALSA buffer |
home/user/.config/systemd/user/pipewire.service.d/rt.conf |
Grants PipeWire the RLIMIT_RTPRIO needed for real-time scheduling |
etc/polkit-1/rules.d/50-rtkit-pipewire.rules |
Allows RTKit to promote PipeWire's audio thread to SCHED_RR on headless systems |
XVF3800 parameters are saved to the chip's flash via SAVE_CONFIGURATION and persist across reboots without any boot-time script. The udev rule only sets USB permissions.
After a firmware upgrade (which clears flash), re-apply settings manually:
/opt/reSpeaker/xvf_host -e /opt/reSpeaker/init_commands.txt
/opt/reSpeaker/xvf_host SAVE_CONFIGURATION 1Key parameters (see linux/opt/reSpeaker/init_commands.txt for the full list):
| Parameter | Value | Purpose |
|---|---|---|
AUDIO_MGR_MIC_GAIN |
90 | Pre-beamformer microphone gain |
AUDIO_MGR_REF_GAIN |
8 | Far-end reference gain for AEC |
AUDIO_MGR_SYS_DELAY |
12 | Chip-internal acoustic path delay (samples) |
PP_AGCONOFF |
1 | Automatic Gain Control enabled |
PP_ECHOONOFF |
1 | Echo suppression enabled |
PP_NLATTENONOFF |
1 | Non-linear echo attenuation enabled |
The AEC converges automatically via its adaptive filter. AUDIO_MGR_SYS_DELAY only fine-tunes the chip-internal path and has a valid range of [-64, 256] samples.
- Music Streaming: Integrating music streaming services (Spotify, YouTube Music).
- Custom Linux Distribution: Building a minimal Linux distribution using the Yocto Project to optimize boot time and performance.