daytonaio · daxia778 · Jun 26, 2026
diff --git a/authors/haodong_liu.md b/authors/haodong_liu.md
@@ -0,0 +1,7 @@
+Author: Haodong Liu Title: Software Engineer Description: Haodong Liu builds
+practical developer tooling and automation workflows for reproducible
+engineering environments. His work focuses on small, testable integrations that
+make AI-assisted development pipelines easier to run, audit, and share. Author
+Image: <https://github.com/daxia778.png> Author LinkedIn: Author Twitter: Company
+Name: Independent Company Description: Independent software development and
+open-source engineering. Company Logo Dark: Company Logo White:
diff --git a/definitions/20260626_definition_hugging_face_inference_provider.md b/definitions/20260626_definition_hugging_face_inference_provider.md
@@ -0,0 +1,26 @@
+---
+title: "Hugging Face Inference Provider"
+description: "A hosted model routing layer for running inference tasks through Hugging Face."
+date: 2026-06-26
+author: "Haodong Liu"
+---
+
+# Hugging Face Inference Provider
+
+## Definition
+
+A Hugging Face Inference Provider is a hosted execution backend exposed through
+Hugging Face's routed inference API. Developers call one consistent HTTP
+interface while selecting the provider and model that should run the task.
+
+## Context and Usage
+
+For speech-to-text workflows, an inference provider can run an automatic speech
+recognition model such as Whisper from a remote endpoint. Client tools send an
+audio payload, authenticate with a Hugging Face token, and receive a transcript
+response without hosting model infrastructure in the local workspace.
+
+Provider routing is useful when teams want to compare model quality, latency,
+and cost from a stable client. It still requires careful configuration because
+providers can differ in model availability, response latency, file size limits,
+timestamp support, and billing behavior.
diff --git a/guides/20260626_run_hugging_face_asr_with_sapat_in_daytona.md b/guides/20260626_run_hugging_face_asr_with_sapat_in_daytona.md
@@ -0,0 +1,330 @@
+---
+title: "Run Hugging Face ASR in Daytona"
+description: "Use Daytona, Sapat, and Hugging Face Inference Providers to transcribe media in a reproducible workspace."
+date: 2026-06-26
+author: "Haodong Liu"
+tags: ["daytona", "sapat", "speech-to-text", "hugging-face"]
+---
+
+# Run Hugging Face ASR in Daytona
+
+## Introduction
+
+Speech-to-text experiments become difficult to compare when every provider
+needs a different setup, machine image, and local script. One provider might
+need a hosted endpoint, another might need a local model, and another might
+return timestamps in a different shape. The more those details leak into the
+project repository, the harder it is for another engineer to reproduce the run.
+
+[Daytona](https://www.daytona.io/) is a good fit for this kind of work because
+the media tools, Python dependencies, environment variables, and temporary audio
+files can live inside a disposable workspace. This guide uses
+[Sapat](https://github.com/nibzard/sapat), a Python speech-to-text CLI, with a
+new [Hugging Face Inference Provider](../definitions/20260626_definition_hugging_face_inference_provider.md)
+adapter. The adapter sends audio to Hugging Face's routed automatic speech
+recognition endpoint and writes the returned transcript beside the source file.
+
+The companion Sapat implementation is available in
+[nibzard/sapat#72](https://github.com/nibzard/sapat/pull/72). While the pull
+request is under review, use the branch shown below. After it is merged, the
+same workflow can run from the upstream Sapat repository.
+
+![Hugging Face ASR workflow with Sapat in Daytona](assets/20260626_run_hugging_face_asr_with_sapat_in_daytona_img1.svg)
+
+## TL;DR
+
+- Create a Daytona workspace from the Sapat branch that adds the Hugging Face
+  provider.
+- Store `HF_TOKEN` and router settings in `.env`; never commit keys or media.
+- Run `sapat --provider huggingface` against a short test recording first.
+- Use `HUGGINGFACE_PROVIDER` to switch the routed inference backend without
+  changing the Sapat command.
+- Enable timestamp chunks only when the selected model and provider support
+  them.
+
+## Prerequisites
+
+You need the following:
+
+- Daytona installed and connected to a target that can create workspaces.
+- A GitHub account that can clone public repositories.
+- A Hugging Face token with access to the model or provider you want to use.
+- A short media file you are allowed to transcribe.
+- Basic terminal familiarity.
+
+Use a short, non-confidential recording for the first run. Provider routing,
+billing, model availability, and file size limits can vary, so it is better to
+validate the path with a safe sample before sending real project recordings.
+
+## Step 1: Create the Workspace
+
+Start the Daytona server if it is not already running:
+
+```bash
+daytona server
+```
+
+Create a workspace from the Sapat fork that contains the companion provider
+implementation:
+
+```bash
+daytona create https://github.com/daxia778/sapat --code
+```
+
+When the editor opens, check out the provider branch:
+
+```bash
+git checkout codex/huggingface-provider
+```
+
+If the provider has already been merged upstream, use the main repository
+instead:
+
+```bash
+daytona create https://github.com/nibzard/sapat --code
+```
+
+The main benefit is repeatability. Everyone who opens this workspace gets the
+same repository state, can install the same Python package, and can keep
+provider configuration out of the host machine.
+
+## Step 2: Install Sapat and Check Media Tools
+
+Install Sapat in editable mode inside the workspace:
+
+```bash
+python -m pip install -e .
+```
+
+Sapat converts input media before it calls a provider, so `ffmpeg` must be
+available:
+
+```bash
+ffmpeg -version
+```
+
+If the command is missing in a Debian-based workspace image, install it:
+
+```bash
+sudo apt-get update
+sudo apt-get install -y ffmpeg
+```
+
+Run a quick CLI check:
+
+```bash
+sapat --version
+```
+
+At this point the workspace has the CLI entry point, provider registry, and
+conversion tool needed for a Hugging Face transcription run.
+
+## Step 3: Configure Hugging Face
+
+Create a `.env` file in the workspace root:
+
+```bash
+cat > .env <<'EOF'
+HF_TOKEN=replace-with-your-token
+HUGGINGFACE_PROVIDER=hf-inference
+HUGGINGFACE_RETURN_TIMESTAMPS=false
+EOF
+```
+
+The provider reads these settings:
+
+| Variable | Purpose | Default |
+| --- | --- | --- |
+| `HF_TOKEN` | Bearer token for the router request | Required |
+| `HUGGINGFACE_PROVIDER` | Routed backend segment in the Hugging Face URL | `hf-inference` |
+| `HUGGINGFACE_API_BASE` | Router base URL for advanced deployments | `https://router.huggingface.co` |
+| `HUGGINGFACE_TIMEOUT` | HTTP timeout in seconds | `120` |
+| `HUGGINGFACE_RETURN_TIMESTAMPS` | Request timestamp chunks when supported | `false` |
+
+Keep `.env` local. Do not paste real tokens into PR descriptions, issue
+comments, screenshots, or sample files.
+
+## Step 4: Pick a Model and Provider
+
+The Sapat provider defaults to `openai/whisper-large-v3`, which is a common
+starting point for multilingual ASR experiments. Hugging Face routing lets you
+combine a provider and model in the request URL. The default path is:
+
+```text
+https://router.huggingface.co/hf-inference/models/openai/whisper-large-v3
+```
+
+If your selected provider is different, change only the environment variable:
+
+```bash
+HUGGINGFACE_PROVIDER=fal-ai
+```
+
+Then keep the CLI command the same. This makes comparisons easier because the
+transcription workflow stays stable while the provider setting changes.
+
+## Step 5: Run a Safe First Transcription
+
+Copy a short sample into the workspace, for example:
+
+```bash
+mkdir -p samples
+cp ~/Downloads/demo-call.mp4 samples/demo-call.mp4
+```
+
+Run Sapat:
+
+```bash
+sapat samples/demo-call.mp4 \
+  --provider huggingface \
+  --model openai/whisper-large-v3 \
+  --language en \
+  --temperature 0 \
+  --quality M
+```
+
+Sapat will convert the file to the provider's preferred audio format, call the
+Hugging Face adapter, and write `samples/demo-call.txt`. The current adapter
+sends a JSON request with base64-encoded audio and parses either `text` or
+`generated_text` from the response. If timestamp chunks are returned, Sapat
+stores them on the in-memory result object while the CLI still writes the main
+transcript text to disk.
+
+Open the transcript:
+
+```bash
+sed -n '1,80p' samples/demo-call.txt
+```
+
+Review the first run for missing sections, obvious language mismatch, and names
+that need a different model or prompt strategy in a later provider-specific
+adapter.
+
+## Step 6: Keep the Workspace Clean
+
+Before committing anything, check the working tree:
+
+```bash
+git status --short
+```
+
+Do not commit:
+
+- `.env` files
+- source media
+- generated `.mp3`, `.wav`, or `.txt` outputs
+- API responses that contain private transcript text
+- local absolute paths
+
+For repeatable examples, document commands and expected file names rather than
+committing private audio or transcripts.
+
+## Step 7: Compare Provider Runs
+
+Once the first safe sample works, use the same recording to compare provider
+settings. Keep the source file, Sapat command, language, and model fixed while
+changing one routing variable at a time. That makes the result easier to
+explain in a review or engineering note.
+
+For example, run the default route first:
+
+```bash
+HUGGINGFACE_PROVIDER=hf-inference \
+sapat samples/demo-call.mp4 \
+  --provider huggingface \
+  --model openai/whisper-large-v3 \
+  --language en \
+  --quality M
+```
+
+Then try another routed backend if your Hugging Face account and selected model
+support it:
+
+```bash
+HUGGINGFACE_PROVIDER=fal-ai \
+sapat samples/demo-call.mp4 \
+  --provider huggingface \
+  --model openai/whisper-large-v3 \
+  --language en \
+  --quality M
+```
+
+Rename each output immediately after the run:
+
+```bash
+mv samples/demo-call.txt samples/demo-call-hf-inference.txt
+mv samples/demo-call.txt samples/demo-call-fal-ai.txt
+```
+
+Compare the two transcripts for latency, missing words, punctuation, speaker
+names, and domain terms. If the content is sensitive, summarize differences in
+a private note instead of committing the transcripts. The useful artifact for a
+shared repository is the reproducible command, not the private recording or its
+verbatim output.
+
+## Common Issues and Troubleshooting
+
+**Problem:** Sapat says no providers are available.
+
+**Solution:** The provider registry only enables `huggingface` when `HF_TOKEN`
+is present in the process environment. Confirm `.env` is in the workspace root
+and run the command from that same directory.
+
+**Problem:** The request returns an authentication error.
+
+**Solution:** Regenerate or verify the Hugging Face token. Also confirm that the
+token has access to the selected model and provider.
+
+**Problem:** The request returns a model or provider error.
+
+**Solution:** Check that the model is available through the selected
+`HUGGINGFACE_PROVIDER`. Try the default provider first, then switch providers
+only after the basic run succeeds.
+
+**Problem:** Timestamp chunks are missing.
+
+**Solution:** Set `HUGGINGFACE_RETURN_TIMESTAMPS=true` only for models and
+providers that support timestamp output. The main transcript can still be valid
+when chunks are absent.
+
+**Problem:** Conversion fails before the request is sent.
+
+**Solution:** Run `ffmpeg -version`. If it is missing, install `ffmpeg` in the
+workspace image and rerun the command.
+
+## When to Build a Dedicated Provider
+
+The Hugging Face adapter is useful when you want a general routed ASR endpoint.
+Build a dedicated Sapat provider instead when you need:
+
+- vendor-specific diarization options
+- custom vocabulary files
+- async job polling
+- per-speaker transcript formatting
+- provider billing metadata in the result
+- a response shape that does not include `text` or `generated_text`
+
+Keeping those cases in separate adapters prevents the generic Hugging Face path
+from becoming hard to reason about.
+
+## Conclusion
+
+You now have a Daytona workspace that can run Sapat against a Hugging Face
+automatic speech recognition model with a clean provider boundary. Daytona
+keeps the environment reproducible, Sapat handles media conversion and transcript
+writing, and the provider adapter contains the Hugging Face router details.
+
+This is a useful pattern for AI engineers who need to compare ASR backends
+without scattering credentials, temporary files, and one-off scripts across
+their host machines. Start with a safe sample, verify the transcript, and then
+decide whether a general routed provider is enough or a richer dedicated
+adapter is worth building.
+
+## References
+
+- [Sapat repository](https://github.com/nibzard/sapat)
+- [Companion Sapat provider pull request](https://github.com/nibzard/sapat/pull/72)
+- [Hugging Face automatic speech recognition task docs](https://huggingface.co/docs/inference-providers/en/tasks/automatic-speech-recognition)
+- [Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers/en/index)
+- [Daytona documentation](https://www.daytona.io/docs/)
+- [Daytona content issue: AI Transcription Tool](https://github.com/daytonaio/content/issues/13)