diff --git a/authors/haodong_liu.md b/authors/haodong_liu.md new file mode 100644 index 00000000..ba734dc8 --- /dev/null +++ b/authors/haodong_liu.md @@ -0,0 +1,7 @@ +Author: Haodong Liu Title: Software Engineer Description: Haodong Liu builds +practical developer tooling and automation workflows for reproducible +engineering environments. His work focuses on small, testable integrations that +make AI-assisted development pipelines easier to run, audit, and share. Author +Image: Author LinkedIn: Author Twitter: Company +Name: Independent Company Description: Independent software development and +open-source engineering. Company Logo Dark: Company Logo White: diff --git a/definitions/20260626_definition_hugging_face_inference_provider.md b/definitions/20260626_definition_hugging_face_inference_provider.md new file mode 100644 index 00000000..20332235 --- /dev/null +++ b/definitions/20260626_definition_hugging_face_inference_provider.md @@ -0,0 +1,26 @@ +--- +title: "Hugging Face Inference Provider" +description: "A hosted model routing layer for running inference tasks through Hugging Face." +date: 2026-06-26 +author: "Haodong Liu" +--- + +# Hugging Face Inference Provider + +## Definition + +A Hugging Face Inference Provider is a hosted execution backend exposed through +Hugging Face's routed inference API. Developers call one consistent HTTP +interface while selecting the provider and model that should run the task. + +## Context and Usage + +For speech-to-text workflows, an inference provider can run an automatic speech +recognition model such as Whisper from a remote endpoint. Client tools send an +audio payload, authenticate with a Hugging Face token, and receive a transcript +response without hosting model infrastructure in the local workspace. + +Provider routing is useful when teams want to compare model quality, latency, +and cost from a stable client. It still requires careful configuration because +providers can differ in model availability, response latency, file size limits, +timestamp support, and billing behavior. diff --git a/guides/20260626_run_hugging_face_asr_with_sapat_in_daytona.md b/guides/20260626_run_hugging_face_asr_with_sapat_in_daytona.md new file mode 100644 index 00000000..5af945ba --- /dev/null +++ b/guides/20260626_run_hugging_face_asr_with_sapat_in_daytona.md @@ -0,0 +1,330 @@ +--- +title: "Run Hugging Face ASR in Daytona" +description: "Use Daytona, Sapat, and Hugging Face Inference Providers to transcribe media in a reproducible workspace." +date: 2026-06-26 +author: "Haodong Liu" +tags: ["daytona", "sapat", "speech-to-text", "hugging-face"] +--- + +# Run Hugging Face ASR in Daytona + +## Introduction + +Speech-to-text experiments become difficult to compare when every provider +needs a different setup, machine image, and local script. One provider might +need a hosted endpoint, another might need a local model, and another might +return timestamps in a different shape. The more those details leak into the +project repository, the harder it is for another engineer to reproduce the run. + +[Daytona](https://www.daytona.io/) is a good fit for this kind of work because +the media tools, Python dependencies, environment variables, and temporary audio +files can live inside a disposable workspace. This guide uses +[Sapat](https://github.com/nibzard/sapat), a Python speech-to-text CLI, with a +new [Hugging Face Inference Provider](../definitions/20260626_definition_hugging_face_inference_provider.md) +adapter. The adapter sends audio to Hugging Face's routed automatic speech +recognition endpoint and writes the returned transcript beside the source file. + +The companion Sapat implementation is available in +[nibzard/sapat#72](https://github.com/nibzard/sapat/pull/72). While the pull +request is under review, use the branch shown below. After it is merged, the +same workflow can run from the upstream Sapat repository. + +![Hugging Face ASR workflow with Sapat in Daytona](assets/20260626_run_hugging_face_asr_with_sapat_in_daytona_img1.svg) + +## TL;DR + +- Create a Daytona workspace from the Sapat branch that adds the Hugging Face + provider. +- Store `HF_TOKEN` and router settings in `.env`; never commit keys or media. +- Run `sapat --provider huggingface` against a short test recording first. +- Use `HUGGINGFACE_PROVIDER` to switch the routed inference backend without + changing the Sapat command. +- Enable timestamp chunks only when the selected model and provider support + them. + +## Prerequisites + +You need the following: + +- Daytona installed and connected to a target that can create workspaces. +- A GitHub account that can clone public repositories. +- A Hugging Face token with access to the model or provider you want to use. +- A short media file you are allowed to transcribe. +- Basic terminal familiarity. + +Use a short, non-confidential recording for the first run. Provider routing, +billing, model availability, and file size limits can vary, so it is better to +validate the path with a safe sample before sending real project recordings. + +## Step 1: Create the Workspace + +Start the Daytona server if it is not already running: + +```bash +daytona server +``` + +Create a workspace from the Sapat fork that contains the companion provider +implementation: + +```bash +daytona create https://github.com/daxia778/sapat --code +``` + +When the editor opens, check out the provider branch: + +```bash +git checkout codex/huggingface-provider +``` + +If the provider has already been merged upstream, use the main repository +instead: + +```bash +daytona create https://github.com/nibzard/sapat --code +``` + +The main benefit is repeatability. Everyone who opens this workspace gets the +same repository state, can install the same Python package, and can keep +provider configuration out of the host machine. + +## Step 2: Install Sapat and Check Media Tools + +Install Sapat in editable mode inside the workspace: + +```bash +python -m pip install -e . +``` + +Sapat converts input media before it calls a provider, so `ffmpeg` must be +available: + +```bash +ffmpeg -version +``` + +If the command is missing in a Debian-based workspace image, install it: + +```bash +sudo apt-get update +sudo apt-get install -y ffmpeg +``` + +Run a quick CLI check: + +```bash +sapat --version +``` + +At this point the workspace has the CLI entry point, provider registry, and +conversion tool needed for a Hugging Face transcription run. + +## Step 3: Configure Hugging Face + +Create a `.env` file in the workspace root: + +```bash +cat > .env <<'EOF' +HF_TOKEN=replace-with-your-token +HUGGINGFACE_PROVIDER=hf-inference +HUGGINGFACE_RETURN_TIMESTAMPS=false +EOF +``` + +The provider reads these settings: + +| Variable | Purpose | Default | +| --- | --- | --- | +| `HF_TOKEN` | Bearer token for the router request | Required | +| `HUGGINGFACE_PROVIDER` | Routed backend segment in the Hugging Face URL | `hf-inference` | +| `HUGGINGFACE_API_BASE` | Router base URL for advanced deployments | `https://router.huggingface.co` | +| `HUGGINGFACE_TIMEOUT` | HTTP timeout in seconds | `120` | +| `HUGGINGFACE_RETURN_TIMESTAMPS` | Request timestamp chunks when supported | `false` | + +Keep `.env` local. Do not paste real tokens into PR descriptions, issue +comments, screenshots, or sample files. + +## Step 4: Pick a Model and Provider + +The Sapat provider defaults to `openai/whisper-large-v3`, which is a common +starting point for multilingual ASR experiments. Hugging Face routing lets you +combine a provider and model in the request URL. The default path is: + +```text +https://router.huggingface.co/hf-inference/models/openai/whisper-large-v3 +``` + +If your selected provider is different, change only the environment variable: + +```bash +HUGGINGFACE_PROVIDER=fal-ai +``` + +Then keep the CLI command the same. This makes comparisons easier because the +transcription workflow stays stable while the provider setting changes. + +## Step 5: Run a Safe First Transcription + +Copy a short sample into the workspace, for example: + +```bash +mkdir -p samples +cp ~/Downloads/demo-call.mp4 samples/demo-call.mp4 +``` + +Run Sapat: + +```bash +sapat samples/demo-call.mp4 \ + --provider huggingface \ + --model openai/whisper-large-v3 \ + --language en \ + --temperature 0 \ + --quality M +``` + +Sapat will convert the file to the provider's preferred audio format, call the +Hugging Face adapter, and write `samples/demo-call.txt`. The current adapter +sends a JSON request with base64-encoded audio and parses either `text` or +`generated_text` from the response. If timestamp chunks are returned, Sapat +stores them on the in-memory result object while the CLI still writes the main +transcript text to disk. + +Open the transcript: + +```bash +sed -n '1,80p' samples/demo-call.txt +``` + +Review the first run for missing sections, obvious language mismatch, and names +that need a different model or prompt strategy in a later provider-specific +adapter. + +## Step 6: Keep the Workspace Clean + +Before committing anything, check the working tree: + +```bash +git status --short +``` + +Do not commit: + +- `.env` files +- source media +- generated `.mp3`, `.wav`, or `.txt` outputs +- API responses that contain private transcript text +- local absolute paths + +For repeatable examples, document commands and expected file names rather than +committing private audio or transcripts. + +## Step 7: Compare Provider Runs + +Once the first safe sample works, use the same recording to compare provider +settings. Keep the source file, Sapat command, language, and model fixed while +changing one routing variable at a time. That makes the result easier to +explain in a review or engineering note. + +For example, run the default route first: + +```bash +HUGGINGFACE_PROVIDER=hf-inference \ +sapat samples/demo-call.mp4 \ + --provider huggingface \ + --model openai/whisper-large-v3 \ + --language en \ + --quality M +``` + +Then try another routed backend if your Hugging Face account and selected model +support it: + +```bash +HUGGINGFACE_PROVIDER=fal-ai \ +sapat samples/demo-call.mp4 \ + --provider huggingface \ + --model openai/whisper-large-v3 \ + --language en \ + --quality M +``` + +Rename each output immediately after the run: + +```bash +mv samples/demo-call.txt samples/demo-call-hf-inference.txt +mv samples/demo-call.txt samples/demo-call-fal-ai.txt +``` + +Compare the two transcripts for latency, missing words, punctuation, speaker +names, and domain terms. If the content is sensitive, summarize differences in +a private note instead of committing the transcripts. The useful artifact for a +shared repository is the reproducible command, not the private recording or its +verbatim output. + +## Common Issues and Troubleshooting + +**Problem:** Sapat says no providers are available. + +**Solution:** The provider registry only enables `huggingface` when `HF_TOKEN` +is present in the process environment. Confirm `.env` is in the workspace root +and run the command from that same directory. + +**Problem:** The request returns an authentication error. + +**Solution:** Regenerate or verify the Hugging Face token. Also confirm that the +token has access to the selected model and provider. + +**Problem:** The request returns a model or provider error. + +**Solution:** Check that the model is available through the selected +`HUGGINGFACE_PROVIDER`. Try the default provider first, then switch providers +only after the basic run succeeds. + +**Problem:** Timestamp chunks are missing. + +**Solution:** Set `HUGGINGFACE_RETURN_TIMESTAMPS=true` only for models and +providers that support timestamp output. The main transcript can still be valid +when chunks are absent. + +**Problem:** Conversion fails before the request is sent. + +**Solution:** Run `ffmpeg -version`. If it is missing, install `ffmpeg` in the +workspace image and rerun the command. + +## When to Build a Dedicated Provider + +The Hugging Face adapter is useful when you want a general routed ASR endpoint. +Build a dedicated Sapat provider instead when you need: + +- vendor-specific diarization options +- custom vocabulary files +- async job polling +- per-speaker transcript formatting +- provider billing metadata in the result +- a response shape that does not include `text` or `generated_text` + +Keeping those cases in separate adapters prevents the generic Hugging Face path +from becoming hard to reason about. + +## Conclusion + +You now have a Daytona workspace that can run Sapat against a Hugging Face +automatic speech recognition model with a clean provider boundary. Daytona +keeps the environment reproducible, Sapat handles media conversion and transcript +writing, and the provider adapter contains the Hugging Face router details. + +This is a useful pattern for AI engineers who need to compare ASR backends +without scattering credentials, temporary files, and one-off scripts across +their host machines. Start with a safe sample, verify the transcript, and then +decide whether a general routed provider is enough or a richer dedicated +adapter is worth building. + +## References + +- [Sapat repository](https://github.com/nibzard/sapat) +- [Companion Sapat provider pull request](https://github.com/nibzard/sapat/pull/72) +- [Hugging Face automatic speech recognition task docs](https://huggingface.co/docs/inference-providers/en/tasks/automatic-speech-recognition) +- [Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers/en/index) +- [Daytona documentation](https://www.daytona.io/docs/) +- [Daytona content issue: AI Transcription Tool](https://github.com/daytonaio/content/issues/13) diff --git a/guides/assets/20260626_run_hugging_face_asr_with_sapat_in_daytona_img1.svg b/guides/assets/20260626_run_hugging_face_asr_with_sapat_in_daytona_img1.svg new file mode 100644 index 00000000..f43ad48e --- /dev/null +++ b/guides/assets/20260626_run_hugging_face_asr_with_sapat_in_daytona_img1.svg @@ -0,0 +1,41 @@ + + Hugging Face ASR workflow with Sapat in Daytona + Diagram showing a media file processed in a Daytona workspace by Sapat and sent to a Hugging Face Inference Provider for speech recognition. + + + Hugging Face ASR in a Daytona workspace + Keep media conversion, Sapat provider code, and inference credentials in one reproducible sandbox. + + + Audio or video + demo.mp4, call.wav + + + Daytona workspace + Sapat converts media, loads .env, + and calls the provider adapter. + + --provider huggingface + + + HF router + provider + model + transcript JSON + + + + + + convert + POST JSON + transcript.txt is written next to the source file + + + Configuration: HF_TOKEN, HUGGINGFACE_PROVIDER, HUGGINGFACE_RETURN_TIMESTAMPS + + + + + + +