Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions authors/haodong_liu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Author: Haodong Liu Title: Software Engineer Description: Haodong Liu builds
practical developer tooling and automation workflows for reproducible
engineering environments. His work focuses on small, testable integrations that
make AI-assisted development pipelines easier to run, audit, and share. Author
Image: <https://github.com/daxia778.png> Author LinkedIn: Author Twitter: Company
Name: Independent Company Description: Independent software development and
open-source engineering. Company Logo Dark: Company Logo White:
26 changes: 26 additions & 0 deletions definitions/20260626_definition_hugging_face_inference_provider.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: "Hugging Face Inference Provider"
description: "A hosted model routing layer for running inference tasks through Hugging Face."
date: 2026-06-26
author: "Haodong Liu"
---

# Hugging Face Inference Provider

## Definition

A Hugging Face Inference Provider is a hosted execution backend exposed through
Hugging Face's routed inference API. Developers call one consistent HTTP
interface while selecting the provider and model that should run the task.

## Context and Usage

For speech-to-text workflows, an inference provider can run an automatic speech
recognition model such as Whisper from a remote endpoint. Client tools send an
audio payload, authenticate with a Hugging Face token, and receive a transcript
response without hosting model infrastructure in the local workspace.

Provider routing is useful when teams want to compare model quality, latency,
and cost from a stable client. It still requires careful configuration because
providers can differ in model availability, response latency, file size limits,
timestamp support, and billing behavior.
330 changes: 330 additions & 0 deletions guides/20260626_run_hugging_face_asr_with_sapat_in_daytona.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,330 @@
---
title: "Run Hugging Face ASR in Daytona"
description: "Use Daytona, Sapat, and Hugging Face Inference Providers to transcribe media in a reproducible workspace."
date: 2026-06-26
author: "Haodong Liu"
tags: ["daytona", "sapat", "speech-to-text", "hugging-face"]
---

# Run Hugging Face ASR in Daytona

## Introduction

Speech-to-text experiments become difficult to compare when every provider
needs a different setup, machine image, and local script. One provider might
need a hosted endpoint, another might need a local model, and another might
return timestamps in a different shape. The more those details leak into the
project repository, the harder it is for another engineer to reproduce the run.

[Daytona](https://www.daytona.io/) is a good fit for this kind of work because
the media tools, Python dependencies, environment variables, and temporary audio
files can live inside a disposable workspace. This guide uses
[Sapat](https://github.com/nibzard/sapat), a Python speech-to-text CLI, with a
new [Hugging Face Inference Provider](../definitions/20260626_definition_hugging_face_inference_provider.md)
adapter. The adapter sends audio to Hugging Face's routed automatic speech
recognition endpoint and writes the returned transcript beside the source file.

The companion Sapat implementation is available in
[nibzard/sapat#72](https://github.com/nibzard/sapat/pull/72). While the pull
request is under review, use the branch shown below. After it is merged, the
same workflow can run from the upstream Sapat repository.

![Hugging Face ASR workflow with Sapat in Daytona](assets/20260626_run_hugging_face_asr_with_sapat_in_daytona_img1.svg)

## TL;DR

- Create a Daytona workspace from the Sapat branch that adds the Hugging Face
provider.
- Store `HF_TOKEN` and router settings in `.env`; never commit keys or media.
- Run `sapat --provider huggingface` against a short test recording first.
- Use `HUGGINGFACE_PROVIDER` to switch the routed inference backend without
changing the Sapat command.
- Enable timestamp chunks only when the selected model and provider support
them.

## Prerequisites

You need the following:

- Daytona installed and connected to a target that can create workspaces.
- A GitHub account that can clone public repositories.
- A Hugging Face token with access to the model or provider you want to use.
- A short media file you are allowed to transcribe.
- Basic terminal familiarity.

Use a short, non-confidential recording for the first run. Provider routing,
billing, model availability, and file size limits can vary, so it is better to
validate the path with a safe sample before sending real project recordings.

## Step 1: Create the Workspace

Start the Daytona server if it is not already running:

```bash
daytona server
```

Create a workspace from the Sapat fork that contains the companion provider
implementation:

```bash
daytona create https://github.com/daxia778/sapat --code
```

When the editor opens, check out the provider branch:

```bash
git checkout codex/huggingface-provider
```

If the provider has already been merged upstream, use the main repository
instead:

```bash
daytona create https://github.com/nibzard/sapat --code
```

The main benefit is repeatability. Everyone who opens this workspace gets the
same repository state, can install the same Python package, and can keep
provider configuration out of the host machine.

## Step 2: Install Sapat and Check Media Tools

Install Sapat in editable mode inside the workspace:

```bash
python -m pip install -e .
```

Sapat converts input media before it calls a provider, so `ffmpeg` must be
available:

```bash
ffmpeg -version
```

If the command is missing in a Debian-based workspace image, install it:

```bash
sudo apt-get update
sudo apt-get install -y ffmpeg
```

Run a quick CLI check:

```bash
sapat --version
```

At this point the workspace has the CLI entry point, provider registry, and
conversion tool needed for a Hugging Face transcription run.

## Step 3: Configure Hugging Face

Create a `.env` file in the workspace root:

```bash
cat > .env <<'EOF'
HF_TOKEN=replace-with-your-token
HUGGINGFACE_PROVIDER=hf-inference
HUGGINGFACE_RETURN_TIMESTAMPS=false
EOF
```

The provider reads these settings:

| Variable | Purpose | Default |
| --- | --- | --- |
| `HF_TOKEN` | Bearer token for the router request | Required |
| `HUGGINGFACE_PROVIDER` | Routed backend segment in the Hugging Face URL | `hf-inference` |
| `HUGGINGFACE_API_BASE` | Router base URL for advanced deployments | `https://router.huggingface.co` |
| `HUGGINGFACE_TIMEOUT` | HTTP timeout in seconds | `120` |
| `HUGGINGFACE_RETURN_TIMESTAMPS` | Request timestamp chunks when supported | `false` |

Keep `.env` local. Do not paste real tokens into PR descriptions, issue
comments, screenshots, or sample files.

## Step 4: Pick a Model and Provider

The Sapat provider defaults to `openai/whisper-large-v3`, which is a common
starting point for multilingual ASR experiments. Hugging Face routing lets you
combine a provider and model in the request URL. The default path is:

```text
https://router.huggingface.co/hf-inference/models/openai/whisper-large-v3
```

If your selected provider is different, change only the environment variable:

```bash
HUGGINGFACE_PROVIDER=fal-ai
```

Then keep the CLI command the same. This makes comparisons easier because the
transcription workflow stays stable while the provider setting changes.

## Step 5: Run a Safe First Transcription

Copy a short sample into the workspace, for example:

```bash
mkdir -p samples
cp ~/Downloads/demo-call.mp4 samples/demo-call.mp4
```

Run Sapat:

```bash
sapat samples/demo-call.mp4 \
--provider huggingface \
--model openai/whisper-large-v3 \
--language en \
--temperature 0 \
--quality M
```

Sapat will convert the file to the provider's preferred audio format, call the
Hugging Face adapter, and write `samples/demo-call.txt`. The current adapter
sends a JSON request with base64-encoded audio and parses either `text` or
`generated_text` from the response. If timestamp chunks are returned, Sapat
stores them on the in-memory result object while the CLI still writes the main
transcript text to disk.

Open the transcript:

```bash
sed -n '1,80p' samples/demo-call.txt
```

Review the first run for missing sections, obvious language mismatch, and names
that need a different model or prompt strategy in a later provider-specific
adapter.

## Step 6: Keep the Workspace Clean

Before committing anything, check the working tree:

```bash
git status --short
```

Do not commit:

- `.env` files
- source media
- generated `.mp3`, `.wav`, or `.txt` outputs
- API responses that contain private transcript text
- local absolute paths

For repeatable examples, document commands and expected file names rather than
committing private audio or transcripts.

## Step 7: Compare Provider Runs

Once the first safe sample works, use the same recording to compare provider
settings. Keep the source file, Sapat command, language, and model fixed while
changing one routing variable at a time. That makes the result easier to
explain in a review or engineering note.

For example, run the default route first:

```bash
HUGGINGFACE_PROVIDER=hf-inference \
sapat samples/demo-call.mp4 \
--provider huggingface \
--model openai/whisper-large-v3 \
--language en \
--quality M
```

Then try another routed backend if your Hugging Face account and selected model
support it:

```bash
HUGGINGFACE_PROVIDER=fal-ai \
sapat samples/demo-call.mp4 \
--provider huggingface \
--model openai/whisper-large-v3 \
--language en \
--quality M
```

Rename each output immediately after the run:

```bash
mv samples/demo-call.txt samples/demo-call-hf-inference.txt
mv samples/demo-call.txt samples/demo-call-fal-ai.txt
```

Compare the two transcripts for latency, missing words, punctuation, speaker
names, and domain terms. If the content is sensitive, summarize differences in
a private note instead of committing the transcripts. The useful artifact for a
shared repository is the reproducible command, not the private recording or its
verbatim output.

## Common Issues and Troubleshooting

**Problem:** Sapat says no providers are available.

**Solution:** The provider registry only enables `huggingface` when `HF_TOKEN`
is present in the process environment. Confirm `.env` is in the workspace root
and run the command from that same directory.

**Problem:** The request returns an authentication error.

**Solution:** Regenerate or verify the Hugging Face token. Also confirm that the
token has access to the selected model and provider.

**Problem:** The request returns a model or provider error.

**Solution:** Check that the model is available through the selected
`HUGGINGFACE_PROVIDER`. Try the default provider first, then switch providers
only after the basic run succeeds.

**Problem:** Timestamp chunks are missing.

**Solution:** Set `HUGGINGFACE_RETURN_TIMESTAMPS=true` only for models and
providers that support timestamp output. The main transcript can still be valid
when chunks are absent.

**Problem:** Conversion fails before the request is sent.

**Solution:** Run `ffmpeg -version`. If it is missing, install `ffmpeg` in the
workspace image and rerun the command.

## When to Build a Dedicated Provider

The Hugging Face adapter is useful when you want a general routed ASR endpoint.
Build a dedicated Sapat provider instead when you need:

- vendor-specific diarization options
- custom vocabulary files
- async job polling
- per-speaker transcript formatting
- provider billing metadata in the result
- a response shape that does not include `text` or `generated_text`

Keeping those cases in separate adapters prevents the generic Hugging Face path
from becoming hard to reason about.

## Conclusion

You now have a Daytona workspace that can run Sapat against a Hugging Face
automatic speech recognition model with a clean provider boundary. Daytona
keeps the environment reproducible, Sapat handles media conversion and transcript
writing, and the provider adapter contains the Hugging Face router details.

This is a useful pattern for AI engineers who need to compare ASR backends
without scattering credentials, temporary files, and one-off scripts across
their host machines. Start with a safe sample, verify the transcript, and then
decide whether a general routed provider is enough or a richer dedicated
adapter is worth building.

## References

- [Sapat repository](https://github.com/nibzard/sapat)
- [Companion Sapat provider pull request](https://github.com/nibzard/sapat/pull/72)
- [Hugging Face automatic speech recognition task docs](https://huggingface.co/docs/inference-providers/en/tasks/automatic-speech-recognition)
- [Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers/en/index)
- [Daytona documentation](https://www.daytona.io/docs/)
- [Daytona content issue: AI Transcription Tool](https://github.com/daytonaio/content/issues/13)
Loading