From ca16f767f43c1fb4a0fe7412f348089124f3962e Mon Sep 17 00:00:00 2001 From: "Rohit M." Date: Fri, 19 Jun 2026 19:54:12 +0400 Subject: [PATCH] Add SAPAT provider benchmarking guide --- authors/rohit_m.md | 10 + ...definition_speech_to_text_transcription.md | 23 ++ ...apat_transcription_providers_in_daytona.md | 359 ++++++++++++++++++ ...cription_providers_in_daytona_workflow.svg | 37 ++ 4 files changed, 429 insertions(+) create mode 100644 authors/rohit_m.md create mode 100644 definitions/20260619_definition_speech_to_text_transcription.md create mode 100644 guides/20260619_benchmark_sapat_transcription_providers_in_daytona.md create mode 100644 guides/assets/20260619_benchmark_sapat_transcription_providers_in_daytona_workflow.svg diff --git a/authors/rohit_m.md b/authors/rohit_m.md new file mode 100644 index 00000000..22d3e76d --- /dev/null +++ b/authors/rohit_m.md @@ -0,0 +1,10 @@ +Author: Rohit M. +Title: Technical Writer and Builder +Description: Rohit writes practical, hands-on developer guides focused on clear setup steps, reproducible workflows, and useful troubleshooting notes. He enjoys turning messy tool setup into simple instructions that engineers can follow without guesswork. +Author Image: ![rohit-m](https://github.com/rohitmulani63-ops.png) +Author LinkedIn: +Author Twitter: +Company Name: Independent +Company Description: Independent contributor focused on practical developer tooling guides. +Company Logo Dark: +Company Logo White: \ No newline at end of file diff --git a/definitions/20260619_definition_speech_to_text_transcription.md b/definitions/20260619_definition_speech_to_text_transcription.md new file mode 100644 index 00000000..c9db636a --- /dev/null +++ b/definitions/20260619_definition_speech_to_text_transcription.md @@ -0,0 +1,23 @@ +--- +title: "Speech-to-Text Transcription" +description: "The process of converting spoken audio into written text with software." +date: 2026-06-19 +author: "Rohit M." +--- + +# Speech-to-Text Transcription + +## Definition + +Speech-to-text transcription is the process of converting spoken words from an +audio or video recording into written text. It can be performed by cloud APIs, +local machine-learning models, or hybrid workflows that extract audio first and +then send it to a transcription engine. + +## Context and Usage + +In development workflows, speech-to-text transcription is often used to turn +demos, meetings, interviews, support calls, lectures, and screen recordings into +searchable notes. Tools such as SAPAT combine media processing with provider +APIs so engineers can convert videos into transcript files from the command +line. diff --git a/guides/20260619_benchmark_sapat_transcription_providers_in_daytona.md b/guides/20260619_benchmark_sapat_transcription_providers_in_daytona.md new file mode 100644 index 00000000..8c2ad19a --- /dev/null +++ b/guides/20260619_benchmark_sapat_transcription_providers_in_daytona.md @@ -0,0 +1,359 @@ +--- +title: "Benchmark SAPAT Transcription Providers in Daytona" +description: "Compare SAPAT transcriptions from OpenAI, Groq, and Azure OpenAI inside a repeatable Daytona workspace." +date: 2026-06-19 +author: "Rohit M." +tags: ["daytona", "sapat", "transcription", "benchmarking", "ai"] +--- + +# Benchmark SAPAT Transcription Providers in Daytona + +SAPAT is a Python command-line tool that turns video files into written +transcripts. It extracts audio with `ffmpeg`, sends that audio to a supported +speech-to-text provider, and writes a `.txt` transcript beside the input file. +The repository currently exposes three CLI provider choices: `openai`, `groq`, +and `azure`. + +A simple setup guide is useful, but AI engineers usually need one step more: +they need to know which provider is best for their recordings. One provider may +be faster for short demos. Another may handle accents or noisy audio better. A +team already on Azure may care more about operational fit and data governance +than raw speed. This guide shows how to benchmark SAPAT providers inside a +Daytona workspace so the comparison is repeatable, safe, and easy to review. + +The workflow below uses the same source clip, the same prompt, and the same +quality setting across providers. You will produce separate transcripts, inspect +basic quality signals, and fill in a lightweight scorecard. The goal is not to +claim that one provider is always best. The goal is to create a clean Daytona +workflow that lets your team decide using its own audio and requirements. + +## TL;DR + +- Create a Daytona workspace from the SAPAT repository. +- Install `ffmpeg`, Python dependencies, and the SAPAT wheel. +- Configure only the provider keys you plan to test in `.env`. +- Run the same `.mp4` through `openai`, `groq`, and `azure` where available. +- Save each transcript separately so results do not overwrite each other. +- Compare accuracy, terminology, latency, setup effort, and operational fit. +- Keep API keys, private recordings, and generated transcripts out of Git. + +## Materials checklist + +You need the following before starting: + +- Daytona installed and connected to your preferred IDE. +- Python available in the Daytona workspace. +- `ffmpeg` installed inside the workspace. +- One short `.mp4` sample that is safe to use for testing. +- API access for at least one of OpenAI, Groq, or Azure OpenAI. +- A basic understanding of [APIs](../definitions/20241212_definition_api.md), + [environment variables](../definitions/20241126_definition_environment_variables.md), + and [speech-to-text transcription](../definitions/20260619_definition_speech_to_text_transcription.md). + +## Benchmark workflow overview + +![SAPAT provider benchmark workflow](assets/20260619_benchmark_sapat_transcription_providers_in_daytona_workflow.svg) + +The benchmark has four stages: prepare one clean input clip, run SAPAT with each +provider, save each transcript under a provider-specific name, and review the +outputs with the same scorecard. + +| Stage | What you do | Why it matters | +| --- | --- | --- | +| Prepare | Choose one representative `.mp4` file. | Every provider receives the same input. | +| Run | Use SAPAT with one `--api` value at a time. | Each transcript has a clear source. | +| Preserve | Rename outputs into a `transcripts/` folder. | Results do not overwrite each other. | +| Review | Compare quality, speed, and setup effort. | The final choice is evidence-based. | + +## Step 1: Create the Daytona workspace + +Create a workspace from the current SAPAT repository: + +```bash +daytona create https://github.com/nibzard/sapat --code +``` + +Some older references use `https://github.com/nkkko/sapat`. GitHub currently +redirects that repository to `nibzard/sapat`, so this guide uses the current +repository URL directly. + +When Daytona opens the project, stay at the repository root. You should see +files such as `README.md`, `requirements.txt`, `.env.example`, and the +`src/sapat` package directory. + +## Step 2: Install dependencies + +SAPAT needs `ffmpeg` to extract audio from video files before transcription. In +a Debian or Ubuntu-based Daytona workspace, install it with: + +```bash +sudo apt-get update +sudo apt-get install -y ffmpeg +``` + +Confirm that it is available: + +```bash +ffmpeg -version +``` + +Install the Python dependencies and build the package: + +```bash +python -m pip install --upgrade pip +python -m pip install -r requirements.txt +python -m build +python -m pip install dist/sapat-*.whl +``` + +If your shell does not expand `dist/sapat-*.whl`, list the `dist` folder and +install the exact wheel filename. + +## Step 3: Configure only the providers you will test + +Copy the environment template: + +```bash +cp .env.example .env +``` + +Then add only the credentials you need. Do not paste every possible key into the +file. For OpenAI, use: + +```bash +OPENAI_API_KEY=your_openai_api_key_here +``` + +For Groq, use: + +```bash +GROQ_API_KEY=your_groq_api_key_here +``` + +For Azure OpenAI, use: + +```bash +AZURE_OPENAI_API_KEY=your_azure_api_key_here +AZURE_OPENAI_ENDPOINT=https://DEPLOYMENTENDPOINTNAME.openai.azure.com +AZURE_OPENAI_STT_MODEL_NAME=whisper +AZURE_OPENAI_STT_API_VERSION=2024-06-01 +``` + +SAPAT's `.env.example` lists many possible provider names, but the current CLI +path should be treated as `openai`, `groq`, and `azure` unless new provider code +has been added. This distinction keeps your guide accurate and avoids promising +support that is not active in the CLI yet. + +**Note:** Never commit `.env`. Keep real keys, private recordings, generated +transcripts, and payout details out of the content repository. + +## Step 4: Prepare a repeatable sample folder + +Use one short test video that represents the type of recording your team cares +about. For example, a product demo, design review, support clip, or narrated +screen recording. + +Create folders for inputs and outputs: + +```bash +mkdir -p samples transcripts +``` + +Copy your test file into `samples`: + +```bash +cp ~/Downloads/demo.mp4 samples/demo.mp4 +``` + +If your original file has a private or customer-specific name, rename it before +running the benchmark. + +## Step 5: Run the OpenAI benchmark + +Run SAPAT with the OpenAI provider: + +```bash +sapat samples/demo.mp4 \ + --api openai \ + --quality M \ + --language en \ + --prompt "The recording discusses Daytona workspaces, SAPAT, ffmpeg, provider APIs, and speech-to-text transcription." \ + --temperature 0.2 +``` + +Move the generated transcript to a provider-specific filename: + +```bash +mv samples/demo.txt transcripts/demo_openai.txt +``` + +Record your observed runtime manually. A simple note is enough: + +```bash +date >> transcripts/benchmark_notes.txt +echo "openai: completed" >> transcripts/benchmark_notes.txt +``` + +## Step 6: Run the Groq benchmark + +Run the same source file with Groq: + +```bash +sapat samples/demo.mp4 \ + --api groq \ + --quality M \ + --language en \ + --prompt "The recording discusses Daytona workspaces, SAPAT, ffmpeg, provider APIs, and speech-to-text transcription." \ + --temperature 0.2 +``` + +Save the transcript separately: + +```bash +mv samples/demo.txt transcripts/demo_groq.txt +``` + +Use the same prompt, quality setting, and language code wherever possible. That +makes the comparison fairer. + +## Step 7: Run the Azure OpenAI benchmark + +If your workspace is configured for Azure OpenAI, run: + +```bash +sapat samples/demo.mp4 \ + --api azure \ + --quality M \ + --language en \ + --prompt "The recording discusses Daytona workspaces, SAPAT, ffmpeg, provider APIs, and speech-to-text transcription." \ + --temperature 0.2 +``` + +Save that output too: + +```bash +mv samples/demo.txt transcripts/demo_azure.txt +``` + +Azure setup is usually more sensitive to deployment names, endpoint format, and +API version. If Azure fails while OpenAI or Groq works, check the deployment +configuration before changing the SAPAT command. + +## Step 8: Compare outputs with a scorecard + +Create a small scorecard file: + +```bash +cat > transcripts/scorecard.md <<'EOF' +# SAPAT Provider Benchmark Scorecard + +| Provider | Setup effort | Terminology accuracy | Speaker/name handling | Formatting cleanup needed | Runtime notes | Best fit | +| --- | --- | --- | --- | --- | --- | --- | +| OpenAI | | | | | | | +| Groq | | | | | | | +| Azure OpenAI | | | | | | | +EOF +``` + +Then inspect the transcript starts: + +```bash +sed -n '1,40p' transcripts/demo_openai.txt +sed -n '1,40p' transcripts/demo_groq.txt +sed -n '1,40p' transcripts/demo_azure.txt +``` + +Look for practical quality signals: + +- Did the provider spell product names correctly? +- Did it preserve technical terms such as Daytona, SAPAT, `ffmpeg`, and API? +- Did it hallucinate section breaks, names, or actions? +- Did it handle pauses, accents, and background noise well enough? +- How much manual cleanup would be needed before publication? +- Was the provider easy to configure in the Daytona workspace? + +For a quick size comparison, count words: + +```bash +wc -w transcripts/demo_*.txt +``` + +A much shorter transcript can mean missed speech. A much longer transcript can +mean repeated phrases, filler, or extra model cleanup text. Always inspect the +actual text before making a decision. + +## Step 9: Choose a provider by workflow, not hype + +A useful provider choice depends on the job: + +| Use case | What to prioritize | What to test | +| --- | --- | --- | +| Internal demo notes | Speed and low cleanup effort. | Short engineering demos. | +| Customer research | Accuracy and privacy controls. | Realistic call audio with safe test data. | +| Developer tutorials | Technical vocabulary and formatting. | Screen recordings with product names. | +| Enterprise workflows | Governance and existing cloud setup. | Azure deployment and access policies. | +| Batch processing | Reliability over many files. | A folder of short `.mp4` clips. | + +This is why Daytona is valuable for the benchmark. You can keep the same input, +same commands, same dependencies, and same review file in one workspace. Another +teammate can repeat the benchmark without guessing which packages or environment +variables were used. + +## Common issues and troubleshooting + +**Problem:** `ffmpeg` is not found. + +**Solution:** Install it inside the Daytona workspace and confirm with +`ffmpeg -version`. Installing it only on your host machine may not help the +workspace. + +**Problem:** SAPAT says the API choice is unsupported. + +**Solution:** Use the current CLI choices: `openai`, `groq`, or `azure`. Extra +provider names in `.env.example` need matching implementation before they are +safe to document as active CLI options. + +**Problem:** Authentication fails. + +**Solution:** Check the matching key in `.env`. For Azure OpenAI, also verify +the endpoint, speech-to-text deployment name, and API version. + +**Problem:** The second provider overwrites the first transcript. + +**Solution:** Move `samples/demo.txt` into `transcripts/` after every run and +rename it with the provider name before starting the next run. + +**Problem:** The transcript misses technical terms. + +**Solution:** Add a focused `--prompt` with expected terms. Keep the prompt short +and factual. Do not include private information or secrets. + +**Problem:** A provider is faster but less accurate. + +**Solution:** Use the scorecard. Fast output is valuable for rough internal +notes, but publication-quality transcripts may need the provider with fewer +technical mistakes. + +## Conclusion + +You now have a repeatable benchmark for SAPAT transcription providers inside a +Daytona workspace. Instead of choosing a provider by reputation, you can test +OpenAI, Groq, and Azure OpenAI against the same source clip, preserve each +transcript, and compare results with a practical scorecard. + +This workflow is intentionally conservative. It documents only the provider +choices currently exposed by SAPAT's CLI, keeps secrets in `.env`, avoids +committing private recordings, and gives reviewers a clear way to reproduce the +comparison. If SAPAT adds more provider implementations later, the same benchmark +structure can be reused: add a new provider row, run the same clip, and compare +the output against the existing transcripts. + +## References + +- [Daytona content issue #13: AI Transcription Tool](https://github.com/daytonaio/content/issues/13) +- [Daytona content contribution guide](https://github.com/daytonaio/content/blob/main/CONTRIBUTING.md) +- [Daytona guide template](https://github.com/daytonaio/content/blob/main/guides/YYYYMMDD_guide_template.md) +- [SAPAT repository](https://github.com/nibzard/sapat) +- [SAPAT README](https://github.com/nibzard/sapat/blob/main/README.md) +- [SAPAT CLI source](https://github.com/nibzard/sapat/blob/main/src/sapat/script.py) +- [SAPAT environment example](https://github.com/nibzard/sapat/blob/main/.env.example) diff --git a/guides/assets/20260619_benchmark_sapat_transcription_providers_in_daytona_workflow.svg b/guides/assets/20260619_benchmark_sapat_transcription_providers_in_daytona_workflow.svg new file mode 100644 index 00000000..ac05dd93 --- /dev/null +++ b/guides/assets/20260619_benchmark_sapat_transcription_providers_in_daytona_workflow.svg @@ -0,0 +1,37 @@ + + SAPAT provider benchmark workflow in Daytona + One video clip is transcribed through multiple SAPAT providers in a Daytona workspace, then compared with a shared scorecard. + + + + + + + + + + + Benchmark SAPAT transcription providers in Daytona + Same video, same prompt, separate transcripts, one scorecard. + + + Input clip + samples/demo.mp4 + + + SAPAT runs + --api openai + --api groq + --api azure + + + Transcripts + provider-named .txt + + + Score + choose + + Keep .env, private recordings, generated transcripts, and payout details out of Git. + +