Transcribe YouTube videos into formatted Markdown articles. Supports subtitle download or Deepgram speech-to-text (with multi-speaker recognition).
Start here:
- If you want the canonical terminology, see
Canonical Termsbelow. - If you want the reusable execution order, see
Validation MatrixandMinimum Commands.
- 🎯 Smart Subtitle Fetching: Prioritizes YouTube official/auto-generated subtitles
- 🎙️ Speech-to-Text: Auto-transcribes via Deepgram Nova-3 when no subtitles available
- 👥 Multi-speaker Recognition: Automatically distinguishes different speakers
- 🌐 Bilingual Support: Auto-translate and side-by-side formatting
- 🤖 AI Enhancement: Auto punctuation, paragraph splitting, error correction
- 📝 Markdown Output: Formatted articles with metadata
yt-dlp: Download YouTube videos/audio/subtitlesffmpeg: Audio processing (splitting, silence detection)python3: Text processingcurl: Call Deepgram API- Deepgram Account: For speech transcription
# macOS
brew install yt-dlp python3 ffmpeg
# or via pip
pip install yt-dlp-
Copy the config template:
cp config.example.yaml config.yaml
-
Edit
config.yamlwith your settings:deepgram_api_key: "your_api_key_here" deepgram_model: "nova-3" deepgram_enable_utterances: true deepgram_prefer_structured_output: true output_dir: "~/Downloads" # LLM API for long video chunk processing (optional) # Format: "openai" or "anthropic" llm_api_format: "openai" llm_api_key: "your_llm_api_key" llm_base_url: "https://api.deepseek.com" llm_model: "deepseek-v4-pro" llm_timeout_sec: 180 llm_max_retries: 3 llm_backoff_sec: 1.5 llm_stream: "auto" llm_reasoning_probe_enabled: true llm_chunk_recovery_attempts: 1 llm_chunk_recovery_backoff_sec: 1.0 # yt-dlp network hardening (safe defaults) yt_dlp_socket_timeout_sec: 15 yt_dlp_retries: 1 yt_dlp_extractor_retries: 1
yt-transcriptnow uses this policy by default:- Start anonymously (do not read browser cookies up front)
- If
yt-dlpreturnsSign in to confirm you're not a bot, automatically retry (up to 3 attempts) with:--cookies-from-browser chrome
- If that Chrome retry also fails, surface a clear error and tell you how to provide a
cookies.txtfile
This means local desktop setups may recover automatically, while remote/container setups remain explicit and safe. The JSON returned by
scripts/download.shnow also includes ayt_dlp_runtimeobject so callers can see which timeout, retry, and auth strategy actually ran.If the automatic Chrome retry fails, the most portable fix is an exported Netscape-format
cookies.txt:yt_dlp_cookies_file: "~/.config/yt-transcript/youtube_cookies.txt"
Or for a one-off run:
YT_DLP_COOKIES_FILE=~/.config/yt-transcript/youtube_cookies.txt \ bash scripts/download.sh "$URL" metadata
You can still force browser-cookie mode explicitly:
yt_dlp_cookies_from_browser: "chrome"
Recommended
cookies.txtimport flow:- Open
youtube.comin a logged-in browser on your local machine - Export cookies for YouTube in Netscape
cookies.txtformat - Copy that file to the machine/container running this skill
- Set
yt_dlp_cookies_fileinconfig.yamlorYT_DLP_COOKIES_FILEin the environment
In remote or container environments,
yt_dlp_cookies_fileis usually more reliable than--cookies-from-browser chrome.Note:
deepgram_api_keyis only required when the video has no usable subtitles and audio transcription is needed.- Deepgram transcription defaults to
nova-3. Legacynova-2/nova-2-*settings are automatically upgraded tonova-3before any request is sent. deepgram_enable_utterancesanddeepgram_prefer_structured_outputnow default totrue; set either tofalseonly when you need to compare against the legacy flat-transcript path.- Deepgram requests now use bounded automatic retries for transient timeout/network failures before surfacing an error.
- LLM API config is only needed for long video chunk processing, or when bilingual translation is required.
llm_base_urlcan be either a provider root URL or a/v1URL. The tool normalizes both.llm_stream: "auto"prefers SSE streaming when the provider supports it.- With
llm_reasoning_probe_enabled: true, unknown OpenAI-compatible models get one non-streamingOKprobe; if the response exposes reasoning metadata, chunk calls switch to non-streaming for recovery telemetry. bash scripts/preflight.sh --require-llmnow performs a real low-cost LLM probe and reports latency.
- Place this directory in any Claude skills directory
- Provide a YouTube link in your Claude conversation
- Claude will automatically execute the transcription workflow
The scripts resolve config.yaml relative to the skill directory, so the skill is no longer tied to ~/.claude/skills/yt-transcript.
Please transcribe this video: https://www.youtube.com/watch?v=xxxxx
You can provide multiple links at once. They will be processed serially (one at a time) to ensure quality and context isolation:
Please transcribe these videos:
- https://www.youtube.com/watch?v=xxxxx
- https://www.youtube.com/watch?v=yyyyy
- https://www.youtube.com/watch?v=zzzzz
After completion, a summary table will be provided with status and output paths for each video.
plan-optimizationstill records the rawduration_bucket(shortvslong).- If a short-duration transcript is too large for reliable single-pass prompting, the planner now escalates it to chunked processing and returns
video_path=longwithrouting_reason=oversized_short_input. - Workflow callers should follow
video_path,operations, androuting_reasonfrom the planner rather than duration alone. - Final output filenames must format date fragments as
yyyy-mm-dd; do not use raw metadataupload_date(yyyymmdd) in filenames.
yt-transcript/
├── SKILL.md # Claude Skill workflow guide (main entry point)
├── SYSTEM_DESIGN.md # Single authoritative system design document
├── workflows/ # Modular workflow files
├── prompts/ # Single-task prompt templates
├── scripts/ # Helper shell scripts
├── yt_transcript_utils.py # Main Python entry; imports the two kernel layers directly
├── kernel/ # Two-layer kernel package
│ ├── task_runtime/ # Generic task runtime layer
│ │ ├── runtime.py # Ownership, command envelopes, telemetry append
│ │ ├── api.py # Stable runtime-facing API for create/inspect/advance/control/finalize
│ │ ├── contracts.py # Runtime contracts for task/run/action/artifact envelopes
│ │ ├── lifecycle.py # Lifecycle shell and transition summaries for runtime stages
│ │ ├── policy.py # Allowed-action derivation and budget-pressure policy checks
│ │ ├── evaluator.py # Quality-gated evaluator reports and action recommendations
│ │ ├── decision.py # Rule-first action selection and decision records
│ │ ├── ledger.py # Runtime budget and action accounting summaries
│ │ ├── recovery.py # Resume-safe recovery summaries and processing sub-states
│ │ ├── artifacts.py # Artifact-graph helpers for persisted runtime outputs
│ │ ├── state.py # Manifest/runtime persistence and control files
│ │ ├── controller.py # Owned mutation and bounded control-loop helpers
│ │ └── telemetry.py # Telemetry query and summary helpers
│ └── long_text/ # Long-text transformation layer
│ ├── glossary.py # Glossary extraction and terminology checks
│ ├── semantic.py # Semantic anchor extraction and checks
│ ├── contracts.py # Control contracts and policy state
│ ├── autotune.py # Chunk autotune and token-source summarization
│ ├── lifecycle.py # Manifest lifecycle and resume-state helpers
│ ├── prompting.py # Prompt assembly and chunking-context helpers
│ ├── llm.py # LLM request loop and retry helpers
│ ├── processing.py # Chunk-processing and replan execution loops
│ ├── chunking.py # Chunking command surfaces
│ ├── merge.py # Merge and chapter-plan command surfaces
│ └── execution.py # Execution, resume, and replan command surfaces
├── tests/ # Regression test suite
├── config.yaml # Local config (gitignored)
├── config.example.yaml # Config template
└── README.md # This document
README.md is the operator-facing quickstart and command guide. SYSTEM_DESIGN.md is the single authoritative design document for the system architecture.
At a high level, yt-transcript is a local-first, script-first system that turns a YouTube URL into a Markdown article through:
- preflight and configuration checks
- metadata and subtitle availability detection
- subtitle download or Deepgram fallback transcription
- state synchronization and normalized document creation
- optimization planning
- short-path direct transformation or the long-text transformation subsystem
- final assembly and quality gates
The codebase mirrors that design through a two-layer kernel split: kernel/task_runtime/* owns generic long-running job control, while kernel/long_text/* owns long-text transformation behavior. yt_transcript_utils.py remains the main CLI and workflow façade, but it now delegates into those two layers directly.
Phase 1 of the runtime-upgrade path also introduces kernel/task_runtime/contracts.py, which normalizes task, run-state, action, artifact, and quality-report envelopes without changing the nominal workflow behavior.
Phase 2 adds kernel/task_runtime/lifecycle.py, which wraps runtime-sensitive commands in an explicit lifecycle shell so state transitions become observable before richer policy logic is introduced.
Phase 3 adds kernel/task_runtime/policy.py, decision.py, and ledger.py so command envelopes can expose allowed actions, budget-pressure summaries, and rule-first decision records in a uniform way.
Phase 4 adds kernel/task_runtime/recovery.py and artifacts.py so long-text runs expose processing sub-states, recovery recommendations, and artifact-graph views without changing the core chunk-processing algorithms.
Phase 5 adds kernel/task_runtime/evaluator.py plus constrained llm-assisted ranking hooks so quality-gated recommendations and optional model-assisted action selection can coexist without bypassing the allowed-action contract.
Phase 6 adds kernel/task_runtime/api.py, making create-run, inspect-run, advance-run, apply-control, resume-run, and finalize-run the preferred outer runtime contract while preserving the older control commands as compatibility helpers.
The hardest internal subsystem is long-text transformation. It activates only when the planning layer determines that the input is long enough to require chunking, continuity control, consistency protection, verification, repair / replan, and deterministic merge.
Read separate System Design Document
For new outer-agent integrations, prefer the stable runtime-facing commands over the older path-oriented helpers:
create-run <work_dir>initializes or refreshes the persisted runtime task recordinspect-run <work_dir>returns stable task/run state, recovery summary, and allowed actionsadvance-run <work_dir>selects the bounded next runtime action and dispatches itapply-control <work_dir> --signal pause|cancelapplies operator control signalsresume-run <work_dir>resumes a paused runfinalize-run <work_dir>returns a final summary and can optionally callmerge-content
The default migration mode is runtime_api. Set YT_TRANSCRIPT_RUNTIME_API_MODE=legacy_cli only when you need to force the older compatibility framing.
# Base checks only: subtitles / metadata workflows
bash scripts/preflight.sh
# Require Deepgram before audio transcription
bash scripts/preflight.sh --require-deepgram
# Require LLM config before long-video chunk processing
bash scripts/preflight.sh --require-llmThis project is script-first: helper commands emit machine-readable JSON on stdout so routing, validation, and execution decisions stay in code instead of drifting inside prompt prose:
scripts/download.sh "$URL" metadatascripts/download.sh "$URL" subtitle-infoscripts/download.sh "$URL" subtitlesscripts/download.sh "$URL" audiopython3 yt_transcript_utils.py get-chapters "$URL"python3 yt_transcript_utils.py chunk-segments /tmp/${VIDEO_ID}_segments.json /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>python3 yt_transcript_utils.py chunk-document /tmp/${VIDEO_ID}_normalized_document.json /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>python3 yt_transcript_utils.py prepare-resume /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>python3 yt_transcript_utils.py build-chapter-plan /tmp/${VIDEO_ID}_chapters.json /tmp/${VIDEO_ID}_chunks /tmp/${VIDEO_ID}_chunks/chapter_plan.jsonpython3 yt_transcript_utils.py build-glossary /tmp/${VIDEO_ID}_chunks --mode transcriptpython3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage <stage>python3 yt_transcript_utils.py normalize-document /tmp/${VIDEO_ID}_state.mdpython3 yt_transcript_utils.py plan-optimization /tmp/${VIDEO_ID}_state.mdpython3 yt_transcript_utils.py verify-quality /tmp/${VIDEO_ID}_optimized.txt --raw-text /tmp/${VIDEO_ID}_raw_text.txt
Here <RAW_STAGE_PROMPT> comes from plan-optimization:
cleanup_zhfor Chinese monolingual runsstructure_onlyfor bilingual English-source runs
This keeps workflow logic in scripts instead of ad-hoc shell parsing inside the prompt instructions.
plan-optimization also emits the canonical chunk execution contract.
At the whole-project level, it is the routing boundary between source acquisition and text transformation. At the long-text subsystem level, it defines the execution contract that downstream chunk processing must follow:
operations[*].execution.supports_auto_replanoperations[*].execution.recommended_cli_flagsoperations[*].execution.on_replan_required
normalize-document materializes /tmp/${VIDEO_ID}_normalized_document.json from either raw text or timed segments.json, and plan-optimization auto-materializes it when source artifacts already exist. When the source segments came from subtitle cleanup, the normalized document now also preserves lightweight cleanup diagnostics under diagnostics.subtitle_cleanup and an explainable subtitle-path quality summary under diagnostics.subtitle_quality.
plan-optimization now separates duration routing from source routing: routing_reason still explains short-vs-long execution, while source_route_reason, subtitle_quality_score, reroute_recommended, and reroute_target explain whether the current subtitle source should be kept, manually reviewed, or replaced with Deepgram.
For long-video chunking, plan-optimization now also emits a canonical chunking block; when normalization exists, chunk-document is the preferred driver and it keeps chunk boundary / continuity assumptions explicit in manifest.json.
The current design also has explicit resume semantics: prepare-resume repairs stale manifest state manually, while process-chunks runs the same repair step automatically before execution continues.
Current policy is intentional and explicit:
raw_pathchunk stages useprocess-chunks --auto-replanprocessed_pathchunk stages do not auto-replan; ifreplan_required=true, stop and review manually
bilingualmeans English source text plus Chinese translation, not subtitle file merging- If Chinese subtitles exist, they take precedence as the single subtitle source track; English is used only when no usable Chinese subtitle track can be downloaded
- Chinese-source monolingual optimization now uses a dedicated
cleanup_zhprompt that preserves meaning while repairing punctuation, paragraphing, duplicate subtitle fragments, and obvious spacing artifacts config.yamlis intentionally limited to flat top-level key/value entries; nested or multi-line YAML is not supported- YAML frontmatter values are always quoted on purpose to favor safe parsing over prettier formatting
- Markdown header text is escaped and link destinations are encoded so edge-case titles/channels do not break output structure
chunk-documentis now the canonical long-video chunking entrypoint whennormalized_document.jsonexists; it followspreferred_chunk_sourceinstead of blindly preferring timed segments, so Chinese YouTube-subtitle long paths now chunk from cleanedtextwhile still retainingsegmentsfor timing metadatachunk-textforce-splits very long unpunctuated passages to stay within downstream LLM chunk budgetsdownload.sh metadatanow prefers a singleyt-dlp -Jfetch when available, and subtitle/audio modes reuse metadata-derived video IDs before falling back to extra probestranscribe-deepgramdefaults to Deepgramnova-3; legacynova-2/nova-2-*model settings are upgraded tonova-3at runtimetranscribe-deepgram --output-segmentscan emit time-aligned segments for downstream timed chunking + YouTube chapter mappingtranscribe-deepgramnow defaults to utterance-first transcript assembly;--disable-utterances --legacy-flat-outputremains available for compatibility/debugging checkstranscribe-deepgramnow also reports lightweight observability fields such as paragraph/sentence/word counts, per-chunk transcript metadata, and fallback warnings in its result JSONchunk-segmentsproduces timed chunk manifests, andbuild-chapter-planmaps YouTube chapters onto chunk boundaries formerge-contentparse-vtt/parse-vtt-segmentsnow use subtitle-aware cleanup so CJK subtitle fragments are not re-joined with stray ASCII spacesparse-vtt-segmentsnow also emits lightweight cleanup diagnostics such as duplicate/overlap trimming counters, andnormalize-documentcarries those subtitle-cleanup signals plus a deterministicsubtitle_quality_scoreintonormalized_document.jsonplan-optimizationnow exposes explainable source-route fields such assource_route_reason,reroute_recommended,reroute_target, andreroute_reasons; critically poor Chinese subtitle paths can recommend Deepgram fallback without silently changing the current workflow shellmerge-contentnow runs a deterministic post-merge cleanup pass on the merged body only: it repairs chunk seams, conservatively rejoins continuation-like split fragments, preserves any explicit prefixed header/frontmatter verbatim, and drops immediately duplicated heading/body seams without asking the LLM to rewrite the documentchunk-segments --chapterscan force chunk boundaries at YouTube chapter starts to reduce heading driftchunk-textnow defaults to token-aware planning when--promptis provided, while an explicit--chunk-sizewithout--promptkeeps legacy character sizing for workflow compatibility- prompt names are validated eagerly for chunk planning, so typos fail fast instead of silently falling back to generic budgets
process-chunksnow assigns prompt-specificmax_output_tokensfrom the same planning budget instead of using one large shared defaultmanifest.jsonnow records explicitplan.chunk_contractandplan.continuity;process-chunksfollows that plan-owned continuity policy instead of silently drifting with later config changes- chunk execution now also has explicit resume semantics: stale
running/ missing-output checkpoints are repaired deterministically intodoneorinterruptedbefore work resumes process-chunksalso injects a short continuity context from the previous chunk (tail sentence + optional section title) without enabling body overlap, and chunk budgeting now reserves a small token allowance for that carry-over contextprocess-chunksnow treats transient gateway disconnects such asRemote end closed connection without responseas retryable transport failures, and can auto-rerun suspiciously short / malformed chunk outputs before keeping a warningprocess-chunks --dry-runvalidates prompts, manifests, and chunk budgets without requiring live LLM credentials; actual execution still requiresllm_api_key,llm_base_url, andllm_modeldownload.shnow writes subtitle and audio artifacts into per-video isolated temp directories under/tmp/${VIDEO_ID}_downloads/...and exposesdownload_dirin JSON for deterministic selection and cleanupdownload.sh subtitlesnow requests the exact selected subtitle language codes, so regional variants such asen-GB/zh-TWwork instead of being dropped by a hard-coded whitelistdownload.sh subtitlesnow tries one source-family candidate at a time: Chinese first, then English only as a fallback when no usable Chinese track can be downloadeddownload.sh subtitlesnow distinguishes detection vs downloadability more explicitly:listed_candidatesshows tracks exposed by YouTube/yt-dlp, whileattempted_candidates,blocked_candidates, andfallback_usedshow what the current runtime could actually fetch- when a preferred subtitle candidate fails with an auth-like error such as
HTTP 429,download.sh subtitlesnow retries the same candidate with Chrome cookies before it gives up and falls back to the next candidate - subtitle-driven workflows still support Chinese-source monolingual mode and English-source bilingual mode; Chinese-source runs now prefer
cleanup_zh, while English-source runs still usestructure_only -> translate_only; when neither usable Chinese nor English subtitles can be downloaded, the workflow should stop and fall back to audio transcription plan-optimizationis the canonical short/long router with< 1800s = shortand>= 1800s = long; the Quick Mode shortcut fromSKILL.mdis a narrower< 900ssubset for subtitle-friendly videosmanifest.jsonnow separates immutableplanmetadata fromruntimestate, andprocess-chunksrecords attempt-level telemetry (attempt_logs) in addition to chunk-level fieldsprocess-chunksno longer rewrites the current batch budget on the fly; when canary chunks or retry history show the plan is unhealthy, it aborts withreplan_required=truesoreplan-remainingcan generate a new plan for unfinished raw chunksprocess-chunks --auto-replanpreserves that architecture boundary while automating the orchestration loop (process -> replan-remaining -> resume) for raw-path plansbuild-glossary --mode transcriptnow mines glossary terms from raw chunks plus any available normalized-document title/channel, chapter titles, and optional metadata/description context;cleanup_zhauto-builds that glossary when the work_dir does not have one yet- glossary selection and verification now use boundary-aware matching for simple ASCII terms, so acronyms like
APIdo not get falsely selected from unrelated words such asrapidorcapital run_kernel_command(...)is the stable Python envelope API for kernel commands, andpython3 yt_transcript_utils.py --api-envelope ...emits the sameyt_transcript.command_result/v1envelope on the CLI without breaking legacy flat JSON output- envelope-producing kernel commands append local
yt_transcript.telemetry_event/v1records totelemetry.jsonlwhen a stable nearby sink path can be inferred runtime.statusnow distinguishescompleted,completed_with_errors, andaborted, and raw-path replans remap existingchapter_plan.jsonchunk starts so merged chapter headers still land on valid chunk boundaries- runtime token estimation remains heuristic by default;
test-token-count/preflight.sh --require-llmprobe provider-side token counting and clearly fall back to local estimates when unavailable chunk_hard_cap_multiplieris constrained to a conservative1.0-2.0range so misconfiguration cannot silently blow up chunk envelopespreflight.shis staged so subtitle-only workflows do not require Deepgram or LLM credentials up front, while--require-llmnow performs both reachability and token-count capability probestranscribe-deepgramis the only supported Deepgram entry point; split / merge behavior is owned by the Python utilityverify-qualityis a hard gate only whenhard_failuresis non-empty;warningsare advisory review signals, andchecksnow expose explainable readability metrics such aschunk_seam_warning_count,cjk_space_ratio,duplicate_ngram_ratio,short_paragraph_ratio,header_density,punctuation_density,glossary_drift_count, andglossary_preservation_ratio- bilingual
verify-qualitynow detects adjacent English-to-Chinese paragraph pairs from the full body stream, so short intro paragraphs before paired EN/ZH blocks no longer trigger a false hard failure
bilingual: English source text plus Chinese translation. It is not subtitle-file merging.preflight base:bash scripts/preflight.shfor metadata, subtitle inspection, and subtitle-driven paths.preflight deepgram:bash scripts/preflight.sh --require-deepgramimmediately before audio transcription.preflight llm:bash scripts/preflight.sh --require-llmonly whenplan-optimizationsays long-video chunk processing requires it.Deepgram unified entry:python3 yt_transcript_utils.py transcribe-deepgram ...Deepgram result observability:/tmp/${VIDEO_ID}_deepgram_result.jsonnow exposes per-chunkchunk_reports, aggregate structure counts, andwarningswhen segment extraction falls back from richer structuresquality gate:verify-qualityJSON wherehard_failuresmeans STOP,warningsmeans review before proceeding, andcheckscarries the readability metrics behind those warnings. Pass--work-diror--glossary-pathwhen available to enable glossary drift checks.source routing: the planning-layer decision about whether the current subtitle/deepgram source should continue as-is.routing_reasoncovers duration/size routing;source_route_reasoncovers source-quality routing.runtime reroute action: whenplan-optimization/verify-qualityreportsreroute_recommended=truewithreroute_target=deepgram, runtime contracts normalize that intorecommended_action=fallback_to_deepgramso policy, evaluator, and decision outputs stay aligned
| Scenario | Minimum command sequence | Stop/go rule |
|---|---|---|
| Short video with subtitles | preflight.sh → download.sh metadata → create state → validate-state --stage metadata → download.sh subtitle-info → download.sh subtitles → validate-state --stage post-source → optimize → verify-quality |
Stop only if validate-state or verify-quality returns non-empty hard_failures |
| Video without usable subtitles | preflight.sh → download.sh metadata → download.sh subtitle-info → preflight.sh --require-deepgram → transcribe-deepgram → validate-state --stage post-source → optimize → verify-quality |
Stop on any command failure or non-empty hard_failures |
| Long video | validate-state --stage post-source → plan-optimization → if requires_llm_preflight=true, run preflight.sh --require-llm → chunk → raw-path process-chunks --auto-replan → optional processed-path translation → merge → verify-quality → validate-state --stage pre-assemble |
warnings alone do not block; hard_failures block |
# 1. Base checks
bash scripts/preflight.sh
# 2. Metadata + subtitle availability
bash scripts/download.sh "$URL" metadata
bash scripts/download.sh "$URL" subtitle-info
# 3. State validation
python3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage metadata
python3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage post-source
# 4. Optimization planning
python3 yt_transcript_utils.py plan-optimization /tmp/${VIDEO_ID}_state.md
# 4b. Long-video raw chunk stages follow the plan contract
# use process-chunks --auto-replan for raw_path,
# but stop-and-review for processed_path replan_required
# 5. Audio fallback when needed
bash scripts/preflight.sh --require-deepgram
python3 yt_transcript_utils.py transcribe-deepgram "$AUDIO_FILE" --language "$LANGUAGE" --output-text "/tmp/${VIDEO_ID}_raw_text.txt"
# optional parity check against the legacy flat path:
# python3 yt_transcript_utils.py transcribe-deepgram "$AUDIO_FILE" --language "$LANGUAGE" --disable-utterances --legacy-flat-output
# 6. Optional explicit transcript glossary build for chunked cleanup paths
python3 yt_transcript_utils.py build-glossary /tmp/${VIDEO_ID}_chunks --mode transcript
# 7. Final quality gate
python3 yt_transcript_utils.py verify-quality /tmp/${VIDEO_ID}_optimized.txt --raw-text /tmp/${VIDEO_ID}_raw_text.txt
# add --work-dir /tmp/${VIDEO_ID}_chunks when a glossary/work_dir exists and you want glossary drift checks
# `checks` now includes advisory readability signals such as chunk seam duplication,
# Chinese spacing anomalies, repeated phrase density, short paragraph ratio,
# header density, punctuation density, and glossary drift metrics.MIT License
将 YouTube 视频转录为格式化的 Markdown 文章。支持字幕下载或 Deepgram 语音转录(包含多角色识别)。
建议先看:
- 想确认统一术语口径,直接看下方
术语口径 - 想复用执行顺序,直接看
验证矩阵和最小命令集
- 🎯 智能字幕获取:优先使用 YouTube 官方/自动字幕
- 🎙️ 语音转录:无字幕时自动使用 Deepgram Nova-3 转录
- 👥 多说话者识别:自动区分不同讲者
- 🌐 中英双语支持:自动翻译并对照排版
- 🤖 AI 智能优化:自动添加标点、分段、纠错
- 📝 Markdown 输出:带元数据的格式化文章
yt-dlp:下载 YouTube 视频/音频/字幕ffmpeg:音频处理(分割、静音检测)python3:处理文本格式化curl:调用 Deepgram API- Deepgram 账号:用于语音转录
# macOS
brew install yt-dlp python3 ffmpeg
# 或使用 pip
pip install yt-dlp-
复制配置模板:
cp config.example.yaml config.yaml
-
编辑
config.yaml,填入你的配置:deepgram_api_key: "your_api_key_here" deepgram_model: "nova-3" deepgram_enable_utterances: true deepgram_prefer_structured_output: true output_dir: "~/Downloads" # 长视频 chunk 处理的 LLM API 配置(可选) # 格式: "openai" 或 "anthropic" llm_api_format: "openai" llm_api_key: "your_llm_api_key" llm_base_url: "https://api.deepseek.com" llm_model: "deepseek-v4-pro" llm_timeout_sec: 180 llm_max_retries: 3 llm_backoff_sec: 1.5 llm_stream: "auto" llm_reasoning_probe_enabled: true llm_chunk_recovery_attempts: 1 llm_chunk_recovery_backoff_sec: 1.0
注意:
deepgram_api_key仅在没有可用字幕、需要音频转录时才必需。- Deepgram 转录默认使用
nova-3。旧版nova-2/nova-2-*配置会在发起请求前自动提升为nova-3。 deepgram_enable_utterances和deepgram_prefer_structured_output现在默认都是true;只有在需要和旧的 flat transcript 路径做对照排查时,才建议改成false。- LLM API 配置仅用于长视频 chunk 处理,或需要双语翻译时。
llm_base_url可以填写服务根地址或带/v1的地址,工具会自动归一化。llm_stream: "auto"会在 provider 支持时优先走流式响应。llm_reasoning_probe_enabled: true时,未知 OpenAI-compatible 模型会先走一次非流式OK探测;如果响应里有 reasoning 元数据,后续 chunk 调用会切到非流式以保留恢复所需的 usage 信息。bash scripts/preflight.sh --require-llm现在会执行一次低成本真实探活并输出延迟。
- 将此目录放入任意 Claude skills 目录
- 在 Claude 对话中提供 YouTube 链接
- Claude 将自动执行转录流程
脚本会相对于 skill 目录查找 config.yaml,不再强绑定 ~/.claude/skills/yt-transcript。
请帮我转录这个视频:https://www.youtube.com/watch?v=xxxxx
可以一次提供多个链接,将串行处理(逐个处理)以确保质量和上下文隔离:
请帮我转录这些视频:
- https://www.youtube.com/watch?v=xxxxx
- https://www.youtube.com/watch?v=yyyyy
- https://www.youtube.com/watch?v=zzzzz
处理完成后会提供汇总表格,显示每个视频的状态和输出路径。
yt-transcript/
├── SKILL.md # Claude Skill 工作流程指南(主入口)
├── SYSTEM_DESIGN.md # 系统设计唯一权威文档
├── workflows/ # 模块化工作流文件
├── prompts/ # 单任务 Prompt 模板
├── scripts/ # Shell 辅助脚本
├── yt_transcript_utils.py # 主 Python 入口;现直接依赖两层 kernel 子包
├── kernel/ # 两层 kernel 包
│ ├── task_runtime/ # 通用任务运行时层
│ │ ├── runtime.py # ownership、command envelope、telemetry append
│ │ ├── api.py # 对外稳定 runtime API:create/inspect/advance/control/finalize
│ │ ├── contracts.py # runtime contracts:task/run/action/artifact envelope
│ │ ├── lifecycle.py # runtime lifecycle shell 与 transition summary
│ │ ├── policy.py # allowed-action derivation 与 budget-pressure policy
│ │ ├── evaluator.py # quality-gated evaluator report 与建议动作
│ │ ├── decision.py # rule-first action selection 与 decision record
│ │ ├── ledger.py # runtime budget 与 action accounting 摘要
│ │ ├── recovery.py # processing substate 与 recovery summary
│ │ ├── artifacts.py # artifact graph 与持久化产物引用
│ │ ├── state.py # manifest/runtime 持久化与控制文件
│ │ ├── controller.py # owned mutation 与 bounded control-loop 辅助
│ │ └── telemetry.py # telemetry 查询与汇总辅助
│ └── long_text/ # 长文本变换层
│ ├── glossary.py # glossary 提取与术语检查
│ ├── semantic.py # semantic anchor 提取与检查
│ ├── contracts.py # control contract 与 policy state
│ ├── autotune.py # chunk autotune 与 token source 汇总
│ ├── lifecycle.py # manifest 生命周期与 resume state 辅助
│ ├── prompting.py # prompt 组装与 chunking context 辅助
│ ├── llm.py # LLM 请求循环与重试辅助
│ ├── processing.py # chunk 处理与 replan 执行循环
│ ├── chunking.py # 分块命令表面
│ ├── merge.py # merge 与 chapter-plan 命令表面
│ └── execution.py # 执行、resume 与 replan 命令表面
├── tests/ # 回归测试集
├── config.yaml # 本地配置(已 gitignore)
├── config.example.yaml # 配置模板
└── README.md # 本文档
README.md 是面向操作者的快速上手与命令指南,SYSTEM_DESIGN.md 是系统架构唯一的权威设计文档。
从整体上看,yt-transcript 是一个 local-first、script-first 的系统:它把 YouTube URL 通过以下阶段转换成 Markdown 文章:
- preflight 与配置检查
- metadata 与字幕可用性探测
- 字幕下载或 Deepgram 兜底转录
- 状态同步与标准化文档生成
- 优化计划制定
- 短路径直接变换或进入长文本变换子系统
- 最终装配与质量门禁
代码结构也按照这套设计拆成两层 kernel:kernel/task_runtime/* 负责通用长程任务控制,kernel/long_text/* 负责长文本变换行为。yt_transcript_utils.py 仍然是主 CLI 和 workflow façade,但现在会直接把职责委托给这两层。
Phase 6 进一步加入了 kernel/task_runtime/api.py,把 create-run、inspect-run、advance-run、apply-control、resume-run、finalize-run 收敛为首选外部运行时接口;原来的 runtime-status、process-chunks、pause-run、cancel-run 等命令仍然保留,但定位为兼容辅助入口。
其中最难的内部子系统是长文本变换。它只会在 planning 层判断输入足够长、必须进入 chunk 处理时激活,并负责 chunking、continuity、一致性保护、verification、repair / replan 与确定性 merge。
对于新的外层 agent 集成,建议优先使用稳定的 runtime-facing 命令,而不是旧的 path-oriented helper:
create-run <work_dir>:初始化或刷新持久化 runtime task recordinspect-run <work_dir>:返回稳定的 task/run state、recovery summary 与 allowed actionsadvance-run <work_dir>:选择受限的下一步 runtime action 并执行apply-control <work_dir> --signal pause|cancel:施加操作级控制信号resume-run <work_dir>:恢复暂停中的 runfinalize-run <work_dir>:返回最终摘要,并可选调用merge-content
默认迁移模式是 runtime_api。只有在必须强制走旧兼容包装时,才设置 YT_TRANSCRIPT_RUNTIME_API_MODE=legacy_cli。
# 仅检查基础依赖:字幕 / metadata 工作流
bash scripts/preflight.sh
# 在音频转录前要求 Deepgram 可用
bash scripts/preflight.sh --require-deepgram
# 在长视频 chunk 处理前要求 LLM 配置完整
bash scripts/preflight.sh --require-llm这个项目是 script-first 的:辅助命令会在 stdout 输出可解析 JSON,让路由、校验与执行决策尽量留在代码里,而不是漂移到 prompt 文案中:
scripts/download.sh "$URL" metadatascripts/download.sh "$URL" subtitle-infoscripts/download.sh "$URL" subtitlesscripts/download.sh "$URL" audiopython3 yt_transcript_utils.py get-chapters "$URL"python3 yt_transcript_utils.py chunk-segments /tmp/${VIDEO_ID}_segments.json /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>python3 yt_transcript_utils.py chunk-document /tmp/${VIDEO_ID}_normalized_document.json /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>python3 yt_transcript_utils.py prepare-resume /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>python3 yt_transcript_utils.py build-chapter-plan /tmp/${VIDEO_ID}_chapters.json /tmp/${VIDEO_ID}_chunks /tmp/${VIDEO_ID}_chunks/chapter_plan.jsonpython3 yt_transcript_utils.py build-glossary /tmp/${VIDEO_ID}_chunks --mode transcriptpython3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage <stage>python3 yt_transcript_utils.py normalize-document /tmp/${VIDEO_ID}_state.mdpython3 yt_transcript_utils.py plan-optimization /tmp/${VIDEO_ID}_state.mdpython3 yt_transcript_utils.py verify-quality /tmp/${VIDEO_ID}_optimized.txt --raw-text /tmp/${VIDEO_ID}_raw_text.txt
这里的 <RAW_STAGE_PROMPT> 由 plan-optimization 决定:
- 中文单语输出使用
cleanup_zh - 英文源双语输出使用
structure_only
这样 workflow 文档只保留调用顺序,具体判断逻辑下沉到脚本中。
plan-optimization 现在还会输出标准化的 chunk 执行契约。
在整个项目层面,它是 source acquisition 和 text transformation 之间的路由边界;在长文本子系统层面,它定义了后续 chunk 执行必须遵循的 execution contract:
operations[*].execution.supports_auto_replanoperations[*].execution.recommended_cli_flagsoperations[*].execution.on_replan_required
normalize-document 会基于 raw text 或带时间戳的 segments.json 物化 /tmp/${VIDEO_ID}_normalized_document.json;当源 artifact 已存在时,plan-optimization 也会自动完成这一步。如果这些 segments 来自字幕清洗阶段,标准化文档还会把轻量清洗诊断透传到 diagnostics.subtitle_cleanup,并在 diagnostics.subtitle_quality 下写入可解释的字幕源质量摘要。
plan-optimization 现在把“时长/体积路由”和“源路径质量路由”分开表达:routing_reason 仍只解释 short/long 执行路径,而 source_route_reason、subtitle_quality_score、reroute_recommended、reroute_target 用来解释当前字幕源是继续沿用、只做人工复核,还是建议切换到 Deepgram。
对于长视频分块,plan-optimization 现在还会输出显式的 chunking 契约;一旦 normalization 已存在,优先使用 chunk-document,并把 chunk 边界 / continuity 假设显式记录到 manifest.json。
当前设计还包含显式的 resume 语义:prepare-resume 用于手动修复 stale manifest,而 process-chunks 在继续执行前会自动做同样的修复。
当前约定是明确固定的:
raw_path阶段统一使用process-chunks --auto-replanprocessed_path阶段不做自动 replan;若返回replan_required=true,必须先停下人工检查
bilingual表示“英文源文本 + 中文翻译”,不是直接合并双字幕文件- 如果存在可用中文字幕,会优先把其中一个中文字幕轨作为唯一源文本;只有在中文字幕不可用时才回退到英文字幕
- 中文单语优化现在使用专门的
cleanup_zhprompt,在不改变原意的前提下修复标点、分段、重复字幕碎片和明显的中文空格问题 config.yaml被刻意限制为扁平的顶层键值配置,不支持嵌套结构或多行 YAML- YAML frontmatter 的值会统一加引号,优先保证解析安全,而不是追求最简洁的展示
- Markdown 头部里的标题/频道文本会做转义,链接目标会做编码,避免边界字符破坏结构
chunk-document现在是normalized_document.json已存在时的规范长视频分块入口;它会遵循preferred_chunk_source,而不是在有 timed segments 时一律偏向segments;因此中文字幕长路径现在会优先使用清洗后的text做正文分块,同时保留segments供时间轴 / 章节映射使用chunk-text会对超长且缺少标点的段落做强制切分,并在提供--prompt时默认启用 token-aware 规划download.sh metadata现在会优先走单次yt-dlp -J获取;字幕/音频模式也会先复用 metadata 里的 video id,再回退到额外探测transcribe-deepgram默认使用 Deepgramnova-3;旧版nova-2/nova-2-*模型配置会在运行时提升为nova-3transcribe-deepgram --output-segments可选输出带时间戳的对齐 segments,用于后续 timed chunk 与 YouTube 章节映射transcribe-deepgram现在默认就是 utterance-first 组装;仍保留--disable-utterances --legacy-flat-output作为兼容/排障回退transcribe-deepgram现在还会在结果 JSON 中输出轻量可观测字段,例如 paragraph/sentence/word 计数、逐 chunk 的 transcript 元数据,以及 structured-output 回退 warningdownload.sh subtitles现在会按“中文优先、英文回退”的顺序一次只尝试一个源字幕轨;当可见的中文字幕下载失败时,才会继续尝试英文字幕- 字幕驱动的 workflow 仍支持“中文字幕单语输出”和“英文字幕双语输出”两条路径;其中中文单语现在优先走
cleanup_zh,英文字幕仍走structure_only -> translate_only download.sh subtitles现在会显式区分“平台列出了哪些轨”和“当前运行环境实际下载到了哪条轨”:listed_candidates描述可见候选,attempted_candidates/blocked_candidates/fallback_used描述实际下载结果- 当首选字幕轨因为
HTTP 429这类鉴权/限流问题失败时,download.sh subtitles现在会先用 Chrome cookies 对同一条轨重试,再决定是否回退到下一条候选 chunk-segments基于 segments 生成带时间轴的 timed manifest;build-chapter-plan可将 YouTube chapters 映射到 chunk 边界,供merge-content注入标题parse-vtt/parse-vtt-segments现在都会做 subtitle-aware cleanup,避免 CJK 字幕碎片在重新拼接时被错误插入 ASCII 空格parse-vtt-segments现在还会输出轻量 cleanup diagnostics,例如重复 cue / overlap 裁剪计数;normalize-document会把这些字幕清洗信号以及确定性的subtitle_quality_score一并透传进normalized_document.jsonplan-optimization现在会显式输出source_route_reason、reroute_recommended、reroute_target、reroute_reasons等源路径解释字段;当中文字幕路径质量极差时,它会建议切到 Deepgram,但不会静默改写当前 workflow shellmerge-content现在会只对合并后的正文 body 执行 deterministic post-merge cleanup:它会修复 chunk seam 重复、以更保守的续写片段规则重新拼合被拆开的正文、原样保留显式传入的 header/frontmatter,并去掉紧邻重复的标题/正文接缝,而不是把这些机械问题继续留给 LLMchunk-segments --chapters可选在 YouTube 章节起点强制切 chunk,减少章节标题漂移- 如果只传显式
--chunk-size而不传--prompt,chunk-text会继续按 legacy 字符大小解释,避免现有 workflow 被静默改变 - 分块阶段会提前校验 prompt 名称,避免因为 prompt 拼写错误而静默回退到通用预算
process-chunks现在按 prompt 预算单独设置max_output_tokens,不再复用单一的大默认值manifest.json现在会显式记录plan.chunk_contract与plan.continuity;process-chunks会遵循 plan-own 的 continuity 策略,而不是被后续 config 漂移静默改变- chunk 执行现在也有显式 resume 语义:在继续执行前,stale 的
running/ 缺失输出 checkpoint 会被确定性修复为done或interrupted process-chunks还会注入上一块的轻量 continuity context(尾句 + 可选 section title),但不会启用正文 overlap;同时分块预算也会为这段 carry-over context 预留一小段 token 成本process-chunks现在会把Remote end closed connection without response这类瞬时网关断连视为可重试传输错误,并可在产出异常短/结构异常的 chunk 时自动重跑一轮,再决定是否保留 warningmanifest.json现在会把不可变plan和可变runtime状态分开,同时为每个 chunk 记录attempt_logs级别的请求观测数据process-chunks不再在当前 batch 内偷偷改预算;如果 canary 或重试历史表明当前 plan 不健康,会以replan_required=true终止,并通过replan-remaining为剩余原始 chunk 生成新计划process-chunks --auto-replan会在不破坏上述边界的前提下,自动编排process -> replan-remaining -> resume这一恢复链路(仅适用于raw_path计划)build-glossary --mode transcript现在会从 raw chunk、可用的 normalized document 标题/频道、chapter 标题,以及可选 metadata/description 上下文中抽取术语;当cleanup_zh的 work_dir 里还没有 glossary 时,process-chunks会自动生成这一份 glossary- glossary 的筛选与校验现在会对简单 ASCII 术语做边界感知匹配,因此像
API这样的缩写不会再因为rapid、capital之类的无关单词而误命中 runtime.status现在会区分completed/completed_with_errors/aborted,而 raw replan 也会同步重映射已有chapter_plan.json的 chunk 起点,保证 merge 阶段的章节标题仍落在有效 chunk 边界上- 运行时 token 估算默认仍是本地启发式 fallback;
test-token-count/preflight.sh --require-llm会探测 provider 级 token count,并在不可用时明确回退到 local estimate chunk_hard_cap_multiplier会被限制在保守的1.0-2.0区间,避免配置失误把 chunk 包络静默放大preflight.sh采用分层校验,确保只走字幕路径时不必预先配置 Deepgram 或 LLM 凭据;进入--require-llm时会同时做连通性和 token count 能力探测transcribe-deepgram是唯一支持的 Deepgram 统一入口,分片与合并逻辑由 Python 工具统一负责verify-quality只有在hard_failures非空时才阻断流程;warnings仅用于人工复核提示,而checks现在会额外暴露chunk_seam_warning_count、cjk_space_ratio、duplicate_ngram_ratio、short_paragraph_ratio、header_density、punctuation_density、glossary_drift_count、glossary_preservation_ratio等可解释指标- 双语
verify-quality现在会在完整正文流里识别相邻的“英文段 -> 中文段”配对,因此对照正文前面即使有简短导语,也不会再被误判为完全缺少双语段对
bilingual:英文源文本 + 中文翻译,不是双字幕文件合并。基础 preflight:bash scripts/preflight.sh,用于 metadata、字幕探测和字幕路径。Deepgram preflight:bash scripts/preflight.sh --require-deepgram,仅在音频转录前执行。LLM preflight:只有当plan-optimization返回 long-video chunk 处理需要时,才执行bash scripts/preflight.sh --require-llm。Deepgram 统一入口:python3 yt_transcript_utils.py transcribe-deepgram ...Deepgram 结果可观测性:/tmp/${VIDEO_ID}_deepgram_result.json现在会暴露逐 chunk 的chunk_reports、聚合结构计数,以及 structured-output 回退时的warnings质量门禁:读取verify-quality的 JSON;hard_failures表示必须 STOP,warnings表示需要人工复核,checks则给出这些告警背后的可读性指标。有 work_dir 或现成 glossary 时,建议额外传--work-dir或--glossary-path打开 glossary drift 检查。源路径路由:规划层决定当前 subtitle / Deepgram 源是否继续沿用。routing_reason负责解释时长/输入体积路由,source_route_reason负责解释源质量路由。运行时 reroute 动作:当plan-optimization/verify-quality返回reroute_recommended=true且reroute_target=deepgram时,runtime contract 会把它标准化为recommended_action=fallback_to_deepgram,这样 policy、evaluator、decision 三层动作口径保持一致输出文件命名:最终 Markdown 文件名里的日期片段必须使用yyyy-mm-dd,不要把 metadata 的原始upload_date(yyyymmdd)直接放进文件名。
| 场景 | 最小命令序列 | Stop/go 规则 |
|---|---|---|
| 有字幕短视频 | preflight.sh → download.sh metadata → 创建 state → validate-state --stage metadata → download.sh subtitle-info → download.sh subtitles → validate-state --stage post-source → 优化 → verify-quality |
只有 validate-state 或 verify-quality 返回非空 hard_failures 才停止 |
| 无可用字幕视频 | preflight.sh → download.sh metadata → download.sh subtitle-info → preflight.sh --require-deepgram → transcribe-deepgram → validate-state --stage post-source → 优化 → verify-quality |
任一命令失败或 hard_failures 非空都必须停止 |
| 长视频 | validate-state --stage post-source → plan-optimization → 若 requires_llm_preflight=true 则执行 preflight.sh --require-llm → 分块 → raw_path 阶段使用 process-chunks --auto-replan → 需要时执行 processed_path 翻译阶段 → 合并 → verify-quality → validate-state --stage pre-assemble |
warnings 不自动阻断,hard_failures 阻断 |
# 1. 基础检查
bash scripts/preflight.sh
# 2. Metadata 与字幕可用性
bash scripts/download.sh "$URL" metadata
bash scripts/download.sh "$URL" subtitle-info
# 3. State 校验
python3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage metadata
python3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage post-source
# 4. 优化计划
python3 yt_transcript_utils.py plan-optimization /tmp/${VIDEO_ID}_state.md
# 4b. 长视频 raw chunk 阶段遵循 plan contract
# raw_path 用 process-chunks --auto-replan,
# processed_path 若出现 replan_required 则停下人工检查
# 5. 需要时走音频兜底
bash scripts/preflight.sh --require-deepgram
python3 yt_transcript_utils.py transcribe-deepgram "$AUDIO_FILE" --language "$LANGUAGE" --output-text "/tmp/${VIDEO_ID}_raw_text.txt"
# 需要与 legacy flat path 做对照时:
# python3 yt_transcript_utils.py transcribe-deepgram "$AUDIO_FILE" --language "$LANGUAGE" --disable-utterances --legacy-flat-output
# 6. 仅在 chunk cleanup 路径下可选的 transcript glossary 构建
python3 yt_transcript_utils.py build-glossary /tmp/${VIDEO_ID}_chunks --mode transcript
# 7. 最终质量门禁
python3 yt_transcript_utils.py verify-quality /tmp/${VIDEO_ID}_optimized.txt --raw-text /tmp/${VIDEO_ID}_raw_text.txt
# 如果存在 glossary/work_dir,且希望检查 glossary drift,可额外传 --work-dir /tmp/${VIDEO_ID}_chunks
# `checks` 现在还会包含 chunk seam 重复、中文空格异常、
# 重复短语密度、短碎段比例、标题密度、标点密度,以及 glossary drift 等 advisory 指标。MIT License