Skip to content

williamwang-ty/yt-transcript

Repository files navigation

yt-transcript

English | 中文


English

Transcribe YouTube videos into formatted Markdown articles. Supports subtitle download or Deepgram speech-to-text (with multi-speaker recognition).

Start here:

  • If you want the canonical terminology, see Canonical Terms below.
  • If you want the reusable execution order, see Validation Matrix and Minimum Commands.

✨ Features

  • 🎯 Smart Subtitle Fetching: Prioritizes YouTube official/auto-generated subtitles
  • 🎙️ Speech-to-Text: Auto-transcribes via Deepgram Nova-3 when no subtitles available
  • 👥 Multi-speaker Recognition: Automatically distinguishes different speakers
  • 🌐 Bilingual Support: Auto-translate and side-by-side formatting
  • 🤖 AI Enhancement: Auto punctuation, paragraph splitting, error correction
  • 📝 Markdown Output: Formatted articles with metadata

📋 Prerequisites

  • yt-dlp: Download YouTube videos/audio/subtitles
  • ffmpeg: Audio processing (splitting, silence detection)
  • python3: Text processing
  • curl: Call Deepgram API
  • Deepgram Account: For speech transcription

Installation

# macOS
brew install yt-dlp python3 ffmpeg

# or via pip
pip install yt-dlp

⚙️ Configuration

  1. Copy the config template:

    cp config.example.yaml config.yaml
  2. Edit config.yaml with your settings:

    deepgram_api_key: "your_api_key_here"
    deepgram_model: "nova-3"
    deepgram_enable_utterances: true
    deepgram_prefer_structured_output: true
    output_dir: "~/Downloads"
    
    # LLM API for long video chunk processing (optional)
    # Format: "openai" or "anthropic"
    llm_api_format: "openai"
    llm_api_key: "your_llm_api_key"
    llm_base_url: "https://api.deepseek.com"
    llm_model: "deepseek-v4-pro"
    llm_timeout_sec: 180
    llm_max_retries: 3
    llm_backoff_sec: 1.5
    llm_stream: "auto"
    llm_reasoning_probe_enabled: true
    llm_chunk_recovery_attempts: 1
    llm_chunk_recovery_backoff_sec: 1.0
    
    # yt-dlp network hardening (safe defaults)
    yt_dlp_socket_timeout_sec: 15
    yt_dlp_retries: 1
    yt_dlp_extractor_retries: 1

    YouTube "Sign in to confirm you're not a bot"

    yt-transcript now uses this policy by default:

    1. Start anonymously (do not read browser cookies up front)
    2. If yt-dlp returns Sign in to confirm you're not a bot, automatically retry (up to 3 attempts) with:
      • --cookies-from-browser chrome
    3. If that Chrome retry also fails, surface a clear error and tell you how to provide a cookies.txt file

    This means local desktop setups may recover automatically, while remote/container setups remain explicit and safe. The JSON returned by scripts/download.sh now also includes a yt_dlp_runtime object so callers can see which timeout, retry, and auth strategy actually ran.

    If the automatic Chrome retry fails, the most portable fix is an exported Netscape-format cookies.txt:

    yt_dlp_cookies_file: "~/.config/yt-transcript/youtube_cookies.txt"

    Or for a one-off run:

    YT_DLP_COOKIES_FILE=~/.config/yt-transcript/youtube_cookies.txt \
      bash scripts/download.sh "$URL" metadata

    You can still force browser-cookie mode explicitly:

    yt_dlp_cookies_from_browser: "chrome"

    Recommended cookies.txt import flow:

    1. Open youtube.com in a logged-in browser on your local machine
    2. Export cookies for YouTube in Netscape cookies.txt format
    3. Copy that file to the machine/container running this skill
    4. Set yt_dlp_cookies_file in config.yaml or YT_DLP_COOKIES_FILE in the environment

    In remote or container environments, yt_dlp_cookies_file is usually more reliable than --cookies-from-browser chrome.

    Note:

    • deepgram_api_key is only required when the video has no usable subtitles and audio transcription is needed.
    • Deepgram transcription defaults to nova-3. Legacy nova-2 / nova-2-* settings are automatically upgraded to nova-3 before any request is sent.
    • deepgram_enable_utterances and deepgram_prefer_structured_output now default to true; set either to false only when you need to compare against the legacy flat-transcript path.
    • Deepgram requests now use bounded automatic retries for transient timeout/network failures before surfacing an error.
    • LLM API config is only needed for long video chunk processing, or when bilingual translation is required.
    • llm_base_url can be either a provider root URL or a /v1 URL. The tool normalizes both.
    • llm_stream: "auto" prefers SSE streaming when the provider supports it.
    • With llm_reasoning_probe_enabled: true, unknown OpenAI-compatible models get one non-streaming OK probe; if the response exposes reasoning metadata, chunk calls switch to non-streaming for recovery telemetry.
    • bash scripts/preflight.sh --require-llm now performs a real low-cost LLM probe and reports latency.

🚀 Usage

As a Claude Skill

  1. Place this directory in any Claude skills directory
  2. Provide a YouTube link in your Claude conversation
  3. Claude will automatically execute the transcription workflow

The scripts resolve config.yaml relative to the skill directory, so the skill is no longer tied to ~/.claude/skills/yt-transcript.

Single Video Example

Please transcribe this video: https://www.youtube.com/watch?v=xxxxx

Multiple Videos (Batch Processing)

You can provide multiple links at once. They will be processed serially (one at a time) to ensure quality and context isolation:

Please transcribe these videos:
- https://www.youtube.com/watch?v=xxxxx
- https://www.youtube.com/watch?v=yyyyy
- https://www.youtube.com/watch?v=zzzzz

After completion, a summary table will be provided with status and output paths for each video.

Routing Notes

  • plan-optimization still records the raw duration_bucket (short vs long).
  • If a short-duration transcript is too large for reliable single-pass prompting, the planner now escalates it to chunked processing and returns video_path=long with routing_reason=oversized_short_input.
  • Workflow callers should follow video_path, operations, and routing_reason from the planner rather than duration alone.
  • Final output filenames must format date fragments as yyyy-mm-dd; do not use raw metadata upload_date (yyyymmdd) in filenames.

📁 Project Structure

yt-transcript/
├── SKILL.md                 # Claude Skill workflow guide (main entry point)
├── SYSTEM_DESIGN.md         # Single authoritative system design document
├── workflows/               # Modular workflow files
├── prompts/                 # Single-task prompt templates
├── scripts/                 # Helper shell scripts
├── yt_transcript_utils.py   # Main Python entry; imports the two kernel layers directly
├── kernel/                 # Two-layer kernel package
│   ├── task_runtime/       # Generic task runtime layer
│   │   ├── runtime.py      # Ownership, command envelopes, telemetry append
│   │   ├── api.py          # Stable runtime-facing API for create/inspect/advance/control/finalize
│   │   ├── contracts.py    # Runtime contracts for task/run/action/artifact envelopes
│   │   ├── lifecycle.py    # Lifecycle shell and transition summaries for runtime stages
│   │   ├── policy.py       # Allowed-action derivation and budget-pressure policy checks
│   │   ├── evaluator.py    # Quality-gated evaluator reports and action recommendations
│   │   ├── decision.py     # Rule-first action selection and decision records
│   │   ├── ledger.py       # Runtime budget and action accounting summaries
│   │   ├── recovery.py     # Resume-safe recovery summaries and processing sub-states
│   │   ├── artifacts.py    # Artifact-graph helpers for persisted runtime outputs
│   │   ├── state.py        # Manifest/runtime persistence and control files
│   │   ├── controller.py   # Owned mutation and bounded control-loop helpers
│   │   └── telemetry.py    # Telemetry query and summary helpers
│   └── long_text/          # Long-text transformation layer
│       ├── glossary.py     # Glossary extraction and terminology checks
│       ├── semantic.py     # Semantic anchor extraction and checks
│       ├── contracts.py    # Control contracts and policy state
│       ├── autotune.py     # Chunk autotune and token-source summarization
│       ├── lifecycle.py    # Manifest lifecycle and resume-state helpers
│       ├── prompting.py    # Prompt assembly and chunking-context helpers
│       ├── llm.py          # LLM request loop and retry helpers
│       ├── processing.py   # Chunk-processing and replan execution loops
│       ├── chunking.py     # Chunking command surfaces
│       ├── merge.py        # Merge and chapter-plan command surfaces
│       └── execution.py    # Execution, resume, and replan command surfaces
├── tests/                   # Regression test suite
├── config.yaml              # Local config (gitignored)
├── config.example.yaml      # Config template
└── README.md                # This document

🏗️ Architecture & Design Overview

README.md is the operator-facing quickstart and command guide. SYSTEM_DESIGN.md is the single authoritative design document for the system architecture.

At a high level, yt-transcript is a local-first, script-first system that turns a YouTube URL into a Markdown article through:

  • preflight and configuration checks
  • metadata and subtitle availability detection
  • subtitle download or Deepgram fallback transcription
  • state synchronization and normalized document creation
  • optimization planning
  • short-path direct transformation or the long-text transformation subsystem
  • final assembly and quality gates

The codebase mirrors that design through a two-layer kernel split: kernel/task_runtime/* owns generic long-running job control, while kernel/long_text/* owns long-text transformation behavior. yt_transcript_utils.py remains the main CLI and workflow façade, but it now delegates into those two layers directly.

Phase 1 of the runtime-upgrade path also introduces kernel/task_runtime/contracts.py, which normalizes task, run-state, action, artifact, and quality-report envelopes without changing the nominal workflow behavior. Phase 2 adds kernel/task_runtime/lifecycle.py, which wraps runtime-sensitive commands in an explicit lifecycle shell so state transitions become observable before richer policy logic is introduced. Phase 3 adds kernel/task_runtime/policy.py, decision.py, and ledger.py so command envelopes can expose allowed actions, budget-pressure summaries, and rule-first decision records in a uniform way. Phase 4 adds kernel/task_runtime/recovery.py and artifacts.py so long-text runs expose processing sub-states, recovery recommendations, and artifact-graph views without changing the core chunk-processing algorithms. Phase 5 adds kernel/task_runtime/evaluator.py plus constrained llm-assisted ranking hooks so quality-gated recommendations and optional model-assisted action selection can coexist without bypassing the allowed-action contract. Phase 6 adds kernel/task_runtime/api.py, making create-run, inspect-run, advance-run, apply-control, resume-run, and finalize-run the preferred outer runtime contract while preserving the older control commands as compatibility helpers.

The hardest internal subsystem is long-text transformation. It activates only when the planning layer determines that the input is long enough to require chunking, continuity control, consistency protection, verification, repair / replan, and deterministic merge.

Read separate System Design Document

🔁 Preferred Runtime API

For new outer-agent integrations, prefer the stable runtime-facing commands over the older path-oriented helpers:

  • create-run <work_dir> initializes or refreshes the persisted runtime task record
  • inspect-run <work_dir> returns stable task/run state, recovery summary, and allowed actions
  • advance-run <work_dir> selects the bounded next runtime action and dispatches it
  • apply-control <work_dir> --signal pause|cancel applies operator control signals
  • resume-run <work_dir> resumes a paused run
  • finalize-run <work_dir> returns a final summary and can optionally call merge-content

The default migration mode is runtime_api. Set YT_TRANSCRIPT_RUNTIME_API_MODE=legacy_cli only when you need to force the older compatibility framing.

🔧 Preflight Modes

# Base checks only: subtitles / metadata workflows
bash scripts/preflight.sh

# Require Deepgram before audio transcription
bash scripts/preflight.sh --require-deepgram

# Require LLM config before long-video chunk processing
bash scripts/preflight.sh --require-llm

🧩 Structured Script Outputs

This project is script-first: helper commands emit machine-readable JSON on stdout so routing, validation, and execution decisions stay in code instead of drifting inside prompt prose:

  • scripts/download.sh "$URL" metadata
  • scripts/download.sh "$URL" subtitle-info
  • scripts/download.sh "$URL" subtitles
  • scripts/download.sh "$URL" audio
  • python3 yt_transcript_utils.py get-chapters "$URL"
  • python3 yt_transcript_utils.py chunk-segments /tmp/${VIDEO_ID}_segments.json /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>
  • python3 yt_transcript_utils.py chunk-document /tmp/${VIDEO_ID}_normalized_document.json /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>
  • python3 yt_transcript_utils.py prepare-resume /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>
  • python3 yt_transcript_utils.py build-chapter-plan /tmp/${VIDEO_ID}_chapters.json /tmp/${VIDEO_ID}_chunks /tmp/${VIDEO_ID}_chunks/chapter_plan.json
  • python3 yt_transcript_utils.py build-glossary /tmp/${VIDEO_ID}_chunks --mode transcript
  • python3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage <stage>
  • python3 yt_transcript_utils.py normalize-document /tmp/${VIDEO_ID}_state.md
  • python3 yt_transcript_utils.py plan-optimization /tmp/${VIDEO_ID}_state.md
  • python3 yt_transcript_utils.py verify-quality /tmp/${VIDEO_ID}_optimized.txt --raw-text /tmp/${VIDEO_ID}_raw_text.txt

Here <RAW_STAGE_PROMPT> comes from plan-optimization:

  • cleanup_zh for Chinese monolingual runs
  • structure_only for bilingual English-source runs

This keeps workflow logic in scripts instead of ad-hoc shell parsing inside the prompt instructions.

plan-optimization also emits the canonical chunk execution contract.

At the whole-project level, it is the routing boundary between source acquisition and text transformation. At the long-text subsystem level, it defines the execution contract that downstream chunk processing must follow:

  • operations[*].execution.supports_auto_replan
  • operations[*].execution.recommended_cli_flags
  • operations[*].execution.on_replan_required

normalize-document materializes /tmp/${VIDEO_ID}_normalized_document.json from either raw text or timed segments.json, and plan-optimization auto-materializes it when source artifacts already exist. When the source segments came from subtitle cleanup, the normalized document now also preserves lightweight cleanup diagnostics under diagnostics.subtitle_cleanup and an explainable subtitle-path quality summary under diagnostics.subtitle_quality.

plan-optimization now separates duration routing from source routing: routing_reason still explains short-vs-long execution, while source_route_reason, subtitle_quality_score, reroute_recommended, and reroute_target explain whether the current subtitle source should be kept, manually reviewed, or replaced with Deepgram.

For long-video chunking, plan-optimization now also emits a canonical chunking block; when normalization exists, chunk-document is the preferred driver and it keeps chunk boundary / continuity assumptions explicit in manifest.json.

The current design also has explicit resume semantics: prepare-resume repairs stale manifest state manually, while process-chunks runs the same repair step automatically before execution continues.

Current policy is intentional and explicit:

  • raw_path chunk stages use process-chunks --auto-replan
  • processed_path chunk stages do not auto-replan; if replan_required=true, stop and review manually

🧭 Intentional Design Decisions

  • bilingual means English source text plus Chinese translation, not subtitle file merging
  • If Chinese subtitles exist, they take precedence as the single subtitle source track; English is used only when no usable Chinese subtitle track can be downloaded
  • Chinese-source monolingual optimization now uses a dedicated cleanup_zh prompt that preserves meaning while repairing punctuation, paragraphing, duplicate subtitle fragments, and obvious spacing artifacts
  • config.yaml is intentionally limited to flat top-level key/value entries; nested or multi-line YAML is not supported
  • YAML frontmatter values are always quoted on purpose to favor safe parsing over prettier formatting
  • Markdown header text is escaped and link destinations are encoded so edge-case titles/channels do not break output structure
  • chunk-document is now the canonical long-video chunking entrypoint when normalized_document.json exists; it follows preferred_chunk_source instead of blindly preferring timed segments, so Chinese YouTube-subtitle long paths now chunk from cleaned text while still retaining segments for timing metadata
  • chunk-text force-splits very long unpunctuated passages to stay within downstream LLM chunk budgets
  • download.sh metadata now prefers a single yt-dlp -J fetch when available, and subtitle/audio modes reuse metadata-derived video IDs before falling back to extra probes
  • transcribe-deepgram defaults to Deepgram nova-3; legacy nova-2 / nova-2-* model settings are upgraded to nova-3 at runtime
  • transcribe-deepgram --output-segments can emit time-aligned segments for downstream timed chunking + YouTube chapter mapping
  • transcribe-deepgram now defaults to utterance-first transcript assembly; --disable-utterances --legacy-flat-output remains available for compatibility/debugging checks
  • transcribe-deepgram now also reports lightweight observability fields such as paragraph/sentence/word counts, per-chunk transcript metadata, and fallback warnings in its result JSON
  • chunk-segments produces timed chunk manifests, and build-chapter-plan maps YouTube chapters onto chunk boundaries for merge-content
  • parse-vtt / parse-vtt-segments now use subtitle-aware cleanup so CJK subtitle fragments are not re-joined with stray ASCII spaces
  • parse-vtt-segments now also emits lightweight cleanup diagnostics such as duplicate/overlap trimming counters, and normalize-document carries those subtitle-cleanup signals plus a deterministic subtitle_quality_score into normalized_document.json
  • plan-optimization now exposes explainable source-route fields such as source_route_reason, reroute_recommended, reroute_target, and reroute_reasons; critically poor Chinese subtitle paths can recommend Deepgram fallback without silently changing the current workflow shell
  • merge-content now runs a deterministic post-merge cleanup pass on the merged body only: it repairs chunk seams, conservatively rejoins continuation-like split fragments, preserves any explicit prefixed header/frontmatter verbatim, and drops immediately duplicated heading/body seams without asking the LLM to rewrite the document
  • chunk-segments --chapters can force chunk boundaries at YouTube chapter starts to reduce heading drift
  • chunk-text now defaults to token-aware planning when --prompt is provided, while an explicit --chunk-size without --prompt keeps legacy character sizing for workflow compatibility
  • prompt names are validated eagerly for chunk planning, so typos fail fast instead of silently falling back to generic budgets
  • process-chunks now assigns prompt-specific max_output_tokens from the same planning budget instead of using one large shared default
  • manifest.json now records explicit plan.chunk_contract and plan.continuity; process-chunks follows that plan-owned continuity policy instead of silently drifting with later config changes
  • chunk execution now also has explicit resume semantics: stale running / missing-output checkpoints are repaired deterministically into done or interrupted before work resumes
  • process-chunks also injects a short continuity context from the previous chunk (tail sentence + optional section title) without enabling body overlap, and chunk budgeting now reserves a small token allowance for that carry-over context
  • process-chunks now treats transient gateway disconnects such as Remote end closed connection without response as retryable transport failures, and can auto-rerun suspiciously short / malformed chunk outputs before keeping a warning
  • process-chunks --dry-run validates prompts, manifests, and chunk budgets without requiring live LLM credentials; actual execution still requires llm_api_key, llm_base_url, and llm_model
  • download.sh now writes subtitle and audio artifacts into per-video isolated temp directories under /tmp/${VIDEO_ID}_downloads/... and exposes download_dir in JSON for deterministic selection and cleanup
  • download.sh subtitles now requests the exact selected subtitle language codes, so regional variants such as en-GB / zh-TW work instead of being dropped by a hard-coded whitelist
  • download.sh subtitles now tries one source-family candidate at a time: Chinese first, then English only as a fallback when no usable Chinese track can be downloaded
  • download.sh subtitles now distinguishes detection vs downloadability more explicitly: listed_candidates shows tracks exposed by YouTube/yt-dlp, while attempted_candidates, blocked_candidates, and fallback_used show what the current runtime could actually fetch
  • when a preferred subtitle candidate fails with an auth-like error such as HTTP 429, download.sh subtitles now retries the same candidate with Chrome cookies before it gives up and falls back to the next candidate
  • subtitle-driven workflows still support Chinese-source monolingual mode and English-source bilingual mode; Chinese-source runs now prefer cleanup_zh, while English-source runs still use structure_only -> translate_only; when neither usable Chinese nor English subtitles can be downloaded, the workflow should stop and fall back to audio transcription
  • plan-optimization is the canonical short/long router with < 1800s = short and >= 1800s = long; the Quick Mode shortcut from SKILL.md is a narrower < 900s subset for subtitle-friendly videos
  • manifest.json now separates immutable plan metadata from runtime state, and process-chunks records attempt-level telemetry (attempt_logs) in addition to chunk-level fields
  • process-chunks no longer rewrites the current batch budget on the fly; when canary chunks or retry history show the plan is unhealthy, it aborts with replan_required=true so replan-remaining can generate a new plan for unfinished raw chunks
  • process-chunks --auto-replan preserves that architecture boundary while automating the orchestration loop (process -> replan-remaining -> resume) for raw-path plans
  • build-glossary --mode transcript now mines glossary terms from raw chunks plus any available normalized-document title/channel, chapter titles, and optional metadata/description context; cleanup_zh auto-builds that glossary when the work_dir does not have one yet
  • glossary selection and verification now use boundary-aware matching for simple ASCII terms, so acronyms like API do not get falsely selected from unrelated words such as rapid or capital
  • run_kernel_command(...) is the stable Python envelope API for kernel commands, and python3 yt_transcript_utils.py --api-envelope ... emits the same yt_transcript.command_result/v1 envelope on the CLI without breaking legacy flat JSON output
  • envelope-producing kernel commands append local yt_transcript.telemetry_event/v1 records to telemetry.jsonl when a stable nearby sink path can be inferred
  • runtime.status now distinguishes completed, completed_with_errors, and aborted, and raw-path replans remap existing chapter_plan.json chunk starts so merged chapter headers still land on valid chunk boundaries
  • runtime token estimation remains heuristic by default; test-token-count / preflight.sh --require-llm probe provider-side token counting and clearly fall back to local estimates when unavailable
  • chunk_hard_cap_multiplier is constrained to a conservative 1.0-2.0 range so misconfiguration cannot silently blow up chunk envelopes
  • preflight.sh is staged so subtitle-only workflows do not require Deepgram or LLM credentials up front, while --require-llm now performs both reachability and token-count capability probes
  • transcribe-deepgram is the only supported Deepgram entry point; split / merge behavior is owned by the Python utility
  • verify-quality is a hard gate only when hard_failures is non-empty; warnings are advisory review signals, and checks now expose explainable readability metrics such as chunk_seam_warning_count, cjk_space_ratio, duplicate_ngram_ratio, short_paragraph_ratio, header_density, punctuation_density, glossary_drift_count, and glossary_preservation_ratio
  • bilingual verify-quality now detects adjacent English-to-Chinese paragraph pairs from the full body stream, so short intro paragraphs before paired EN/ZH blocks no longer trigger a false hard failure

📘 Canonical Terms

  • bilingual: English source text plus Chinese translation. It is not subtitle-file merging.
  • preflight base: bash scripts/preflight.sh for metadata, subtitle inspection, and subtitle-driven paths.
  • preflight deepgram: bash scripts/preflight.sh --require-deepgram immediately before audio transcription.
  • preflight llm: bash scripts/preflight.sh --require-llm only when plan-optimization says long-video chunk processing requires it.
  • Deepgram unified entry: python3 yt_transcript_utils.py transcribe-deepgram ...
  • Deepgram result observability: /tmp/${VIDEO_ID}_deepgram_result.json now exposes per-chunk chunk_reports, aggregate structure counts, and warnings when segment extraction falls back from richer structures
  • quality gate: verify-quality JSON where hard_failures means STOP, warnings means review before proceeding, and checks carries the readability metrics behind those warnings. Pass --work-dir or --glossary-path when available to enable glossary drift checks.
  • source routing: the planning-layer decision about whether the current subtitle/deepgram source should continue as-is. routing_reason covers duration/size routing; source_route_reason covers source-quality routing.
  • runtime reroute action: when plan-optimization / verify-quality reports reroute_recommended=true with reroute_target=deepgram, runtime contracts normalize that into recommended_action=fallback_to_deepgram so policy, evaluator, and decision outputs stay aligned

🧪 Validation Matrix

Scenario Minimum command sequence Stop/go rule
Short video with subtitles preflight.shdownload.sh metadata → create state → validate-state --stage metadatadownload.sh subtitle-infodownload.sh subtitlesvalidate-state --stage post-source → optimize → verify-quality Stop only if validate-state or verify-quality returns non-empty hard_failures
Video without usable subtitles preflight.shdownload.sh metadatadownload.sh subtitle-infopreflight.sh --require-deepgramtranscribe-deepgramvalidate-state --stage post-source → optimize → verify-quality Stop on any command failure or non-empty hard_failures
Long video validate-state --stage post-sourceplan-optimization → if requires_llm_preflight=true, run preflight.sh --require-llm → chunk → raw-path process-chunks --auto-replan → optional processed-path translation → merge → verify-qualityvalidate-state --stage pre-assemble warnings alone do not block; hard_failures block

🛠️ Minimum Commands

# 1. Base checks
bash scripts/preflight.sh

# 2. Metadata + subtitle availability
bash scripts/download.sh "$URL" metadata
bash scripts/download.sh "$URL" subtitle-info

# 3. State validation
python3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage metadata
python3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage post-source

# 4. Optimization planning
python3 yt_transcript_utils.py plan-optimization /tmp/${VIDEO_ID}_state.md

# 4b. Long-video raw chunk stages follow the plan contract
#     use process-chunks --auto-replan for raw_path,
#     but stop-and-review for processed_path replan_required

# 5. Audio fallback when needed
bash scripts/preflight.sh --require-deepgram
python3 yt_transcript_utils.py transcribe-deepgram "$AUDIO_FILE" --language "$LANGUAGE" --output-text "/tmp/${VIDEO_ID}_raw_text.txt"
# optional parity check against the legacy flat path:
# python3 yt_transcript_utils.py transcribe-deepgram "$AUDIO_FILE" --language "$LANGUAGE" --disable-utterances --legacy-flat-output

# 6. Optional explicit transcript glossary build for chunked cleanup paths
python3 yt_transcript_utils.py build-glossary /tmp/${VIDEO_ID}_chunks --mode transcript

# 7. Final quality gate
python3 yt_transcript_utils.py verify-quality /tmp/${VIDEO_ID}_optimized.txt --raw-text /tmp/${VIDEO_ID}_raw_text.txt
# add --work-dir /tmp/${VIDEO_ID}_chunks when a glossary/work_dir exists and you want glossary drift checks

# `checks` now includes advisory readability signals such as chunk seam duplication,
# Chinese spacing anomalies, repeated phrase density, short paragraph ratio,
# header density, punctuation density, and glossary drift metrics.

📄 License

MIT License

🔗 Links


中文

将 YouTube 视频转录为格式化的 Markdown 文章。支持字幕下载或 Deepgram 语音转录(包含多角色识别)。

建议先看:

  • 想确认统一术语口径,直接看下方 术语口径
  • 想复用执行顺序,直接看 验证矩阵最小命令集

✨ 功能特点

  • 🎯 智能字幕获取:优先使用 YouTube 官方/自动字幕
  • 🎙️ 语音转录:无字幕时自动使用 Deepgram Nova-3 转录
  • 👥 多说话者识别:自动区分不同讲者
  • 🌐 中英双语支持:自动翻译并对照排版
  • 🤖 AI 智能优化:自动添加标点、分段、纠错
  • 📝 Markdown 输出:带元数据的格式化文章

📋 前置依赖

  • yt-dlp:下载 YouTube 视频/音频/字幕
  • ffmpeg:音频处理(分割、静音检测)
  • python3:处理文本格式化
  • curl:调用 Deepgram API
  • Deepgram 账号:用于语音转录

安装依赖

# macOS
brew install yt-dlp python3 ffmpeg

# 或使用 pip
pip install yt-dlp

⚙️ 配置

  1. 复制配置模板:

    cp config.example.yaml config.yaml
  2. 编辑 config.yaml,填入你的配置:

    deepgram_api_key: "your_api_key_here"
    deepgram_model: "nova-3"
    deepgram_enable_utterances: true
    deepgram_prefer_structured_output: true
    output_dir: "~/Downloads"
    
    # 长视频 chunk 处理的 LLM API 配置(可选)
    # 格式: "openai" 或 "anthropic"
    llm_api_format: "openai"
    llm_api_key: "your_llm_api_key"
    llm_base_url: "https://api.deepseek.com"
    llm_model: "deepseek-v4-pro"
    llm_timeout_sec: 180
    llm_max_retries: 3
    llm_backoff_sec: 1.5
    llm_stream: "auto"
    llm_reasoning_probe_enabled: true
    llm_chunk_recovery_attempts: 1
    llm_chunk_recovery_backoff_sec: 1.0

    注意

    • deepgram_api_key 仅在没有可用字幕、需要音频转录时才必需。
    • Deepgram 转录默认使用 nova-3。旧版 nova-2 / nova-2-* 配置会在发起请求前自动提升为 nova-3
    • deepgram_enable_utterancesdeepgram_prefer_structured_output 现在默认都是 true;只有在需要和旧的 flat transcript 路径做对照排查时,才建议改成 false
    • LLM API 配置仅用于长视频 chunk 处理,或需要双语翻译时。
    • llm_base_url 可以填写服务根地址或带 /v1 的地址,工具会自动归一化。
    • llm_stream: "auto" 会在 provider 支持时优先走流式响应。
    • llm_reasoning_probe_enabled: true 时,未知 OpenAI-compatible 模型会先走一次非流式 OK 探测;如果响应里有 reasoning 元数据,后续 chunk 调用会切到非流式以保留恢复所需的 usage 信息。
    • bash scripts/preflight.sh --require-llm 现在会执行一次低成本真实探活并输出延迟。

🚀 使用方法

作为 Claude Skill 使用

  1. 将此目录放入任意 Claude skills 目录
  2. 在 Claude 对话中提供 YouTube 链接
  3. Claude 将自动执行转录流程

脚本会相对于 skill 目录查找 config.yaml,不再强绑定 ~/.claude/skills/yt-transcript

单个视频示例

请帮我转录这个视频:https://www.youtube.com/watch?v=xxxxx

多个视频(批量处理)

可以一次提供多个链接,将串行处理(逐个处理)以确保质量和上下文隔离:

请帮我转录这些视频:
- https://www.youtube.com/watch?v=xxxxx
- https://www.youtube.com/watch?v=yyyyy
- https://www.youtube.com/watch?v=zzzzz

处理完成后会提供汇总表格,显示每个视频的状态和输出路径。

📁 项目结构

yt-transcript/
├── SKILL.md                 # Claude Skill 工作流程指南(主入口)
├── SYSTEM_DESIGN.md         # 系统设计唯一权威文档
├── workflows/               # 模块化工作流文件
├── prompts/                 # 单任务 Prompt 模板
├── scripts/                 # Shell 辅助脚本
├── yt_transcript_utils.py   # 主 Python 入口;现直接依赖两层 kernel 子包
├── kernel/                 # 两层 kernel 包
│   ├── task_runtime/       # 通用任务运行时层
│   │   ├── runtime.py      # ownership、command envelope、telemetry append
│   │   ├── api.py          # 对外稳定 runtime API:create/inspect/advance/control/finalize
│   │   ├── contracts.py    # runtime contracts:task/run/action/artifact envelope
│   │   ├── lifecycle.py    # runtime lifecycle shell 与 transition summary
│   │   ├── policy.py       # allowed-action derivation 与 budget-pressure policy
│   │   ├── evaluator.py    # quality-gated evaluator report 与建议动作
│   │   ├── decision.py     # rule-first action selection 与 decision record
│   │   ├── ledger.py       # runtime budget 与 action accounting 摘要
│   │   ├── recovery.py     # processing substate 与 recovery summary
│   │   ├── artifacts.py    # artifact graph 与持久化产物引用
│   │   ├── state.py        # manifest/runtime 持久化与控制文件
│   │   ├── controller.py   # owned mutation 与 bounded control-loop 辅助
│   │   └── telemetry.py    # telemetry 查询与汇总辅助
│   └── long_text/          # 长文本变换层
│       ├── glossary.py     # glossary 提取与术语检查
│       ├── semantic.py     # semantic anchor 提取与检查
│       ├── contracts.py    # control contract 与 policy state
│       ├── autotune.py     # chunk autotune 与 token source 汇总
│       ├── lifecycle.py    # manifest 生命周期与 resume state 辅助
│       ├── prompting.py    # prompt 组装与 chunking context 辅助
│       ├── llm.py          # LLM 请求循环与重试辅助
│       ├── processing.py   # chunk 处理与 replan 执行循环
│       ├── chunking.py     # 分块命令表面
│       ├── merge.py        # merge 与 chapter-plan 命令表面
│       └── execution.py    # 执行、resume 与 replan 命令表面
├── tests/                   # 回归测试集
├── config.yaml              # 本地配置(已 gitignore)
├── config.example.yaml      # 配置模板
└── README.md                # 本文档

🏗️ 架构设计总览

README.md 是面向操作者的快速上手与命令指南,SYSTEM_DESIGN.md 是系统架构唯一的权威设计文档。

从整体上看,yt-transcript 是一个 local-first、script-first 的系统:它把 YouTube URL 通过以下阶段转换成 Markdown 文章:

  • preflight 与配置检查
  • metadata 与字幕可用性探测
  • 字幕下载或 Deepgram 兜底转录
  • 状态同步与标准化文档生成
  • 优化计划制定
  • 短路径直接变换或进入长文本变换子系统
  • 最终装配与质量门禁

代码结构也按照这套设计拆成两层 kernel:kernel/task_runtime/* 负责通用长程任务控制,kernel/long_text/* 负责长文本变换行为。yt_transcript_utils.py 仍然是主 CLI 和 workflow façade,但现在会直接把职责委托给这两层。

Phase 6 进一步加入了 kernel/task_runtime/api.py,把 create-runinspect-runadvance-runapply-controlresume-runfinalize-run 收敛为首选外部运行时接口;原来的 runtime-statusprocess-chunkspause-runcancel-run 等命令仍然保留,但定位为兼容辅助入口。

其中最难的内部子系统是长文本变换。它只会在 planning 层判断输入足够长、必须进入 chunk 处理时激活,并负责 chunking、continuity、一致性保护、verification、repair / replan 与确定性 merge。

阅读详细系统设计文档

🔁 首选 Runtime API

对于新的外层 agent 集成,建议优先使用稳定的 runtime-facing 命令,而不是旧的 path-oriented helper:

  • create-run <work_dir>:初始化或刷新持久化 runtime task record
  • inspect-run <work_dir>:返回稳定的 task/run state、recovery summary 与 allowed actions
  • advance-run <work_dir>:选择受限的下一步 runtime action 并执行
  • apply-control <work_dir> --signal pause|cancel:施加操作级控制信号
  • resume-run <work_dir>:恢复暂停中的 run
  • finalize-run <work_dir>:返回最终摘要,并可选调用 merge-content

默认迁移模式是 runtime_api。只有在必须强制走旧兼容包装时,才设置 YT_TRANSCRIPT_RUNTIME_API_MODE=legacy_cli

🔧 预检模式

# 仅检查基础依赖:字幕 / metadata 工作流
bash scripts/preflight.sh

# 在音频转录前要求 Deepgram 可用
bash scripts/preflight.sh --require-deepgram

# 在长视频 chunk 处理前要求 LLM 配置完整
bash scripts/preflight.sh --require-llm

🧩 结构化脚本输出

这个项目是 script-first 的:辅助命令会在 stdout 输出可解析 JSON,让路由、校验与执行决策尽量留在代码里,而不是漂移到 prompt 文案中:

  • scripts/download.sh "$URL" metadata
  • scripts/download.sh "$URL" subtitle-info
  • scripts/download.sh "$URL" subtitles
  • scripts/download.sh "$URL" audio
  • python3 yt_transcript_utils.py get-chapters "$URL"
  • python3 yt_transcript_utils.py chunk-segments /tmp/${VIDEO_ID}_segments.json /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>
  • python3 yt_transcript_utils.py chunk-document /tmp/${VIDEO_ID}_normalized_document.json /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>
  • python3 yt_transcript_utils.py prepare-resume /tmp/${VIDEO_ID}_chunks --prompt <RAW_STAGE_PROMPT>
  • python3 yt_transcript_utils.py build-chapter-plan /tmp/${VIDEO_ID}_chapters.json /tmp/${VIDEO_ID}_chunks /tmp/${VIDEO_ID}_chunks/chapter_plan.json
  • python3 yt_transcript_utils.py build-glossary /tmp/${VIDEO_ID}_chunks --mode transcript
  • python3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage <stage>
  • python3 yt_transcript_utils.py normalize-document /tmp/${VIDEO_ID}_state.md
  • python3 yt_transcript_utils.py plan-optimization /tmp/${VIDEO_ID}_state.md
  • python3 yt_transcript_utils.py verify-quality /tmp/${VIDEO_ID}_optimized.txt --raw-text /tmp/${VIDEO_ID}_raw_text.txt

这里的 <RAW_STAGE_PROMPT>plan-optimization 决定:

  • 中文单语输出使用 cleanup_zh
  • 英文源双语输出使用 structure_only

这样 workflow 文档只保留调用顺序,具体判断逻辑下沉到脚本中。

plan-optimization 现在还会输出标准化的 chunk 执行契约。

在整个项目层面,它是 source acquisition 和 text transformation 之间的路由边界;在长文本子系统层面,它定义了后续 chunk 执行必须遵循的 execution contract:

  • operations[*].execution.supports_auto_replan
  • operations[*].execution.recommended_cli_flags
  • operations[*].execution.on_replan_required

normalize-document 会基于 raw text 或带时间戳的 segments.json 物化 /tmp/${VIDEO_ID}_normalized_document.json;当源 artifact 已存在时,plan-optimization 也会自动完成这一步。如果这些 segments 来自字幕清洗阶段,标准化文档还会把轻量清洗诊断透传到 diagnostics.subtitle_cleanup,并在 diagnostics.subtitle_quality 下写入可解释的字幕源质量摘要。

plan-optimization 现在把“时长/体积路由”和“源路径质量路由”分开表达:routing_reason 仍只解释 short/long 执行路径,而 source_route_reasonsubtitle_quality_scorereroute_recommendedreroute_target 用来解释当前字幕源是继续沿用、只做人工复核,还是建议切换到 Deepgram。

对于长视频分块,plan-optimization 现在还会输出显式的 chunking 契约;一旦 normalization 已存在,优先使用 chunk-document,并把 chunk 边界 / continuity 假设显式记录到 manifest.json

当前设计还包含显式的 resume 语义:prepare-resume 用于手动修复 stale manifest,而 process-chunks 在继续执行前会自动做同样的修复。

当前约定是明确固定的:

  • raw_path 阶段统一使用 process-chunks --auto-replan
  • processed_path 阶段不做自动 replan;若返回 replan_required=true,必须先停下人工检查

🧭 设计上的刻意取舍

  • bilingual 表示“英文源文本 + 中文翻译”,不是直接合并双字幕文件
  • 如果存在可用中文字幕,会优先把其中一个中文字幕轨作为唯一源文本;只有在中文字幕不可用时才回退到英文字幕
  • 中文单语优化现在使用专门的 cleanup_zh prompt,在不改变原意的前提下修复标点、分段、重复字幕碎片和明显的中文空格问题
  • config.yaml 被刻意限制为扁平的顶层键值配置,不支持嵌套结构或多行 YAML
  • YAML frontmatter 的值会统一加引号,优先保证解析安全,而不是追求最简洁的展示
  • Markdown 头部里的标题/频道文本会做转义,链接目标会做编码,避免边界字符破坏结构
  • chunk-document 现在是 normalized_document.json 已存在时的规范长视频分块入口;它会遵循 preferred_chunk_source,而不是在有 timed segments 时一律偏向 segments;因此中文字幕长路径现在会优先使用清洗后的 text 做正文分块,同时保留 segments 供时间轴 / 章节映射使用
  • chunk-text 会对超长且缺少标点的段落做强制切分,并在提供 --prompt 时默认启用 token-aware 规划
  • download.sh metadata 现在会优先走单次 yt-dlp -J 获取;字幕/音频模式也会先复用 metadata 里的 video id,再回退到额外探测
  • transcribe-deepgram 默认使用 Deepgram nova-3;旧版 nova-2 / nova-2-* 模型配置会在运行时提升为 nova-3
  • transcribe-deepgram --output-segments 可选输出带时间戳的对齐 segments,用于后续 timed chunk 与 YouTube 章节映射
  • transcribe-deepgram 现在默认就是 utterance-first 组装;仍保留 --disable-utterances --legacy-flat-output 作为兼容/排障回退
  • transcribe-deepgram 现在还会在结果 JSON 中输出轻量可观测字段,例如 paragraph/sentence/word 计数、逐 chunk 的 transcript 元数据,以及 structured-output 回退 warning
  • download.sh subtitles 现在会按“中文优先、英文回退”的顺序一次只尝试一个源字幕轨;当可见的中文字幕下载失败时,才会继续尝试英文字幕
  • 字幕驱动的 workflow 仍支持“中文字幕单语输出”和“英文字幕双语输出”两条路径;其中中文单语现在优先走 cleanup_zh,英文字幕仍走 structure_only -> translate_only
  • download.sh subtitles 现在会显式区分“平台列出了哪些轨”和“当前运行环境实际下载到了哪条轨”:listed_candidates 描述可见候选,attempted_candidates / blocked_candidates / fallback_used 描述实际下载结果
  • 当首选字幕轨因为 HTTP 429 这类鉴权/限流问题失败时,download.sh subtitles 现在会先用 Chrome cookies 对同一条轨重试,再决定是否回退到下一条候选
  • chunk-segments 基于 segments 生成带时间轴的 timed manifest;build-chapter-plan 可将 YouTube chapters 映射到 chunk 边界,供 merge-content 注入标题
  • parse-vtt / parse-vtt-segments 现在都会做 subtitle-aware cleanup,避免 CJK 字幕碎片在重新拼接时被错误插入 ASCII 空格
  • parse-vtt-segments 现在还会输出轻量 cleanup diagnostics,例如重复 cue / overlap 裁剪计数;normalize-document 会把这些字幕清洗信号以及确定性的 subtitle_quality_score 一并透传进 normalized_document.json
  • plan-optimization 现在会显式输出 source_route_reasonreroute_recommendedreroute_targetreroute_reasons 等源路径解释字段;当中文字幕路径质量极差时,它会建议切到 Deepgram,但不会静默改写当前 workflow shell
  • merge-content 现在会只对合并后的正文 body 执行 deterministic post-merge cleanup:它会修复 chunk seam 重复、以更保守的续写片段规则重新拼合被拆开的正文、原样保留显式传入的 header/frontmatter,并去掉紧邻重复的标题/正文接缝,而不是把这些机械问题继续留给 LLM
  • chunk-segments --chapters 可选在 YouTube 章节起点强制切 chunk,减少章节标题漂移
  • 如果只传显式 --chunk-size 而不传 --promptchunk-text 会继续按 legacy 字符大小解释,避免现有 workflow 被静默改变
  • 分块阶段会提前校验 prompt 名称,避免因为 prompt 拼写错误而静默回退到通用预算
  • process-chunks 现在按 prompt 预算单独设置 max_output_tokens,不再复用单一的大默认值
  • manifest.json 现在会显式记录 plan.chunk_contractplan.continuityprocess-chunks 会遵循 plan-own 的 continuity 策略,而不是被后续 config 漂移静默改变
  • chunk 执行现在也有显式 resume 语义:在继续执行前,stale 的 running / 缺失输出 checkpoint 会被确定性修复为 doneinterrupted
  • process-chunks 还会注入上一块的轻量 continuity context(尾句 + 可选 section title),但不会启用正文 overlap;同时分块预算也会为这段 carry-over context 预留一小段 token 成本
  • process-chunks 现在会把 Remote end closed connection without response 这类瞬时网关断连视为可重试传输错误,并可在产出异常短/结构异常的 chunk 时自动重跑一轮,再决定是否保留 warning
  • manifest.json 现在会把不可变 plan 和可变 runtime 状态分开,同时为每个 chunk 记录 attempt_logs 级别的请求观测数据
  • process-chunks 不再在当前 batch 内偷偷改预算;如果 canary 或重试历史表明当前 plan 不健康,会以 replan_required=true 终止,并通过 replan-remaining 为剩余原始 chunk 生成新计划
  • process-chunks --auto-replan 会在不破坏上述边界的前提下,自动编排 process -> replan-remaining -> resume 这一恢复链路(仅适用于 raw_path 计划)
  • build-glossary --mode transcript 现在会从 raw chunk、可用的 normalized document 标题/频道、chapter 标题,以及可选 metadata/description 上下文中抽取术语;当 cleanup_zh 的 work_dir 里还没有 glossary 时,process-chunks 会自动生成这一份 glossary
  • glossary 的筛选与校验现在会对简单 ASCII 术语做边界感知匹配,因此像 API 这样的缩写不会再因为 rapidcapital 之类的无关单词而误命中
  • runtime.status 现在会区分 completed / completed_with_errors / aborted,而 raw replan 也会同步重映射已有 chapter_plan.json 的 chunk 起点,保证 merge 阶段的章节标题仍落在有效 chunk 边界上
  • 运行时 token 估算默认仍是本地启发式 fallback;test-token-count / preflight.sh --require-llm 会探测 provider 级 token count,并在不可用时明确回退到 local estimate
  • chunk_hard_cap_multiplier 会被限制在保守的 1.0-2.0 区间,避免配置失误把 chunk 包络静默放大
  • preflight.sh 采用分层校验,确保只走字幕路径时不必预先配置 Deepgram 或 LLM 凭据;进入 --require-llm 时会同时做连通性和 token count 能力探测
  • transcribe-deepgram 是唯一支持的 Deepgram 统一入口,分片与合并逻辑由 Python 工具统一负责
  • verify-quality 只有在 hard_failures 非空时才阻断流程;warnings 仅用于人工复核提示,而 checks 现在会额外暴露 chunk_seam_warning_countcjk_space_ratioduplicate_ngram_ratioshort_paragraph_ratioheader_densitypunctuation_densityglossary_drift_countglossary_preservation_ratio 等可解释指标
  • 双语 verify-quality 现在会在完整正文流里识别相邻的“英文段 -> 中文段”配对,因此对照正文前面即使有简短导语,也不会再被误判为完全缺少双语段对

📘 术语口径

  • bilingual:英文源文本 + 中文翻译,不是双字幕文件合并。
  • 基础 preflightbash scripts/preflight.sh,用于 metadata、字幕探测和字幕路径。
  • Deepgram preflightbash scripts/preflight.sh --require-deepgram,仅在音频转录前执行。
  • LLM preflight:只有当 plan-optimization 返回 long-video chunk 处理需要时,才执行 bash scripts/preflight.sh --require-llm
  • Deepgram 统一入口python3 yt_transcript_utils.py transcribe-deepgram ...
  • Deepgram 结果可观测性/tmp/${VIDEO_ID}_deepgram_result.json 现在会暴露逐 chunk 的 chunk_reports、聚合结构计数,以及 structured-output 回退时的 warnings
  • 质量门禁:读取 verify-quality 的 JSON;hard_failures 表示必须 STOP,warnings 表示需要人工复核,checks 则给出这些告警背后的可读性指标。有 work_dir 或现成 glossary 时,建议额外传 --work-dir--glossary-path 打开 glossary drift 检查。
  • 源路径路由:规划层决定当前 subtitle / Deepgram 源是否继续沿用。routing_reason 负责解释时长/输入体积路由,source_route_reason 负责解释源质量路由。
  • 运行时 reroute 动作:当 plan-optimization / verify-quality 返回 reroute_recommended=truereroute_target=deepgram 时,runtime contract 会把它标准化为 recommended_action=fallback_to_deepgram,这样 policy、evaluator、decision 三层动作口径保持一致
  • 输出文件命名:最终 Markdown 文件名里的日期片段必须使用 yyyy-mm-dd,不要把 metadata 的原始 upload_dateyyyymmdd)直接放进文件名。

🧪 验证矩阵

场景 最小命令序列 Stop/go 规则
有字幕短视频 preflight.shdownload.sh metadata → 创建 state → validate-state --stage metadatadownload.sh subtitle-infodownload.sh subtitlesvalidate-state --stage post-source → 优化 → verify-quality 只有 validate-stateverify-quality 返回非空 hard_failures 才停止
无可用字幕视频 preflight.shdownload.sh metadatadownload.sh subtitle-infopreflight.sh --require-deepgramtranscribe-deepgramvalidate-state --stage post-source → 优化 → verify-quality 任一命令失败或 hard_failures 非空都必须停止
长视频 validate-state --stage post-sourceplan-optimization → 若 requires_llm_preflight=true 则执行 preflight.sh --require-llm → 分块 → raw_path 阶段使用 process-chunks --auto-replan → 需要时执行 processed_path 翻译阶段 → 合并 → verify-qualityvalidate-state --stage pre-assemble warnings 不自动阻断,hard_failures 阻断

🛠️ 最小命令集

# 1. 基础检查
bash scripts/preflight.sh

# 2. Metadata 与字幕可用性
bash scripts/download.sh "$URL" metadata
bash scripts/download.sh "$URL" subtitle-info

# 3. State 校验
python3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage metadata
python3 yt_transcript_utils.py validate-state /tmp/${VIDEO_ID}_state.md --stage post-source

# 4. 优化计划
python3 yt_transcript_utils.py plan-optimization /tmp/${VIDEO_ID}_state.md

# 4b. 长视频 raw chunk 阶段遵循 plan contract
#     raw_path 用 process-chunks --auto-replan,
#     processed_path 若出现 replan_required 则停下人工检查

# 5. 需要时走音频兜底
bash scripts/preflight.sh --require-deepgram
python3 yt_transcript_utils.py transcribe-deepgram "$AUDIO_FILE" --language "$LANGUAGE" --output-text "/tmp/${VIDEO_ID}_raw_text.txt"
# 需要与 legacy flat path 做对照时:
# python3 yt_transcript_utils.py transcribe-deepgram "$AUDIO_FILE" --language "$LANGUAGE" --disable-utterances --legacy-flat-output

# 6. 仅在 chunk cleanup 路径下可选的 transcript glossary 构建
python3 yt_transcript_utils.py build-glossary /tmp/${VIDEO_ID}_chunks --mode transcript

# 7. 最终质量门禁
python3 yt_transcript_utils.py verify-quality /tmp/${VIDEO_ID}_optimized.txt --raw-text /tmp/${VIDEO_ID}_raw_text.txt
# 如果存在 glossary/work_dir,且希望检查 glossary drift,可额外传 --work-dir /tmp/${VIDEO_ID}_chunks

# `checks` 现在还会包含 chunk seam 重复、中文空格异常、
# 重复短语密度、短碎段比例、标题密度、标点密度,以及 glossary drift 等 advisory 指标。

📄 许可证

MIT License

🔗 相关链接

About

YouTube video transcription skill for Claude/AI agents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors