fix(codex): rebuild traces from transcript#69
Conversation
ralf0131
left a comment
There was a problem hiding this comment.
Summary
Major architectural improvement: replaces the old codex-log input (which relied on OTel plugin JSONL and accumulated hook events) with a new codex-transcript input that reads directly from Codex rollout transcripts as the single source of truth. The codex-hook-processor.mjs is greatly simplified from 657 lines to 65 lines — Stop is now just a best-effort wakeup marker, eliminating fragile transcript-flush timing dependencies.
Overall verdict: solid design, well-executed. The new architecture is cleaner, safer (hooks can no longer disrupt agent operation by failing mid-transcript-parse), and more robust (the transcript tailer picks up changes via polling even if a wakeup marker is missed). Comprehensive test coverage across hook processor, transcript parser, both input modules, integration converter, and trace flusher turn-boundary logic. Good documentation including the aborted-turn recovery design doc.
CLA status could not be verified (no CLA context found on the head commit). Posting as COMMENT — please ensure CLA is signed before merge.
Findings
See inline comments below. Summary of severity:
- [Warning] 1 — silent error swallowing in wakeup marker
- [Info] 3 — sort stability, config migration, no-op hook documentation
Suggestions
- Add stderr logging to the
writeWakeupMarkercatch block for debuggability. - Add a backward-compatible config fallback for
codex-log->codex-transcriptmigration. - Consider a secondary sort key in
compareTimelineEventsfor deterministic ordering.
Automated review by github-manager-bot
| fs.mkdirSync(directory, { recursive: true }); | ||
| fs.writeFileSync(temporary, JSON.stringify(payload), 'utf8'); | ||
| fs.renameSync(temporary, marker); | ||
| } catch { |
There was a problem hiding this comment.
[Warning] writeWakeupMarker silently swallows all errors with catch {}. While best-effort is appropriate for a hook (it must never disrupt Codex), consider logging to stderr for debuggability — a failed wakeup marker delays telemetry export until the next poll interval, and without any log output this failure mode would be invisible.
| process.stdout.write('{}\n'); | ||
| function main() { | ||
| const subcommand = (process.argv[2] || '').trim(); | ||
| if (subcommand !== 'stop') return; |
There was a problem hiding this comment.
[Info] main() silently returns for all subcommands except stop. The old processor handled session-start, user-prompt-submit, pre-tool-use, post-tool-use. If agents.d/codex.json still dispatches these subcommands, they are now no-ops. This is intentional (transcript is the source of truth), but a brief comment here or in codex.json would help future maintainers understand why these hooks fire without doing anything.
| ...(turn.cwd ? { 'agent.codex.cwd': turn.cwd } : {}), | ||
| }; | ||
| const records: AgentActivityEntry[] = []; | ||
| const timeline = [...turn.timeline].sort(compareTimelineEvents); |
There was a problem hiding this comment.
[Info] [...turn.timeline].sort(compareTimelineEvents) — JavaScript's Array.prototype.sort is not guaranteed stable across all engines (though V8/Node ≥ 11 is stable). If two timeline events share the same timestamp, their relative order could vary on non-V8 engines. Consider adding a secondary sort key (e.g., original array index) for deterministic ordering.
| 'cursor-hook': 'cursor', | ||
| 'claude-code-log': 'claude-code', | ||
| 'codex-log': 'codex', | ||
| 'codex-transcript': 'codex', |
There was a problem hiding this comment.
[Info] Config key changes from codex-log to codex-transcript. Existing users with codex-log listener config entries (pollInterval, enabled) will silently lose their customization. Consider adding a backward-compatible fallback that reads codex-log config if codex-transcript is absent, or a one-time deprecation warning.
|
This PR has conflicts with the git fetch origin
git checkout yunshen/codex-aborted-turn-recovery
git rebase origin/main
# resolve conflicts, then:
git push --force-with-leaseThis is a one-time reminder. Feel free to @mention me for a re-review after conflicts are resolved. Automated notification by github-manager-bot |
linrunqi08
left a comment
There was a problem hiding this comment.
Code Review — PR #69 Codex Telemetry: Hook → Transcript Migration
Reviewed the full 5700-line diff across 38 changed files. Overall this is a well-structured migration from hook-based to transcript-based telemetry collection. Solid architecture with the wakeup-marker pattern, deterministic IDs via SHA-256, and the new BaseInput.cyclePromise serialization. Below are findings ranked by severity.
1. 🔴 Massive code duplication between codex-transcript and codex-aborted-turn modules
Files: src/inputs/codex-transcript/ and src/inputs/codex-aborted-turn/
The following functions are duplicated near-identically across both modules:
asRecord,stringValue,timestampMs— both modules define local copiestoJsonValue,sameUsage— extractor files both define thesereadJsonLines,readJsonLineAt— both input files have independent implementationssessionIdFromTranscriptPath,hashId,buildEntry— builders and extractors duplicate thesecollectRolloutFiles/discoverSessionFiles— both inputs walk the same~/.codex/sessionsdirectory
That's 400+ lines of duplicated utility code that must be kept in sync manually. The codex-aborted-turn-utils.ts file exists as a shared utilities module but only exports 3 of these functions — and even those aren't consumed by the transcript module.
Suggestion: Extract shared utilities into a codex-shared/ module (e.g. codex-shared/transcript-utils.ts and codex-shared/io-utils.ts).
2. 🟠 timestampMs inconsistency: returns 0 vs undefined on parse failure
Files:
src/inputs/codex-transcript/codex-transcript-input.tsline 432 — returnsDate.now()on failuresrc/inputs/codex-transcript/codex-transcript-extractor.tsline 45 — initializesstartedAtMs = 0(implicit "no timestamp")src/inputs/codex-aborted-turn/codex-aborted-turn-utils.tsline 11 — returnsundefinedon failure
Three different failure modes for the same semantic operation:
- Transcript input: falls back to
Date.now()(never produces a sentinel) - Transcript extractor: uses
0as a sentinel →timestampToUnixNanos(0)produces'0'(Unix epoch 1970-01-01) - Aborted-turn utils: returns
undefined→ callers do explicit null-checks
A transcript record with a missing or malformed timestamp silently gets epoch-0 timestamps in the output telemetry, which may confuse downstream consumers or create silent data quality issues.
3. 🟠 Double cycle-serialization guard — collecting promise redundant with BaseInput.cyclePromise
Files:
src/inputs/codex-transcript/codex-transcript-input.tslines 95-100src/inputs/codex-aborted-turn/codex-aborted-turn-input.tslines 73-78
Both inputs override collect() with:
if (this.collecting) return this.collecting;
this.collecting = this.collectOnce().finally(() => {
this.collecting = null;
});
return this.collecting;This is identical in purpose to BaseInput.runCycle() which now has:
if (this.cyclePromise) return this.cyclePromise;
this.cyclePromise = this.runCycleOnce().finally(() => {
this.cyclePromise = null;
});The outer cyclePromise guard already prevents concurrent cycles, making the inner collecting guard dead code. It's not harmful but adds confusion about which guard is load-bearing.
4. 🟠 readJsonLines allocates full-range buffer for entire unscanned range
Files:
src/inputs/codex-transcript/codex-transcript-input.tsline 356src/inputs/codex-aborted-turn/codex-aborted-turn-input.ts(same pattern)
const buffer = Buffer.alloc(endOffset - startOffset);During baselineFile() on a transcript that hasn't been scanned yet, startOffset = 0 and endOffset = stat.size. A Codex session with extensive tool use can produce transcripts of tens of MB. This allocates the entire file as a single buffer every 30 seconds per-transcript until fully consumed.
Consider chunked reading (e.g. 1MB at a time) to bound memory usage, especially since multiple transcripts may be processed in the same cycle.
5. 🟡 Hook processor silently swallows all errors and drops stdout response
File: assets/hooks/codex-hook-processor.mjs lines 46-51
try {
fs.mkdirSync(directory, { recursive: true });
fs.writeFileSync(temporary, JSON.stringify(payload), 'utf8');
fs.renameSync(temporary, marker);
} catch {
try { fs.unlinkSync(temporary); } catch {}
}Two issues:
- Silent error swallowing: The old code called
logHookError()on failures; the new code catches all errors (disk full, permissions, path too long) without any logging. If the wakeup directory is not writable, the Stop hook silently fails and the transcript tailer must wait for the next poll cycle (up to 30s delay) instead of waking immediately. - Dropped stdout response: The old code wrote
process.stdout.write('{}\n')on every exit path. The new code writes nothing to stdout. If Codex expects a JSON response on stdout (some hook protocols do), this could cause the hook to be treated as failed by the caller.
6. 🟡 Both inputs independently scan ~/.codex/sessions if both are registered
Files:
src/inputs/codex-transcript/codex-transcript-input.ts— registered in orchestratorsrc/inputs/codex-aborted-turn/codex-aborted-turn-input.ts— exported fromsrc/index.ts
Both inputs:
- Call
discoverSessionFiles()to walk the same~/.codex/sessionsdirectory - Both handle
turn_abortedevents from the same transcripts CodexTranscriptInputis registered in the orchestrator;CodexAbortedTurnInputis exported but not registered
If an external consumer registers both inputs (possible since both are exported), every aborted turn would produce duplicate telemetry entries. Consider documenting the mutual-exclusion constraint, or adding a runtime guard.
aad8877 to
a4dc50d
Compare
a4dc50d to
09d4e21
Compare
ralf0131
left a comment
There was a problem hiding this comment.
Summary
Re-review after new commits. The latest commit (fix(codex): harden transcript retry and hook cleanup) is clean and well-tested:
- The scan-offset advancement was fixed so a terminal turn that must be retried is no longer skipped on the next poll — it now uses
line.endOffsetof the last processed turn instead oflines.nextOffset(correctness fix). - Hook uninstall now removes legacy
replaceHookCommandsand deletes the array key when it becomes empty (good cleanup + migration support).
The earlier catch {} in writeWakeupMarker now attempts temp-file cleanup before swallowing, which addresses the resource-leak aspect of the previous warning. CI is green. Good hardening pass.
Notes
- The
catch {}still does not log (silent swallow). For a hook that must never disrupt Codex this is a defensible design choice; if you want debuggability, a singlelogger.debugline would help triage wakeups that silently no-op. - The
codex-transcriptrefactor (commit 09d4e21) is substantial (~1200 lines across builder/extractor/input). CI passes and the parts reviewed look correct; a maintainer may want a deeper pass on the transcript ordering/recovery logic before merge.
Automated review by github-manager-bot
linrunqi08
left a comment
There was a problem hiding this comment.
Review Update — 第二次检查(commit a0641f6 之后)
感谢第二个提交的修复工作,将 codex-aborted-turn 模块合并进了 codex-transcript,消除了大部分重复代码和重复扫描问题。以下是 6 个原始发现的当前状态:
✅ 已修复
Finding #1 (部分) — 代码大量重复
codex-aborted-turn 的 builder/extractor/input/types/utils 5 个独立文件已从 PR 移除,aborted turn 处理合并到 codex-transcript 模块中。大幅消除了重复代码。
Finding #6 — 两个 input 独立扫描同一目录导致重复遥测
CodexAbortedTurnInput 已从 orchestrator 移除,仅 CodexTranscriptInput 处理所有场景(含 turn_aborted),重复遥测风险消除。
⚠️ 仍未修复
Finding #2 — timestampMs 不一致:返回 0 vs Date.now()
仍然有两个 timestampMs 实现,失败行为不同:
codex-transcript-extractor.ts— 解析失败返回0return Number.isFinite(parsed) ? parsed : 0;
codex-transcript-input.ts— 解析失败返回Date.now()return Number.isFinite(parsed) ? parsed : Date.now();
同一模块内两个同名函数对 parse 失败的语义不同:0 会导致 timestampToUnixNanos(0) 输出 Unix 纪元(1970-01-01),Date.now() 则用当前时间兜底。建议统一策略,或者至少合并为一个带参数的函数(如 timestampMs(record, fallback))。
Finding #3 — 双重 cycle 序列化守卫
CodexTranscriptInput.collect() 仍保留 this.collecting 守卫:
// codex-transcript-input.ts
protected override async collect(): Promise<AgentActivityEntry[]> {
if (this.collecting) return this.collecting;
this.collecting = this.collectOnce().finally(() => { this.collecting = null; });
return this.collecting;
}这与 BaseInput.runCycle() 中新增的 cyclePromise 守卫功能完全重复:
// base-input.ts
private runCycle(): Promise<void> {
if (this.cyclePromise) return this.cyclePromise;
this.cyclePromise = this.runCycleOnce().finally(() => { this.cyclePromise = null; });
return this.cyclePromise;
}两层守卫不会导致 bug,但会让读者困惑哪层是真正生效的。建议移除子类的 collecting 守卫,依赖 BaseInput.cyclePromise。
Finding #4 — readJsonLines 一次性分配全文件大小的 Buffer
// codex-transcript-input.ts
const buffer = Buffer.alloc(endOffset - startOffset);baselineFile() 首次扫描时 startOffset = 0、endOffset = stat.size,整个 transcript 文件被一次性读入内存。长时间运行的 Codex session 可能产生几十 MB 的 transcript,每个 poll 周期(30s)对未消费完的文件都会触发这种分配。建议分块读取(如每次 1MB),或在 baseline 时直接 seek 到文件末尾只记录 offset 而不解析历史内容。
Finding #5 — Hook processor 静默吞错 + 丢弃 stdout 响应
writeWakeupMarker 仍然无日志地吞掉所有错误:
} catch {
try { fs.unlinkSync(temporary); } catch {}
}旧代码在每个 catch 分支都调用 logHookError() 记录错误,新代码完全静默。如果 wakeup 目录不可写(权限/磁盘满),唯一的症状是遥测延迟从即时变为 30 秒,难以排查。
同时 main() 不再向 stdout 写入 {}\n。如果 Codex hook 协议期望 stdout JSON 响应,这可能导致 hook 被认为执行失败。
|
timestampMs 统一 移除子类重复 cycle 守卫 readJsonLines 改为分块读取 Codex hook 恢复 fail-open 诊断与 stdout ack |
ralf0131
left a comment
There was a problem hiding this comment.
Summary
Re-review after commit 2bcf21eb ("fix(codex): harden transcript parsing and wakeup hook"). This commit directly resolves the two findings from the previous review round and adds solid hardening:
- [Resolved] Silent
catch {}inwriteWakeupMarker— now useslogHookError()for stdin-parse, wakeup-write, and cleanup errors, providing debuggability without risking hook stability. - [Resolved] Non-
stopsubcommands produce no stdout —main()now wraps intry/finallyand always writes{}to stdout, ensuring the agent receives acknowledgment regardless of subcommand. - Chunked file reading —
readJsonLinesnow reads in 1 MB chunks with apendingbuffer for cross-chunk line boundaries, replacing the previous single-buffer allocation that could cause memory pressure on large transcripts. - Unified
timestampMs— moved to a sharedcodex-transcript-utils.tswith an explicitfallbackparameter, eliminating the two divergent implementations (0 vsDate.now()). - Removed redundant
collectingdedup — safe becauseBaseInput.runCycle()already serializes viacyclePromise, making the subclass-level guard unnecessary. - New tests cover stdout acknowledgment and error-logging paths.
CI is green across all build/test checks. CLA signed. No new findings on this commit.
Notes
- The overall PR is substantial (~2000 additions across 20 files). The transcript ordering/recovery logic was reviewed in earlier rounds and looks correct; maintainers should do a final confidence pass before merge.
Automated review by github-manager-bot
Summary
Verification