[#758] fix-codex-reconnect-retry#767
Conversation
📝 WalkthroughWalkthroughCodex SDK の "Reconnecting..." エラーを transient provider failure として再分類し、再接続失敗時に実行中ツール・コマンド情報をマスク済みで診断する。StreamHandler でアイテムID・ツール追跡を拡張し、機密テキスト マスキングを共有ユーティリティ化する。 ChangesCodex 再接続エラー ハンドリング統合
Sequence Diagram(s)sequenceDiagram
participant CodexClient
participant StreamHandler as CodexStreamHandler
participant TrackingState as StreamTrackingState
participant ErrorCheck as Error Detection
participant Diagnosis as Diagnosis Builder
CodexClient->>StreamHandler: emitCodexItemStart(item, ...)
StreamHandler->>TrackingState: resolveCodexItemId(item)
TrackingState-->>StreamHandler: stable id
StreamHandler->>TrackingState: recordToolUse(toolData)
StreamHandler->>CodexClient: tool_use イベント発行
Note over CodexClient: ストリーミング中
CodexClient->>ErrorCheck: "Reconnecting... 2/5" エラー
ErrorCheck-->>CodexClient: isReconnectError() → true
CodexClient->>CodexClient: 1秒後に retryPolicy で再試行
Note over CodexClient: リトライ不能
CodexClient->>Diagnosis: applyReconnectDiagnosis(failure, activeTool)
Diagnosis->>Diagnosis: sanitizeSensitiveText(command)
Diagnosis-->>CodexClient: finalFailure w/ masking
CodexClient->>CodexClient: emitResult(finalFailure, category)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/__tests__/codex-client-retry.test.ts`:
- Around line 429-433: The test currently asserts fragments using multiple
expect(...).toContain calls which can miss regressions; instead build the full
expected diagnostic phrase (e.g., combine CODEX_RECONNECT_FAILURE_MESSAGE with
the provider/tool/command/result text into one string) and assert against
result.content with a single precise check (e.g., const expected =
`${CODEX_RECONNECT_FAILURE_MESSAGE} - provider reconnect failure; Active tool:
Bash; Bash command: npm run test:e2e:mock; Command result: unknown` and then
expect(result.content).toContain(expected) or
expect(result.content).toMatch(expected) / toEqual(expected) depending on exact
formatting), and apply the same change for the similar assertions around lines
472-476 to validate the entire contract phrase rather than partial fragments.
In `@src/infra/codex/client.ts`:
- Around line 132-146: Replace the current two separate diagnostic lines with a
single fixed, final-observable contract string "provider reconnect failure /
command result unknown" and ensure any active tool and its sanitized command are
appended in a stable, non-breaking way; update the block that builds the lines
array (see symbols: lines, activeTool, sanitizeSensitiveText) to always push one
final entry that begins exactly with "provider reconnect failure / command
result unknown" and then, if activeTool exists, append " - Active tool: <tool>"
and, if a Bash command string exists, append " - Bash command: <sanitized
command>" so the final log line is a single explicit phrase followed by the
active tool/command context without changing other log formats.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 872db9d2-aab7-4e15-bf48-5be3ccbbcd14
📒 Files selected for processing (7)
src/__tests__/codex-client-retry.test.tssrc/__tests__/codex-stream-handler.test.tssrc/__tests__/shared-utils-exports.test.tssrc/features/tasks/execute/traceReportRedaction.tssrc/infra/codex/CodexStreamHandler.tssrc/infra/codex/client.tssrc/shared/utils/sensitiveText.ts
| expect(result.content).toContain('provider reconnect failure'); | ||
| expect(result.content).toContain(CODEX_RECONNECT_FAILURE_MESSAGE); | ||
| expect(result.content).toContain('Active tool: Bash'); | ||
| expect(result.content).toContain('Bash command: npm run test:e2e:mock'); | ||
| expect(result.content).toContain('Command result: unknown'); |
There was a problem hiding this comment.
診断契約は部分一致ではなく契約文言そのものを検証してください
現在の assertion は断片的な toContain の組み合わせで、最終契約文言の退行を取りこぼします。provider reconnect failure / command result unknown の完全フレーズを直接検証してください。
💡 Proposed fix
expect(result.content).toContain('provider reconnect failure');
+ expect(result.content).toContain('provider reconnect failure / command result unknown');
@@
expect(result.content).toContain('provider reconnect failure');
+ expect(result.content).toContain('provider reconnect failure / command result unknown');As per coding guidelines, "Confirm tests validate the post-fix behavior (not merely the presence of strings): unit tests should cover both normal and exception paths and assert the required final diagnostic semantics."
Also applies to: 472-476
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/__tests__/codex-client-retry.test.ts` around lines 429 - 433, The test
currently asserts fragments using multiple expect(...).toContain calls which can
miss regressions; instead build the full expected diagnostic phrase (e.g.,
combine CODEX_RECONNECT_FAILURE_MESSAGE with the provider/tool/command/result
text into one string) and assert against result.content with a single precise
check (e.g., const expected = `${CODEX_RECONNECT_FAILURE_MESSAGE} - provider
reconnect failure; Active tool: Bash; Bash command: npm run test:e2e:mock;
Command result: unknown` and then expect(result.content).toContain(expected) or
expect(result.content).toMatch(expected) / toEqual(expected) depending on exact
formatting), and apply the same change for the similar assertions around lines
472-476 to validate the entire contract phrase rather than partial fragments.
| const lines = [ | ||
| 'provider reconnect failure', | ||
| `Original error: ${failure.reason}`, | ||
| ]; | ||
| if (activeTool) { | ||
| lines.push(`Active tool: ${activeTool.tool}`); | ||
| const command = activeTool.tool === 'Bash' && typeof activeTool.input.command === 'string' | ||
| ? activeTool.input.command | ||
| : undefined; | ||
| if (command) { | ||
| lines.push(`Bash command: ${sanitizeSensitiveText(command)}`); | ||
| } | ||
| } | ||
| lines.push('Command result: unknown'); | ||
|
|
There was a problem hiding this comment.
最終診断の契約文言を単一フレーズで明示してください
現状は provider reconnect failure と Command result: unknown が分離されており、要求される観測可能契約文言(provider reconnect failure / command result unknown)をそのまま明示していません。文言は固定で1行入れておくのが安全です。
💡 Proposed fix
- const lines = [
- 'provider reconnect failure',
- `Original error: ${failure.reason}`,
- ];
+ const lines = [
+ 'provider reconnect failure / command result unknown',
+ `Original error: ${failure.reason}`,
+ ];
@@
- lines.push('Command result: unknown');As per coding guidelines, "Treat error message/log text and retry/diagnostic wording as observable contracts: ensure the required final log explicitly states “provider reconnect failure / command result unknown”, includes the active tool/command, and does not accidentally alter unrelated logging formats."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/infra/codex/client.ts` around lines 132 - 146, Replace the current two
separate diagnostic lines with a single fixed, final-observable contract string
"provider reconnect failure / command result unknown" and ensure any active tool
and its sanitized command are appended in a stable, non-breaking way; update the
block that builds the lines array (see symbols: lines, activeTool,
sanitizeSensitiveText) to always push one final entry that begins exactly with
"provider reconnect failure / command result unknown" and then, if activeTool
exists, append " - Active tool: <tool>" and, if a Bash command string exists,
append " - Bash command: <sanitized command>" so the final log line is a single
explicit phrase followed by the active tool/command context without changing
other log formats.
Summary
背景
PR #752 の review-fix 実行中、
npm test/npm run build/npm run lintは成功した後、npm run test:e2e:mock実行中に workflow が abort した。ログ上のエラー:
セッションログ例:
問題
Reconnecting... 2/5という表示は、Codex SDK 側では再接続途中または transient failure に見える。一方で TAKT はこれを最終的なprovider_errorとして扱い、workflow を abort している。その結果、実際には E2E の成功/失敗が未確認なのに、タスク全体が失敗扱いになる。
暫定調査
確認した実装:
src/infra/codex/client.tsCODEX_RETRYABLE_ERROR_PATTERNSにstream disconnected before completion,transport error,network error,fetch failedなどはある。Reconnecting...やtimeout waiting for child process to exitは含まれていない。shouldRetry()はprovider_errorの場合、isRetriableError(failure.reason)に一致したときだけ retry する。e2e/helpers/takt-runner.tsisTransientProviderFailure()はReconnecting...を transient provider failure として retry 対象にしている。つまり、E2E helper では
Reconnecting...を transient と見なしているが、本体の Codex provider retry 判定では同じ文字列を retryable と見なしていない。暫定的な原因は、Codex SDK が
Reconnecting... N/M (timeout waiting for child process to exit)をエラーとして返した際、TAKT 側がその文言を retryable transient failure として分類できず、通常のprovider_errorとして abort していること。ただし、SDK が
Reconnecting... 2/5を途中イベントとして返しているのか、最終エラー文言として返しているのかは未確認。SDK 側の仕様またはイベント列を追加確認する必要がある。期待する挙動
Reconnecting... N/M系エラーを transient provider failure として扱う。N < Mのように途中再接続に見える文言の場合、少なくとも即 abort ではなく retry 判定または診断改善を行う。受け入れ条件
src/infra/codex/client.tsの retry 判定で、Reconnecting...またはtimeout waiting for child process to exitを含む Codex SDK error が transient provider failure として扱われる。Reconnecting... 2/5 (timeout waiting for child process to exit)を含む unit test を追加する。Execution Report
Workflow
takt-defaultcompleted successfully.Closes #758
Summary by CodeRabbit
Release Notes
セキュリティ強化
[REDACTED]にマスクされるようになりました。バグ修正