Skip to content

import-jsonl (and live capture) hangs indefinitely on observations containing multibyte/CJK text #969

Description

@spegas

import-jsonl (and live capture) hangs indefinitely on observations containing multibyte/CJK text — request never completes, no CPU activity

Summary

A single observation whose content contains CJK (Korean) text causes the import/ingestion request to stall forever (the CLI gives up at its hard-coded 2-minute AbortSignal.timeout(12e4)). During the stall both the Node worker and the native iii engine are completely idle — this is not a CPU/regex blowup; it is a deadlock / lost-response stall. Removing the CJK characters from the offending line makes the exact same import complete in ~1 second.

  • Version: @agentmemory/agentmemory v0.9.27, iii-engine v0.11.2 (pinned)
  • Platform: macOS (Darwin arm64), Node v24.16.0
  • Provider: no LLM key (no-op provider), local embeddings enabled

Impact

  • agentmemory import-jsonl cannot ingest transcripts that contain such a line (the whole file/chunk aborts at 2 min).
  • More importantly, the same ingestion path is used by the live auto-capture hooks, so a normal session containing such content stalls capture of that observation for ~120s before the hook's own timeout fires and drops it.

Reproduction

The triggering line is an assistant tool_use (Bash) whose command contains Korean text mixed with an auth-header/token pattern, e.g.:

# 한글 주석: Python으로 JSON 생성 후 curl 전송 테스트  (Korean comment)
export MY_BOT_TOKEN=$(grep MY_BOT_TOKEN ~/.env | cut -d= -f2)
JSON_PAYLOAD=$(python3 -c "
import json
message = ('🤖 **작업 완료 알림** ...\n' ...)   # contains Hangul + emoji
print(json.dumps({'content': message}))
")
HTTP_CODE=$(curl -s ... -H "Authorization: Bot ${MY_BOT_TOKEN}" ...)

Note: the actual reproduction only requires an assistant tool_use whose command
string mixes Korean (Hangul) text with a typical auth-header/token pattern. The
exact tool/service is irrelevant; values above are anonymized.

Steps:

  1. Put that single JSONL line in one.jsonl.
  2. agentmemory import-jsonl one.jsonl → hangs, fails with import timed out after 2 minutes.
  3. Strip the Korean (Hangul) characters from command only → re-run → imported 1 file(s), N observation(s) in ~1s.

The line is small (~2.3 KB; longest unbroken string ~59 chars), so it is not a size issue.

What I ruled out (with isolated timing)

  • Secret-redaction regexes (SECRET_PATTERN_SOURCES / stripPrivateData): all 14 patterns run in 0 ms on the line.
  • BM25 tokenizer (SearchIndex.tokenizesegmentCjk/segmentHangul/stem): reproduced in isolation, 0–1 ms.
  • @node-rs/jieba dict load: ~70 ms (and the line is Hangul, not Han, so jieba isn't even on the path).
  • LESSON_PATTERNS in deriveCrystalAndLessons: 0 ms, and they don't run for a tool_use-only line (empty toText).

Profiling evidence (the key finding)

sample taken while the import is actually stalled (using a correctly-extracted line, not a shell-mangled one):

  • Node worker: main thread in uv__io_poll / kevent (waiting on I/O); V8 workers idle. Essentially no JS executing.
  • iii engine: all tokio-rt-worker threads in park_internal / wait_for_task; only ~10 samples in hyper keeping the HTTP connection open.

So the request is dispatched, the HTTP connection is held open, but neither process does any work — it waits forever. Combined with the fact that removing CJK characters fixes it, this points to a UTF-8 / message-framing bug in the worker ↔ native-engine RPC (a multibyte char landing on a buffer/chunk boundary corrupting framing → response never delivered → deadlock), rather than anything CPU-bound.

Suggested directions

  • Audit the worker↔engine message (de)serialization / chunking for non-ASCII (multibyte UTF-8) payload handling.
  • As a safety net, give the server-side ingestion of a single observation its own timeout so one stalled observation can't hang the whole request indefinitely.
  • Consider making the CLI's 2-minute timeout configurable (currently hard-coded AbortSignal.timeout(12e4) in cli.mjs).

Workaround for users

Import transcripts file-by-file; if one file stalls, bisect it down to the offending line (10-line → 1-line chunks) and drop/neutralize that line. Re-importing the same path duplicates observations (sessions dedupe by id, observations do not), so always reset the store between attempts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions