Skip to content

Windows port: platform-agnostic IPC, fcntl shim, POSIX signal guards, Task Scheduler daemon#12

Open
danielhertz1999-bit wants to merge 57 commits into
CodeAbra:mainfrom
danielhertz1999-bit:main
Open

Windows port: platform-agnostic IPC, fcntl shim, POSIX signal guards, Task Scheduler daemon#12
danielhertz1999-bit wants to merge 57 commits into
CodeAbra:mainfrom
danielhertz1999-bit:main

Conversation

@danielhertz1999-bit

@danielhertz1999-bit danielhertz1999-bit commented Jun 18, 2026

Copy link
Copy Markdown

Windows port — complete & validated on a real Windows 11 box

This PR (head = danielhertz1999-bit:main) is the canonical, complete Windows port. It supersedes #18 (an earlier rework, now ~84 commits behind). Everything below is verified on Windows 11 / Python 3.12 / Rust 1.96 (MSVC) with the project built from source.

Platform port

  • IPC: Unix-domain socket → TCP loopback with an ephemeral port persisted to a per-endpoint .port file (_ipc.py); honors IAI_DAEMON_SOCKET_PATH for isolation.
  • Auth-token handshake on the loopback TCP (32-byte token, ACL-restricted via icacls, compare_digest validation) — closes the "loopback reachable by any local process" concern from @warplayer's review. Token file is also per-endpoint isolated.
  • fcntl_filelock shim (msvcrt, offset-preserving, POSIX-blocking emulation); resource/signal/os.geteuid/pwd guarded; os.kill(pid,0)psutil.pid_exists.
  • Daemon installer via Task Scheduler (schtasks XML); PowerShell hooks; %APPDATA% log paths; icacls key security.
  • package-level __main__.py (the hooks call python -m iai_mcp); UTF-8 forced on all text I/O + std streams; os.replace for atomic moves.
  • Rust Cargo.toml macOS-only features gated; pip install -e . builds the native ext + hnswlib clean.

Validation

  • Builds + imports clean; core test suite ~90%+ green.
  • Full daemon/socket test suite ported to the cross-platform transport (with the auth handshake) — passes or cleanly skips POSIX-only cases; no hangs. Includes real-daemon-spawn tests.

Bugs fixed (surfaced by the port)

_patch_claude_desktop_config crash · _filelock offset/blocking · doctor.check_b socket-file gate · capture._pid_is_alive (os.kill(pid,0) is CTRL_C_EVENT on Windows) · port-file & token-file isolation.

Happy to rebase onto the latest main whenever you're ready to merge. Thanks for tracking this!

🤖 Validated with Claude Code on a real Windows 11 build

claude and others added 13 commits June 17, 2026 21:11
The Rust hf-hub client uses a different TLS stack than Python and fails
to reach huggingface.co in containers with custom CA certificates
(UnknownIssuer). Add _auto_set_embed_offline() which runs at daemon
startup: if the bge-small-en-v1.5 snapshot directory already exists in
the HF cache, it sets IAI_MCP_EMBED_OFFLINE=1 automatically so the
embedder skips the network download entirely.

To seed the cache in a restricted environment, download the three model
files (model.safetensors, tokenizer.json, config.json) via Python's SSL
stack and place them under:
~/.cache/huggingface/hub/models--BAAI--bge-small-en-v1.5/snapshots/<rev>/

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mf3VFyVtczcK2WxKKCyBS4
…tdfl63

Auto-detect HF model cache to bypass Rust TLS in restricted environments
Introduces src/iai_mcp/_ipc.py as the central abstraction for all
daemon socket communication. On POSIX it delegates to the existing
Unix-domain socket; on Windows it uses TCP loopback with the ephemeral
port persisted in ~/.iai-mcp/.daemon.port.

Replaces all nine raw asyncio.open_unix_connection /
asyncio.start_unix_server / socket.AF_UNIX call-sites with the new
open_ipc_connection / start_ipc_server / make_sync_ipc_socket helpers.
The POSIX code-paths are structurally unchanged.

This is step 1 of the Windows port. Remaining blockers (fcntl, resource
module, POSIX signals, shell hooks, daemon installer) are tracked in the
audit at the top of _ipc.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _filelock.py: on POSIX delegates to fcntl.flock; on Windows uses
msvcrt.locking with EWOULDBLOCK normalisation so all non-blocking callers
work unchanged.  Also changes doctor check_c to open the lock file with
O_RDWR (msvcrt.locking requires write access; harmless on POSIX).

Updated 7 files: capture_queue, lifecycle_event_log, lifecycle,
lock_protocol, hippo/_db, hippo/__init__ (dead import), doctor/_lifecycle_checks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- resource module: lazy-import behind platform check (Windows has no
  RLIMIT_NOFILE).
- Signals: build shutdown-signal list dynamically using hasattr; replace
  bare signal.SIGKILL with hasattr guard + sys.exit(1) fallback in
  _self_kill; use getattr for SIGTERM/SIGKILL in CLI stop and doctor
  orphan killer.
- os.getuid(): guarded with hasattr fallback to 0 (4 sites in _daemon.py).
- Log path: add _get_daemon_log_path() returning platform-appropriate
  path (%APPDATA%/iai-mcp/logs on Windows); add Windows branch to
  cmd_daemon_logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- crypto.py: skip POSIX mode/uid checks on Windows (chmod is no-op);
  add _secure_key_file() using icacls for Windows ACL lockdown; guard
  os.fchmod with hasattr.
- cli/_crypto.py: guard st.st_uid and os.geteuid() for status report.
- memory_bank.py: guard os.fchmod call with hasattr.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds _is_windows() and SCHTASKS_TASK_NAME to cli/__init__.py and
implements cmd_daemon_install/uninstall/start/stop branches for Windows
in cli/_daemon.py using schtasks.exe and _render_schtasks_xml().

Also adds the Windows PowerShell turn-capture hook stub and the
WINDOWS_PORT_HANDOFF.md guide for continuing the port in a new session
(covers the remaining Steps 6-10 with file paths, code examples, and
test commands).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Creates three PowerShell scripts (.ps1) to replace shell hooks on Windows:
- iai-mcp-turn-capture.ps1: per-turn ambient capture (UserPromptSubmit)
- iai-mcp-session-capture.ps1: batch-capture at session end (Stop)
- iai-mcp-session-recall.ps1: session-start recall injection (SessionStart)

Updates src/iai_mcp/cli/_capture.py:
- _hook_ext() detects platform and returns '.ps1' on Windows, '.sh' on POSIX
- _capture_hook_paths(), _turn_hook_paths(), _session_recall_hook_paths()
  now use _hook_ext() for dynamic extension
- cmd_capture_hooks_install() uses 'powershell -ExecutionPolicy Bypass -File'
  for Windows hooks instead of 'bash' on POSIX
- Markers (_CAPTURE_HOOK_MARKER, etc.) changed to base names (no extension)
  so substring matching works for both .sh and .ps1 in settings.json

Updates pyproject.toml:
- Adds "_deploy/hooks/*.ps1" to package.data so PowerShell hooks are bundled

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The bench scripts measured peak RSS via resource.getrusage(RUSAGE_SELF).ru_maxrss,
which is POSIX-only and crashes on import on Windows. Switch the Windows branch
to psutil.Process().memory_info().peak_wset; POSIX paths are unchanged.

- bench/memory_footprint.py: _rss_mb() guards on sys.platform == "win32"
- bench/memorygraph_memory.py: rss_mb() same pattern
- bench/consolidation_rss_peak.py: defer `import resource` into _ru_maxrss_bytes()
- bench/embed_warm_cost.py: rewrite _PAYLOAD_RSS subprocess template to detect
  platform and pick peak_wset / ru_maxrss; emit rss_platform key; measure_rss
  decodes the new field for the unit label, returns rss_platform in result dict

psutil is already a declared dependency (pyproject.toml), so no new package.
Verified end-to-end on Windows: all three helpers and the rewritten payload
produce sane values; AST-parse clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The fcntl→_filelock rewrite in commit 8154b9b swept timedelta and timezone
into the new _filelock import line, but those belong to datetime. The module
crashed with ImportError on any platform — masked until now because the
Windows porting work hadn't tried to import the chain end-to-end.

Move timedelta and timezone back to the `from datetime import` line.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Step 5 added Task Scheduler / schtasks backends to install/uninstall/start/
stop/logs, but the argparse help strings still mentioned only launchd and
systemd. Update them to list all three platforms.

The `Windows %APPDATA%\iai-mcp\logs` reference in the logs help is escaped
as `%%APPDATA%%` so argparse's own %-formatter doesn't choke on it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a new "Verified on Windows in-situ" section recording what was actually
exercised on a Windows machine in this session: AST parse over all 23
touched files, import smoke test of 10 runtime modules, CLI help, daemon
install --dry-run producing a valid Task Scheduler XML, and capture-hooks
status detecting the .ps1 templates. Also record the lifecycle_event_log
fix and daemon help-text update commits alongside the existing Step-7
entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@warplayer

Copy link
Copy Markdown

Hi @danielhertz1999-bit 👋 — I'm not a maintainer, just someone who came across this PR and happened to have a Windows 11 machine, so I took the branch for a spin to give you some independent test data. Really nice work — the platform-abstraction approach (the _ipc / _filelock shims plus explicit per-OS branches) is clean and easy to follow. Everything below is offered as a fellow contributor; please treat it as FYI and ignore anything you've already got in flight.

Environment: Windows 11 Home (build 26200), fresh gh pr checkout 12, Python 3.12.10 venv, Rust 1.96.0 (x86_64-pc-windows-msvc, VS BuildTools MSVC 14.51), Node 24.15.

What worked great (actually ran it, not just static checks)

  • pip install -e ".[dev]" succeeds end-to-end, including the Rust build. iai_mcp_native.cp312-win_amd64.pyd compiles and imports (embed + graph), hnswlib builds from sdist via cl.exe, and the whole scientific stack resolves to cp312 wheels (numpy 2.2.6, scipy 1.17.1, numba 0.65.1, pyarrow 19, cryptography 49, keyring via pywin32-ctypes). The --no-default-features Rust build keeps the macOS accelerate/metal features out, so no Apple-framework link errors.
  • mcp-wrapper builds clean (tscdist/index.js) on Node 24.
  • daemon install actually round-trips through Task Scheduler. The generated UTF-16 XML is accepted by schtasks /Create /XML, registers \iai-mcp-daemon with Task To Run = …\pythonw.exe -m iai_mcp.daemon, Start In = %APPDATA%\iai-mcp\logs, Run As = <user>; daemon uninstall cleanly /End+/Deletes it.
  • _ipc TCP-loopback works in a live round-trip — server binds 127.0.0.1:<ephemeral>, writes .daemon.port, client discovers it and echoes, port file removed on shutdown.
  • capture-hooks install writes correct Windows wiringpowershell -ExecutionPolicy Bypass -File "…ps1" for Stop/UserPromptSubmit/SessionStart with the right timeouts + matcher; status → WIRED; uninstall reverts cleanly.

A few things I ran into

  1. pytest can't be collected on Windows — 5 test modules import POSIX-only modules at top level, so collection aborts before any test runs (Interrupted: 5 errors during collection):

    • tests/test_capture_queue.py, tests/test_doctor_lock_probe.py, tests/test_live_e2e_gate.py, tests/test_lock_starvation.pyimport fcntl
    • tests/test_daemon_fdlimit_and_fsm.pyimport resource

    The src/ tree was nicely ported to the _filelock/_ipc shims, but these tests still import fcntl/resource directly. With --continue-on-collection-errors, a partial run (~31% of the suite before the 60s daemon socket-bind timeouts led me to cut it at 6m35s) gave 957 passed / 98 failed / 74 skipped / 7 errors. The large majority of those failures are the test suite itself assuming POSIX — fake daemons built on asyncio.start_unix_server / open_unix_connection / socket.AF_UNIX, plus signal.SIGKILL, os.geteuid, /tmp paths, and 0o600 mode-bit assertions — i.e. the same gap as the collection errors: src/ was ported but the tests weren't. A smaller set look like genuine latent cross-platform issues the port surfaces (not regressions in this PR), e.g.: file I/O without encoding="utf-8" falling back to cp1252 (bench/contradiction_longitudinal_claude.py:817UnicodeEncodeError on Δ); Windows file-replace semantics in crypto-key rotation (WinError 183); and a couple of hippo lock-escalation failures consistent with the LOCK_SH divergence in fix: capture clean Codex transcript event rows #3. Happy to share the full categorized list.

  2. capture-hooks install crashes if the MCP wrapper isn't built yet and Claude Desktop is installed. _patch_claude_code_config handles a missing wrapper by writing a placeholder (cli/_capture.py:452-470), but _patch_claude_desktop_config calls _build_iai_mcp_server_entry() without the same guard (cli/_capture.py:404,422), so it raises an uncaught FileNotFoundError after already writing the hooks + settings.json. Repro on my box (Desktop installed, wrapper unbuilt): install writes the hooks, then traceback + non-zero exit. Building mcp-wrapper first makes it exit 0. Since the README/handoff flow installs hooks before building the wrapper, it's easy to trip. (Not strictly Windows-specific.)

  3. _filelock shared locks aren't actually shared on Windows. msvcrt.locking is exclusive-only, so the shim maps LOCK_SH to the same exclusive lock as LOCK_EX. Cross-process repro: process A takes LOCK_SH, process B's LOCK_SH | LOCK_NB returns EWOULDBLOCK (on POSIX the second reader gets the lock). That changes the multi-reader path in hippo/_db.py (LOCK_SH ~lines 253/306), lock_protocol.acquire_client_shared_nb, and the doctor shared-probe at doctor/_lifecycle_checks.py:186 (it would read "exclusively held" when only a reader holds it). The errno-normalization itself is correct — it's the lock mode that differs. A faithful shared lock on Windows would need LockFileEx (via ctypes), or this could just be documented as a known divergence.

  4. Minor: _filelock.flock() moves the file offset. The Windows path does os.lseek(fd, 0, SEEK_SET) before locking, whereas fcntl.flock leaves the offset untouched (observed 7 → 0 after a flock call). Probably harmless since the current callers seek explicitly, but it's a behavioral difference. (Also, per your own code comment, blocking LK_LOCK raises after ~10s vs POSIX blocking forever — could surface as spurious failures under long contention.)

  5. Design question (not a bug): Windows IPC is a loopback TCP port with no auth token. The Unix-domain socket's access was bounded by filesystem perms; 127.0.0.1:<port> is reachable by any local process/user. CONTRIBUTING.md flags "network-exposed surface beyond the existing local UNIX socket" as sensitive, so I figured it was worth surfacing — a Windows named pipe (\\.\pipe\…) would preserve ACL-based access control if that property matters here. Could well be a deliberate trade-off.

(FWIW the em-dash-renders-as--on-a-cp1252-console thing you already flagged in the handoff is indeed cosmetic — the on-disk XML is UTF-16 and schtasks reads it fine.)

Happy to share the exact repro scripts, open issues, or send a small PR for the test-collection guards and/or the _patch_claude_desktop_config fallback — only if it's useful and doesn't cut across what you're already doing. Thanks for porting this to Windows! 🙏

@CodeAbra

Copy link
Copy Markdown
Owner

hey guys i would really appreciate some help with windows porting
please feel free to open PRs

dealing with repo rn

…m shims

5 test modules imported fcntl or resource at top level, aborting pytest
collection before any test ran on Windows. Switch to the existing
_filelock shim (LOCK_EX/SH/UN/NB + flock) and guard the resource import
with a sys.platform check, skipping TestRaiseFdLimitClampsToHard on
Windows (resource.RLIM_INFINITY is unavailable there).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@danielhertz1999-bit

Copy link
Copy Markdown
Author

Thanks for the warm welcome and for sharing that Windows support is a priority! This PR covers Steps 1–7 of a full Windows port: platform-agnostic IPC (Unix sockets → TCP loopback), fcntl/resource/signal/os.geteuid shims, a Task Scheduler daemon installer, and PowerShell hook scripts. Happy to iterate on any feedback.

🤖 Addressed by Claude Code

@danielhertz1999-bit

Copy link
Copy Markdown
Author

@warplayer — this is incredibly thorough, thank you! Really appreciate the live test data. Addressing your points:

1. Test collection errors (fcntl / resource) — fixed in commit 269e90a. The 5 test files now import from the _filelock shim instead of fcntl directly; the resource import is guarded with sys.platform != "win32" and the TestRaiseFdLimitClampsToHard class is @skipif'd on Windows. The remaining failures you described (POSIX-assuming fake daemons, /tmp paths, cp1252 encoding, WinError 183 key rotation) are on the backlog — good to have them categorized.

2. _patch_claude_desktop_config crash when wrapper unbuilt — confirmed, this is a real bug (not Windows-specific). Adding a FileNotFoundError guard to _build_iai_mcp_server_entry() at cli/_capture.py:404,422 is a clear fix; will include that in the next commit.

3. _filelock LOCK_SH maps to exclusive on Windows — correct, this is a known divergence documented in _filelock.py. A faithful shared lock via LockFileEx (ctypes) would be the right fix for hippo/_db.py's multi-reader path and the doctor shared probe; flagged as a follow-up issue since it's a bigger change.

4. flock() moves file offset on Windows — good catch. The lseek before msvcrt.locking was to reset to byte 0 for the lock range; callers do seek explicitly, but aligning behavior with POSIX (leave offset untouched) is the right call. Will fix.

5. TCP loopback auth / named pipe — acknowledged trade-off. Windows named pipes (\.\pipe\iai-mcp) with ACL-based access would be the proper equivalent. Adding it as a future hardening issue.

Thanks again — this kind of hands-on validation is exactly what's needed.

🤖 Addressed by Claude Code

danielhertz1999-bit and others added 11 commits June 20, 2026 02:31
_patch_claude_desktop_config called _build_iai_mcp_server_entry() without
the FileNotFoundError guard its Claude Code sibling already had, so on a
box with Claude Desktop installed but the wrapper not yet built, install
wrote the hooks + settings.json and THEN raised an uncaught
FileNotFoundError (non-zero exit). The README/handoff flow installs hooks
before building the wrapper, so this was easy to trip.

Factor the build-or-placeholder fallback into _iai_entry_or_placeholder()
and use it at all three sites. include_type preserves the format
difference: Claude Code's .claude.json carries "type": "stdio"; Claude
Desktop's config omits it.

Reported by @warplayer on a live Windows 11 run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On Windows, text-mode file I/O defaults to the locale codepage (cp1252),
not UTF-8. Any memory content with non-ASCII characters — emoji, em-dashes,
smart quotes, accented or CJK text, math symbols — raised UnicodeEncodeError
on write and could corrupt on read. The capture path was the worst hit:
write_deferred_captures() serializes conversation turns as
json.dumps(..., ensure_ascii=False) (real UTF-8) into a handle opened
without an encoding, so a single emoji in a turn crashed capture.

Add encoding="utf-8" to every text-mode open()/Path.open()/read_text()/
write_text() across the runtime tree (74 sites, 19 files). Binary-mode
opens, socket/asyncio openers, and the already-correct capture_queue.py
(which encodes to UTF-8 bytes via os.write) are untouched. JSON config
writers were already safe (ensure_ascii=True escapes to ASCII) but are now
explicit on the read side too.

Repro (before): json.dumps(text, ensure_ascii=False) -> cp1252 handle
raises "'charmap' codec can't encode character 'Δ'". After: round-trips.

Surfaced by @warplayer on a live Windows 11 run (bench Δ UnicodeEncodeError).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two defects that together broke memory capture/recall on Windows:

1. All three PowerShell hooks invoke `python -m iai_mcp <subcommand>`, but
   the package had no __main__.py (only iai_mcp.cli did). `python -m iai_mcp`
   failed with "No module named iai_mcp.__main__", so every hook silently
   no-opped (they fail-safe to exit 0). The macOS .sh hooks embed Python
   inline (`python3 -c ...`) and never exercised this path, so it was missed.
   Add src/iai_mcp/__main__.py delegating to iai_mcp.cli:main.

2. The session-recall hook reads this CLI's stdout, into which recalled
   memory is written with ensure_ascii=False. On Windows (and POSIX C
   locale) stdout defaults to a non-UTF-8 codepage, so any emoji/CJK/em-dash
   in recalled memory would raise UnicodeEncodeError and yield no context.
   Reconfigure sys.stdout/stderr to UTF-8 at the top of main().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two behavioral divergences from fcntl.flock, both surfaced by @warplayer:

- flock() moved the caller's file offset to 0 (the seek needed before
  msvcrt.locking). fcntl.flock leaves the offset untouched. Save and
  restore it around the lock op so callers that rely on the offset are
  unaffected.

- The blocking (non-LOCK_NB) path used msvcrt's LK_LOCK, which gives up
  after ~10 s and raises, whereas POSIX flock blocks until acquired. Under
  long contention — e.g. a blocking LOCK_EX in capture_queue/
  lifecycle_event_log while the consolidator holds the lock — this would
  spuriously fail. Poll LK_NBLCK instead to block until acquired.

Verified on Windows: offset stays put across LOCK_EX|NB/LOCK_UN; a blocking
LOCK_EX waits while another handle holds the lock and acquires immediately
on release (405 ms for a 400 ms hold), instead of failing at 10 s.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
os.rename / Path.rename raise FileExistsError (WinError 183) on Windows
when the destination exists, whereas POSIX rename atomically replaces it.
crypto key rotation hit this every time (_try_file_set renames the new key
over the existing one); the capture and provenance-queue spill/failed-move
paths had the same latent bug. Since the code already runs on macOS — where
rename replaces — switching to os.replace/Path.replace is behaviour-
preserving on POSIX and fixes Windows.

Sites: crypto._try_file_set, capture (processing-marker strip, failed/
permanent-failed/crash moves, claim rename), provenance_queue spill +
failed-drain.

Reported by @warplayer (key-rotation WinError 183) on a live Windows run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
msvcrt only offers exclusive byte-range locks, so the shim services LOCK_SH
as exclusive — a second concurrent reader blocks where POSIX would admit it.
Document why this is deliberately not fixed with Win32 LockFileEx: hippo/_db.py
relies on fcntl.flock's atomic lock conversion (EXCLUSIVE<->SHARED in place on
one fd), which LockFileEx cannot do without an unlock/relock race. Flagged by
@warplayer as a known divergence; capturing the analysis for a future faithful
port.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…trings

Three categories of POSIX assumptions that blocked Windows test collection
and caused ~98 failures in @warplayer's live run:

1. /tmp sock_dir paths (17 files): Replace hardcoded Path(f"/tmp/iai-...")
   with tmp_path / "sock" so pytest manages the temp directory on all
   platforms. test_daemon_crash_loop_immunity.py: use tmp_path instead of
   a hand-rolled /tmp path with time.time(). test_doctor_multi_binder.py
   already has pytestmark.skipif(Windows) so its /tmp strings are correct.

2. Dummy "cwd" values (6 files): Replace "/tmp/test" / "/tmp/latency-test"
   literal strings passed as capture metadata with
   str(Path(tempfile.gettempdir()) / "test"). The path doesn't need to
   exist — it's stored as metadata in JSONL headers.

3. Mode assertions (13 files): assert mode == 0o600 always fails on
   Windows because os.chmod() is a no-op there (ACLs via icacls govern
   access instead). Guard each assertion with
   `if sys.platform != "win32":` so Windows skips the check without
   masking the underlying security intent on POSIX.

82 existing POSIX tests still pass with no change in behaviour.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mf3VFyVtczcK2WxKKCyBS4
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mf3VFyVtczcK2WxKKCyBS4
# Conflicts:
#	src/iai_mcp/claude_cli.py
TCP loopback (127.0.0.1:<port>) is reachable by any local process, unlike
Unix-domain sockets whose access is bounded by filesystem permissions. This
addresses the security concern raised in CodeAbra#12
by @warplayer.

How it works:
- start_ipc_server() generates a 32-byte random hex token (secrets.token_hex)
  on daemon startup, writes it to ~/.iai-mcp/.daemon.token (ACL-restricted to
  the current user via icacls, equivalent to chmod 0o600), and wraps the
  connection handler to require the token as the first line before processing
  any requests. Connections that send the wrong token are closed immediately.
- open_ipc_connection() reads the token and sends it as the first line after
  connecting (Windows only).
- make_sync_ipc_socket() callers now call send_sync_auth_token(sock) after
  connect() — updated in cli/__init__.py and direct_write.py.
- shutdown_ipc() removes both the port file and the token file on shutdown.
- All of this is Windows-only; POSIX paths are structurally unchanged.

Also classifies bench/capture_dedup_lock.py in the authoritative bench-script
list in test_bench_worktree_resolution.py (it imports iai_mcp so it belongs
in BENCH_SCRIPTS_NEEDING_SHIM).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mf3VFyVtczcK2WxKKCyBS4
danielhertz1999-bit and others added 2 commits June 21, 2026 03:18
Complete the Windows port: daemon startup, scheduled task, state writes, MCP wrapper IPC
# Conflicts:
#	src/iai_mcp/capture.py
@danielhertz1999-bit

Copy link
Copy Markdown
Author

Status update — conflicts resolved + live Windows E2E validation

Rebased onto upstream v1.1.5 and pushed; the PR is mergeable again. The write_deferred_captures conflict was resolved by keeping your pid-collision-safe atomic .tmpos.replace rewrite and re-applying the encoding="utf-8" on both file handles (Windows defaults to cp1252, which corrupts non-ASCII memory content otherwise).

What landed since the last review (addresses @warplayer's findings)

  • feat: add Codex capture hook installer support #1 test collection — the 5 modules that imported fcntl/resource at top level now use the _filelock shim / guard, so collection no longer aborts.
  • feat: add Codex capture hook installer support #2 _patch_claude_desktop_config crash — now shares the wrapper-missing fallback its Claude Code sibling had; capture-hooks install no longer raises after writing hooks.
  • UTF-8 everywhereencoding="utf-8" on all text I/O across the runtime tree (the cp1252 UnicodeEncodeError on Δ/emoji/CJK). Verified: cp1252 throws on Δ, utf-8 round-trips.
  • iai_mcp.__main__ — the package had no __main__.py, so the PowerShell hooks' python -m iai_mcp … failed with "No module named iai_mcp.main" and every hook silently no-opped. Added it; also force UTF-8 on stdout/stderr (the recall hook's output path).
  • fix: capture Codex event transcript rows #4 _filelock — preserve the file offset and emulate POSIX block-until-acquired (msvcrt's LK_LOCK gives up after ~10 s).
  • os.replace for atomic moves — fixes the crypto key-rotation WinError 183.
  • fix: capture clean Codex transcript event rows #3 LOCK_SH — documented as a known divergence (LockFileEx would break the atomic lock conversion hippo/_db.py relies on).

Live E2E on Windows 11 (Python 3.12.10, Rust 1.96 MSVC, VS Build Tools)

  • pip install -e . builds the Rust native extension + hnswlib clean; iai_mcp_native, hnswlib, and iai_mcp all import.
  • mcp-wrapper builds (tscdist/index.js).
  • Core test suite: 627 passed / 68 failed across 108 modules (~90%). The failures are almost entirely the POSIX test-harness assumptions you'd expect — fake daemons on asyncio.start_unix_server/AF_UNIX, launchd re-enable, signal.SIGKILL, 0o600/geteuid crypto-mode asserts, /tmp paths — not product-logic regressions.
  • A cluster of daemon/socket integration tests hang under the Windows asyncio proactor loop (same "60s socket-bind" stall you hit); these need the test fixtures ported to the TCP-loopback transport.

Happy to split the test-harness port into its own PR if that's easier to review.

🤖 Validated with Claude Code on a real Windows 11 build

danielhertz1999-bit and others added 2 commits June 22, 2026 01:19
On Windows the multiprocessing-spawn RGC worker launches under the venv's
base interpreter (sys._base_executable), re-imports iai_mcp.daemon, and
hangs past the watchdog timeout, taking the parent daemon down with it.
ctx.set_executable() does not override base-interpreter selection.

Replace the spawned subprocess with an in-process daemon thread
(_ThreadWorkerHandle) on Windows, mirroring the Process API the rebuild
path uses. POSIX keeps the spawned subprocess unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
v1.1.5

# Conflicts:
#	src/iai_mcp/capture.py
danielhertz1999-bit and others added 17 commits June 22, 2026 01:38
On Windows the multiprocessing-spawn RGC worker launches under the venv's
base interpreter (sys._base_executable), re-imports iai_mcp.daemon, and
hangs past the watchdog timeout, taking the parent daemon down with it.
ctx.set_executable() does not override base-interpreter selection.

Replace the spawned subprocess with an in-process daemon thread
(_ThreadWorkerHandle) on Windows, mirroring the Process API the rebuild
path uses. POSIX keeps the spawned subprocess unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Windows port completion: RGC worker hang fix + v1.1.5 merge
…-hang

Fix Windows graph-rebuild worker hang (run rebuild in-process thread, not spawn)
On Windows the daemon's TCP port was always persisted to the shared
~/.iai-mcp/.daemon.port, ignoring IAI_DAEMON_SOCKET_PATH — the env var the
POSIX path already honors (see ipc_address). Two consequences:

1. Product: a daemon bound to a non-default endpoint (custom IAI_MCP_STORE)
   clobbered the global port file, so clients for different stores collided.
2. Tests: the daemon/socket suite isolates via IAI_DAEMON_SOCKET_PATH on
   POSIX, but on Windows every test raced the one global port file — a major
   reason those tests hang/cross-talk under the Windows asyncio loop.

Resolve the port-file path dynamically (`_port_file_path()`): when
IAI_DAEMON_SOCKET_PATH is set, persist the port alongside it
(`<socket-path>.port`); otherwise the default. Computed per-call, not as a
module constant, because tests set the env var after import.

Verified on Windows: two daemons on different env paths get distinct port
files (no collision) and a client reaches the intended one.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…solation

Windows: per-endpoint port-file isolation (_ipc honors IAI_DAEMON_SOCKET_PATH)
test_socket_server_dispatch.py hard-coded asyncio.open_unix_connection and
waited on a unix socket file, so it hung/failed on Windows. Port it to the
platform-agnostic _ipc layer, relying on the per-endpoint port-file isolation
just added (IAI_DAEMON_SOCKET_PATH):

- fixture sets IAI_DAEMON_SOCKET_PATH so server + client share an isolated
  endpoint (unix socket on POSIX, TCP loopback + "<path>.port" on Windows)
- client helpers use open_ipc_connection() instead of open_unix_connection()
- server driven via serve() (resolves the endpoint from the env var)
- bind-wait checks _endpoint_ready_path() (socket file on POSIX, port file on
  Windows) instead of the unix socket path

Verified: 11 passed in ~8s on a real Windows 11 build (previously hung).
POSIX behavior unchanged. Template for the remaining AF_UNIX test files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same cross-platform port as the dispatch suite, plus replacing a hardcoded
/tmp/iai-srvact-* socket dir (which doesn't exist on Windows) with pytest's
tmp_path. Uses IAI_DAEMON_SOCKET_PATH isolation + open_ipc_connection() +
_endpoint_ready_path().

Verified: 2 passed in ~1.4s on Windows 11 (previously hung). POSIX unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The "(b) socket file fresh" doctor check gated on the AF_UNIX socket *file*
existing, which never happens on Windows (TCP loopback + sidecar port file),
so it always reported FAIL even with a live daemon. Check the per-platform
endpoint (port file on Windows) instead; the connect probe was already
cross-platform. Surfaced by porting test_doctor_checklist, which now uses the
shared fake-daemon-socket helper and skips the AF_UNIX regular-file case on
Windows. 13 passed / 1 skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tests: port socket suite to cross-platform _ipc transport (batch 1 + pattern)
…rapper-integration module

test_socket_fail_loud spawns a real `python -m iai_mcp.daemon` and kills it
mid-call. The daemon binds on Windows (TCP + port file); ported the test's raw
AF_UNIX clients to cross-platform connect helpers (daemon_endpoint /
new_daemon_client_socket), swapped signal.SIGKILL for proc.kill(), waited on the
port file, and accepted TimeoutError as a valid post-kill connect failure. 2 pass.

test_socket_disconnect_reconnect is skipped on Windows: it builds the Node
mcp-wrapper via npm and bridges it to an AF_UNIX fake daemon — a POSIX-stack
integration needing a Node-wrapper TCP port (separate effort).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@CodeAbra

Copy link
Copy Markdown
Owner

Thanks again for this! Looks like #18 is your newer rework of the same Windows port — I'm tracking both. Same note as there: I'm following the work and will merge once the port is complete and validated on a real Windows box. No rush — really appreciate you driving Windows support.

danielhertz1999-bit and others added 5 commits June 22, 2026 17:42
…mmunity

Production fix: _pid_is_alive used os.kill(pid, 0), but on Windows signal 0 is
CTRL_C_EVENT (it would try to signal the process group), not a liveness probe —
so the deferred-capture drain's stale-PID crash-recovery never reclaimed
abandoned .processing-<pid> files. Use psutil.pid_exists (psutil is a hard dep),
falling back to the POSIX signal-0 probe. (The parallel session fixed the same
os.kill(pid,0) issue in lifecycle_lock; this is the capture-path twin.)

Test port (test_daemon_crash_loop_immunity, now 7/7 on Windows):
- the in-process socket-binds test reads the TCP port file + connects via the
  cross-platform daemon_endpoint helpers (was hanging on AF_UNIX);
- fixtures set USERPROFILE alongside HOME — Path.home() reads USERPROFILE on
  Windows, so the drain was scanning the real ~/.iai-mcp, not the temp dir
  (every drain test silently found 0 files);
- the rename-failure mocks patch Path.replace (what the code calls now), not
  Path.rename.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ts-port

tests: finish the daemon/socket port (subprocess cluster) + capture _pid_is_alive Windows fix
…s the handshake

Integrates the auth-token handshake (PR #2) with the ported daemon/socket
test suite now in main:

- _token_file_path() mirrors _port_file_path() so the Windows token lives at
  <IAI_DAEMON_SOCKET_PATH>.token, not a single shared ~/.iai-mcp/.daemon.token
  — required for test isolation and custom-store daemons (otherwise every
  daemon and test clobbers one global token).
- bind_fake_daemon_socket() now writes a token file so the production client's
  mandatory handshake finds one; add send_daemon_token() for raw client sockets.
- test_socket_fail_loud sends the token on its raw active connection.

Verified on Windows with auth ON: 121 passed / 8 skipped across the 12 ported
daemon/socket files (production-server, raw-fake-server, and real-daemon-spawn).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add auth-token handshake to Windows TCP loopback IPC
@danielhertz1999-bit danielhertz1999-bit mentioned this pull request Jun 25, 2026
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants