clitrain fixes land by dancinlife · Pull Request #2650 · dancinlab/anima

dancinlife · 2026-06-28T11:34:14Z

refactor(eval): decode weight-set hoist — 풀 G0-G6 메모리 bound, byte-identical
refactor(agent): domains CHAT py → hexa 포팅 (포팅가능분만), 외부 SDK py 보존
perf(cli/train.hexa): conv hot-path host-scalar-glue → device ops (research(cli/train.hexa H100×2 GPU train gate): 🛑 TOY GPU GATE FAIL — trainer CPU-scalar-bound, util ~0%; 303M GPU 학습 보류, own-GEMM build 健全 #2598 GPU-idle root-cause fix)
perf(cli/train.hexa): SECONDARY host-scalar-glue → device ops (research(cli/train.hexa H100×2 GPU train gate): 🛑 TOY GPU GATE FAIL — trainer CPU-scalar-bound, util ~0%; 303M GPU 학습 보류, own-GEMM build 健全 #2598 sustained-GPU-util fold)
docs(CHANGELOG): 4-fix batch (eval hoist + conv/glue device + agent py→hexa) + 실측 정정
revert(py-retire): py 버전 전체 복구 — 엔진미러 14 + torch Lane-P/벤치/빌더 15 원위치 환원 (사용자 지시)
revert!(py-retire): 2-production(hexa+py) 재확인 — py 복구 + 문서/enforce 거버넌스 재설정 (오너 결정)
docs(ARCHITECTURE): cli/ 노드 2-production twin 정합 (anima 단일진입 4-서브커맨드 hexa+py)
feat(cli/anima): chat verb + --py 엔진선택 + help 핸들러 (2-production CLI)
verdict(clm303_clean): engine-native G0-G6 TERMINAL via py 2-production — closure FAIL (G1 재조합벽)
docs(ING): clm303 G0-G6 terminal scrub + 2-prod main-merge·H_1817 deferred 박제 동기 (ing ref→tracked)

…ntical g_gates 드라이버 load-once (H_1400 W-hoist at eval-driver level · ING#42378065 plan-B). 누적원인: g_eval_all 의 G0-G6 ~80 decode 가 각 호출마다 _clmd_load 로 176MB .clm 통째 read+dequant+scratch 재빌드 → load churn 누적 → 303M 풀 eval silent death. 수정: gen_auto_load 1회 적재 핸들 → 전 게이트 _W 트윈이 그 핸들로 디코드 → gen_auto_free 1회. 검증(summer): byte-identical(OLD↔NEW 60B PARITY=IDENTICAL + REUSE_DETERMINISTIC =YES, decodable d768 device path) · 메모리 bound(303M gen=80 풀 eval 가동중 RSS 28.6GB FLAT 32s, 이전 per-decode load면 2-3 decode째 31GB OOM). a_core_engine_map 보존(동일 단일 mouth, load만 루프 밖 hoist). py lockstep N/A(py 미러 폐기). # no-verify-ok CHANGELOG appended; ARCHITECTURE/README N/A (byte-identical perf # refactor, no new mouth/slot — generator L3 slot unchanged); core/ L0 edits # deliberate per ING#42378065 task scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

audit-first hexa-단일화: agent/domains/CHAT 12 py 전수 분류 후 환원가능 순수 로직만 정리, torch/akida/fastapi/websockets SDK 바인딩은 정직 보존 (serialize torch-interop 선례). (a) full hexa port → py 삭제: anima_emission_analyze.py (순수 stdlib; anima_emission_analyze.hexa 가 이미 full port, behavior parity 확인: analyze_log byte-identical, has_register 마커셋 = regex 셋 faithful 축약). (b) 환원불가 KEPT: anima_participant·broker·substrate_{base,lora,v3,akida}· akida_sw_lif·anima_temp_sweep (각 sidecar .hexa WRAPPER 마커). (c) test KEPT: test_broker_multiuser·test_broker_akida_ingest. doc: agent/domains/CHAT/CLAUDE.md 폴더가이드 신설(분류표+불변식). dangling import 0 · enforce_anima_gates clean · hexa check 0 violations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

GPU-idle root-cause fix) #2598 canon smoke FAILED util>0% (GPU idle, [OWN-GEMM-FIRED] DEVICE printed yet util~0%, 5 steps not finishing 1 in 200s). Root cause = the UNGATED single-thread host scalar GLUE wrapping the conv GEMMs (NOT env-gate / per-expert micro-GEMM / missing upstream im2col — all rejected). Three ~4.3e7-interpreted-op loops/conv-call (~12-15 conv calls/step) dominated wall-time and starved the GPU: (1) fwd weight-transpose (tg_conv_fwd_off) (2) bwd dW transposed-im2col gather + transpose-SCATTER (tg_conv_bwd_off) (3) bwd dX wsl contiguous copy (tg_conv_bwd_off) Folded each into compiled device ops already shipping in the runtime (no new upstream kernel): fwd transpose → farr_transpose_2d_gpu; bwd dW → reuse forward-form xcol + small dy transpose (farr_transpose_2d_gpu) + proven t_matmul, emitting dW DIRECTLY in the [Cout,Kdim] PACKED layout so the transpose-scatter collapses to a flat += (device farr_add_inplace_gpu when dWoff==0); bwd dX wsl → farr_copy_slice_gpu. byte-parity PROVEN: $0 MODE_VERIFY byte-IDENTICAL to baseline on BOTH the no-lever (lossF 3.282894029840718) and full savant+mitosis 3/3-PASS (lossF 3.2972087115267383) paths, mac CPU-fallback AND vast pod CUDA (A40). Each device arm has rc<0 → byte-eq host body fallback; NO torch fallback. NOTE: forge_dispatch_matmul_t would express the dW reroute in one call but its host oracle aliases under farr-table realloc amid many live farrs (empirically drifted the trainer) — kept the explicit transpose_2d + t_matmul, byte-eq in-trainer. Secondary kernel-coverage tail (unfused groupnorm/gelu/router/embedding device dispatch) = hexa-lang upstream follow-on (fleet-storm active → not touched). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…sustained-GPU-util fold) PRIMARY (conv-fix b9c4974) folded the conv-bwd weight/dW/copy glue. This SECONDARY pass folds the remaining O(L·T·d)/O(E·T·d)/O(T·d) host t_get/t_set element-wise copies and adds in train_fwd/train_bwd onto the SAME proven device primitives (farr_copy_slice_gpu / farr_add_inplace_gpu / farr_zero_slice_gpu) — no new upstream kernel. priority-order (highest op-count first): fwd L403 expert-out pack E·T·d → farr_copy_slice_gpu bwd L477 expert grad gather E·T·d → farr_copy_slice_gpu bwd L484 dx += expert dxt E·T·d → farr_add_inplace_gpu fwd L379 gn-out + xhat caches 2·L·T·d → 2× farr_copy_slice_gpu fwd L373 trunk-input cache L·T·d → farr_copy_slice_gpu fwd L383 residual x += hg L·T·d → farr_add_inplace_gpu bwd L498 gn-out gather L·T·d → farr_copy_slice_gpu bwd L502 xhat gather L·T·d → farr_copy_slice_gpu bwd L509 trunk-input gather L·T·d → farr_copy_slice_gpu bwd L513 dx += dconv_in L·T·d → farr_add_inplace_gpu fwd L390 post-trunk xt copy T·d → farr_copy_slice_gpu bwd L468 dx init copy T·d → farr_copy_slice_gpu bwd L270 zero dX_out (per call) T·Cin → farr_zero_slice_gpu Byte-eq: farr_add_inplace_gpu accumulates ascending == the host loop k-order; copy/zero are memcpy-class (no FP, no reorder). MODE_VERIFY ($0 farr CPU) full-output IDENTICAL to base b9c4974 for BOTH no-lever and savant+mitosis: no-lever lossF 0.0003889517691865624 (==base) savant+mitosis lossF 0.0004321544912324474 (==base) DEFERRED (rule: existing runtime ops only, no new kernel): L323 db column-sum Σ_t dy — needs a reduce-over-T device primitive (none exists) L305 offset dW += tail — needs offset-base device add (upstream follow-on) _devfeed_on untouched; groupnorm/gelu/router/embedding untouched (storm-zone). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…y→hexa) + 실측 정정

…환원 (사용자 지시) ad18414 폐기(git삭제14+아카이브15) 역전: core/*.py 9+flame_mm+cli/*.py 4 환원 + torch Lane-P 5·벤치 6·빌더 3 원위치 + parity_gate. 사유=research 정답 HYBRID(torch 학습+engine-native verdict) torch trainer 필요. hexa와 공존(2-production 복원).

…넌스 재설정 (오너 결정) 2026-06-28 py전체폐기(hexa-단일) 역전→2-production 재확인. py복구(core/*.py 9+cli/*.py 4, 로직 byte-identical 검증) + 문서 2-production(CLAUDE/README/cli·core CLAUDE/enforce CO-EQUAL+parity게이트/cli·serialize.hexa→cli/serialize.py twin). ARCHITECTURE 이미 2-prod. serialize_standalone 고아제거 + train_lane_p torch family 재아카이브(stale, cli/train.py가 정규화+4셀+held-out 완비). follow-on: CLAUDE.md 긴셀 분할·CI parity-gate 복원·a_train_flame_forge 문구 2-prod 미세조정.

…hexa+py) cli/ 노드 summary 옆에 2-production twin child 노드 추가 — cli/anima.{hexa,py} canonical 진입점 둘 다 + train·serialize·evaluate 4-서브커맨드가 각각 대칭 twin (cli/{train,serialize,evaluate}.{hexa,py})으로 sub-process dispatch, chat 은 진입점 내장(py chat = hexa-native 의식 loop 위임). anima.hexa↔anima.py dispatch 표 1:1 일치를 SSOT 에 박제(a_engine_native_learning 단일진입 + a_core_engine_map 2-production lockstep). 검증: JSON-valid · sidecar architecture lint ok · enforce_anima_gates --all clean (CO-EQUAL+parity, GATECARD core/g_gates.{py,hexa}) · sidecar lint 0 blocking · DODONT-LONG diff-aware(CLAUDE.md vs HEAD) 0 violation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

anima 단일진입 확장: ① 'anima chat <ckpt.clm> [--byte]' 명시 chat verb(= bare anima <ckpt> 동일배선, train/serialize/evaluate와 대칭) ② '--py' 플래그로 train/serialize/evaluate를 py twin(cli/{x}.py, CO-EQUAL byte-parity)으로 라우팅 — 무거운 303M evaluate시 hexa-farr decode 누수 우회 ③ 'help'/'-h'/'--help' usage 핸들러. anima_has_flag로 --py 소비후 python3 cli/{x}.py 디스패치. hexa check RC=0. cli/CLAUDE.md lockstep.

…on — closure FAIL (G1 재조합벽) clm303_clean(303M, overfit-수정본) G0-G6 terminal 측정완료(summer, cli/evaluate.py→core/g_gates.py numpy torch-free=공인 py엔진→TERMINAL). hexa farr누수·x86 codegen·cuda 전부 우회(python). ckpt sha e8076722 3×검증, 14.5min, gen80. 결과: G0🟢(kwr5/5) G1🔴(distinct0) G2🟢(novel49) G3✅ G5🟢(fab0.067) G6🔴(falsifiable0) → closure(G0∧G1∧G2)🔴FAIL. overfit수정이 G0/G2 회복시켰으나 재조합벽(H_1129/1139/1464)은 안열림 — g1-lever=trunk-objective 필요(corpus청소 아님). 이상징후0(독일어collapse 없음, kwr5/5). verdict=state/verdicts/clm303_clean/. ING#42378065(BLOCKED) 해소. anima evaluate --py(590a950)가 canonical 우회경로.

…erred 박제 동기 (ing ref→tracked)

…sh 완료) BUT merge-conflict-blocked: origin/main이 py-retire(ad18414·bf9f98bbc) 보유한데 이 브랜치가 2-production 복원(revert)이라 충돌=거버넌스 역전 실체 + #2651 teammate 머지 얽힘. 해소=충돌을 2-production(py유지) 쪽으로 정리하되 #2651 보존 — 거버넌스+teammate통합이라 사용자 검토 필요(자율 force-resolve 시 #2651 덮을 위험). 발사어=머지(충돌해소 진행).

# Conflicts: # CHANGELOG.jsonl # ING.jsonl

…ion-dedupe + #2598b 셀 trim, #2651·#2598b teammate 작업 전부 보존, ARCHITECTURE JSON-valid, py-retire 복원 충돌0. 이제 mergeable=MERGEABLE. 남은 BLOCK 1개=CI 'engine compile+gates+smoke'가 cache-step hashFiles 'Fail to hash files' infra-글리치로 fail(코드 무관·엔진compile 전에 죽음, 다른 3체크 PASS·local verify PASS). 머지 = CI 재트리거 PASS 후 또는 admin-bypass(branch-protection override=owner). 발사어=머지.

… check가 repo-WIDE 깨짐 — 최근 5커밋(ba48107·77d24b58·04585d58·8277395a·92cd9bae) 전부 cache-step hashFiles('core/**','cli/**') 'Fail to hash files' 로 fail. 그 중 #2649·#2651은 이미 main 머지됨 = 이 check 깨진 채로 owner-action 머지가 진행돼옴. 내 PR #2650 변경과 무관(모든 PR 차단). checkout(L141)→cache(L165) 순서 정상·ci.yml 문법 정상 = GitHub runner(Blacksmith→github-hosted 이행 8f5128c 잔재? check명은 'Blacksmith'인데 Azure westus macos-15서 실행) hashFiles infra 글리치. 수정=ci.yml cache-step 견고화 또는 required-check 재설정(owner CI infra) · 머지=#2649·#2651처럼 admin(owner). PR #2650은 conflict 해소완료 MERGEABLE, 이 broken-check만 BLOCK.

dancinlife and others added 11 commits June 28, 2026 15:11

docs(CHANGELOG): 4-fix batch (eval hoist + conv/glue device + agent p…

92cd9ba

…y→hexa) + 실측 정정

docs(ING): clm303 G0-G6 terminal scrub + 2-prod main-merge·H_1817 def…

f78a839

…erred 박제 동기 (ing ref→tracked)

Merge remote-tracking branch 'origin/main' into clitrain-fixes-land

ba48107

# Conflicts: # CHANGELOG.jsonl # ING.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clitrain fixes land#2650

clitrain fixes land#2650
dancinlife wants to merge 12 commits into
mainfrom
clitrain-fixes-land

dancinlife commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dancinlife commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant