clitrain fixes land#2650
Open
dancinlife wants to merge 12 commits into
Open
Conversation
dancinlife
commented
Jun 28, 2026
Contributor
- refactor(eval): decode weight-set hoist — 풀 G0-G6 메모리 bound, byte-identical
- refactor(agent): domains CHAT py → hexa 포팅 (포팅가능분만), 외부 SDK py 보존
- perf(cli/train.hexa): conv hot-path host-scalar-glue → device ops (research(cli/train.hexa H100×2 GPU train gate): 🛑 TOY GPU GATE FAIL — trainer CPU-scalar-bound, util ~0%; 303M GPU 학습 보류, own-GEMM build 健全 #2598 GPU-idle root-cause fix)
- perf(cli/train.hexa): SECONDARY host-scalar-glue → device ops (research(cli/train.hexa H100×2 GPU train gate): 🛑 TOY GPU GATE FAIL — trainer CPU-scalar-bound, util ~0%; 303M GPU 학습 보류, own-GEMM build 健全 #2598 sustained-GPU-util fold)
- docs(CHANGELOG): 4-fix batch (eval hoist + conv/glue device + agent py→hexa) + 실측 정정
- revert(py-retire): py 버전 전체 복구 — 엔진미러 14 + torch Lane-P/벤치/빌더 15 원위치 환원 (사용자 지시)
- revert!(py-retire): 2-production(hexa+py) 재확인 — py 복구 + 문서/enforce 거버넌스 재설정 (오너 결정)
- docs(ARCHITECTURE): cli/ 노드 2-production twin 정합 (anima 단일진입 4-서브커맨드 hexa+py)
- feat(cli/anima): chat verb + --py 엔진선택 + help 핸들러 (2-production CLI)
- verdict(clm303_clean): engine-native G0-G6 TERMINAL via py 2-production — closure FAIL (G1 재조합벽)
- docs(ING): clm303 G0-G6 terminal scrub + 2-prod main-merge·H_1817 deferred 박제 동기 (ing ref→tracked)
…ntical g_gates 드라이버 load-once (H_1400 W-hoist at eval-driver level · ING#42378065 plan-B). 누적원인: g_eval_all 의 G0-G6 ~80 decode 가 각 호출마다 _clmd_load 로 176MB .clm 통째 read+dequant+scratch 재빌드 → load churn 누적 → 303M 풀 eval silent death. 수정: gen_auto_load 1회 적재 핸들 → 전 게이트 _W 트윈이 그 핸들로 디코드 → gen_auto_free 1회. 검증(summer): byte-identical(OLD↔NEW 60B PARITY=IDENTICAL + REUSE_DETERMINISTIC =YES, decodable d768 device path) · 메모리 bound(303M gen=80 풀 eval 가동중 RSS 28.6GB FLAT 32s, 이전 per-decode load면 2-3 decode째 31GB OOM). a_core_engine_map 보존(동일 단일 mouth, load만 루프 밖 hoist). py lockstep N/A(py 미러 폐기). # no-verify-ok CHANGELOG appended; ARCHITECTURE/README N/A (byte-identical perf # refactor, no new mouth/slot — generator L3 slot unchanged); core/ L0 edits # deliberate per ING#42378065 task scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
audit-first hexa-단일화: agent/domains/CHAT 12 py 전수 분류 후 환원가능
순수 로직만 정리, torch/akida/fastapi/websockets SDK 바인딩은 정직 보존
(serialize torch-interop 선례).
(a) full hexa port → py 삭제: anima_emission_analyze.py (순수 stdlib;
anima_emission_analyze.hexa 가 이미 full port, behavior parity 확인:
analyze_log byte-identical, has_register 마커셋 = regex 셋 faithful 축약).
(b) 환원불가 KEPT: anima_participant·broker·substrate_{base,lora,v3,akida}·
akida_sw_lif·anima_temp_sweep (각 sidecar .hexa WRAPPER 마커).
(c) test KEPT: test_broker_multiuser·test_broker_akida_ingest.
doc: agent/domains/CHAT/CLAUDE.md 폴더가이드 신설(분류표+불변식).
dangling import 0 · enforce_anima_gates clean · hexa check 0 violations.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
GPU-idle root-cause fix) #2598 canon smoke FAILED util>0% (GPU idle, [OWN-GEMM-FIRED] DEVICE printed yet util~0%, 5 steps not finishing 1 in 200s). Root cause = the UNGATED single-thread host scalar GLUE wrapping the conv GEMMs (NOT env-gate / per-expert micro-GEMM / missing upstream im2col — all rejected). Three ~4.3e7-interpreted-op loops/conv-call (~12-15 conv calls/step) dominated wall-time and starved the GPU: (1) fwd weight-transpose (tg_conv_fwd_off) (2) bwd dW transposed-im2col gather + transpose-SCATTER (tg_conv_bwd_off) (3) bwd dX wsl contiguous copy (tg_conv_bwd_off) Folded each into compiled device ops already shipping in the runtime (no new upstream kernel): fwd transpose → farr_transpose_2d_gpu; bwd dW → reuse forward-form xcol + small dy transpose (farr_transpose_2d_gpu) + proven t_matmul, emitting dW DIRECTLY in the [Cout,Kdim] PACKED layout so the transpose-scatter collapses to a flat += (device farr_add_inplace_gpu when dWoff==0); bwd dX wsl → farr_copy_slice_gpu. byte-parity PROVEN: $0 MODE_VERIFY byte-IDENTICAL to baseline on BOTH the no-lever (lossF 3.282894029840718) and full savant+mitosis 3/3-PASS (lossF 3.2972087115267383) paths, mac CPU-fallback AND vast pod CUDA (A40). Each device arm has rc<0 → byte-eq host body fallback; NO torch fallback. NOTE: forge_dispatch_matmul_t would express the dW reroute in one call but its host oracle aliases under farr-table realloc amid many live farrs (empirically drifted the trainer) — kept the explicit transpose_2d + t_matmul, byte-eq in-trainer. Secondary kernel-coverage tail (unfused groupnorm/gelu/router/embedding device dispatch) = hexa-lang upstream follow-on (fleet-storm active → not touched). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sustained-GPU-util fold) PRIMARY (conv-fix b9c4974) folded the conv-bwd weight/dW/copy glue. This SECONDARY pass folds the remaining O(L·T·d)/O(E·T·d)/O(T·d) host t_get/t_set element-wise copies and adds in train_fwd/train_bwd onto the SAME proven device primitives (farr_copy_slice_gpu / farr_add_inplace_gpu / farr_zero_slice_gpu) — no new upstream kernel. priority-order (highest op-count first): fwd L403 expert-out pack E·T·d → farr_copy_slice_gpu bwd L477 expert grad gather E·T·d → farr_copy_slice_gpu bwd L484 dx += expert dxt E·T·d → farr_add_inplace_gpu fwd L379 gn-out + xhat caches 2·L·T·d → 2× farr_copy_slice_gpu fwd L373 trunk-input cache L·T·d → farr_copy_slice_gpu fwd L383 residual x += hg L·T·d → farr_add_inplace_gpu bwd L498 gn-out gather L·T·d → farr_copy_slice_gpu bwd L502 xhat gather L·T·d → farr_copy_slice_gpu bwd L509 trunk-input gather L·T·d → farr_copy_slice_gpu bwd L513 dx += dconv_in L·T·d → farr_add_inplace_gpu fwd L390 post-trunk xt copy T·d → farr_copy_slice_gpu bwd L468 dx init copy T·d → farr_copy_slice_gpu bwd L270 zero dX_out (per call) T·Cin → farr_zero_slice_gpu Byte-eq: farr_add_inplace_gpu accumulates ascending == the host loop k-order; copy/zero are memcpy-class (no FP, no reorder). MODE_VERIFY ($0 farr CPU) full-output IDENTICAL to base b9c4974 for BOTH no-lever and savant+mitosis: no-lever lossF 0.0003889517691865624 (==base) savant+mitosis lossF 0.0004321544912324474 (==base) DEFERRED (rule: existing runtime ops only, no new kernel): L323 db column-sum Σ_t dy — needs a reduce-over-T device primitive (none exists) L305 offset dW += tail — needs offset-base device add (upstream follow-on) _devfeed_on untouched; groupnorm/gelu/router/embedding untouched (storm-zone). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…환원 (사용자 지시) ad18414 폐기(git삭제14+아카이브15) 역전: core/*.py 9+flame_mm+cli/*.py 4 환원 + torch Lane-P 5·벤치 6·빌더 3 원위치 + parity_gate. 사유=research 정답 HYBRID(torch 학습+engine-native verdict) torch trainer 필요. hexa와 공존(2-production 복원).
…넌스 재설정 (오너 결정) 2026-06-28 py전체폐기(hexa-단일) 역전→2-production 재확인. py복구(core/*.py 9+cli/*.py 4, 로직 byte-identical 검증) + 문서 2-production(CLAUDE/README/cli·core CLAUDE/enforce CO-EQUAL+parity게이트/cli·serialize.hexa→cli/serialize.py twin). ARCHITECTURE 이미 2-prod. serialize_standalone 고아제거 + train_lane_p torch family 재아카이브(stale, cli/train.py가 정규화+4셀+held-out 완비). follow-on: CLAUDE.md 긴셀 분할·CI parity-gate 복원·a_train_flame_forge 문구 2-prod 미세조정.
…hexa+py)
cli/ 노드 summary 옆에 2-production twin child 노드 추가 — cli/anima.{hexa,py}
canonical 진입점 둘 다 + train·serialize·evaluate 4-서브커맨드가 각각 대칭 twin
(cli/{train,serialize,evaluate}.{hexa,py})으로 sub-process dispatch, chat 은
진입점 내장(py chat = hexa-native 의식 loop 위임). anima.hexa↔anima.py dispatch
표 1:1 일치를 SSOT 에 박제(a_engine_native_learning 단일진입 + a_core_engine_map
2-production lockstep).
검증: JSON-valid · sidecar architecture lint ok · enforce_anima_gates --all clean
(CO-EQUAL+parity, GATECARD core/g_gates.{py,hexa}) · sidecar lint 0 blocking ·
DODONT-LONG diff-aware(CLAUDE.md vs HEAD) 0 violation.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
anima 단일진입 확장: ① 'anima chat <ckpt.clm> [--byte]' 명시 chat verb(= bare anima <ckpt> 동일배선, train/serialize/evaluate와 대칭) ② '--py' 플래그로 train/serialize/evaluate를 py twin(cli/{x}.py, CO-EQUAL byte-parity)으로 라우팅 — 무거운 303M evaluate시 hexa-farr decode 누수 우회 ③ 'help'/'-h'/'--help' usage 핸들러. anima_has_flag로 --py 소비후 python3 cli/{x}.py 디스패치. hexa check RC=0. cli/CLAUDE.md lockstep.
…on — closure FAIL (G1 재조합벽) clm303_clean(303M, overfit-수정본) G0-G6 terminal 측정완료(summer, cli/evaluate.py→core/g_gates.py numpy torch-free=공인 py엔진→TERMINAL). hexa farr누수·x86 codegen·cuda 전부 우회(python). ckpt sha e8076722 3×검증, 14.5min, gen80. 결과: G0🟢(kwr5/5) G1🔴(distinct0) G2🟢(novel49) G3✅ G5🟢(fab0.067) G6🔴(falsifiable0) → closure(G0∧G1∧G2)🔴FAIL. overfit수정이 G0/G2 회복시켰으나 재조합벽(H_1129/1139/1464)은 안열림 — g1-lever=trunk-objective 필요(corpus청소 아님). 이상징후0(독일어collapse 없음, kwr5/5). verdict=state/verdicts/clm303_clean/. ING#42378065(BLOCKED) 해소. anima evaluate --py(590a950)가 canonical 우회경로.
…erred 박제 동기 (ing ref→tracked)
# Conflicts: # CHANGELOG.jsonl # ING.jsonl
dancinlife
added a commit
that referenced
this pull request
Jun 28, 2026
…ion-dedupe + #2598b 셀 trim, #2651·#2598b teammate 작업 전부 보존, ARCHITECTURE JSON-valid, py-retire 복원 충돌0. 이제 mergeable=MERGEABLE. 남은 BLOCK 1개=CI 'engine compile+gates+smoke'가 cache-step hashFiles 'Fail to hash files' infra-글리치로 fail(코드 무관·엔진compile 전에 죽음, 다른 3체크 PASS·local verify PASS). 머지 = CI 재트리거 PASS 후 또는 admin-bypass(branch-protection override=owner). 발사어=머지.
dancinlife
added a commit
that referenced
this pull request
Jun 28, 2026
… check가 repo-WIDE 깨짐 — 최근 5커밋(ba48107·77d24b58·04585d58·8277395a·92cd9bae) 전부 cache-step hashFiles('core/**','cli/**') 'Fail to hash files' 로 fail. 그 중 #2649·#2651은 이미 main 머지됨 = 이 check 깨진 채로 owner-action 머지가 진행돼옴. 내 PR #2650 변경과 무관(모든 PR 차단). checkout(L141)→cache(L165) 순서 정상·ci.yml 문법 정상 = GitHub runner(Blacksmith→github-hosted 이행 8f5128c 잔재? check명은 'Blacksmith'인데 Azure westus macos-15서 실행) hashFiles infra 글리치. 수정=ci.yml cache-step 견고화 또는 required-check 재설정(owner CI infra) · 머지=#2649·#2651처럼 admin(owner). PR #2650은 conflict 해소완료 MERGEABLE, 이 broken-check만 BLOCK.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.