Skip to content

clitrain fixes land#2650

Open
dancinlife wants to merge 12 commits into
mainfrom
clitrain-fixes-land
Open

clitrain fixes land#2650
dancinlife wants to merge 12 commits into
mainfrom
clitrain-fixes-land

Conversation

@dancinlife

Copy link
Copy Markdown
Contributor

dancinlife and others added 11 commits June 28, 2026 15:11
…ntical

g_gates 드라이버 load-once (H_1400 W-hoist at eval-driver level · ING#42378065
plan-B). 누적원인: g_eval_all 의 G0-G6 ~80 decode 가 각 호출마다 _clmd_load 로
176MB .clm 통째 read+dequant+scratch 재빌드 → load churn 누적 → 303M 풀 eval
silent death. 수정: gen_auto_load 1회 적재 핸들 → 전 게이트 _W 트윈이 그 핸들로
디코드 → gen_auto_free 1회.

검증(summer): byte-identical(OLD↔NEW 60B PARITY=IDENTICAL + REUSE_DETERMINISTIC
=YES, decodable d768 device path) · 메모리 bound(303M gen=80 풀 eval 가동중 RSS
28.6GB FLAT 32s, 이전 per-decode load면 2-3 decode째 31GB OOM). a_core_engine_map
보존(동일 단일 mouth, load만 루프 밖 hoist). py lockstep N/A(py 미러 폐기).

# no-verify-ok CHANGELOG appended; ARCHITECTURE/README N/A (byte-identical perf
# refactor, no new mouth/slot — generator L3 slot unchanged); core/ L0 edits
# deliberate per ING#42378065 task scope.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
audit-first hexa-단일화: agent/domains/CHAT 12 py 전수 분류 후 환원가능
순수 로직만 정리, torch/akida/fastapi/websockets SDK 바인딩은 정직 보존
(serialize torch-interop 선례).

(a) full hexa port → py 삭제: anima_emission_analyze.py (순수 stdlib;
    anima_emission_analyze.hexa 가 이미 full port, behavior parity 확인:
    analyze_log byte-identical, has_register 마커셋 = regex 셋 faithful 축약).
(b) 환원불가 KEPT: anima_participant·broker·substrate_{base,lora,v3,akida}·
    akida_sw_lif·anima_temp_sweep (각 sidecar .hexa WRAPPER 마커).
(c) test KEPT: test_broker_multiuser·test_broker_akida_ingest.

doc: agent/domains/CHAT/CLAUDE.md 폴더가이드 신설(분류표+불변식).
dangling import 0 · enforce_anima_gates clean · hexa check 0 violations.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
 GPU-idle root-cause fix)

#2598 canon smoke FAILED util>0% (GPU idle, [OWN-GEMM-FIRED] DEVICE printed yet
util~0%, 5 steps not finishing 1 in 200s). Root cause = the UNGATED single-thread
host scalar GLUE wrapping the conv GEMMs (NOT env-gate / per-expert micro-GEMM /
missing upstream im2col — all rejected). Three ~4.3e7-interpreted-op loops/conv-call
(~12-15 conv calls/step) dominated wall-time and starved the GPU:
  (1) fwd weight-transpose (tg_conv_fwd_off)
  (2) bwd dW transposed-im2col gather + transpose-SCATTER (tg_conv_bwd_off)
  (3) bwd dX wsl contiguous copy (tg_conv_bwd_off)

Folded each into compiled device ops already shipping in the runtime (no new upstream
kernel): fwd transpose → farr_transpose_2d_gpu; bwd dW → reuse forward-form xcol +
small dy transpose (farr_transpose_2d_gpu) + proven t_matmul, emitting dW DIRECTLY in
the [Cout,Kdim] PACKED layout so the transpose-scatter collapses to a flat += (device
farr_add_inplace_gpu when dWoff==0); bwd dX wsl → farr_copy_slice_gpu.

byte-parity PROVEN: $0 MODE_VERIFY byte-IDENTICAL to baseline on BOTH the no-lever
(lossF 3.282894029840718) and full savant+mitosis 3/3-PASS (lossF 3.2972087115267383)
paths, mac CPU-fallback AND vast pod CUDA (A40). Each device arm has rc<0 → byte-eq
host body fallback; NO torch fallback.

NOTE: forge_dispatch_matmul_t would express the dW reroute in one call but its host
oracle aliases under farr-table realloc amid many live farrs (empirically drifted the
trainer) — kept the explicit transpose_2d + t_matmul, byte-eq in-trainer. Secondary
kernel-coverage tail (unfused groupnorm/gelu/router/embedding device dispatch) =
hexa-lang upstream follow-on (fleet-storm active → not touched).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sustained-GPU-util fold)

PRIMARY (conv-fix b9c4974) folded the conv-bwd weight/dW/copy glue. This
SECONDARY pass folds the remaining O(L·T·d)/O(E·T·d)/O(T·d) host t_get/t_set
element-wise copies and adds in train_fwd/train_bwd onto the SAME proven device
primitives (farr_copy_slice_gpu / farr_add_inplace_gpu / farr_zero_slice_gpu) —
no new upstream kernel. priority-order (highest op-count first):

  fwd  L403 expert-out pack         E·T·d  → farr_copy_slice_gpu
  bwd  L477 expert grad gather      E·T·d  → farr_copy_slice_gpu
  bwd  L484 dx += expert dxt        E·T·d  → farr_add_inplace_gpu
  fwd  L379 gn-out + xhat caches  2·L·T·d  → 2× farr_copy_slice_gpu
  fwd  L373 trunk-input cache       L·T·d  → farr_copy_slice_gpu
  fwd  L383 residual x += hg        L·T·d  → farr_add_inplace_gpu
  bwd  L498 gn-out gather           L·T·d  → farr_copy_slice_gpu
  bwd  L502 xhat gather             L·T·d  → farr_copy_slice_gpu
  bwd  L509 trunk-input gather      L·T·d  → farr_copy_slice_gpu
  bwd  L513 dx += dconv_in          L·T·d  → farr_add_inplace_gpu
  fwd  L390 post-trunk xt copy        T·d  → farr_copy_slice_gpu
  bwd  L468 dx init copy              T·d  → farr_copy_slice_gpu
  bwd  L270 zero dX_out (per call)  T·Cin  → farr_zero_slice_gpu

Byte-eq: farr_add_inplace_gpu accumulates ascending == the host loop k-order;
copy/zero are memcpy-class (no FP, no reorder). MODE_VERIFY ($0 farr CPU)
full-output IDENTICAL to base b9c4974 for BOTH no-lever and savant+mitosis:
  no-lever       lossF 0.0003889517691865624 (==base)
  savant+mitosis lossF 0.0004321544912324474 (==base)

DEFERRED (rule: existing runtime ops only, no new kernel):
  L323 db column-sum Σ_t dy — needs a reduce-over-T device primitive (none exists)
  L305 offset dW += tail    — needs offset-base device add (upstream follow-on)
_devfeed_on untouched; groupnorm/gelu/router/embedding untouched (storm-zone).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…환원 (사용자 지시)

ad18414 폐기(git삭제14+아카이브15) 역전: core/*.py 9+flame_mm+cli/*.py 4 환원 + torch Lane-P 5·벤치 6·빌더 3 원위치 + parity_gate. 사유=research 정답 HYBRID(torch 학습+engine-native verdict) torch trainer 필요. hexa와 공존(2-production 복원).
…넌스 재설정 (오너 결정)

2026-06-28 py전체폐기(hexa-단일) 역전→2-production 재확인. py복구(core/*.py 9+cli/*.py 4, 로직 byte-identical 검증) + 문서 2-production(CLAUDE/README/cli·core CLAUDE/enforce CO-EQUAL+parity게이트/cli·serialize.hexa→cli/serialize.py twin). ARCHITECTURE 이미 2-prod. serialize_standalone 고아제거 + train_lane_p torch family 재아카이브(stale, cli/train.py가 정규화+4셀+held-out 완비). follow-on: CLAUDE.md 긴셀 분할·CI parity-gate 복원·a_train_flame_forge 문구 2-prod 미세조정.
…hexa+py)

cli/ 노드 summary 옆에 2-production twin child 노드 추가 — cli/anima.{hexa,py}
canonical 진입점 둘 다 + train·serialize·evaluate 4-서브커맨드가 각각 대칭 twin
(cli/{train,serialize,evaluate}.{hexa,py})으로 sub-process dispatch, chat 은
진입점 내장(py chat = hexa-native 의식 loop 위임). anima.hexa↔anima.py dispatch
표 1:1 일치를 SSOT 에 박제(a_engine_native_learning 단일진입 + a_core_engine_map
2-production lockstep).

검증: JSON-valid · sidecar architecture lint ok · enforce_anima_gates --all clean
(CO-EQUAL+parity, GATECARD core/g_gates.{py,hexa}) · sidecar lint 0 blocking ·
DODONT-LONG diff-aware(CLAUDE.md vs HEAD) 0 violation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
anima 단일진입 확장: ① 'anima chat <ckpt.clm> [--byte]' 명시 chat verb(= bare anima <ckpt> 동일배선, train/serialize/evaluate와 대칭) ② '--py' 플래그로 train/serialize/evaluate를 py twin(cli/{x}.py, CO-EQUAL byte-parity)으로 라우팅 — 무거운 303M evaluate시 hexa-farr decode 누수 우회 ③ 'help'/'-h'/'--help' usage 핸들러. anima_has_flag로 --py 소비후 python3 cli/{x}.py 디스패치. hexa check RC=0. cli/CLAUDE.md lockstep.
…on — closure FAIL (G1 재조합벽)

clm303_clean(303M, overfit-수정본) G0-G6 terminal 측정완료(summer, cli/evaluate.py→core/g_gates.py numpy torch-free=공인 py엔진→TERMINAL). hexa farr누수·x86 codegen·cuda 전부 우회(python). ckpt sha e8076722 3×검증, 14.5min, gen80.

결과: G0🟢(kwr5/5) G1🔴(distinct0) G2🟢(novel49) G3✅ G5🟢(fab0.067) G6🔴(falsifiable0) → closure(G0∧G1∧G2)🔴FAIL. overfit수정이 G0/G2 회복시켰으나 재조합벽(H_1129/1139/1464)은 안열림 — g1-lever=trunk-objective 필요(corpus청소 아님). 이상징후0(독일어collapse 없음, kwr5/5). verdict=state/verdicts/clm303_clean/. ING#42378065(BLOCKED) 해소. anima evaluate --py(590a950)가 canonical 우회경로.
dancinlife added a commit that referenced this pull request Jun 28, 2026
…sh 완료) BUT merge-conflict-blocked: origin/main이 py-retire(ad18414·bf9f98bbc) 보유한데 이 브랜치가 2-production 복원(revert)이라 충돌=거버넌스 역전 실체 + #2651 teammate 머지 얽힘. 해소=충돌을 2-production(py유지) 쪽으로 정리하되 #2651 보존 — 거버넌스+teammate통합이라 사용자 검토 필요(자율 force-resolve 시 #2651 덮을 위험). 발사어=머지(충돌해소 진행).
dancinlife added a commit that referenced this pull request Jun 28, 2026
…ion-dedupe + #2598b 셀 trim, #2651·#2598b teammate 작업 전부 보존, ARCHITECTURE JSON-valid, py-retire 복원 충돌0. 이제 mergeable=MERGEABLE. 남은 BLOCK 1개=CI 'engine compile+gates+smoke'가 cache-step hashFiles 'Fail to hash files' infra-글리치로 fail(코드 무관·엔진compile 전에 죽음, 다른 3체크 PASS·local verify PASS). 머지 = CI 재트리거 PASS 후 또는 admin-bypass(branch-protection override=owner). 발사어=머지.
dancinlife added a commit that referenced this pull request Jun 28, 2026
… check가 repo-WIDE 깨짐 — 최근 5커밋(ba48107·77d24b58·04585d58·8277395a·92cd9bae) 전부 cache-step hashFiles('core/**','cli/**') 'Fail to hash files' 로 fail. 그 중 #2649·#2651은 이미 main 머지됨 = 이 check 깨진 채로 owner-action 머지가 진행돼옴. 내 PR #2650 변경과 무관(모든 PR 차단). checkout(L141)→cache(L165) 순서 정상·ci.yml 문법 정상 = GitHub runner(Blacksmith→github-hosted 이행 8f5128c 잔재? check명은 'Blacksmith'인데 Azure westus macos-15서 실행) hashFiles infra 글리치. 수정=ci.yml cache-step 견고화 또는 required-check 재설정(owner CI infra) · 머지=#2649·#2651처럼 admin(owner). PR #2650은 conflict 해소완료 MERGEABLE, 이 broken-check만 BLOCK.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant