Skip to content

Latest commit

 

History

History
87 lines (74 loc) · 21.9 KB

File metadata and controls

87 lines (74 loc) · 21.9 KB

Changelog

All notable changes to Vortex are documented here. The format is based on Keep a Changelog, and the project follows the version pins recorded in VERSION (VORTEX_VERSION, TOOLCHAIN_REV, GEM5_REV).

[3.0] — 2026-06-08

The 3.0 release introduces a fixed-function graphics stack (rasterizer, texture units, and output mergers), tensor core structured sparsity (2:4), warpgroup-level matrix multiplication (WGMMA), global-to-local data transfer acceleration (DXA), a new hardware kernel scheduler (KMU) and Command Processor (CP) architecture, a new asynchronous runtime API (vortex2.h), asynchronous barriers with arrive/wait/event semantics, compressed instruction set (RVC) support, hardware atomics, an MMU/SV32 virtual memory stack, a Mesa/lavapipe Vulkan backend (vortexpipe), HIP via chipStar, gem5 integration, a SimX v3 TLM architecture with fixed-size handshake channels, productized Synopsys and Yosys ASIC synthesis flows, and a refreshed toolchain (LLVM 20, POCL 7.0). Build and configuration infrastructure was reworked: TOML-driven HW configuration (VX_config.toml + VX_types.toml) decoupling SimX/runtime from the RTL source tree, a VX_CFG_ macro namespace that resolves toolchain preprocessor collisions, retirement of the global toolchain_env.sh to enable parallel multi-version Vortex worktrees on the same shell, consolidation of kernel//runtime/ under a shared sw/ root, a single-source VERSION file driving CI toolchain pinning, Perfetto trace export (ci/perfetto.py), and new top-level AGENTS.md + CONTRIBUTING.md for AI-agent and contributor workflows.

Added

  • TCU tfr arithmetic backend. New in-house, fully-synthesizable fused dot-product running integer and floating-point through one shared 4-cycle pipeline; gated by VX_CFG_TCU_TYPE_TFR. ** Adds FP8 (e4m3), BF8 (e5m2), and TF32 on top of the v2.x set (fp32 / fp16 / bf16 / i32 / i8 / u8 / i4 / u4). Each is gated by its own VX_CFG_TCU_{FP8,BF16,TF32}_ENABLE; format dispatch is unified across all four FEDP backends.
  • Tensor-core structured sparsity (2:4). VX_tcu_sp_mux + VX_tcu_meta datapath plus host compress_2to4_matrix / prune_2to4_matrix helpers; gated by VX_CFG_TCU_SPARSE_ENABLE.
  • Warpgroup-level MMA (WGMMA). Per-warp NRA=4 / variable-NRC fragment layout, S/R source modes, smem descriptor path; gated by VX_CFG_TCU_WGMMA_ENABLE.
  • Data-transfer Acceleration (DXA). Async global→local DMA engine for tile staging (hw/rtl/dxa/ + sim/simx/dxa/).
  • Hardware Kernel Management Unit (KMU). New scheduler block (hw/rtl/VX_kmu.sv + sim/simx/kmu/) that owns CTA dispatch from the CP launch path.
  • Launch-level CTA clustering. vx_launch_info_t::cluster_dim[3] (sw/runtime/include/vortex2.h:202) guarantees every K-CTA group is co-resident on one core; KMU iterates intra-cluster offsets first (sim/simx/kmu/kmu.cpp:63) and the CTA dispatcher reserves K contiguous LMEM slots so DXA Path A multicast can target issuer + r*stride.
  • Command Processor (CP) v3. New hw/rtl/cp/ block + host-resident command ring (CMD_LAUNCH, CMD_MEM_*, CMD_DCR_*, CMD_CACHE_FLUSH, CMD_EVENT_*); integrated end-to-end across xrt, opae, simx, rtlsim behind VORTEX_USE_CP.
  • Asynchronous vortex2.h runtime API. Queues, events, modules, kernels, UVA raw-pointer kernel args, per-queue worker thread; legacy vortex.h retained as a thin wrapper.
  • C++ software CP model (sim/common/cmd_processor.cpp) shared by simx and rtlsim.
  • Graphics stack (RASTER / TEX / OM). Fixed-function 3D pipeline: hw/rtl/{raster,tex,om}/ + VX_graphics.sv + matching SimX models; --graphics regression group.
  • Public host-side graphics API. sw/runtime/include/graphics.h exposes vortex::graphics::Binning() (triangle setup + tile binning producing the on-wire rast_prim_t stream the RASTER unit reads) plus self-contained vertex_t / primitive_t input types and DCR address helpers for external Vulkan/HIP/OpenGL drivers.
  • Canonical on-wire graphics ABI in sw/kernel/include/vx_graphics.h. Templated POD vortex::graphics::fixed_t<F> (Q15.16 / Q?.24 with full arithmetic, all members public + trivially copyable) plus the on-wire structs (vec3e_t, rast_prim_t, rast_attribs_t, rast_tile_header_t, etc.) and 8888 pixel helpers.
  • Vortex SDK install layout. make install produces $VORTEX_PATH/{kernel,runtime}/{include,lib<XLEN>} plus lib/pkgconfig/vortex-{runtime,kernel}.pc (auto-generated from sw/runtime/vortex-runtime.pc.in and sw/runtime/vortex-kernel.pc.in at configure time). Default prefix is <build>/install. Why: gives downstream tools (mesa-vortex, pocl-vortex, chipStar) a single $VORTEX_PATH env var + pkg-config integration shape.
  • Vulkan support via a new Mesa Gallium driver vortexpipe selected through the lavapipe ICD; tests/vulkan/ suite (compute, draw3d, depth, textured, raytrace); Mesa shipped via the prebuilt toolchain; rv64 path enabled.
  • HIP support on rv32 + rv64 via chipStar. chipStar's hipcc now accepts --offload-pointer-width={32,64} and emits SPIR-V with the matching OpMemoryModel Physical{32,64}; a single libCHIP.so ships both widths' rtdevlib modules and selects at runtime via CL_DEVICE_ADDRESS_BITS. POCL on rv32 Vortex accepts the resulting Physical32 SPIR-V cleanly. See docs/designs/hip_on_vortex_chipstar.md.
  • Hardware atomics. RISC-V A-extension (LR/SC reservation table + cache-resident AMO* RMW), gated by VX_CFG_EXT_A_ENABLE. AMOs complete at the LLC while non-LLC banks invalidate the line on passthrough, so atomics are correct across the full L1/L2/L3 cache hierarchy.
  • In-house IEEE-754 FPU (VX_fpu_std), now F32 and F64. Fully RV-compliant scalar FPU built from Vortex-owned blocks, covering both single (F) and double (D) precision natively — VX_fma_unit (separate F32/F64 fused multiply-add cores, NVIDIA-style), merged-format VX_fdivsqrt_unit (radix-2 non-restoring FDIV + FSQRT; one carry-save datapath sized for the widest format — 17-cycle F32-only, 32-cycle when D is enabled), merged-format VX_fcvt_unit (I2F/F2I/F2F incl. FCVT.S.D subnormal/overflow), and merged VX_fncp_unit (sign-inject/min-max/compare/class/move with F32 NaN-box checking) — selected via VX_CFG_FPU_TYPE_STD. Validated against the full rv64u[fd]-p riscv-tests ISA suite on the RTL FPU. Why: removes the FPNEW dependency entirely — FPNEW is no longer required for D support on any flow (ASIC/Yosys/Synopsys and FPGA), only optionally selectable; the native units deliver higher fmax, lower latency, and smaller area, and the in-tree source unblocks block-level tuning that vendoring made impractical.
  • RISC-V Zicond (conditional ops). CZERO.EQZ / CZERO.NEZ integrated end-to-end (decode in VX_decode.sv, ALU in VX_alu_int.sv); gated by VX_CFG_EXT_ZICOND_ENABLE. Adds an ISA-level branchless-select primitive used by LLVM 20's codegen.
  • Pack-load intrinsics (vx_packlb_f / vx_packlh_f). Single-instruction strided loads that fold 4×byte (PACKLB) or 2×halfword (PACKLH) loads into one front-end issue, expanded by VX_uop_packld into N back-to-back LSU uops with eff_rs1 = rs1 + rs2 × uop_idx. Used heavily by sw/kernel/include/vx_tensor.h for TCU tile-row packing.
  • Wallace-tree + folded-radix multipliers (VX_wallace_mul, VX_fold_mul). New hw/rtl/libs/ multiplier blocks. Why: used by VX_fma_unit (mantissa multiply) and the TFR TCU backend (per-format integer / fp multiply); shared structural multiplier lets both blocks pick the area/latency trade-off via a single point of change.
  • Kogge-Stone parallel-prefix adder (VX_ks_adder). Logarithmic-depth carry-propagate adder under hw/rtl/libs/. Why: drives the long carry chains in VX_fma_unit (exponent/mantissa alignment + final add) and the TFR TCU's FEDP final accumulator without the ripple-carry timing penalty Verilator's default + emits.
  • Stream split/join primitives (VX_stream_dispatch, VX_stream_fork, VX_stream_join). New hw/rtl/libs/ modules consumed by VX_dcr_arb, VX_dxa_dispatch, and VX_gbar_arb. Why: replaces ad-hoc 1→N and N→1 ready/valid plumbing duplicated across the DCR, DXA, and global-barrier paths with one tested, parameterizable primitive; cuts ~3 copies of the same handshake state machine.
  • Inference-based integrated clock gating (VX_clockgate). Synthesizable ICG cell at hw/rtl/libs/VX_clockgate.sv; instantiated for per-core gating in VX_socket.sv:407. Why: gives the synthesis tools a single recognizable ICG inference pattern so ASIC flows (Synopsys/Yosys) generate proper latch-based gates instead of glitching AND-based gating; per-core gating is the first power-domain leverage point.
  • Compressed instruction set (RVC). New VX_decompressor block in the fetch stage (hw/rtl/core/VX_decompressor.sv + sim/simx/decompressor.cpp); gated by VX_CFG_EXT_C_ENABLE. v2.x shipped the test binaries but had no decompressor.
  • Asynchronous barriers with arrive / wait / expect_tx semantics. VX_bar_unit + vortex::barrier host API; expect_tx is the hook DXA multicast uses to declare expected bytes.
  • MMU / virtual memory (SV32, rv32-only). Host-shadow page table + DeviceMemIO refactor + --vm regression group.
  • GEM5 integration. VortexGPGPU SimObject + x86/aarch64 host runtimes; ci/regression.sh --gem5 + VORTEX_GEM5_ARM=1.
  • SimX v3 TLM architecture. Transaction-level memory packets (MemReq/MemRsp with shared_ptr<mem_block_t> payloads) and reusable TLM cache / switch / coalescer modules across the L1/L2/L3 hierarchy, tcache/ocache/rcache, DXA, and CP DMA paths.
  • ASIC synthesis flows. hw/syn/{synopsys,yosys}/ productized: shared hw/syn/common.mk, bundled NanGate_15nm_OCL.db standard cells, standardized OPT_LEVEL; legacy hw/syn/modelsim flow retired.
  • Synopsys multi-PDK support. hw/syn/synopsys/Makefile now exposes three target PDKs via LIB_TGT selection: ASAP7 (7nm), SAED14 (14nm SLVT), and NanGate 15nm OCL (default). Per-PDK SRAM mappings, pin polarity, and address packing handled in hw/syn/synopsys/project.tcl (SRAM_PINS + family dispatch). Why: lets users compare PPA on a real foundry-style 7nm/14nm flow without rewriting the synthesis scripts per technology node.
  • Yosys + OpenSTA end-to-end ASIC pipeline. hw/syn/yosys/run_synth.sh runs synthesis → tech-map → area (stat -liberty) → SRAM-cost estimation (sram_cost.py) → OpenSTA timing + power (run_sta.tcl, gated by RUN_STA=1). Makefile targets: synth, techmap, timing. Why: gives the open-source flow the same area/timing/power deliverables as the Synopsys flow — no commercial-tool license needed for first-pass PPA exploration.
  • SAIF switching-activity workflow for power analysis. ci/blackbox.sh --saif [--saif_file=...] captures gate-level activity from rtlsim runs; both the Synopsys (project.tcl:489-922read_saif -auto_map_names + report_saif) and Xilinx (hw/syn/xilinx/dut/common.mk:31) power flows consume the resulting SAIF for vector-driven power estimation. Why: replaces vectorless power estimation with activity-annotated numbers tied to a real workload, dropping the order-of-magnitude error band that purely static analysis carries.
  • OpenSTA tool packaged with the toolchain. New sta() function in ci/toolchain_install.sh.in + ci/toolchain_prebuilt.sh.in; installed under $TOOLDIR/sta/ and consumed by both the Yosys flow (run_sta.tcl) and hw/syn/common.mk as $(STA). Why: makes OpenSTA a first-class peer of Verilator/sv2v/Yosys in the prebuilt toolchain so the ASIC flow's timing/power story works out-of-the-box from a fresh toolchain_install.sh --all.
  • Preemption groundwork. Synchronous RISC-V trap path + native riscv-tests support.
  • Toolchain refresh. LLVM 20, POCL 7.0, chipStar.
  • Kernel-entry calling convention (no callee-saved spills). vortex.kernel entries (__kernel macro, vx_spawn2.h) skip s0–s11/fs0–fs11 spills via an empty LLVM-20 callee-saved set (ra still saved); the KMU/__vx_cta_entry trampoline runs each once then vx_tmc zero. Why: generic-ABI per-CTA spills thrash the 16 KB L1 (sgemm2 +27% cycles).
  • Versioned toolchain pipeline for CI. VERSION is the single source of truth (VORTEX_VERSION, TOOLCHAIN_REV, GEM5_REV); CI cache keys + installer scripts both honour it; bumping a pin rolls the CI cache.
  • Perfetto trace integration. ci/perfetto.py renders RTL and SimX traces into Chrome Trace JSON (auto-detects flavour); see docs/perfetto_analysis.md.
  • AI-agent integration via AGENTS.md. Canonical entry point for AI agents and human contributors — foundation rules, documentation map, build/test/design invariants.
  • Test groups. New --vulkan, --gem5, --hip, --amo, --tensor_sp, --tensor_wg, --dtm, --mpi, --rvc, --vm, --graphics runners in ci/regression.sh; matrix expanded to rv32 + rv64.
  • CONTRIBUTING.md and this changelog at the repo root.

Changed

  • TCU register-file bank-conflict-free mapping. TCU micro-op generation in TcuUopGen (and the SimX sim/simx/tcu/tcu_unit.cpp:1332 bank-conflict-free formulas) permutes A / B / C operand offsets so every uop's three RF reads land in different GPR banks; separate formula classes cover sparse, dense NT∈{4,16,64}, and dense NT∈{8,32}. Why: drops issue-stage stall cycles to zero on the TCU MMA loop — v2.x used a naive (step % sub_blocks) * block_size offset that incurred bank collisions on every other uop.
  • Profiling counters are host-driven via DCR (device-side dump removed). Kernel startup (vx_start.S) no longer dumps perf counters at exit; the host reads them on demand via vx_mpm_queryvx_dcr_read, gated by VORTEX_PROFILING. Why: drops perf-dump code from every kernel binary.
  • Source-tree consolidation. kernel/ + runtime/sw/{kernel,runtime,common}; new sw/common/ holds code shared by device and host (rvfloats, softfloat_ext, mem_alloc, plus the h/w-internal gfx_render.{h,cpp} host hardware model used only by simx). Why: removes include-path duplication and mirrors the sim/{simx,common,...} layout. Note: ABI types and downstream-visible config (tensor_cfg.h, the graphics on-wire types) live in sw/kernel/include/ so they ship through the install tree; sw/common/ is vortex-internal and never installed.
  • Downstream tools consume vortex via $VORTEX_PATH + pkg-config. mesa-vortex, pocl-vortex and chipStar no longer reach into the vortex source tree ($VORTEX_HOME) or build tree ($VORTEX_BUILD_DIR) for headers or libraries. They link against the install tree produced by make install through vortex-runtime.pc / vortex-kernel.pc. Mesa's vortex-runtime option is now vortex-path; pocl's -DVORTEX_PREFIX= / -DVORTEX_BUILD_DIR= are replaced by pkg_check_modules(VORTEX REQUIRED vortex-runtime) + -DVORTEX_PATH_{32,64} for per-XLEN device-side bitcode builds.
  • gfxutil.{h,cpp} removed. Binning() moved into graphics.h + graphics.cpp (now part of libvortex.so); the toVX{Format,Compare,StencilOp,BlendFunc} helpers (CGLTrace-specific test-input translators) inlined into tests/regression/gfx_draw3d/main.cpp, the only consumer; ResolveFilePath (test-asset filesystem resolver) inlined per-test, matching the resolve_path pattern raycast already used. Why: gfxutil.h was leaking cocogfx (CGLTrace, ePixelFormat) into what should have been vortex's public surface; eliminating it lets graphics.h be self-contained as the install tree requires.
  • riscv-tests .bin blobs removed from the source tree. Pre-built tests/riscv/isa/*.bin (c56562fe) and tests/riscv/benchmarks_{32,64}/*.bin (cd1656cf) collapsed into a single on-demand build under tests/riscv/common.mk that clones upstream riscv-tests at a pinned commit and builds per-XLEN behind a stamp file. Why: drops binary blobs from version control and pins behaviour to a single upstream commit instead of stale checked-in artifacts.
  • Configuration moved to TOML. VX_config.toml + VX_types.toml replace hw/rtl/VX_config.vh; ci/gen_config.py emits per-target headers (build/hw/VX_config.vh, build/sw/VX_config.h) and -D overrides from one source, with expr: / [[enum]] / [[builtin]] / [[param]] semantics. Why: gives the config typed scalars / cross-key expressions / typed enums the flat \defineblock could not express, and decouples SimX/runtime from the RTL source tree (no more-I$(ROOT_DIR)/hw`).
  • Configuration namespace. All HW config macros now carry the VX_CFG_ prefix; HW/SW layering split — VX_config.h is HW/sim-private. Why: resolves preprocessor collisions with LLVM/Clang and Verilator/SV \define`s, and stops HW config from leaking into kernel-side toolchain invocations.
  • No more global toolchain_env.sh. Each build/ carries its own resolved tool paths. Why: enables parallel multi-version Vortex worktrees on the same shell (the old source ci/toolchain_env.sh hijacked $PATH).
  • Build system. Tool-path env vars unified on the _PATH suffix; shared hw/syn/common.mk; standardized OPT_LEVEL across synthesis backends; LLVM default target fixed; dead hw/config call dropped. Why: normalizes the build/synthesis surface so cross-backend changes touch one place.
  • Verilator pinned to 5.028 (was tracking the latest 5.046 release in the prebuilt). Why: 5.046 removed the --xml-only flag that sim/opaesim/Makefile needs for --scope, tightened DEFOVERRIDE into a hard error breaking DEBUG=3 + -DGPR_RESET debug builds, and rejected cvfpu's C_PC / C_EXP_* macros under DPI_DISABLE + FPU_FPNEW (--config2). 5.028 keeps all three working (commit f00bb142).
  • SimObject channels — explicit fixed-size handshaking. Unbounded std::queue + blocking push/pop replaced by fixed-capacity channels + non-blocking try_send / try_pop (RTL ready/valid analog); [[nodiscard]] forces producers to handle backpressure at the call site. Why: makes SimX model true RTL backpressure (1:1 with ready/valid) so buffering bugs surface in C++ instead of only at RTL bring-up.
  • Scoreboard issue arbiter — GTO (Greedy-Then-Oldest). VX_scoreboard now drives a dedicated VX_gto_arbiter with a suppress mask for warps whose target functional unit is full (they keep aging but are skipped for selection until the FU drains). v2.x used VX_stream_arb (round-robin). Why: GTO matches mainstream GPU warp-issue policy — finish the currently-running warp before switching — yielding better ILP/cache locality than naive RR.
  • Scoreboard — XREGS dependency tracking + per-FU lock for multi-uop macros. Scoreboard gains a second plane (inuse_xregs) tracking FPU special registers (fflags, frm); every instruction declares rd_xregs / wr_xregs masks in decode (VX_decode.sv:377-380). New fu_lock / fu_unlock instruction flags let a uop sequencer hold a functional unit across a multi-uop macro — set by VX_tcu_uops.sv:422 for WGMMA (fu_lock=first_uop, fu_unlock=last_uop) — and the scoreboard refuses to issue any other warp's uop to the locked FU until it retires. Why: eliminates v2.x's coarse FPU-boundary stall for fflags/frm WAW/RAW hazards, and keeps multi-uop macros atomic on one FU instance — without fu_lock, interleaving warps would corrupt shared FU state (WGMMA's per-block A registers + accumulator).
  • Warp scheduler — per-warp ibuffer-capacity gate. VX_scheduler tracks per-warp ibuf_full and computes schedule_warps = ready_warps & ~ibuf_full, with an all_ibuf_full ? ready_warps : preferred_warps fallback to keep pipelines absorbing transient stalls. v2.x scheduled solely on active_warps & ~stalled_warps. Why: prevents the scheduler from issuing a warp whose decoded uops will only block on a full ibuffer downstream — wasting fetch/decode bandwidth and pushing back-pressure into the front end.
  • __syncthreads() (BAR opcode) now drains LSU before suspending warps so SMEM writes commit before any post-barrier reads. Gated on lsu_sched_drained in VX_wctl_unit.sv and lsu_drained() in sim/simx/sfu_unit.cpp.
  • BUG FIX: SimX cache-flush parity with RTL VX_dcr_flushProcessorImpl::flush_caches() (sim/simx/processor.cpp) now fans out to icache + dcache + {tcache, rcache, ocache} L1 surfaces in parallel (was dcache-only), matching the RTL shared-req/AND-of-done topology in VX_core.sv + VX_graphics.sv.

Known limitations

  • --vm is SimX-only. Host-shadow PT is modelled in SimX C++ memory; rtlsim's DeviceMemIO is not yet wired through the same shim.

[2.3] — 2026-05-11

Last v2.x maintenance release. See git log v2.2..v2.3.

Earlier releases

Tags v0.2.0 through v2.3 predate this changelog. Use git log and the GitHub releases page for history.