Changelog

All notable changes to Vortex are documented here. The format is based on Keep a Changelog, and the project follows the version pins recorded in VERSION (VORTEX_VERSION, TOOLCHAIN_REV, GEM5_REV).

[3.0] — 2026-06-08

The 3.0 release introduces a fixed-function graphics stack (rasterizer, texture units, and output mergers), tensor core structured sparsity (2:4), warpgroup-level matrix multiplication (WGMMA), global-to-local data transfer acceleration (DXA), a new hardware kernel scheduler (KMU) and Command Processor (CP) architecture, a new asynchronous runtime API (vortex2.h), asynchronous barriers with arrive/wait/event semantics, compressed instruction set (RVC) support, hardware atomics, an MMU/SV32 virtual memory stack, a Mesa/lavapipe Vulkan backend (vortexpipe), HIP via chipStar, gem5 integration, a SimX v3 TLM architecture with fixed-size handshake channels, productized Synopsys and Yosys ASIC synthesis flows, and a refreshed toolchain (LLVM 20, POCL 7.0). Build and configuration infrastructure was reworked: TOML-driven HW configuration (VX_config.toml + VX_types.toml) decoupling SimX/runtime from the RTL source tree, a VX_CFG_ macro namespace that resolves toolchain preprocessor collisions, retirement of the global toolchain_env.sh to enable parallel multi-version Vortex worktrees on the same shell, consolidation of kernel//runtime/ under a shared sw/ root, a single-source VERSION file driving CI toolchain pinning, Perfetto trace export (ci/perfetto.py), and new top-level AGENTS.md + CONTRIBUTING.md for AI-agent and contributor workflows.

Added

TCU tfr arithmetic backend. New in-house, fully-synthesizable fused dot-product running integer and floating-point through one shared 4-cycle pipeline; gated by VX_CFG_TCU_TYPE_TFR. ** Adds FP8 (e4m3), BF8 (e5m2), and TF32 on top of the v2.x set (fp32 / fp16 / bf16 / i32 / i8 / u8 / i4 / u4). Each is gated by its own VX_CFG_TCU_{FP8,BF16,TF32}_ENABLE; format dispatch is unified across all four FEDP backends.
Tensor-core structured sparsity (2:4). VX_tcu_sp_mux + VX_tcu_meta datapath plus host compress_2to4_matrix / prune_2to4_matrix helpers; gated by VX_CFG_TCU_SPARSE_ENABLE.
Warpgroup-level MMA (WGMMA). Per-warp NRA=4 / variable-NRC fragment layout, S/R source modes, smem descriptor path; gated by VX_CFG_TCU_WGMMA_ENABLE.
Data-transfer Acceleration (DXA). Async global→local DMA engine for tile staging (hw/rtl/dxa/ + sim/simx/dxa/).
Hardware Kernel Management Unit (KMU). New scheduler block (hw/rtl/VX_kmu.sv + sim/simx/kmu/) that owns CTA dispatch from the CP launch path.
Launch-level CTA clustering. vx_launch_info_t::cluster_dim[3] (sw/runtime/include/vortex2.h:202) guarantees every K-CTA group is co-resident on one core; KMU iterates intra-cluster offsets first (sim/simx/kmu/kmu.cpp:63) and the CTA dispatcher reserves K contiguous LMEM slots so DXA Path A multicast can target issuer + r*stride.
Command Processor (CP) v3. New hw/rtl/cp/ block + host-resident command ring (CMD_LAUNCH, CMD_MEM_*, CMD_DCR_*, CMD_CACHE_FLUSH, CMD_EVENT_*); integrated end-to-end across xrt, opae, simx, rtlsim behind VORTEX_USE_CP.
Asynchronous vortex2.h runtime API. Queues, events, modules, kernels, UVA raw-pointer kernel args, per-queue worker thread; legacy vortex.h retained as a thin wrapper.
C++ software CP model (sim/common/cmd_processor.cpp) shared by simx and rtlsim.
Graphics stack (RASTER / TEX / OM). Fixed-function 3D pipeline: hw/rtl/{raster,tex,om}/ + VX_graphics.sv + matching SimX models; --graphics regression group.
Public host-side graphics API. sw/runtime/include/graphics.h exposes vortex::graphics::Binning() (triangle setup + tile binning producing the on-wire rast_prim_t stream the RASTER unit reads) plus self-contained vertex_t / primitive_t input types and DCR address helpers for external Vulkan/HIP/OpenGL drivers.
Canonical on-wire graphics ABI in sw/kernel/include/vx_graphics.h. Templated POD vortex::graphics::fixed_t<F> (Q15.16 / Q?.24 with full arithmetic, all members public + trivially copyable) plus the on-wire structs (vec3e_t, rast_prim_t, rast_attribs_t, rast_tile_header_t, etc.) and 8888 pixel helpers.
Vortex SDK install layout. make install produces $VORTEX_PATH/{kernel,runtime}/{include,lib<XLEN>} plus lib/pkgconfig/vortex-{runtime,kernel}.pc (auto-generated from sw/runtime/vortex-runtime.pc.in and sw/runtime/vortex-kernel.pc.in at configure time). Default prefix is <build>/install. Why: gives downstream tools (mesa-vortex, pocl-vortex, chipStar) a single $VORTEX_PATH env var + pkg-config integration shape.
Vulkan support via a new Mesa Gallium driver vortexpipe selected through the lavapipe ICD; tests/vulkan/ suite (compute, draw3d, depth, textured, raytrace); Mesa shipped via the prebuilt toolchain; rv64 path enabled.
HIP support on rv32 + rv64 via chipStar. chipStar's hipcc now accepts --offload-pointer-width={32,64} and emits SPIR-V with the matching OpMemoryModel Physical{32,64}; a single libCHIP.so ships both widths' rtdevlib modules and selects at runtime via CL_DEVICE_ADDRESS_BITS. POCL on rv32 Vortex accepts the resulting Physical32 SPIR-V cleanly. See docs/designs/hip_on_vortex_chipstar.md.
Hardware atomics. RISC-V A-extension (LR/SC reservation table + cache-resident AMO* RMW), gated by VX_CFG_EXT_A_ENABLE. AMOs complete at the LLC while non-LLC banks invalidate the line on passthrough, so atomics are correct across the full L1/L2/L3 cache hierarchy.
In-house IEEE-754 FPU (VX_fpu_std), now F32 and F64. Fully RV-compliant scalar FPU built from Vortex-owned blocks, covering both single (F) and double (D) precision natively — VX_fma_unit (separate F32/F64 fused multiply-add cores, NVIDIA-style), merged-format VX_fdivsqrt_unit (radix-2 non-restoring FDIV + FSQRT; one carry-save datapath sized for the widest format — 17-cycle F32-only, 32-cycle when D is enabled), merged-format VX_fcvt_unit (I2F/F2I/F2F incl. FCVT.S.D subnormal/overflow), and merged VX_fncp_unit (sign-inject/min-max/compare/class/move with F32 NaN-box checking) — selected via VX_CFG_FPU_TYPE_STD. Validated against the full rv64u[fd]-p riscv-tests ISA suite on the RTL FPU. Why: removes the FPNEW dependency entirely — FPNEW is no longer required for D support on any flow (ASIC/Yosys/Synopsys and FPGA), only optionally selectable; the native units deliver higher fmax, lower latency, and smaller area, and the in-tree source unblocks block-level tuning that vendoring made impractical.
RISC-V Zicond (conditional ops). CZERO.EQZ / CZERO.NEZ integrated end-to-end (decode in VX_decode.sv, ALU in VX_alu_int.sv); gated by VX_CFG_EXT_ZICOND_ENABLE. Adds an ISA-level branchless-select primitive used by LLVM 20's codegen.
Pack-load intrinsics (vx_packlb_f / vx_packlh_f). Single-instruction strided loads that fold 4×byte (PACKLB) or 2×halfword (PACKLH) loads into one front-end issue, expanded by VX_uop_packld into N back-to-back LSU uops with eff_rs1 = rs1 + rs2 × uop_idx. Used heavily by sw/kernel/include/vx_tensor.h for TCU tile-row packing.
Wallace-tree + folded-radix multipliers (VX_wallace_mul, VX_fold_mul). New hw/rtl/libs/ multiplier blocks. Why: used by VX_fma_unit (mantissa multiply) and the TFR TCU backend (per-format integer / fp multiply); shared structural multiplier lets both blocks pick the area/latency trade-off via a single point of change.
Kogge-Stone parallel-prefix adder (VX_ks_adder). Logarithmic-depth carry-propagate adder under hw/rtl/libs/. Why: drives the long carry chains in VX_fma_unit (exponent/mantissa alignment + final add) and the TFR TCU's FEDP final accumulator without the ripple-carry timing penalty Verilator's default + emits.
Stream split/join primitives (VX_stream_dispatch, VX_stream_fork, VX_stream_join). New hw/rtl/libs/ modules consumed by VX_dcr_arb, VX_dxa_dispatch, and VX_gbar_arb. Why: replaces ad-hoc 1→N and N→1 ready/valid plumbing duplicated across the DCR, DXA, and global-barrier paths with one tested, parameterizable primitive; cuts ~3 copies of the same handshake state machine.
Inference-based integrated clock gating (VX_clockgate). Synthesizable ICG cell at hw/rtl/libs/VX_clockgate.sv; instantiated for per-core gating in VX_socket.sv:407. Why: gives the synthesis tools a single recognizable ICG inference pattern so ASIC flows (Synopsys/Yosys) generate proper latch-based gates instead of glitching AND-based gating; per-core gating is the first power-domain leverage point.
Compressed instruction set (RVC). New VX_decompressor block in the fetch stage (hw/rtl/core/VX_decompressor.sv + sim/simx/decompressor.cpp); gated by VX_CFG_EXT_C_ENABLE. v2.x shipped the test binaries but had no decompressor.
Asynchronous barriers with arrive / wait / expect_tx semantics. VX_bar_unit + vortex::barrier host API; expect_tx is the hook DXA multicast uses to declare expected bytes.
MMU / virtual memory (SV32, rv32-only). Host-shadow page table + DeviceMemIO refactor + --vm regression group.
GEM5 integration. VortexGPGPU SimObject + x86/aarch64 host runtimes; ci/regression.sh --gem5 + VORTEX_GEM5_ARM=1.
SimX v3 TLM architecture. Transaction-level memory packets (MemReq/MemRsp with shared_ptr<mem_block_t> payloads) and reusable TLM cache / switch / coalescer modules across the L1/L2/L3 hierarchy, tcache/ocache/rcache, DXA, and CP DMA paths.
ASIC synthesis flows. hw/syn/{synopsys,yosys}/ productized: shared hw/syn/common.mk, bundled NanGate_15nm_OCL.db standard cells, standardized OPT_LEVEL; legacy hw/syn/modelsim flow retired.
Synopsys multi-PDK support. hw/syn/synopsys/Makefile now exposes three target PDKs via LIB_TGT selection: ASAP7 (7nm), SAED14 (14nm SLVT), and NanGate 15nm OCL (default). Per-PDK SRAM mappings, pin polarity, and address packing handled in hw/syn/synopsys/project.tcl (SRAM_PINS + family dispatch). Why: lets users compare PPA on a real foundry-style 7nm/14nm flow without rewriting the synthesis scripts per technology node.
Yosys + OpenSTA end-to-end ASIC pipeline. hw/syn/yosys/run_synth.sh runs synthesis → tech-map → area (stat -liberty) → SRAM-cost estimation (sram_cost.py) → OpenSTA timing + power (run_sta.tcl, gated by RUN_STA=1). Makefile targets: synth, techmap, timing. Why: gives the open-source flow the same area/timing/power deliverables as the Synopsys flow — no commercial-tool license needed for first-pass PPA exploration.
SAIF switching-activity workflow for power analysis. ci/blackbox.sh --saif [--saif_file=...] captures gate-level activity from rtlsim runs; both the Synopsys (project.tcl:489-922 — read_saif -auto_map_names + report_saif) and Xilinx (hw/syn/xilinx/dut/common.mk:31) power flows consume the resulting SAIF for vector-driven power estimation. Why: replaces vectorless power estimation with activity-annotated numbers tied to a real workload, dropping the order-of-magnitude error band that purely static analysis carries.
OpenSTA tool packaged with the toolchain. New sta() function in ci/toolchain_install.sh.in + ci/toolchain_prebuilt.sh.in; installed under $TOOLDIR/sta/ and consumed by both the Yosys flow (run_sta.tcl) and hw/syn/common.mk as $(STA). Why: makes OpenSTA a first-class peer of Verilator/sv2v/Yosys in the prebuilt toolchain so the ASIC flow's timing/power story works out-of-the-box from a fresh toolchain_install.sh --all.
Preemption groundwork. Synchronous RISC-V trap path + native riscv-tests support.
Toolchain refresh. LLVM 20, POCL 7.0, chipStar.
Kernel-entry calling convention (no callee-saved spills). vortex.kernel entries (__kernel macro, vx_spawn2.h) skip s0–s11/fs0–fs11 spills via an empty LLVM-20 callee-saved set (ra still saved); the KMU/__vx_cta_entry trampoline runs each once then vx_tmc zero. Why: generic-ABI per-CTA spills thrash the 16 KB L1 (sgemm2 +27% cycles).
Versioned toolchain pipeline for CI. VERSION is the single source of truth (VORTEX_VERSION, TOOLCHAIN_REV, GEM5_REV); CI cache keys + installer scripts both honour it; bumping a pin rolls the CI cache.
Perfetto trace integration. ci/perfetto.py renders RTL and SimX traces into Chrome Trace JSON (auto-detects flavour); see docs/perfetto_analysis.md.
AI-agent integration via AGENTS.md. Canonical entry point for AI agents and human contributors — foundation rules, documentation map, build/test/design invariants.
Test groups. New --vulkan, --gem5, --hip, --amo, --tensor_sp, --tensor_wg, --dtm, --mpi, --rvc, --vm, --graphics runners in ci/regression.sh; matrix expanded to rv32 + rv64.
CONTRIBUTING.md and this changelog at the repo root.

Changed

TCU register-file bank-conflict-free mapping. TCU micro-op generation in TcuUopGen (and the SimX sim/simx/tcu/tcu_unit.cpp:1332 bank-conflict-free formulas) permutes A / B / C operand offsets so every uop's three RF reads land in different GPR banks; separate formula classes cover sparse, dense NT∈{4,16,64}, and dense NT∈{8,32}. Why: drops issue-stage stall cycles to zero on the TCU MMA loop — v2.x used a naive (step % sub_blocks) * block_size offset that incurred bank collisions on every other uop.
Profiling counters are host-driven via DCR (device-side dump removed). Kernel startup (vx_start.S) no longer dumps perf counters at exit; the host reads them on demand via vx_mpm_query → vx_dcr_read, gated by VORTEX_PROFILING. Why: drops perf-dump code from every kernel binary.
Source-tree consolidation. kernel/ + runtime/ → sw/{kernel,runtime,common}; new sw/common/ holds code shared by device and host (rvfloats, softfloat_ext, mem_alloc, plus the h/w-internal gfx_render.{h,cpp} host hardware model used only by simx). Why: removes include-path duplication and mirrors the sim/{simx,common,...} layout. Note: ABI types and downstream-visible config (tensor_cfg.h, the graphics on-wire types) live in sw/kernel/include/ so they ship through the install tree; sw/common/ is vortex-internal and never installed.
Downstream tools consume vortex via $VORTEX_PATH + pkg-config. mesa-vortex, pocl-vortex and chipStar no longer reach into the vortex source tree ($VORTEX_HOME) or build tree ($VORTEX_BUILD_DIR) for headers or libraries. They link against the install tree produced by make install through vortex-runtime.pc / vortex-kernel.pc. Mesa's vortex-runtime option is now vortex-path; pocl's -DVORTEX_PREFIX= / -DVORTEX_BUILD_DIR= are replaced by pkg_check_modules(VORTEX REQUIRED vortex-runtime) + -DVORTEX_PATH_{32,64} for per-XLEN device-side bitcode builds.
gfxutil.{h,cpp} removed. Binning() moved into graphics.h + graphics.cpp (now part of libvortex.so); the toVX{Format,Compare,StencilOp,BlendFunc} helpers (CGLTrace-specific test-input translators) inlined into tests/regression/gfx_draw3d/main.cpp, the only consumer; ResolveFilePath (test-asset filesystem resolver) inlined per-test, matching the resolve_path pattern raycast already used. Why: gfxutil.h was leaking cocogfx (CGLTrace, ePixelFormat) into what should have been vortex's public surface; eliminating it lets graphics.h be self-contained as the install tree requires.
riscv-tests .bin blobs removed from the source tree. Pre-built tests/riscv/isa/*.bin (c56562fe) and tests/riscv/benchmarks_{32,64}/*.bin (cd1656cf) collapsed into a single on-demand build under tests/riscv/common.mk that clones upstream riscv-tests at a pinned commit and builds per-XLEN behind a stamp file. Why: drops binary blobs from version control and pins behaviour to a single upstream commit instead of stale checked-in artifacts.
Configuration moved to TOML. VX_config.toml + VX_types.toml replace hw/rtl/VX_config.vh; ci/gen_config.py emits per-target headers (build/hw/VX_config.vh, build/sw/VX_config.h) and -D overrides from one source, with expr: / [[enum]] / [[builtin]] / [[param]] semantics. Why: gives the config typed scalars / cross-key expressions / typed enums the flat \defineblock could not express, and decouples SimX/runtime from the RTL source tree (no more-I$(ROOT_DIR)/hw`).
Configuration namespace. All HW config macros now carry the VX_CFG_ prefix; HW/SW layering split — VX_config.h is HW/sim-private. Why: resolves preprocessor collisions with LLVM/Clang and Verilator/SV \define`s, and stops HW config from leaking into kernel-side toolchain invocations.
No more global toolchain_env.sh. Each build/ carries its own resolved tool paths. Why: enables parallel multi-version Vortex worktrees on the same shell (the old source ci/toolchain_env.sh hijacked $PATH).
Build system. Tool-path env vars unified on the _PATH suffix; shared hw/syn/common.mk; standardized OPT_LEVEL across synthesis backends; LLVM default target fixed; dead hw/config call dropped. Why: normalizes the build/synthesis surface so cross-backend changes touch one place.
Verilator pinned to 5.028 (was tracking the latest 5.046 release in the prebuilt). Why: 5.046 removed the --xml-only flag that sim/opaesim/Makefile needs for --scope, tightened DEFOVERRIDE into a hard error breaking DEBUG=3 + -DGPR_RESET debug builds, and rejected cvfpu's C_PC / C_EXP_* macros under DPI_DISABLE + FPU_FPNEW (--config2). 5.028 keeps all three working (commit f00bb142).
SimObject channels — explicit fixed-size handshaking. Unbounded std::queue + blocking push/pop replaced by fixed-capacity channels + non-blocking try_send / try_pop (RTL ready/valid analog); [[nodiscard]] forces producers to handle backpressure at the call site. Why: makes SimX model true RTL backpressure (1:1 with ready/valid) so buffering bugs surface in C++ instead of only at RTL bring-up.
Scoreboard issue arbiter — GTO (Greedy-Then-Oldest). VX_scoreboard now drives a dedicated VX_gto_arbiter with a suppress mask for warps whose target functional unit is full (they keep aging but are skipped for selection until the FU drains). v2.x used VX_stream_arb (round-robin). Why: GTO matches mainstream GPU warp-issue policy — finish the currently-running warp before switching — yielding better ILP/cache locality than naive RR.
Scoreboard — XREGS dependency tracking + per-FU lock for multi-uop macros. Scoreboard gains a second plane (inuse_xregs) tracking FPU special registers (fflags, frm); every instruction declares rd_xregs / wr_xregs masks in decode (VX_decode.sv:377-380). New fu_lock / fu_unlock instruction flags let a uop sequencer hold a functional unit across a multi-uop macro — set by VX_tcu_uops.sv:422 for WGMMA (fu_lock=first_uop, fu_unlock=last_uop) — and the scoreboard refuses to issue any other warp's uop to the locked FU until it retires. Why: eliminates v2.x's coarse FPU-boundary stall for fflags/frm WAW/RAW hazards, and keeps multi-uop macros atomic on one FU instance — without fu_lock, interleaving warps would corrupt shared FU state (WGMMA's per-block A registers + accumulator).
Warp scheduler — per-warp ibuffer-capacity gate. VX_scheduler tracks per-warp ibuf_full and computes schedule_warps = ready_warps & ~ibuf_full, with an all_ibuf_full ? ready_warps : preferred_warps fallback to keep pipelines absorbing transient stalls. v2.x scheduled solely on active_warps & ~stalled_warps. Why: prevents the scheduler from issuing a warp whose decoded uops will only block on a full ibuffer downstream — wasting fetch/decode bandwidth and pushing back-pressure into the front end.
__syncthreads() (BAR opcode) now drains LSU before suspending warps so SMEM writes commit before any post-barrier reads. Gated on lsu_sched_drained in VX_wctl_unit.sv and lsu_drained() in sim/simx/sfu_unit.cpp.
BUG FIX: SimX cache-flush parity with RTL VX_dcr_flush — ProcessorImpl::flush_caches() (sim/simx/processor.cpp) now fans out to icache + dcache + {tcache, rcache, ocache} L1 surfaces in parallel (was dcache-only), matching the RTL shared-req/AND-of-done topology in VX_core.sv + VX_graphics.sv.

Known limitations

--vm is SimX-only. Host-shadow PT is modelled in SimX C++ memory; rtlsim's DeviceMemIO is not yet wired through the same shim.

[2.3] — 2026-05-11

Last v2.x maintenance release. See git log v2.2..v2.3.

Earlier releases

Tags v0.2.0 through v2.3 predate this changelog. Use git log and the GitHub releases page for history.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

[3.0] — 2026-06-08

Added

Changed

Known limitations

[2.3] — 2026-05-11

Earlier releases

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

[3.0] — 2026-06-08

Added

Changed

Known limitations

[2.3] — 2026-05-11

Earlier releases