All notable changes to Vortex are documented here. The format is based on
Keep a Changelog, and the project
follows the version pins recorded in VERSION (VORTEX_VERSION,
TOOLCHAIN_REV, GEM5_REV).
The 3.0 release introduces a fixed-function graphics stack (rasterizer, texture units, and output mergers), tensor core structured sparsity (2:4), warpgroup-level matrix multiplication (WGMMA), global-to-local data transfer acceleration (DXA), a new hardware kernel scheduler (KMU) and Command Processor (CP) architecture, a new asynchronous runtime API (vortex2.h), asynchronous barriers with arrive/wait/event semantics, compressed instruction set (RVC) support, hardware atomics, an MMU/SV32 virtual memory stack, a Mesa/lavapipe Vulkan backend (vortexpipe), HIP via chipStar, gem5 integration, a SimX v3 TLM architecture with fixed-size handshake channels, productized Synopsys and Yosys ASIC synthesis flows, and a refreshed toolchain (LLVM 20, POCL 7.0). Build and configuration infrastructure was reworked: TOML-driven HW configuration (VX_config.toml + VX_types.toml) decoupling SimX/runtime from the RTL source tree, a VX_CFG_ macro namespace that resolves toolchain preprocessor collisions, retirement of the global toolchain_env.sh to enable parallel multi-version Vortex worktrees on the same shell, consolidation of kernel//runtime/ under a shared sw/ root, a single-source VERSION file driving CI toolchain pinning, Perfetto trace export (ci/perfetto.py), and new top-level AGENTS.md + CONTRIBUTING.md for AI-agent and contributor workflows.
- TCU
tfrarithmetic backend. New in-house, fully-synthesizable fused dot-product running integer and floating-point through one shared 4-cycle pipeline; gated byVX_CFG_TCU_TYPE_TFR. ** Adds FP8 (e4m3), BF8 (e5m2), and TF32 on top of the v2.x set (fp32 / fp16 / bf16 / i32 / i8 / u8 / i4 / u4). Each is gated by its ownVX_CFG_TCU_{FP8,BF16,TF32}_ENABLE; format dispatch is unified across all four FEDP backends. - Tensor-core structured sparsity (2:4).
VX_tcu_sp_mux+VX_tcu_metadatapath plus hostcompress_2to4_matrix/prune_2to4_matrixhelpers; gated byVX_CFG_TCU_SPARSE_ENABLE. - Warpgroup-level MMA (WGMMA). Per-warp
NRA=4/ variable-NRCfragment layout, S/R source modes, smem descriptor path; gated byVX_CFG_TCU_WGMMA_ENABLE. - Data-transfer Acceleration (DXA). Async global→local DMA engine for tile staging (
hw/rtl/dxa/+sim/simx/dxa/). - Hardware Kernel Management Unit (KMU). New scheduler block (
hw/rtl/VX_kmu.sv+sim/simx/kmu/) that owns CTA dispatch from the CP launch path. - Launch-level CTA clustering.
vx_launch_info_t::cluster_dim[3](sw/runtime/include/vortex2.h:202) guarantees every K-CTA group is co-resident on one core; KMU iterates intra-cluster offsets first (sim/simx/kmu/kmu.cpp:63) and the CTA dispatcher reserves K contiguous LMEM slots so DXA Path A multicast can targetissuer + r*stride. - Command Processor (CP) v3. New
hw/rtl/cp/block + host-resident command ring (CMD_LAUNCH,CMD_MEM_*,CMD_DCR_*,CMD_CACHE_FLUSH,CMD_EVENT_*); integrated end-to-end across xrt, opae, simx, rtlsim behindVORTEX_USE_CP. - Asynchronous
vortex2.hruntime API. Queues, events, modules, kernels, UVA raw-pointer kernel args, per-queue worker thread; legacyvortex.hretained as a thin wrapper. - C++ software CP model (
sim/common/cmd_processor.cpp) shared by simx and rtlsim. - Graphics stack (RASTER / TEX / OM). Fixed-function 3D pipeline:
hw/rtl/{raster,tex,om}/+VX_graphics.sv+ matching SimX models;--graphicsregression group. - Public host-side graphics API.
sw/runtime/include/graphics.hexposesvortex::graphics::Binning()(triangle setup + tile binning producing the on-wirerast_prim_tstream the RASTER unit reads) plus self-containedvertex_t/primitive_tinput types and DCR address helpers for external Vulkan/HIP/OpenGL drivers. - Canonical on-wire graphics ABI in
sw/kernel/include/vx_graphics.h. Templated PODvortex::graphics::fixed_t<F>(Q15.16 / Q?.24 with full arithmetic, all members public + trivially copyable) plus the on-wire structs (vec3e_t,rast_prim_t,rast_attribs_t,rast_tile_header_t, etc.) and 8888 pixel helpers. - Vortex SDK install layout.
make installproduces$VORTEX_PATH/{kernel,runtime}/{include,lib<XLEN>}pluslib/pkgconfig/vortex-{runtime,kernel}.pc(auto-generated from sw/runtime/vortex-runtime.pc.in and sw/runtime/vortex-kernel.pc.in at configure time). Default prefix is<build>/install. Why: gives downstream tools (mesa-vortex, pocl-vortex, chipStar) a single$VORTEX_PATHenv var +pkg-configintegration shape. - Vulkan support via a new Mesa Gallium driver
vortexpipeselected through thelavapipeICD;tests/vulkan/suite (compute, draw3d, depth, textured, raytrace); Mesa shipped via the prebuilt toolchain; rv64 path enabled. - HIP support on rv32 + rv64 via chipStar. chipStar's
hipccnow accepts--offload-pointer-width={32,64}and emits SPIR-V with the matchingOpMemoryModel Physical{32,64}; a singlelibCHIP.soships both widths' rtdevlib modules and selects at runtime viaCL_DEVICE_ADDRESS_BITS. POCL on rv32 Vortex accepts the resultingPhysical32SPIR-V cleanly. See docs/designs/hip_on_vortex_chipstar.md. - Hardware atomics. RISC-V
A-extension (LR/SCreservation table + cache-residentAMO*RMW), gated byVX_CFG_EXT_A_ENABLE. AMOs complete at the LLC while non-LLC banks invalidate the line on passthrough, so atomics are correct across the full L1/L2/L3 cache hierarchy. - In-house IEEE-754 FPU (
VX_fpu_std), now F32 and F64. Fully RV-compliant scalar FPU built from Vortex-owned blocks, covering both single (F) and double (D) precision natively —VX_fma_unit(separate F32/F64 fused multiply-add cores, NVIDIA-style), merged-formatVX_fdivsqrt_unit(radix-2 non-restoring FDIV + FSQRT; one carry-save datapath sized for the widest format — 17-cycle F32-only, 32-cycle whenDis enabled), merged-formatVX_fcvt_unit(I2F/F2I/F2F incl.FCVT.S.Dsubnormal/overflow), and mergedVX_fncp_unit(sign-inject/min-max/compare/class/move with F32 NaN-box checking) — selected viaVX_CFG_FPU_TYPE_STD. Validated against the fullrv64u[fd]-priscv-tests ISA suite on the RTL FPU. Why: removes the FPNEW dependency entirely — FPNEW is no longer required forDsupport on any flow (ASIC/Yosys/Synopsys and FPGA), only optionally selectable; the native units deliver higher fmax, lower latency, and smaller area, and the in-tree source unblocks block-level tuning that vendoring made impractical. - RISC-V
Zicond(conditional ops).CZERO.EQZ/CZERO.NEZintegrated end-to-end (decode inVX_decode.sv, ALU inVX_alu_int.sv); gated byVX_CFG_EXT_ZICOND_ENABLE. Adds an ISA-level branchless-select primitive used by LLVM 20's codegen. - Pack-load intrinsics (
vx_packlb_f/vx_packlh_f). Single-instruction strided loads that fold 4×byte (PACKLB) or 2×halfword (PACKLH) loads into one front-end issue, expanded byVX_uop_packldinto N back-to-back LSU uops witheff_rs1 = rs1 + rs2 × uop_idx. Used heavily by sw/kernel/include/vx_tensor.h for TCU tile-row packing. - Wallace-tree + folded-radix multipliers (
VX_wallace_mul,VX_fold_mul). Newhw/rtl/libs/multiplier blocks. Why: used byVX_fma_unit(mantissa multiply) and the TFR TCU backend (per-format integer / fp multiply); shared structural multiplier lets both blocks pick the area/latency trade-off via a single point of change. - Kogge-Stone parallel-prefix adder (
VX_ks_adder). Logarithmic-depth carry-propagate adder under hw/rtl/libs/. Why: drives the long carry chains inVX_fma_unit(exponent/mantissa alignment + final add) and the TFR TCU's FEDP final accumulator without the ripple-carry timing penalty Verilator's default+emits. - Stream split/join primitives (
VX_stream_dispatch,VX_stream_fork,VX_stream_join). Newhw/rtl/libs/modules consumed byVX_dcr_arb,VX_dxa_dispatch, andVX_gbar_arb. Why: replaces ad-hoc 1→N and N→1 ready/valid plumbing duplicated across the DCR, DXA, and global-barrier paths with one tested, parameterizable primitive; cuts ~3 copies of the same handshake state machine. - Inference-based integrated clock gating (
VX_clockgate). Synthesizable ICG cell at hw/rtl/libs/VX_clockgate.sv; instantiated for per-core gating in VX_socket.sv:407. Why: gives the synthesis tools a single recognizable ICG inference pattern so ASIC flows (Synopsys/Yosys) generate proper latch-based gates instead of glitching AND-based gating; per-core gating is the first power-domain leverage point. - Compressed instruction set (RVC). New
VX_decompressorblock in the fetch stage (hw/rtl/core/VX_decompressor.sv + sim/simx/decompressor.cpp); gated byVX_CFG_EXT_C_ENABLE. v2.x shipped the test binaries but had no decompressor. - Asynchronous barriers with
arrive/wait/expect_txsemantics.VX_bar_unit+vortex::barrierhost API;expect_txis the hook DXA multicast uses to declare expected bytes. - MMU / virtual memory (SV32, rv32-only). Host-shadow page table +
DeviceMemIOrefactor +--vmregression group. - GEM5 integration.
VortexGPGPUSimObject + x86/aarch64 host runtimes;ci/regression.sh --gem5+VORTEX_GEM5_ARM=1. - SimX v3 TLM architecture. Transaction-level memory packets (
MemReq/MemRspwithshared_ptr<mem_block_t>payloads) and reusable TLM cache / switch / coalescer modules across the L1/L2/L3 hierarchy, tcache/ocache/rcache, DXA, and CP DMA paths. - ASIC synthesis flows.
hw/syn/{synopsys,yosys}/productized: sharedhw/syn/common.mk, bundledNanGate_15nm_OCL.dbstandard cells, standardizedOPT_LEVEL; legacyhw/syn/modelsimflow retired. - Synopsys multi-PDK support. hw/syn/synopsys/Makefile now exposes three target PDKs via
LIB_TGTselection: ASAP7 (7nm), SAED14 (14nm SLVT), and NanGate 15nm OCL (default). Per-PDK SRAM mappings, pin polarity, and address packing handled in hw/syn/synopsys/project.tcl (SRAM_PINS+ family dispatch). Why: lets users compare PPA on a real foundry-style 7nm/14nm flow without rewriting the synthesis scripts per technology node. - Yosys + OpenSTA end-to-end ASIC pipeline. hw/syn/yosys/run_synth.sh runs synthesis → tech-map → area (
stat -liberty) → SRAM-cost estimation (sram_cost.py) → OpenSTA timing + power (run_sta.tcl, gated byRUN_STA=1). Makefile targets:synth,techmap,timing. Why: gives the open-source flow the same area/timing/power deliverables as the Synopsys flow — no commercial-tool license needed for first-pass PPA exploration. - SAIF switching-activity workflow for power analysis.
ci/blackbox.sh --saif [--saif_file=...]captures gate-level activity from rtlsim runs; both the Synopsys (project.tcl:489-922 —read_saif -auto_map_names+report_saif) and Xilinx (hw/syn/xilinx/dut/common.mk:31) power flows consume the resulting SAIF for vector-driven power estimation. Why: replaces vectorless power estimation with activity-annotated numbers tied to a real workload, dropping the order-of-magnitude error band that purely static analysis carries. - OpenSTA tool packaged with the toolchain. New
sta()function in ci/toolchain_install.sh.in + ci/toolchain_prebuilt.sh.in; installed under$TOOLDIR/sta/and consumed by both the Yosys flow (run_sta.tcl) andhw/syn/common.mkas$(STA). Why: makes OpenSTA a first-class peer of Verilator/sv2v/Yosys in the prebuilt toolchain so the ASIC flow's timing/power story works out-of-the-box from a freshtoolchain_install.sh --all. - Preemption groundwork. Synchronous RISC-V trap path + native
riscv-testssupport. - Toolchain refresh. LLVM 20, POCL 7.0, chipStar.
- Kernel-entry calling convention (no callee-saved spills).
vortex.kernelentries (__kernelmacro, vx_spawn2.h) skips0–s11/fs0–fs11spills via an empty LLVM-20 callee-saved set (rastill saved); the KMU/__vx_cta_entrytrampoline runs each once thenvx_tmc zero. Why: generic-ABI per-CTA spills thrash the 16 KB L1 (sgemm2+27% cycles). - Versioned toolchain pipeline for CI. VERSION is the single source of truth (
VORTEX_VERSION,TOOLCHAIN_REV,GEM5_REV); CI cache keys + installer scripts both honour it; bumping a pin rolls the CI cache. - Perfetto trace integration. ci/perfetto.py renders RTL and SimX traces into Chrome Trace JSON (auto-detects flavour); see docs/perfetto_analysis.md.
- AI-agent integration via AGENTS.md. Canonical entry point for AI agents and human contributors — foundation rules, documentation map, build/test/design invariants.
- Test groups. New
--vulkan,--gem5,--hip,--amo,--tensor_sp,--tensor_wg,--dtm,--mpi,--rvc,--vm,--graphicsrunners inci/regression.sh; matrix expanded torv32+rv64. - CONTRIBUTING.md and this changelog at the repo root.
- TCU register-file bank-conflict-free mapping. TCU micro-op generation in
TcuUopGen(and the SimX sim/simx/tcu/tcu_unit.cpp:1332 bank-conflict-free formulas) permutes A / B / C operand offsets so every uop's three RF reads land in different GPR banks; separate formula classes cover sparse, denseNT∈{4,16,64}, and denseNT∈{8,32}. Why: drops issue-stage stall cycles to zero on the TCU MMA loop — v2.x used a naive(step % sub_blocks) * block_sizeoffset that incurred bank collisions on every other uop. - Profiling counters are host-driven via DCR (device-side dump removed). Kernel startup (vx_start.S) no longer dumps perf counters at exit; the host reads them on demand via
vx_mpm_query→vx_dcr_read, gated byVORTEX_PROFILING. Why: drops perf-dump code from every kernel binary. - Source-tree consolidation.
kernel/+runtime/→sw/{kernel,runtime,common}; newsw/common/holds code shared by device and host (rvfloats, softfloat_ext, mem_alloc, plus the h/w-internalgfx_render.{h,cpp}host hardware model used only by simx). Why: removes include-path duplication and mirrors thesim/{simx,common,...}layout. Note: ABI types and downstream-visible config (tensor_cfg.h, the graphics on-wire types) live insw/kernel/include/so they ship through the install tree;sw/common/is vortex-internal and never installed. - Downstream tools consume vortex via
$VORTEX_PATH+ pkg-config. mesa-vortex, pocl-vortex and chipStar no longer reach into the vortex source tree ($VORTEX_HOME) or build tree ($VORTEX_BUILD_DIR) for headers or libraries. They link against the install tree produced bymake installthroughvortex-runtime.pc/vortex-kernel.pc. Mesa'svortex-runtimeoption is nowvortex-path; pocl's-DVORTEX_PREFIX=/-DVORTEX_BUILD_DIR=are replaced bypkg_check_modules(VORTEX REQUIRED vortex-runtime)+-DVORTEX_PATH_{32,64}for per-XLEN device-side bitcode builds. gfxutil.{h,cpp}removed.Binning()moved intographics.h+graphics.cpp(now part oflibvortex.so); thetoVX{Format,Compare,StencilOp,BlendFunc}helpers (CGLTrace-specific test-input translators) inlined into tests/regression/gfx_draw3d/main.cpp, the only consumer;ResolveFilePath(test-asset filesystem resolver) inlined per-test, matching theresolve_pathpattern raycast already used. Why:gfxutil.hwas leaking cocogfx (CGLTrace,ePixelFormat) into what should have been vortex's public surface; eliminating it letsgraphics.hbe self-contained as the install tree requires.- riscv-tests
.binblobs removed from the source tree. Pre-builttests/riscv/isa/*.bin(c56562fe) andtests/riscv/benchmarks_{32,64}/*.bin(cd1656cf) collapsed into a single on-demand build under tests/riscv/common.mk that clones upstreamriscv-testsat a pinned commit and builds per-XLEN behind a stamp file. Why: drops binary blobs from version control and pins behaviour to a single upstream commit instead of stale checked-in artifacts. - Configuration moved to TOML. VX_config.toml + VX_types.toml replace
hw/rtl/VX_config.vh; ci/gen_config.py emits per-target headers (build/hw/VX_config.vh,build/sw/VX_config.h) and-Doverrides from one source, withexpr:/[[enum]]/[[builtin]]/[[param]]semantics. Why: gives the config typed scalars / cross-key expressions / typed enums the flat\defineblock could not express, and decouples SimX/runtime from the RTL source tree (no more-I$(ROOT_DIR)/hw`). - Configuration namespace. All HW config macros now carry the
VX_CFG_prefix; HW/SW layering split —VX_config.his HW/sim-private. Why: resolves preprocessor collisions with LLVM/Clang and Verilator/SV\define`s, and stops HW config from leaking into kernel-side toolchain invocations. - No more global
toolchain_env.sh. Eachbuild/carries its own resolved tool paths. Why: enables parallel multi-version Vortex worktrees on the same shell (the oldsource ci/toolchain_env.shhijacked$PATH). - Build system. Tool-path env vars unified on the
_PATHsuffix; sharedhw/syn/common.mk; standardizedOPT_LEVELacross synthesis backends; LLVM default target fixed; deadhw/configcall dropped. Why: normalizes the build/synthesis surface so cross-backend changes touch one place. - Verilator pinned to 5.028 (was tracking the latest 5.046 release in the prebuilt). Why: 5.046 removed the
--xml-onlyflag thatsim/opaesim/Makefileneeds for--scope, tightenedDEFOVERRIDEinto a hard error breakingDEBUG=3 + -DGPR_RESETdebug builds, and rejected cvfpu'sC_PC/C_EXP_*macros underDPI_DISABLE + FPU_FPNEW(--config2). 5.028 keeps all three working (commitf00bb142). - SimObject channels — explicit fixed-size handshaking. Unbounded
std::queue+ blockingpush/popreplaced by fixed-capacity channels + non-blockingtry_send/try_pop(RTL ready/valid analog);[[nodiscard]]forces producers to handle backpressure at the call site. Why: makes SimX model true RTL backpressure (1:1 with ready/valid) so buffering bugs surface in C++ instead of only at RTL bring-up. - Scoreboard issue arbiter — GTO (Greedy-Then-Oldest).
VX_scoreboardnow drives a dedicated VX_gto_arbiter with asuppressmask for warps whose target functional unit is full (they keep aging but are skipped for selection until the FU drains). v2.x usedVX_stream_arb(round-robin). Why: GTO matches mainstream GPU warp-issue policy — finish the currently-running warp before switching — yielding better ILP/cache locality than naive RR. - Scoreboard — XREGS dependency tracking + per-FU lock for multi-uop macros. Scoreboard gains a second plane (
inuse_xregs) tracking FPU special registers (fflags,frm); every instruction declaresrd_xregs/wr_xregsmasks in decode (VX_decode.sv:377-380). Newfu_lock/fu_unlockinstruction flags let a uop sequencer hold a functional unit across a multi-uop macro — set by VX_tcu_uops.sv:422 for WGMMA (fu_lock=first_uop,fu_unlock=last_uop) — and the scoreboard refuses to issue any other warp's uop to the locked FU until it retires. Why: eliminates v2.x's coarse FPU-boundary stall forfflags/frmWAW/RAW hazards, and keeps multi-uop macros atomic on one FU instance — withoutfu_lock, interleaving warps would corrupt shared FU state (WGMMA's per-block A registers + accumulator). - Warp scheduler — per-warp ibuffer-capacity gate.
VX_schedulertracks per-warpibuf_fulland computesschedule_warps = ready_warps & ~ibuf_full, with anall_ibuf_full ? ready_warps : preferred_warpsfallback to keep pipelines absorbing transient stalls. v2.x scheduled solely onactive_warps & ~stalled_warps. Why: prevents the scheduler from issuing a warp whose decoded uops will only block on a full ibuffer downstream — wasting fetch/decode bandwidth and pushing back-pressure into the front end. __syncthreads()(BAR opcode) now drains LSU before suspending warps so SMEM writes commit before any post-barrier reads. Gated onlsu_sched_drainedin VX_wctl_unit.sv andlsu_drained()in sim/simx/sfu_unit.cpp.- BUG FIX: SimX cache-flush parity with RTL
VX_dcr_flush—ProcessorImpl::flush_caches()(sim/simx/processor.cpp) now fans out to icache + dcache + {tcache, rcache, ocache} L1 surfaces in parallel (was dcache-only), matching the RTL shared-req/AND-of-donetopology in VX_core.sv + VX_graphics.sv.
--vmis SimX-only. Host-shadow PT is modelled in SimX C++ memory; rtlsim'sDeviceMemIOis not yet wired through the same shim.
Last v2.x maintenance release. See git log v2.2..v2.3.
Tags v0.2.0 through v2.3 predate this changelog. Use git log
and the GitHub releases page for history.