🚀 Status: Successfully taped out as part of Silicon Sprint 2026 at the American University in Cairo (AUC). Physical silicon fabrication expected in November 2026.
📊 Try the Software Model: Google Colab Demo
- Overview
- Key Features
- Architecture
- Physical Design
- Application: Chest X-Ray Pneumonia Detection
- Repository Structure
- Verification
- Getting Started
- Design Metrics & Signoff
- Future Work
- Team
- References
nanoNPU is a highly area-optimized, silicon-proven Neural Processing Unit designed for low-power medical edge inference. It implements a fused-datapath streaming architecture built around an 8×8 systolic array executing INT8 quantized matrix multiplications, enabling dense neural network inference in a tiny silicon footprint on the SkyWater 130nm open-source PDK.
Although the primary validation application is chest X-ray pneumonia classification, the ISA-driven architecture makes the nanoNPU fully general-purpose — any INT8-quantized convolutional or fully-connected neural network can be compiled to it by issuing the appropriate instruction sequence over UART/APB.
| Feature | Detail |
|---|---|
| Compute Core | 8×8 Systolic Array (64 MAC units) |
| Data Precision | INT8 activations & weights, INT32 accumulators |
| Memory | 128×32 dual-port SRAM (data) + 32×32 instruction memory |
| Host Interface | UART → APB bus bridge |
| Post-Processing | Bias addition, requantization (INT32→INT8), ReLU, Average Pooling |
| ISA | Custom 32-bit fixed-width, 12 instructions (team-designed) |
| Clock | 20 MHz target (50 ns period) |
| Technology | SkyWater SKY130 130nm CMOS |
| Die Area | 880 µm × 1031.66 µm |
| Core Utilization | 20% |
| Tapeout Program | Silicon Sprint 2026, AUC |
The nanoNPU is organized around a linear streaming datapath. The host communicates with the chip over a UART link that is bridged to an internal APB bus. An APB decoder fans out to the NPU core, instruction memory, and data SRAM:
Host PC
│
▼ (115200 baud UART)
UART–APB Bridge
│
▼
APB Splitter / NPU APB Decoder
│
├──► Instruction Memory (IMEM) 32 × 32-bit
├──► Data SRAM (DMEM) 128 × 32-bit (dual-port)
└──► NPU Core (npu_top)
│
└──► Control Unit (CU)
│
├──► ACT Ping-Pong Buffer ──────────┐
├──► WGT Ping-Pong Buffer ──────────┤
│ ▼
│ Systolic Array 8×8
│ │
│ acc_buffer (INT32)
│ │
├──► Bias Buffer ──────► Bias Adder ─┤
│ │
├──► Scale Register ──► Req Unit ────┤ (INT32→INT8)
│ │
│ ReLU Unit
│ │
└──► Store Engine ◄──────────────────┘
│
▼
Data SRAM
The nanoNPU uses a fused streaming datapath to keep on-chip buffer requirements minimal. Intermediate results never return to main SRAM between pipeline stages — they flow directly through dedicated small buffers:
SRAM ──[LOAD_ACT]──► ACT Ping-Pong Buffer ─┐
├──► Systolic Array (8×8 MACs)
SRAM ──[LOAD_WGT]──► WGT Ping-Pong Buffer ─┘
│
▼
acc_buffer (INT32, 8 rows)
│
SRAM ──[LOAD_BIAS]──► bias_buffer ──► Bias Adder
│
pbias_buffer (INT32+bias)
│
SRAM ──[LOAD_SCL]──► scale_reg ───► Req Unit (×M0 >> n, INT8)
│
preq_buffer (INT8)
│
[ReLU] max(0, x)
│
relu_buffer (INT8)
│
─[STORE]──► SRAM
Ping-Pong Buffers enable the CU to load the next tile from SRAM while the systolic array consumes the current tile, hiding SRAM latency behind computation.
✍️ The nanoNPU ISA — including all opcodes, instruction formats, and encoding — was fully designed from scratch by the team.
All NPU operations are encoded in a 32-bit fixed-width instruction format. Two instruction layouts exist:
LOAD / STORE format:
[31:26] [26:22] [21:16] [15:8] [7:0]
OP CODE buf_sel EXT_ADDR TILE_ADDR B TILE_ADDR A
6 bits 5 bits 6 bits 8 bits 8 bits
CONV / BIAS / REQ / ReLU / POOL format:
[31:26] [25:6] [5] [6:1] [0]
OP CODE RESERVED w_transpose n_scale BIAS_bypass
6 bits 20 bits 1 bit 6 bits 1 bit
The full instruction set:
| Instruction | OP Code | Operation |
|---|---|---|
LOAD_ACT |
000000 |
SRAM[tile] → Activation Ping-Pong Buffer |
LOAD_WGT |
000001 |
SRAM[tile] → Weight Ping-Pong Buffer |
LOAD_BIAS |
000010 |
SRAM[tile] → Bias Buffer |
LOAD_SCL |
000011 |
SRAM[tile] → Scale Register (M0, n) |
CONV |
000100 |
Act × Wgt → AccBuffer (8×8 MAC) |
ADD_BIAS |
000101 |
AccBuffer + BiasBuffer → PBBuffer |
REQ |
000110 |
PBBuffer × M0 >> n → ReqBuffer (INT8) |
ReLU |
000111 |
ReqBuffer max(0,x) → ReLUBuffer |
POOL |
001000 |
ReLUBuffer 2×2 MaxPool → PoolBuffer |
STORE |
001001 |
LastActiveBuffer → SRAM[tile] |
LOAD_ACT_WGT |
010000 |
SRAM → Act Ping-Pong + Wgt Ping-Pong (simultaneous) |
NOP |
111110 |
No operation, one cycle |
HALT |
111111 |
Stop; assert npu_done |
The Control Unit (CU.sv) implements a three-level FSM hierarchy — a main pipeline FSM, a CONV sub-FSM, and two additional sub-FSMs for REQ and ReLU sequencing.
┌─────────────────────────────────────────────────────┐
│ start = 0 (STALL) │
▼ │
rst_n ──► IDLE ──[start]──► FETCH ──► STALL ──► DECODE ──► EXECUTE ──► NEXT
▲ │
│ unit_done = 1 │
└───────────────────────────────┘
│
[opcode=HALT]
▼
HALTED
(npu_done = 1)
| State | Action |
|---|---|
IDLE |
Wait for start pulse from host |
FETCH |
Assert inst_rd_en, issue PC to IMEM |
STALL |
Latch instruction word into instr_r (1-cycle SRAM latency) |
DECODE |
Decode opcode/fields; generate exec_pulse on exit |
EXECUTE |
Drive the active functional unit; hold until unit_done |
NEXT |
Increment PC, swap ping-pong buffers if needed → back to FETCH |
HALTED |
Assert npu_done; stay until reset |
[opcode=CONV & EXECUTE]
│
▼
CP_IDLE ──► CP_START ──► CP_LOAD_W ──────────────────► CP_FEED_A ──────────────► CP_WAIT
│ count cols 0…N-1 │ count rows 0…N-1 │
│ sa_valid_in = 1 │ sa_valid_in = 1 │
│ (load weights) │ (stream activations) │
└──[cnt == N-1]───────────────┘ │
[sa_done]─┘
│
CP_IDLE
[opcode=REQ & EXECUTE]
│
▼
RQ_IDLE ──► RQ_PULSE ──────────────► RQ_WAIT ──[req_done]──► RQ_IDLE
│ req_start pulses │
│ for SA_SIZE rows │
└──[cnt == SA_SIZE-1]────┘
[opcode=RELU & EXECUTE]
│
▼
RP_IDLE ──► RP_PULSE ──────────────► RP_WAIT ──[relu_done]──► RP_IDLE
│ relu_start pulses │
│ for SA_SIZE rows │
└──[cnt == SA_SIZE-1]────┘
The nanoNPU is wrapped inside npu_project_macro.sv, which fits the OpenFrame multi-project chip port convention. All host communication happens through 5 active GPIO pins on the bottom pads. All other GPIOs are tied to safe high-Z inputs.
| GPIO Pin | Direction | Signal | Description |
|---|---|---|---|
gpio_bot[0] |
IN | uart_rx |
Host → NPU UART receive |
gpio_bot[1] |
OUT | uart_tx |
NPU → Host UART transmit |
gpio_bot[2] |
OUT | locked |
APB bus lock status |
gpio_bot[3] |
OUT | npu_done |
NPU reached HALT instruction |
gpio_bot[4] |
OUT | done_processing |
All instructions processed |
gpio_bot[14:5] |
— | unused | High-Z (safe input, no pull) |
gpio_rt[8:0] |
— | unused | High-Z (safe input, no pull) |
gpio_top[13:0] |
— | unused | High-Z (safe input, no pull) |
| Pin | oeb |
dm[2:0] |
Mode |
|---|---|---|---|
BOT[0] uart_rx |
1 |
3'b001 |
Input, no pull |
BOT[1] uart_tx |
0 |
3'b110 |
Strong push-pull output |
BOT[2] locked |
0 |
3'b110 |
Strong push-pull output |
BOT[3] npu_done |
0 |
3'b110 |
Strong push-pull output |
BOT[4] done_processing |
0 |
3'b110 |
Strong push-pull output |
| All others | 1 |
3'b001 |
High-Z input, no pull |
| Signal | Source | Notes |
|---|---|---|
clk |
proj_clk_out from green macro |
Gated system clock via ICG cell |
reset_n |
proj_reset_n_out from green macro |
Held LOW when scan slot is disabled |
por_n |
Power-on reset | Unused by NPU directly |
The pin_order.cfg file constrains LibreLane's I/O placer to distribute the GPIO bundles around the four die edges, matching the OpenFrame multi-project chip physical contract:
| Die Edge | GPIO Group | Count |
|---|---|---|
| West | clk, reset_n, por_n |
3 |
| South | gpio_bot[14:0] + drive modes |
15 signal + 45 DM |
| East | gpio_rt[8:0] + drive modes |
9 signal + 27 DM |
| North | gpio_top[13:0] + drive modes |
14 signal + 42 DM |
The full RTL-to-GDSII flow was executed using LibreLane — the open-source RTL-to-GDSII orchestration framework — targeting the SkyWater SKY130 HD standard cell library. The flow was carried out as part of the Silicon Sprint 2026 workshop at AUC.
The LibreLane Classic flow was divided into two phases:
| Phase | Steps | Purpose |
|---|---|---|
| Signoff Prep | Fill insertion → RCX → Post-PnR STA → IR Drop | Electrical & timing verification |
| Physical Signoff | GDSII → DRC → LVS → XOR | Geometric & connectivity verification |
| Parameter | Value | Notes |
|---|---|---|
| Clock Period | 50 ns | 20 MHz |
| Die Area | 880 × 1031.66 µm | Fixed by multi-project chip contract |
| Core Utilization | 20% | Area-optimized |
| Synthesis Strategy | AREA 2 |
Minimize cell count |
| Default Corner | max_ss_100C_1v60 |
SS, 100°C, 1.6 V |
| Max Metal Layer | met4 |
Routing constraint |
| Antenna Repair Iterations | 15 | Aggressive antenna mitigation |
| Post-GRT Design Repair | Enabled | Slew/cap fix after global routing |
After detailed routing, OpenRCX extracted RC parasitics from the physical geometry into three SPEF files (max/nom/min corners). Post-PnR STA then analysed all 9 PVT corners:
max_ss_100C_1v60 nom_ss_100C_1v60 min_ss_100C_1v60
max_tt_025C_1v80 nom_tt_025C_1v80 min_tt_025C_1v80
max_ff_n40C_1v95 nom_ff_n40C_1v95 min_ff_n40C_1v95
The initial post-route STA revealed Max Slew and Max Cap violations in the max_ss_100C_1v60 corner caused by overloaded driver outputs driving long nets. These were resolved using a Side Load Isolation ECO — sky130_fd_sc_hd__buf_4 cells were inserted after overloaded drivers via the INSERT_ECO_BUFFERS flow in config.json, absorbing the excessive capacitive load without disturbing the rest of the routed design. The ECO run started from the post-detailed-routing checkpoint, re-routed only the affected nets, then re-ran the full signoff prep to verify the fix.
Post-route signoff was performed at the worst-case slow corner (max_ss_100C_1v60). Full STA reports, DRC/LVS sign-off logs, and SPEF parasitic files are available in Final/.
| Check | Result | Detail |
|---|---|---|
| Setup Timing | ✅ Clean | Zero violations |
| Hold Timing | ✅ Clean | Zero violations |
| IR Drop (VPWR) | ✅ 0.01% | Well within < 2% signoff budget |
| IR Drop (VGND) | ✅ 0.01% | Well within < 2% signoff budget |
| DRC | ✅ 0 violations | SkyWater 130nm rule deck — Magic & KLayout |
| LVS | ✅ Circuits match uniquely | Physical layout ≡ synthesis netlist |
| XOR GDS | ✅ 0 differences | Magic vs. KLayout GDS agree exactly |
The primary validation use case for nanoNPU is a binary CNN classifier distinguishing normal chest X-rays from pneumonia cases, derived from the publicly available Guangzhou Women and Children's Medical Center dataset.
| Property | Detail |
|---|---|
| Source | Guangzhou Women and Children's Medical Center |
| Images | 5,863 JPEG chest X-rays (anterior-posterior) |
| Classes | NORMAL / PNEUMONIA |
| Splits | Train / Validation / Test |
| Patient Age | 1–5 years |
| License | CC BY 4.0 |
| Citation | Cell 2018 — Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning |
A bit-accurate Python model of the inference pipeline (including INT8 quantization) is provided under Python Modeling/ and runnable directly in the browser:
The notebook (Chest_X_Ray_Images_CNN.ipynb) covers:
- Data loading & preprocessing
- CNN architecture definition
- INT8 quantization-aware training
- Weight export in a format compatible with the NPU's SRAM layout
├── Backend/ # Physical design configurations & flow scripts
│ └── openlane/ # Winning run configuration
│ ├── RTL/ # Flattened SystemVerilog for synthesis
│ ├── config.json # LibreLane parameters (clock, area, antenna rules)
│ ├── pnr.sdc # Place-and-route timing constraints
│ ├── signoff.sdc # Final signoff timing constraints
│ └── fixed_dont_change/ # Fixed DEF template (multi-project contract)
│
├── Final/ # Post-route tapeout deliverables
│ ├── final/
│ │ ├── gds/ # npu_project_macro.gds ← manufacturing-ready
│ │ ├── lef/ # Macro abstract view
│ │ ├── spef/ # Multi-corner parasitics (max/min/nom)
│ │ ├── lib/ # Timing libraries (9 PVT corners)
│ │ ├── sdf/ # SDF for back-annotated simulation
│ │ └── render/ # GDS layout render PNG
│ └── max_ss_100C_1v60/ # Worst-case corner STA & power reports
│
├── FPGA/ # XDC constraints for FPGA prototype
│
├── Python Modeling/
│ ├── npu_modeling.py # Bit-accurate NPU software model
│ └── Uart APB/uart_apb.py # UART–APB host-side script
│
├── RTL/ # RTL source organized by functional unit
│ ├── npu_system_top.sv # System top (UART + APB + NPU)
│ ├── npu_top.sv # NPU core top
│ ├── Npu_apb_decoder.sv # APB address decoder
│ ├── Systolic Array/ # PE.sv, SA_NxN.sv, SA_NxN_top.sv, …
│ ├── Control Unit/ # CU.SV, SA_CU.sv
│ ├── Clock_Gating_Cell/ # ICG cell for low-power clock gating
│ ├── Buffers/ # Ping-pong, bias, acc, relu, preq buffers
│ ├── ReLU/ # relu_unit.sv, ReLU.sv
│ ├── Bias_Adding_Unit/ # bias_adder.sv
│ ├── Store_Engine/ # store_engine.sv
│ ├── SRAM/ # RAM models (64×32, 128×32, 256×32)
│ ├── Req/ # Requantization unit
│ ├── MUX/ # mux2x1.sv, mux4x1.sv
│ └── UART APB/ # UART + APB master/splitter
│
└── Testbench/
├── tb_npu_system.sv # Full-system testbench
├── tb_npu_system_4x4.sv # 4×4 configuration testbench
├── tb_npu_top.sv # NPU core testbench
├── do_npu.do # ModelSim/QuestaSim do-file
└── Components Testing/ # Unit-level testbenches (PE, SA, ReLU, REQ, …)
The full verification suite is located in Testbench/ and was run using QuestaSim / ModelSim. It consists of a system-level randomized testbench (tb_npu_system.sv) that drives the complete hardware stack end-to-end — from UART bytes in, through the APB bus, all the way to SRAM readback — and compares every output byte against a built-in cycle-accurate golden model.
The testbench (tb_npu_system.sv) instantiates the full npu_system_top DUT and drives it entirely through bit-banged UART transactions, exactly as a real host would:
- APB write tasks — serialise address + data into UART frames using the
0xDEADA5write magic header - APB read tasks — issue reads using the
0xDEAD5Aread magic header and capture the 4-byte response - Memory loaders — pack 8-bit tile data into 32-bit SRAM words and write them row by row
- Instruction programmer — encodes and writes all ISA instructions (LOAD_ACT, LOAD_WGT, LOAD_BIAS, LOAD_SCL, CONV, ADD_BIAS, REQ, ReLU, NOP, STORE, HALT) into instruction memory via APB
- Golden model — a SystemVerilog function computes the expected INT8 output for every output cell:
- Matrix multiply (INT8 × INT8 → INT32 accumulate)
- Bias addition (INT32 + INT32)
- Requantization:
(acc × M0) >> nwith INT8 saturation clamp (−128 … +127) - Optional ReLU:
max(0, x)
- Checker — reads back every output row from SRAM and compares against the golden model, printing PASS/FAIL per row word
| TC | Name | Activations | Weights | Bias | ReLU | Shift |
|---|---|---|---|---|---|---|
| TC1 | Full Pipeline — Mixed Random | Rand −20 … +20 | Rand −20 … +20 | Rand −500 … +500 | ✅ Yes | 2 |
| TC2 | No ReLU — Mixed Random | Rand −50 … +50 | Rand −50 … +50 | Rand −1000 … +1000 | ❌ No | 4 |
| TC3 | Positive Data Only | Rand 1 … +30 | Rand 1 … +30 | Rand 0 … +1000 | ❌ No | 3 |
| TC4 | Negative Clamping — Extreme Bias | Rand −10 … +10 | Rand −10 … +10 | Rand −8000 … +8000 | ✅ Yes | 0 |
Each test case resets the chip, loads randomised data into SRAM via UART, programs the full instruction sequence, starts the NPU, polls for npu_done, then reads back all 16 output words (8 rows × 2 words) and checks them against the golden model. All data is re-randomised each run using $urandom_range, so every simulation run is a unique test vector.
All 4 test cases passed in QuestaSim with 64/64 output word comparisons correct across every run:
============================================================
STARTING RANDOMIZED SYSTEM TESTS
============================================================
[TC1] Full Pipeline (Mixed Random Data + RELU)
[TC2] No ReLU Pipeline (Mixed Random Data)
[TC3] Positive Data Only (No ReLU)
[TC4] Negative Clamping Test (Mixed Data + RELU)
============================================================
FINAL RESULTS: 64 PASSED, 0 FAILED
*** SUCCESS: ALL RANDOMIZED TESTS PASSED ***
============================================================
- RTL Simulation: ModelSim / QuestaSim / Icarus Verilog / Verilator
- Physical Design: LibreLane with SkyWater 130nm PDK (installed via Nix)
- Python Modeling: Python 3.9+, TensorFlow/PyTorch, NumPy, pyserial
# Run the full randomized system testbench (TC1–TC4) in QuestaSim
vsim -voptargs=+acc work.tb_npu_system
run -all
# Or using the provided do-file
vsim -do do_npu.do
# Individual unit tests
vsim -sv work.PE_tb
vsim -sv work.SA_NxN_top_tb
vsim -sv work.ReLU_TBSee the Verification section for a full description of each test case.
First, enter the Nix shell:
nix-shell --pure ~/librelane/shell.nixFull flow (synthesis through signoff):
cd Backend/openlane
librelane config.json --run-tag npu_runSignoff prep only (fill → RCX → STA → IR drop):
librelane config.json \
--run-tag npu_run \
--from OpenROAD.FillInsertion \
--to OpenROAD.IRDropReport \
--with-initial-state runs/npu_run/<step>-checker-wirelength/state_out.jsonECO re-run (insert ECO buffers, re-route, re-signoff):
librelane config.json \
--run-tag npu_run_eco \
--from Odb.InsertECOBuffers \
--to OpenROAD.IRDropReport \
--with-initial-state runs/npu_run/<step>-openroad-detailedrouting/state_out.jsonPhysical signoff (GDSII → DRC → LVS):
librelane config.json \
--run-tag npu_run_eco \
--from Magic.StreamOut \
--with-initial-state runs/npu_run_eco/<step>-openroad-irdropreport/state_out.jsonThe config.json already contains all optimized parameters from the successful tapeout run, including INSERT_ECO_BUFFERS entries and antenna repair settings — it is a drop-in ready configuration.
# Load weights and run inference on connected hardware
python "Python Modeling/Uart APB/uart_apb.py" \
--port /dev/ttyUSB0 \
--baud 115200 \
--model weights_int8.npyAll signoff artifacts are under Final/. Key results at worst-case corner (max_ss_100C_1v60):
| Metric | Value |
|---|---|
| Technology | SkyWater SKY130 130nm |
| Die Area | 880 × 1031.66 µm |
| Clock Frequency | 20 MHz |
| Setup WNS (worst corner) | +0.0952 ns |
| Hold WNS (worst corner) | +9.0258 ns |
| IR Drop VPWR / VGND | 0.05% / 0.05% |
| DRC | ✅ 0 violations |
| LVS | ✅ Circuits match uniquely |
| XOR GDS | ✅ 0 differences |
Full reports: Final/drc.magic.rpt, Final/lvs.netgen.rpt, Final/sta_summary.rpt.
The nanoNPU is a working silicon-proven baseline. The following roadmap outlines planned and suggested improvements:
The POOL opcode is defined in the ISA but the hardware pooling unit is not yet implemented in the datapath. Planned additions:
- Max Pooling — 2×2 sliding window, stride 2 (already partially handled in the ISA encoding)
- Average Pooling — 2×2 sum-and-shift, required for MobileNet-style global average pool layers before the final classifier
The 6-bit opcode field has room for 52 additional instructions. Suggested additions:
Memory & Compute:
| Proposed Instruction | OP Code | Operation |
|---|---|---|
LOAD_ACT_WGT_BIAS |
010001 |
Load activations, weights, and bias in one pass |
CONV_BIAS_REQ |
010010 |
Fused CONV + ADD_BIAS + REQ in a single instruction |
DEPTHWISE_CONV |
010011 |
Depthwise separable convolution (MobileNet layer) |
TRANSPOSE |
010110 |
In-place tile transpose for weight reuse |
ZERO_PAD |
010111 |
Zero-pad activation tile edges for convolution |
LOOP |
011000 |
Repeat next N instructions K times (software loop) |
Activation Functions:
| Proposed Instruction | OP Code | Operation |
|---|---|---|
LEAKY_RELU |
010100 |
max(αx, x) — α stored in scale reg |
CLIP |
010101 |
Clamp to [min, max] range (ReLU6 etc.) |
SIGMOID |
011001 |
1 / (1 + e^−x) — INT8 LUT approximation |
SOFTMAX |
011010 |
e^xᵢ / Σe^xⱼ over output tile — classification final layer |
SWISH |
011011 |
x · sigmoid(x) — used in EfficientNet / MobileNetV3 |
GELU |
011100 |
x · Φ(x) — used in BERT, GPT, ViT transformer blocks |
HARD_SWISH |
011101 |
x · ReLU6(x+3) / 6 — hardware-friendly Swish approximation |
ELU |
011110 |
x if x > 0 else α(e^x − 1) — α stored in scale reg |
💡 Non-linear activations (Sigmoid, Softmax, Swish, GELU) are expensive to compute exactly in fixed-point hardware. The practical implementation strategy is a 256-entry INT8 lookup table (LUT) stored in a dedicated SRAM tile — the input byte is used as the address and the output byte is the pre-computed activation value. This keeps the hardware simple (one SRAM read per element) while supporting any smooth activation function at the cost of a small SRAM tile.
A clock gating cell (Clk_Gating_Cell) is already present in the RTL (RTL/Clock_Gating_Cell) but not yet integrated into the datapath. Planned techniques:
- Fine-grained clock gating — gate each functional unit (SA, bias adder, req unit, ReLU) independently when idle using the existing ICG cell
- Operand isolation — insert isolation cells on datapath inputs when a unit is clock-gated to prevent spurious switching power
- Multi-voltage islands — run the SRAM and I/O ring at a lower supply voltage than the compute core
- Power gating — use header/footer cells to fully cut power to idle units between inference runs
Currently, NPU programs are hand-assembled by encoding binary instruction words manually in the Python host script. A proper toolchain would include:
- Assembler — text
LOAD_ACT 0x00 0x10→ 32-bit binary, with label support and a symbol table - Linker — resolve tile address references and SRAM memory map automatically
- Compiler backend — accept a quantized ONNX model and emit a
.npubinary: tile the weight matrices, generate the LOAD/CONV/BIAS/REQ/RELU/STORE instruction sequence automatically, and pack weights into the SRAM binary image
Scaling the systolic array from 8×8 to 16×16 increases peak compute from 128 INT8 MACs/cycle to 512 INT8 MACs/cycle (4×), enabling faster inference on larger CNN layers. Required changes:
- Widen ping-pong buffers from 8 rows to 16 rows
- Increase SRAM depth to accommodate 16-row tiles
- Widen the
n_scalefield in the CONV instruction format (currently 5 bits, supports up to shift-31) - Re-run floorplanning and P&R — die area will grow significantly at 20% utilization on SKY130
The current nanoNPU is fully INT8 — all activations, weights, and accumulations are fixed-point integers, which requires the model to be quantized before deployment. A natural evolution is a floating-point NPU that can run any standard FP32 or FP16 inference model directly without any quantization step:
- FP16 (IEEE 754 half-precision) — the practical target for edge silicon; FP16 MACs are ~4× the area of INT8 MACs but eliminate all quantization error and model-preparation overhead. FP16 is the standard for embedded GPU inference (e.g. ARM Mali, Apple Neural Engine)
- BF16 (Brain Float 16) — same 8-bit exponent as FP32, 7-bit mantissa; better dynamic range than FP16 at identical hardware cost. Used in Google TPUs and most modern AI accelerators
- FP32 (full precision) — highest accuracy, largest area; suitable for training accelerators or high-accuracy medical inference where quantization error is unacceptable
- Mixed-precision — FP16 activations with FP32 accumulators (the standard in modern GPU tensor cores), giving full numerical stability without full FP32 memory bandwidth
Key hardware changes required:
- Replace the INT8 PE multiplier with an IEEE 754 FP16/BF16 multiply-add unit
- Replace the INT32 accumulator with an FP32 accumulator to avoid catastrophic cancellation
- Remove the requantization unit (REQ instruction becomes unnecessary)
- Widen the SRAM datapath from 8-bit to 16-bit per element (doubles memory bandwidth requirement)
- Add a FP-to-INT8 conversion unit at the output for systems that mix FP inference with INT8 I/O
💡 A pragmatic intermediate step is posit arithmetic (Type III unum) — an alternative floating-point format that delivers FP32-equivalent accuracy in 16 bits with simpler hardware than IEEE 754, and is gaining traction in neuromorphic and edge-AI ASIC research.
- Batch normalization folding — fold BN parameters into the bias and scale registers at compile time, eliminating the need for a separate BN layer in hardware
- Sparse weight skipping — add a zero-detector in the systolic array PE to skip MAC operations when the weight is zero, reducing dynamic power on pruned models
- On-chip DMA — replace the UART-driven APB loader with a DMA engine that bursts weights from an external SPI flash, enabling standalone inference without a host PC
- RISC-V integration — replace the custom ISA CU with a small RISC-V core (e.g. PicoRV32) running firmware, using the systolic array as a memory-mapped accelerator
- Multi-chip tiling — add an inter-chip link to chain multiple nanoNPU dies together for larger model inference across chips
| Name | GitHub |
|---|---|
| Ammar Wahidi | @Ammar-Wahidi |
| Omar Mohamed Eid | @OmarEid66 |
| Mohamed Ahmed | @mhmd-ahmdezz |
| Name | Role | GitHub |
|---|---|---|
| Amr Wahidi | CNN model, training & inference | @amr10w |
| Ammar Wahidi | INT8 quantization | @Ammar-Wahidi |
-
Intel FPGA-NPU — High-performance NPU reference architecture on FPGA https://github.com/intel/fpga-npu
-
Superscalar Out-of-Order NPU on FPGA — Yuqiang Ge, Kapinesh Govindaraju, Sona Susan Jacob (ECE5760, Cornell University, Spring 2024) https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/s2024/yg585_kg534_sj778/
-
UART-APB — Dr. Mohamed Shalan (American University in Cairo) https://github.com/shalan
-
Kermany, D. et al. (2018). Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell, 172(5), 1122–1131. https://doi.org/10.1016/j.cell.2018.02.010 Dataset: https://data.mendeley.com/datasets/rscbjbr9sj/2 — CC BY 4.0
- SkyWater SKY130 PDK
- LibreLane — RTL-to-GDSII orchestration framework
- OpenROAD — Placement, CTS, routing, STA
- Magic VLSI — GDSII stream-out, DRC, SPICE extraction
- Netgen LVS — Layout vs. Schematic verification
- KLayout — GDSII stream-out, DRC, XOR verification
- OpenRCX — Parasitic RC extraction
Made at the American University in Cairo · Silicon Sprint 2026
Apache 2.0 License — see LICENSE

