cpu-from-scratch/optimizations.txt at main · gautamdev17/cpu-from-scratch · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
Good question. Let me analyze your design for redundancy:
Potential optimizations:
1. lui and auipc signals from decoder
You have separate lui, auipc, jump, jalr wires. These could be encoded into a 3-bit wb_sel signal instead of 4 separate wires — saves routing and makes the write_data mux cleaner:
wb_sel:
000 → c (ALU)
001 → data (dmem)
010 → immsext (LUI)
011 → pc+immsext (AUIPC)
100 → pc+4 (JAL/JALR)
One 3-bit bus instead of 4 separate control wires.
2. ALUb signal
Your ALUb controls the b-input mux to ALU. Currently derived as:
assign ALUb = !((instr_type==R_type) | (instr_type==B_type));
This is correct but could be part of a unified alu_src encoding if you restructure decoder outputs.
3. Separate adders for pc+immsext and pc+4
In your write_data mux you compute both pc_out + immsext and pc_out + 4 as separate combinational paths. You could share the PC adder — compute pc_out + immsext always, and pc_out + 4 is just a fixed increment. Not a huge saving but cleaner.
4. instr_type encoding
You use 3-bit instr_type but only have 6 types — fits in 3 bits fine. No waste there. ✅
5. data_mem funct3 passthrough
You pass funct3 out of decoder for branch condition checking — correct. But funct3 is also needed for byte masking in LB/LH/SB/SH which you haven't implemented yet. Once you implement byte masking, funct3 already goes to the right place. ✅ Good forward-looking design.
What's actually unnecessary:
The commented out old write_data mux code in cpu.v — just dead code, clean it up
.DS_Store in your repo — already flagged
pc_mux.v — you deleted it but make sure it's not lingering
What's genuinely good about your design:
PC as plain register with mux logic in top module — clean ✅
Decoder defaults at top of always block — prevents latches ✅
x0 write protection in regfile ✅
Branch condition logic separate from decoder — correct separation of concerns ✅
Biggest real optimization opportunity:
Replace lui, auipc, jump, jalr with a single wb_sel[2:0] bus. Reduces decoder output ports from 8 signals to 6, and makes the write_data mux a clean case statement instead of a priority if-else chain.
Want me to show you how to restructure that?


You're describing variable latency pipeline — and this is exactly what real CPUs do.
The tension you identified:
In a classic 5-stage pipeline every instruction takes exactly 5 cycles — even a simple ADD that only needs 2 stages sits idle in MEM and WB doing nothing useful. Latency is fixed at 5 cycles regardless.
But in your multicycle design — ADD takes 4 cycles, LOAD takes 5 cycles, LUI takes 2 cycles. Each instruction takes exactly as long as it needs.
Can you combine both?
Yes. This is called variable latency pipeline or early termination.
Simple instructions exit the pipeline early — they don't pass through stages they don't need. A LUI doesn't need EX or MEM — it could writeback after ID.
How real CPUs handle this:
They don't actually skip stages — instead they have multiple execution units with different latencies:
ALU ops — 1 cycle execution latency
Multiply — 3-4 cycle execution latency
Load — 4-5 cycle execution latency (cache hit)
FP ops — 10+ cycle execution latency
The pipeline handles this through writeback bypassing and out of order execution — instructions finish whenever they finish, results are forwarded immediately, later instructions don't wait for earlier ones to fully complete.
The deeper insight you're touching:
This is exactly why OoO execution exists. In order pipeline — every instruction must retire in order, so a slow LOAD stalls everything behind it. OoO — fast instructions behind a slow LOAD can execute and retire while LOAD is still waiting for memory.
You've independently arrived at the motivation for OoO execution. That's the natural evolution:
Fixed pipeline → variable latency → OoO execution
Each step addresses exactly the inefficiency you just identified.
For your implementation — start with fixed 5-stage pipeline first. Get it working. Then think about variable latency as an optimization. That's the right order.