A minimal systems prototype showing how software can explicitly bind hot data to SRAM-like regions instead of relying on implicit cache behavior.
Includes:
- mock SRAM backend (portable)
- Linux /dev/mem MMIO backend
- prototype mem_hint API
🌐 Live microsite: https://manishklach.github.io/sram-interface-demo/
The repository includes architecture-specific helpers for exploring the interaction between software and the memory hierarchy:
- Barriers: Strict ordering for MMIO/SRAM apertures.
- Timing: Cycle-accurate measurement using
rdtscorcntvct. - Cache Hints: Manual control via
clflushandprefetchinstructions. - Virtual Memory: Educational demos of TLB and Page Table concepts.
Warning: This repository does not directly control page tables or TLBs from user-space. These experiments are for educational and architectural demonstration only. Detailed notes are available in docs/cache_tlb_notes.md.
This prototype exposes a thin software-visible control plane over SRAM-style memory, instead of relying purely on implicit cache behavior.
Standard CPUs hide memory placement behind multiple layers of abstraction (caches, coherence protocols, and speculative execution). While efficient for general-purpose workloads, this model breaks down for:
- KV-cache heavy inference: Where deterministic residency prevents tail latency spikes.
- Tensor tiling: Where explicit staging overlaps computation with data movement.
- Multi-tier memory systems: Where placement between SRAM, HBM, and CXL must be software-directed.
This repository explores a different idea: explicit software-directed residency in fast memory.
- A small prototype of a memory control plane.
- An educational systems demo for hardware/software co-design.
- A demonstration of explicit residency control.
- Not a production-ready memory driver.
- Not a kernel-level manager (though it explores the interface).
- Not a replacement for standard CPU caches or coherence.
Build and run the demo on any environment (defaults to Mock mode):
make
./sram_demo[mem_hint] reserve "kv_tile_0"
[mem_hint] bind → SRAM offset 0x100
[mem_hint] write → 39 bytes
[mem_hint] readback → verified ✔
The most important part of interacting with SRAM/MMIO isn’t the store—it’s the ordering.
static inline void sram_barrier(void) {
#if defined(__x86_64__)
__asm__ volatile("mfence" ::: "memory");
#elif defined(__aarch64__)
__asm__ volatile("dmb sy" ::: "memory");
#else
__sync_synchronize();
#endif
}For MMIO/SRAM apertures, we follow an ordered pattern:
barrier → store/load → barrier
Without these architecture-specific instructions, CPU reordering can break correctness by allowing the hardware to observe a "ready" bit before the data payload has actually reached the aperture.
For SRAM/MMIO apertures, correctness depends on ordering:
barrier → store/load → barrier.
The repository includes a latency benchmark to measure the overhead of the memory-control-plane logic in mock mode. This is useful for validating the control path but does not reflect real SRAM hardware latency.
make latency_bench
./latency_bench
./latency_bench 5000000[bench] backend: mock SRAM
[bench] iterations: 1000000
[bench] write32 avg: 4.52 ns/op
[bench] read32 avg: 3.10 ns/op
[bench] verify: OK ✔
| Backend | Platform | Purpose |
|---|---|---|
| Mock SRAM | Any OS | Development / testing |
| /dev/mem | Linux | Hardware MMIO prototype |
- KV-cache tile residency: Binding active attention blocks to on-chip SRAM.
- Tensor tile staging: Manual orchestration of matrix multiplication tiles.
- MoE expert placement: Promoting active experts to fast residency.
- FPGA scratchpad memory: Software management of non-coherent BRAM.
- Mock backend implementation
- MMIO/devmem backend implementation
- Assembly-backed ordering layer
- UIO/VFIO secure mapping backend
-
/dev/mem_hintconceptual kernel interface - Compiler/runtime intent integration