A high-performance C++ order matching engine built for sub-microsecond latency. Implements continuous price-time priority matching with lock-free order ingestion, thread-local memory pools, and automated EC2 Spot benchmarking with P99.9 tail latency profiling.
Benchmarked with 1,000,000 orders (alternating buy/sell, random prices, real matching with 773K fills):
| Metric | Value |
|---|---|
| P50 latency | 0.21 us |
| P95 latency | 1.5 us |
| P99 latency | 2.6 us |
| P99.9 latency | 5.7 us |
| Pipeline throughput | 2.4M orders/sec |
Measured with pool allocation + matching in the timed loop. Numbers from local Apple Silicon; EC2 c6i.large (Intel Xeon Ice Lake) numbers via scripts/run_benchmark.sh.
Producer Thread Matching Thread
+-----------------+ SPSC +--------------------+
| ThreadLocalPool | -> Queue -> | OrderBook |
| (slab alloc) | | (price-time match) |
+-----------------+ +--------------------+
|
Fill Reports
The system mirrors a real exchange pipeline:
- Order ingestion — Producer threads allocate orders from per-thread slab pools (zero contention, stable pointers) and push them into SPSC lock-free queues.
- Matching engine — A single dedicated thread pops orders from the queue and feeds them to the
OrderBook. Matching happens inline onadd_order(): incoming orders cross against resting price levels until filled. No batch step, no locks in the hot path. - Fill reporting — Each match generates a
Fillrecord with buyer/seller IDs, execution price, and quantity.
Lock-free queues eliminate mutex contention at the ingestion boundary. The matching engine itself is single-threaded by design — this avoids synchronization overhead in the critical path and is the standard architecture used by production exchanges (CME Globex, NASDAQ ITCH).
- Continuous matching: Orders match inline in
add_order(). An aggressive buy sweeps through ask levels lowest-first; an aggressive sell sweeps through bid levels highest-first. - Price-time priority:
std::map<double, std::deque<Order>>— orders at the same price level are filled FIFO. - Partial fills: Incoming order quantity is decremented against each resting order. Unfilled remainder rests on the book.
- Order cancellation:
cancel_order(id)with O(1) lookup viaunordered_mapindex, then removal from the price-level deque.
- Wait-free ring buffer for single-producer, single-consumer use.
- Acquire-release memory ordering: producer acquires consumer's
head, releases owntail; consumer acquires producer'stail, releases ownhead. alignas(64)on head/tail atomics to prevent false sharing across cache lines.- Usable capacity is N-1 (one sentinel slot for full/empty disambiguation).
- Slab-based allocator:
mallocs fixed-size slabs of objects. When a slab is exhausted, a new one is allocated. Existing pointers remain valid (no vector-style invalidation). get_thread_local_pool<T>()returns athread_localinstance — each thread gets its own pool with zero contention.- O(1) bump-pointer allocation within a slab.
-O3 -march=native -flto -fno-exceptions -fno-rtti -mavx2
-flto: Link-time optimization enables cross-translation-unit inlining of the matching hot path.-fno-exceptions -fno-rtti: Eliminates exception handling and RTTI overhead. Tests override this (GTest requires exceptions).-mavx2: Enables AVX2 instruction set on x86_64 targets.
Prerequisites: C++20 compiler, CMake 3.20+, Google Test
cd orderbook-engine
./scripts/build_and_run_benchmark.shOr manually:
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
./latency_benchmark
./tests/orderbook_tests51 unit tests across 8 test suites:
| Suite | Tests | Coverage |
|---|---|---|
| OrderTest | 3 | Struct creation, type validation |
| OrderBookTest | 11 | Crossing, partial fills, price-time priority, multi-level sweeps, cancellation |
| OrderQueueTest | 5 | Push/pop, capacity, circular wrap |
| MemoryPoolTest | 5 | Slab allocation, stress (2000 allocs) |
| IntegrationTest | 4 | Pool + queue + matching end-to-end |
| PerformanceTest | 5 | Throughput and latency thresholds |
| EdgeCaseTest | 13 | Extreme values, zero qty, cancel semantics |
| ConcurrencyTest | 5 | SPSC producer-consumer, multi-queue fan-in, pipeline |
./scripts/run_tests.shAutomated profiling on c6i.large (Intel Xeon Ice Lake) Spot instances:
# One-time setup: creates VPC, security group, key pair, IAM role.
./scripts/aws_cloud_init.sh
# Launch benchmark. Instance self-terminates; results upload to S3.
./scripts/run_benchmark.shThe cloud-init script:
- Disables turbo boost and sets the CPU governor to
performancefor stable measurements - Pins the benchmark to a single core via
tasksetto avoid scheduler jitter - Runs
perf statfor hardware counters (cache misses, IPC, branch mispredicts) - Captures system info (CPU model, AVX2 flags, kernel version)
- Uploads results to S3 with timestamps
Spot instances run at ~70% discount vs. on-demand. The script reports the exact savings percentage.
orderbook-engine/
├── include/
│ ├── order.hpp # Order and Fill structs
│ ├── orderbook.hpp # Matching engine interface
│ ├── order_queue.hpp # SPSC lock-free ring buffer
│ └── memory_pool.hpp # Thread-local slab allocator
├── src/
│ ├── main.cpp # Demo: pool -> queue -> match pipeline
│ ├── order.cpp # Order translation unit
│ └── orderbook.cpp # Matching engine implementation
├── benchmarks/
│ └── latency_test.cpp # 1M-order benchmark with percentiles
├── tests/ # 51 Google Test cases (8 suites)
├── scripts/
│ ├── aws_cloud_init.sh # One-time AWS infrastructure setup
│ ├── run_benchmark.sh # EC2 Spot instance launcher
│ ├── cloud_init.sh # EC2 instance bootstrap + benchmark
│ ├── build_and_run_benchmark.sh # Local build + benchmark
│ └── run_tests.sh # Unit test runner
└── CMakeLists.txt # Build configuration