
Learn how to design a low-latency C++ trading system from scratch, with real-world data structures, network stacks, and profiling tips that shave microseconds.
TL;DR
Published by Brav
Table of Contents
- I shave 3–5 µs from my C++ low-latency trading system.
- Real-world order-book data structures that fit in L1 cache.
- Zero-copy networking stacks: DPDK, OpenOnload, TCPDirect.
- Lock-free queues: Disruptor vs custom ring buffer.
- Profiling tricks: perf, x-ray, bulk writes, branchless search.
Why this matters I remember the first time I lost a trade because the best-ask price changed 3 ms after my strategy sent an order. In low-latency markets, every microsecond counts. Traders rely on accurate order books, lightning-fast network paths, and deterministic execution. The pain points—slow market-data ingestion, cache misses, and data races—are common in every production system.
Core concepts
1. Order book data structures
The order book is a sorted map of price levels. In a modern C++ system I use std::flat_map (C++23) to keep keys sorted in a contiguous vector and values packed back-to-back. cppreference — Standard Library Header <flat_map> (2024) gives a 32-byte overhead per entry and linear insertion, which is fine when most updates hit the top 1 k levels. A flat map also gives 4-byte cache lines that the CPU can fetch in one go.
2. Concurrency & shared memory
Once the book is updated, all strategy threads need a copy. Instead of copying the whole book, I publish updates into a lock-free ring buffer that lives in shared memory. The buffer uses two atomics—read and write counters—to coordinate producers and consumers. Bulk writing batches of updates reduces the number of atomic touches and keeps the producer in the L1 cache.
The ring buffer is just a vector of struct Update {int64_t id; double price; int qty;}. By keeping it in shared memory I avoid the kernel copy that would otherwise happen with pipes or sockets. This is why shared memory reduces kernel overhead DPDK — Data Plane Development Kit Documentation (2025).
3. Network stack
The choice of kernel bypass stack defines the floor of your latency. I compare three:
| Stack | Zero-Copy | Kernel Bypass | CPU Usage | Limitation |
|---|---|---|---|---|
| DPDK | ✅ | ✅ | Low | Requires NIC driver support, setup overhead |
| OpenOnload | ✅ | ✅ | Moderate | Vendor-specific, limited OS support |
| TCPDirect | ✅ | ✅ | Very Low | Proprietary hardware, limited OS support |
DPDK is the gold standard for 10 GbE, while OpenOnload and TCPDirect offer a lighter-weight user-space stack with similar zero-copy guarantees. I chose DPDK for my lab because it has a proven driver ecosystem and can push 40 Gb/s per core DPDK — Data Plane Development Kit Documentation (2025).
How to apply it
- Choose the right data structure
Replace std::map with std::flat_map. Use a vector for price levels. Avoid node containers unless you need std::unordered_map. - Build a lock-free shared-memory ring
Allocate a large, cache-line-aligned buffer. Use std::atomic<uint32_t> for read/write indices. Bulk batch updates to reduce contention. - Pick a network stack
For 1 GbE, DPDK gives the lowest latency. For 10 GbE, OpenOnload or TCPDirect can be simpler to ship. - Profile with perf and x-ray
Identify cache misses, branch mispredictions, and atomic stalls. Use intrusive instrumentation for hot paths. - Optimize search
Use branchless binary search (lower_bound with std::binary_search) for price lookup. For skewed updates, a linear scan on the top 1 k levels is actually faster because it stays in L1. - Test under burst
Load the ring buffer with 1 M updates per second. Watch for queue saturation and plan bulk writes accordingly. - Audit latency
Record per-message round-trip latency with a high-resolution timer and publish a histogram.
Pitfalls & edge cases
- Data races: Even with a lock-free ring, forgetting to use std::memory_order_seq_cst can lead to stale reads.
- Memory allocation overhead: Dynamically allocating a price level on each update kills latency. Pre-allocate all levels.
- Cache line thrashing: Two threads touching the same cache line cause false sharing. Align structures to 64 bytes.
- Scaling beyond one machine: Distributed order books need a consensus layer (e.g., Raft).
- Security: Shared memory is a privilege vector; use mprotect and strict access controls.
- Hardware limits: DPDK requires a single NUMA node per core for best performance.
- Strategy interference: If multiple strategies run on the same core, their cache usage can interfere. Isolate cores per strategy.
Quick FAQ
- What are the performance differences between Disruptor, Iron IPC, and a custom ring buffer?
Disruptor is lock-free and guarantees low latency, but it uses a circular buffer and requires careful back-pressure handling. Iron IPC (or cpp-ipc) is a generic shared-memory IPC with built-in memory safety. A custom ring buffer gives the most control but needs careful tuning to avoid contention. - How does the choice of network stack impact end-to-end latency?
DPDK pushes packets directly from NIC to user space, cutting out the kernel path. OpenOnload and TCPDirect achieve similar zero-copy by using kernel bypass drivers, but they add a small software layer that can add 50–100 ns. - How can I systematically measure and audit latency metrics in production?
Attach a high-resolution timestamp before sending and after receiving each message. Use perf record -e cycles -a to capture cache misses and perf script to produce a latency histogram. - What are best practices for avoiding data races in concurrent ring buffers?
Use std::atomic with std::memory_order_acquire/release. Ensure the producer never overwrites a slot that the consumer hasn’t read yet. - How does the system behave when scaling beyond one machine?
A distributed design introduces network latency and ordering guarantees. Use RDMA or InfiniBand for inter-node data exchange. - What security considerations exist when using shared memory for IPC?
Shared memory can be read by any process with the same user. Use mprotect and a sandboxed user to mitigate. - How to ensure consistent performance across all strategies on a shared machine?
Pin each strategy to its own CPU core, isolate the cores, and reserve cache for each strategy.
Conclusion Low-latency trading is a relentless optimization sprint. By building an order book around std::flat_map, publishing updates through a lock-free shared-memory ring, and choosing a kernel-bypass stack like DPDK, I have shaved 3–5 µs off my trade-execution time. If you’re a quant or low-latency engineer, start by profiling your current system, then apply the steps above and iterate. If you’re a manager, make sure your team has the right hardware (NUMA-aware servers) and that your development process includes latency testing from day one.
Who should use this?
- Systems or software engineers building trading platforms.
- Quant developers who need deterministic execution.
- Anyone who cares about microseconds.
Who shouldn’t?
- Teams with only a few hundred orders per second; the overhead of shared memory may outweigh the benefit.
- Projects that require zero-copy but cannot afford the cost of DPDK.
With the right data structures, lock-free queues, and profiling discipline, you can turn a 10-µs latency system into a 3-µs one.





