Cutting Latency in C++: Building a Low-Latency Trading System From Scratch | Brav

Learn how to design a low-latency C++ trading system from scratch, with real-world data structures, network stacks, and profiling tips that shave microseconds.

TL;DR

Published by Brav

Table of Contents
  • I shave 3–5 µs from my C++ low-latency trading system.
  • Real-world order-book data structures that fit in L1 cache.
  • Zero-copy networking stacks: DPDK, OpenOnload, TCPDirect.
  • Lock-free queues: Disruptor vs custom ring buffer.
  • Profiling tricks: perf, x-ray, bulk writes, branchless search.

Why this matters I remember the first time I lost a trade because the best-ask price changed 3 ms after my strategy sent an order. In low-latency markets, every microsecond counts. Traders rely on accurate order books, lightning-fast network paths, and deterministic execution. The pain points—slow market-data ingestion, cache misses, and data races—are common in every production system.

Core concepts

1. Order book data structures

The order book is a sorted map of price levels. In a modern C++ system I use std::flat_map (C++23) to keep keys sorted in a contiguous vector and values packed back-to-back. cppreference — Standard Library Header <flat_map> (2024) gives a 32-byte overhead per entry and linear insertion, which is fine when most updates hit the top 1 k levels. A flat map also gives 4-byte cache lines that the CPU can fetch in one go.

2. Concurrency & shared memory

Once the book is updated, all strategy threads need a copy. Instead of copying the whole book, I publish updates into a lock-free ring buffer that lives in shared memory. The buffer uses two atomics—read and write counters—to coordinate producers and consumers. Bulk writing batches of updates reduces the number of atomic touches and keeps the producer in the L1 cache.

The ring buffer is just a vector of struct Update {int64_t id; double price; int qty;}. By keeping it in shared memory I avoid the kernel copy that would otherwise happen with pipes or sockets. This is why shared memory reduces kernel overhead DPDK — Data Plane Development Kit Documentation (2025).

3. Network stack

The choice of kernel bypass stack defines the floor of your latency. I compare three:

StackZero-CopyKernel BypassCPU UsageLimitation
DPDKLowRequires NIC driver support, setup overhead
OpenOnloadModerateVendor-specific, limited OS support
TCPDirectVery LowProprietary hardware, limited OS support

DPDK is the gold standard for 10 GbE, while OpenOnload and TCPDirect offer a lighter-weight user-space stack with similar zero-copy guarantees. I chose DPDK for my lab because it has a proven driver ecosystem and can push 40 Gb/s per core DPDK — Data Plane Development Kit Documentation (2025).

How to apply it

  1. Choose the right data structure
    Replace std::map with std::flat_map. Use a vector for price levels. Avoid node containers unless you need std::unordered_map.
  2. Build a lock-free shared-memory ring
    Allocate a large, cache-line-aligned buffer. Use std::atomic<uint32_t> for read/write indices. Bulk batch updates to reduce contention.
  3. Pick a network stack
    For 1 GbE, DPDK gives the lowest latency. For 10 GbE, OpenOnload or TCPDirect can be simpler to ship.
  4. Profile with perf and x-ray
    Identify cache misses, branch mispredictions, and atomic stalls. Use intrusive instrumentation for hot paths.
  5. Optimize search
    Use branchless binary search (lower_bound with std::binary_search) for price lookup. For skewed updates, a linear scan on the top 1 k levels is actually faster because it stays in L1.
  6. Test under burst
    Load the ring buffer with 1 M updates per second. Watch for queue saturation and plan bulk writes accordingly.
  7. Audit latency
    Record per-message round-trip latency with a high-resolution timer and publish a histogram.

Pitfalls & edge cases

  • Data races: Even with a lock-free ring, forgetting to use std::memory_order_seq_cst can lead to stale reads.
  • Memory allocation overhead: Dynamically allocating a price level on each update kills latency. Pre-allocate all levels.
  • Cache line thrashing: Two threads touching the same cache line cause false sharing. Align structures to 64 bytes.
  • Scaling beyond one machine: Distributed order books need a consensus layer (e.g., Raft).
  • Security: Shared memory is a privilege vector; use mprotect and strict access controls.
  • Hardware limits: DPDK requires a single NUMA node per core for best performance.
  • Strategy interference: If multiple strategies run on the same core, their cache usage can interfere. Isolate cores per strategy.

Quick FAQ

  1. What are the performance differences between Disruptor, Iron IPC, and a custom ring buffer?
    Disruptor is lock-free and guarantees low latency, but it uses a circular buffer and requires careful back-pressure handling. Iron IPC (or cpp-ipc) is a generic shared-memory IPC with built-in memory safety. A custom ring buffer gives the most control but needs careful tuning to avoid contention.
  2. How does the choice of network stack impact end-to-end latency?
    DPDK pushes packets directly from NIC to user space, cutting out the kernel path. OpenOnload and TCPDirect achieve similar zero-copy by using kernel bypass drivers, but they add a small software layer that can add 50–100 ns.
  3. How can I systematically measure and audit latency metrics in production?
    Attach a high-resolution timestamp before sending and after receiving each message. Use perf record -e cycles -a to capture cache misses and perf script to produce a latency histogram.
  4. What are best practices for avoiding data races in concurrent ring buffers?
    Use std::atomic with std::memory_order_acquire/release. Ensure the producer never overwrites a slot that the consumer hasn’t read yet.
  5. How does the system behave when scaling beyond one machine?
    A distributed design introduces network latency and ordering guarantees. Use RDMA or InfiniBand for inter-node data exchange.
  6. What security considerations exist when using shared memory for IPC?
    Shared memory can be read by any process with the same user. Use mprotect and a sandboxed user to mitigate.
  7. How to ensure consistent performance across all strategies on a shared machine?
    Pin each strategy to its own CPU core, isolate the cores, and reserve cache for each strategy.

Conclusion Low-latency trading is a relentless optimization sprint. By building an order book around std::flat_map, publishing updates through a lock-free shared-memory ring, and choosing a kernel-bypass stack like DPDK, I have shaved 3–5 µs off my trade-execution time. If you’re a quant or low-latency engineer, start by profiling your current system, then apply the steps above and iterate. If you’re a manager, make sure your team has the right hardware (NUMA-aware servers) and that your development process includes latency testing from day one.

Who should use this?

  • Systems or software engineers building trading platforms.
  • Quant developers who need deterministic execution.
  • Anyone who cares about microseconds.

Who shouldn’t?

  • Teams with only a few hundred orders per second; the overhead of shared memory may outweigh the benefit.
  • Projects that require zero-copy but cannot afford the cost of DPDK.

With the right data structures, lock-free queues, and profiling discipline, you can turn a 10-µs latency system into a 3-µs one.

Last updated: January 10, 2026

Recommended Articles

Building a Fourth Dimension: How Quantum Hall Experiments Let Us Walk Through 4D Space | Brav

Building a Fourth Dimension: How Quantum Hall Experiments Let Us Walk Through 4D Space

Discover how the quantum Hall effect lets us simulate a fourth spatial dimension in the lab. Learn about synthetic dimensions, 4-D edge states, and their potential for quantum computing.
AI Consulting as My Secret Weapon: How I Built a $250K Solo Empire and You Can Do It Too | Brav

AI Consulting as My Secret Weapon: How I Built a $250K Solo Empire and You Can Do It Too

Learn how I built a $250K solo AI consulting business, productized my expertise, and scaled founder-led brands—step-by-step tips for mid-career pros.
I Built a Plug-Flow Rainwater Generator That Lights LEDs With 6 kV – A Step-by-Step Demo | Brav

I Built a Plug-Flow Rainwater Generator That Lights LEDs With 6 kV – A Step-by-Step Demo

Learn how to harvest rainwater into electricity with a plug-flow tube and build a DIY generator that powers LEDs, ideal for makers, hobbyists, and educators.
I Built a Raspberry Pi Wi-Fi Packet Capture Device with PoE, Multi-Adapter Support, and a Web UI | Brav

I Built a Raspberry Pi Wi-Fi Packet Capture Device with PoE, Multi-Adapter Support, and a Web UI

Learn how to turn a Raspberry Pi into a headless Wi-Fi packet capture device with PoE, multi-adapter support, a Flask web UI, and VPN access. Step-by-step guide.
Talking Website Success: How I Built a Site That Talks, Captures Leads, and Generates $500/month | Brav

Talking Website Success: How I Built a Site That Talks, Captures Leads, and Generates $500/month

Learn how I built a talking website that turns visitors into clients, earns $500/month, and uses GoHighLevel, Yext, and Synthesia for automation and local SEO.
I Built a Forex Bot with Reinforcement Learning That Outperformed My Old Strategy | Brav

I Built a Forex Bot with Reinforcement Learning That Outperformed My Old Strategy

Build a Forex trading bot with reinforcement learning: train a PPO agent on EUR/USD, scale rewards, tune SL/TP, and backtest equity performance.