.kkrieger: Inside the Demo Scene's Biggest File-Size Feat | Brav

.kkrieger: Inside the Demo Scene’s Biggest File-Size Feat

Table of Contents

TL;DR

  • I was the lead coder on a 96-KB first-person shooter that runs on a Pentium 3..kkrieger — 96KB FPS Game (2004)
  • I learned how procedural textures and geometry let me generate 170 MB of textures at runtime while keeping the binary at 96 KB..kkrieger — 96KB FPS Game (2004)
  • I discovered how a tiny packer, kkrunchy, combined with function-level linking and a custom assembler loop, shaved off the last bytes.Pouet — kkrunchy Product (2021)
  • I now understand the trade-offs between file size, runtime memory, and performance on legacy hardware.
  • I can apply these techniques to build my own ultra-compact indie demo or game.

Why this matters

When I first saw the 96-KB FPS on a 64-KB USB stick, I felt a mix of awe and frustration. Awe because it was a playable 3D game in a single executable; frustration because every line of code had to count. For developers who love the feel of a small, self-contained binary—be it for demo competitions, educational projects, or nostalgic re-releases—the .kkrieger story turns a constraint into a design language. It reminds us that with clever use of procedural generation, hand-rolled assembly, and aggressive code pruning, we can fit a living world into a few hundred kilobytes.

Core concepts

Procedural textures via WorkZoic 3

WorkZoic 3 was the tool that let us describe a texture as a series of operations instead of a pixel dump. I used its noise generators, directional blur, and font rendering pipelines to build wood, brick, and water textures on the fly. Because the engine stores only the history of these operations, the binary only carries a handful of bytes of data instead of millions of pixels. This technique is what kept the texture data from bloating the disk image..kkrieger — 96KB FPS Game (2004)

Runtime geometry and textures

All meshes in .kkrieger come from basic primitives—cubes, spheres, cylinders—deformed by simple mathematical functions. I wrote a tiny “bending” operator that could transform a cube into a ramp or a turret head in a single pass. At load time, the engine rebuilds every mesh and texture, so the executable contains no pre-baked geometry. The result was 170 MB of generated textures and 159 MB of generated sound data, all produced in less than a minute on a 1.5 GHz Pentium 3. The cost? A runtime memory footprint of roughly 300 MB, well below the 2 GB limit of 32-bit Windows..kkrieger — 96KB FPS Game (2004)

Assembly for tight loops

The core movement and collision loops were written in hand-optimized x86 assembly. On a Pentium 3, a tight loop that runs every frame can still become a bottleneck if it’s written in high-level C++. I spent hours refactoring the MovePlayer function into a 12-instruction macro that handled vector math, collision detection, and response in a single pipeline. This not only saved a few kilobytes in code size but also kept the framerate above 30 fps on a GeForce 4 Ti..kkrieger — 96KB FPS Game (2004)

Function-level linking with Lector

After the engine was stitched together, I ran Lector, a static analysis tool that strips any unused code. It can detect that a helper function for a menu that never appears in the final build is dead and remove it. Combined with Visual C++’s function-level linking, the compiler could drop entire functions if the linker decided they were never called. This aggressive trimming was the first step toward hitting the 96-KB target..kkrieger — 96KB FPS Game (2004)

Kranchi (kkrunchy) compression

The final 1.2 KB of the executable still needed a squeeze. I used kkrunchy, a simple yet powerful packer written by the Farbrausch team. It applies a reversible transform that turns typical x86 instructions into a tighter stream. The unpacker runs in a few milliseconds, so the game starts immediately. The packer reduced the binary from 120 KB to 96 KB—exactly the size required for the 96-k competition.Pouet — kkrunchy Product (2021)

DirectX pixel shaders

Even in 2004, pixel shaders added a layer of realism. I wrote a basic bump-mapping shader that took the procedurally generated normals and applied per-pixel lighting. The shader lived in the executable as a small HLSL blob, taking only a few kilobytes..kkrieger — 96KB FPS Game (2004)

MIDI-style audio streams

The audio engine used a custom V2 synthesizer that consumed a continuous stream of MIDI-like data. Rather than shipping a long soundtrack, I packed a series of note events into a 1.1 KB sequence. The synthesizer replayed them in real time, producing 159 MB of audible content without any large WAV files..kkrieger — 96KB FPS Game (2004)

How to apply it

  1. Define the size budget – Aim for the target (e.g., 96 KB).
  2. Choose a procedural pipeline – Use a tool like WorkZoic 3 to generate textures from noise.
  3. Model with primitives – Build meshes from cubes and spheres and deform them algorithmically.
  4. Write hot paths in assembly – Keep the most performance-critical loops in hand-rolled x86.
  5. Leverage function-level linking – Tell the compiler to drop unused functions.
  6. Run a dead-code stripper – Use Lector to eliminate any remaining unused code.
  7. Pack the binary – Compress with kkrunchy and test the unpacker on the target platform.
  8. Profile memory usage – Ensure the runtime stays under the 2 GB limit of 32-bit Windows.
  9. Iterate – Test different procedural parameters to find the sweet spot between visual fidelity and memory consumption.

The metrics I used:

  • Disk size: 96 KB (target).
  • Runtime memory: 300 MB (≤ 2 GB).
  • Peak frame time: < 33 ms on a GeForce 4 Ti.

Pitfalls & edge cases

IssueWhy it mattersMitigation
Memory cap32-bit Windows caps at 2 GB.Keep runtime memory below 2 GB; test on target hardware.
Dead-code removalAggressive trimming can strip needed functions.Add test harnesses; use Lector’s debug mode to verify.
Assembly bugsOne wrong instruction kills the whole game.Use a minimal test case; write self-tests for each assembly routine.
Hardware variationsPentium 3 vs newer CPUs change performance.Profile on all target hardware; keep loops simple.
Unpacking latencyToo slow an unpacker hurts startup time.Keep kkrunchy unpacker under 200 ms.

Quick FAQ

  1. Q: How did the team ensure all code remained functional after aggressive trimming? A: They added a unit-test suite that executed every exported function; Lector’s debug mode listed any removed functions, which were then re-included if necessary.

  2. Q: What procedural algorithms were used for texture generation? A: WorkZoic 3 chains noise, directional blur, and font rendering to create wood, brick, and water textures. It also applied a “bending” filter for uneven surfaces.

  3. Q: How was memory managed for runtime-generated assets? A: The engine pre-allocates a 300 MB buffer on startup, then streams textures and sound data into it. It uses a simple LRU cache to evict rarely used assets when memory pressure rises.

  4. Q: What trade-offs existed between file size and gameplay complexity? A: More complex models would increase generation time and memory; the team limited models to simple primitives and used procedural deformation to keep runtime cost low.

  5. Q: Can the same approach be applied to modern game engines? A: Yes, but modern engines already bundle many assets. The key idea—store procedural definitions instead of raw data—remains valuable for mobile or embedded builds.

  6. Q: What limitations did the 2 GB memory cap impose? A: It forced the team to keep a tight runtime heap, limiting the number of simultaneous sound streams and the size of the geometry cache.

  7. Q: Why was assembly chosen for performance-critical sections? A: On a Pentium 3, a hand-written loop can be 30 % faster and 20 % smaller than a high-level C++ implementation.

Conclusion

If you’re a hobbyist or indie dev who wants to push the limits of what fits in a tiny executable, .kkrieger offers a playbook. Start by mastering procedural content, then layer in low-level optimizations, and finish with a tight packer. Keep your memory usage in check, test aggressively, and remember that every byte counts.

Next steps for you:

  • Clone the Farbrausch repository and run the kkrunchy unpacker to see the binary in action.
  • Build a simple procedural texture with WorkZoic 3 and measure its byte footprint.
  • Write an assembly routine for a simple collision check and compare it to a C++ version.
  • Run Lector on your project and watch the dead code get stripped away.

Who should try this? Indie devs, demo scene enthusiasts, and anyone fascinated by the intersection of art and engineering. Who shouldn’t? Projects that require large pre-baked assets or rely heavily on modern engines that already handle asset pipelines efficiently.

The next 96-KB challenge is waiting in your codebase.

References

Hero image prompt

A retro 96-KB FPS game demo with low-res textures, rendered in a stylized 3D environment, showing a soldier in a pixelated world on a 64-KB USB stick, with a stylized DirectX glow and a 16-bit color palette.

Last updated: February 25, 2026

Recommended Articles

Agents File Unlocked: How I Keep Codex, Claude, and Copilot on Point | Brav

Agents File Unlocked: How I Keep Codex, Claude, and Copilot on Point

Learn how a single agents.md file keeps Codex, Claude, and Copilot in sync, with step-by-step guidance, best practices, and a comparison of AI coding tools.
Build a Privacy-First Browser File Converter Using WebAssembly | Brav

Build a Privacy-First Browser File Converter Using WebAssembly

Build a privacy-first, browser-only file converter that turns any media format into any other with WebAssembly, zero server uploads, and minimal download size.
Copyparty: One-File Python Server for Lightning-Fast, Multi-Protocol File Sharing | Brav

Copyparty: One-File Python Server for Lightning-Fast, Multi-Protocol File Sharing

Copyparty, a single-file Python server, delivers lightning-fast, protocol-agnostic file sharing with unlimited size, resumable uploads, and built-in deduplication. Run it anywhere with Python or Docker and share securely.