Heisenbug Hunting in Async .NET Systems

by Brad Jolicoeur

04/07/2026

You know that feeling when a bug just... vanishes the moment you try to look at it? You fire up the debugger, step through carefully, and everything works perfectly. No exception. No race condition. No problem. Until you run it again in production.

That's a Heisenbug—a bug that changes its behavior (or disappears entirely) when you try to observe it. The name comes from Heisenberg's uncertainty principle, and if you've ever built async message-driven systems in .NET, you know exactly what I'm talking about.

I've shipped async systems built on Rebus, NServiceBus, and Wolverine that worked beautifully in staging and blew up spectacularly under production load. The problem isn't the frameworks—it's that async distributed systems fail in fundamentally different ways than synchronous code, and our debugging instincts from the sync world don't translate.

This article is about a practical methodology for hunting these bugs down, based on Preethi Viswanathan's whitepaper "A Heisenbug Hunting Toolkit". The framework is built around six phases using open-source tools, and I'm going to show you how it applies specifically to .NET async systems using Marten, Wolverine, and NBomber.

What you'll need: .NET 8+ with Marten 7.x, Wolverine 3.x+, NBomber 6.x (via NBomber.Http.CSharp), and WireMock.Net. The chaos engineering sections use Rancher Desktop and LitmusChaos.

Why Async Changes Everything

When you're debugging synchronous code, you can step through a debugger line by line and trust what you see. The order of operations is predictable. A web API request comes in, you process it, you return a response. If something breaks, you can reproduce it locally, throw a breakpoint in, and watch the failure happen.

Async message-driven systems break that mental model completely.

You've got multiple message handlers running concurrently. Messages get retried when handlers fail. They might land in error queues. Competing consumers pull from the same queue. Eventual consistency means different parts of your system see different states at different times. A timing window that only opens when five specific messages arrive within 200 milliseconds of each other isn't something you can step through in a debugger.

Let me give you a concrete example. Here's a Wolverine message handler for reserving inventory that looks perfectly reasonable:

using Marten;
using Wolverine;

namespace TicketingSystem;

public record ReserveInventory(string ItemId, int Quantity, string OrderId);

public class InventoryHandler
{
    public async Task Handle(ReserveInventory command, IDocumentSession session)
    {
        var item = await session.LoadAsync<InventoryItem>(command.ItemId);
        
        if (item == null)
            throw new InvalidOperationException($"Item {command.ItemId} not found");
        
        if (item.Available >= command.Quantity)
        {
            item.Available -= command.Quantity;
            session.Store(item);
        }
        else
        {
            throw new InvalidOperationException("Insufficient inventory");
        }
    }
}

public class InventoryItem : IVersioned
{
    public string Id { get; set; } = string.Empty;
    public string Name { get; set; } = string.Empty;
    public int Available { get; set; }
    public int Version { get; set; }
}

This code will pass your unit tests. It'll work great when you manually test it. It might even work fine in load testing if your load test doesn't create the right kind of concurrency.

But when you get a flash sale and 50 concurrent messages arrive for the same inventory item? You'll oversell. The race condition is right there: load the item, check availability, decrement, save. Between the load and the save, other handlers are doing the exact same thing with stale data.

The debugger won't show you this. Stepping through the code with a breakpoint changes the timing enough that the race condition disappears. That's the Heisenbug.

And here's what makes these bugs so insidious: you can run 10,000 concurrent requests in a standard load test and never see this race condition. A generic load test spreads requests across many endpoints and resources. The timing window where two handlers are both between load and save on the same document might be 5–10 milliseconds. Unless your load test is deliberately targeting that exact contention point with enough concurrency, the window never opens. Your tests pass. Your dashboard is green. And you ship a race condition to production.

The Six-Phase Framework

Viswanathan's whitepaper proposes a systematic approach to finding and fixing these bugs before they reach production. The core idea is to move from reactive debugging ("it broke in production, now what?") to proactive chaos testing ("let's break it in controlled ways before shipping").

Here's the framework:

Predict (MiroFish) — Identify high-risk service boundaries via swarm simulation
Stress (NBomber) — Generate high-concurrency load to manufacture contention
Fuzz (Bogus) — Stochastic edge-case data—extreme values, nulls, boundary conditions
Isolate (WireMock) — Inject controlled latency and probabilistic failures into dependencies
Contain (Rancher Desktop) — Local Kubernetes with CPU throttling and infrastructure-native faults
Break (LitmusChaos) — Chaos injection—pod kills, network lag—to verify fixes hold under pressure

Each phase addresses a different failure mode. Predict helps you find where to look. Stress manufactures the concurrency needed to reproduce race conditions. Fuzz explores edge cases. Isolate lets you control timing. Contain adds infrastructure realism. Break validates that your fix actually works under chaos.

I'm going to focus on Phases 2, 4, and 6—Stress, Isolate, and Break—because those are the phases where .NET-specific tooling matters most and where I've gotten the most value in my own systems. Phases 1 (Predict), 3 (Fuzz), and 5 (Contain) are covered in the whitepaper.

Phase 2: Stress Testing with NBomber

You can't fix a race condition you can't reproduce. NBomber is a .NET load testing framework that lets you generate realistic concurrency patterns. Here's how you'd stress-test the inventory reservation endpoint:

using NBomber.CSharp;
using NBomber.Http.CSharp;

var httpFactory = HttpClientFactory.Create(
    name: "http_factory",
    initClient: () => new HttpClient { BaseAddress = new Uri("http://localhost:5000") }
);

var itemId = "concert-ticket-front-row";

var scenario = Scenario.Create("inventory_stress", async context =>
{
    var payload = new 
    { 
        ItemId = itemId, 
        Quantity = 1, 
        OrderId = Guid.NewGuid().ToString() 
    };
    
    var request = Http.CreateRequest("POST", "/reserve")
        .WithJsonBody(payload);
    
    var response = await Http.Send(httpFactory, request);
    
    return response.IsError ? Response.Fail() : Response.Ok();
})
.WithLoadSimulations(
    Simulation.Inject(rate: 100, interval: TimeSpan.FromSeconds(1), during: TimeSpan.FromMinutes(2))
);

NBomberRunner
    .RegisterScenarios(scenario)
    .Run();

This simulates 100 concurrent reservation attempts per second for two minutes, all targeting the same inventory item. If you've got a race condition, this will find it. You'll see inventory go negative, or more reservations succeed than you have inventory for.

The key is that you're manufacturing the exact contention pattern that happens in production during a flash sale, but in a controlled environment where you can observe and measure the failure.

Phase 4: Isolate with Controlled Latency

Sometimes the race condition only appears when there's latency in your dependencies. Maybe your Marten document store is on a slower connection in production. Maybe there's network jitter. You can inject that latency deliberately to widen the timing window.

For HTTP-based services, use WireMock to virtualize dependencies and control timing:

var server = WireMockServer.Start();

server
    .Given(Request.Create().WithPath("/inventory/*").UsingGet())
    .RespondWith(Response.Create()
        .WithStatusCode(200)
        .WithDelay(TimeSpan.FromMilliseconds(800))
        .WithBodyAsJson(new { id = "item1", available = 10 }));

Adding latency widens the timing window. A race condition that happens 0.3% of the time at normal speed might jump to 1.2% with 800ms of latency. That makes it much easier to observe and fix.

Note: WireMock virtualizes HTTP dependencies. If you need to inject latency into database calls or other non-HTTP connections, look at Toxiproxy or network-level throttling tools.

The Fix

Once you've reproduced the bug reliably through stress testing, you need to fix it. The fix isn't always what you'd instinctively reach for—distributed locks feel safe, but they create new failure modes under load.

For this race condition, the cleanest fix is optimistic concurrency with Marten's IVersioned interface:

using Marten;
using Wolverine;
using Wolverine.Marten;

public class InventoryItem : IVersioned
{
    public string Id { get; set; } = string.Empty;
    public string Name { get; set; } = string.Empty;
    public int Available { get; set; }
    public int Version { get; set; }  // Marten tracks version here
}

public class InventoryHandler
{
    public static IEnumerable<string> Validate(ReserveInventory command, InventoryItem item)
    {
        if (item.Available < command.Quantity)
            yield return "Insufficient inventory";
    }

    public static IMartenOp Handle(
        ReserveInventory command, 
        [Entity(Required = true)] InventoryItem item)
    {
        item.Available -= command.Quantity;
        return MartenOps.Store(item);
    }
}

When InventoryItem implements IVersioned, Marten detects version conflicts automatically. If two handlers try to save conflicting updates, Marten throws a ConcurrencyException and Wolverine retries with fresh state. The race resolves itself through intelligent retries.

For high-contention scenarios (flash sales, limited inventory), there's an even better solution: partitioned sequential messaging. Messages for the same ItemId route to the same queue and process sequentially. Different items process in parallel. The race condition becomes structurally impossible—not because you're detecting conflicts, but because the architecture prevents concurrent execution.

💡 Want the full deep-dive? I wrote a companion post that covers Wolverine's concurrency patterns in depth—including when to use optimistic concurrency vs. partitioning, why distributed locks fail under load, and how to think about designing concurrency into your architecture: Wolverine's Answer to the Distributed Lock

The key insight from that article: stop thinking about preventing concurrent access. Start thinking about designing the system so contention can't happen structurally. That mental shift changes how you build async distributed systems.

Phase 6: Validation Under Chaos

You've applied the fix. The NBomber stress test now passes cleanly. Are you done?

Not yet. The fix needs to hold under real-world chaos: pods restarting, network partitions, CPU throttling. That's where LitmusChaos (or similar chaos engineering tools) comes in.

For .NET systems, you might use Rancher Desktop to run your Wolverine app in a local Kubernetes cluster and inject faults:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: inventory-chaos
spec:
  engineState: active
  experiments:
    - name: pod-delete
      spec:
        components:
          interval: 10s
          force: false

This deletes random pods every 10 seconds while your NBomber test runs. If your fix depends on in-memory state, this will expose it. If you've got optimistic concurrency or partitioned messaging configured, your system should maintain 100% consistency even with continuous pod churn.

The TicketRush Case Study

The whitepaper walks through a case study—a fictional ticketing platform called TicketRush—that hits exactly the scenario we've been discussing: overselling inventory during flash sales. TicketRush isn't a .NET system, but the race conditions and fix strategies are identical to what we just walked through with Marten and Wolverine.

Here's what makes their approach brilliant: they didn't just run a load test and hope for the best. They intentionally widened the timing window by adding 800ms of latency to database calls and throttling CPU through Rancher Desktop. That single move took a race condition that failed 0.3% of the time and pushed it to 1.2% failure rate. Suddenly the bug wasn't a phantom—it was reproducible.

Then they fixed it with a distributed lock—and validated it under sustained chaos. Pod deletions every 10 seconds. 30-minute test run. 100% consistency maintained. That's not luck. That's evidence.

Metrics That Matter

You can run all the chaos tests you want, but if you're not measuring whether they're actually working, you're just building elaborate tooling.

There are five metrics that actually predict whether your async system will hold up in production. Most teams don't measure any of them:

Mean Time to Reproduce (MTTR_reproduce): How long does it take to reproduce an intermittent bug? The goal is minutes, not weeks.
Heisenbug Escape Rate: How many intermittent failures reach production each quarter? Track this to see if your chaos testing is catching them earlier.
Chaos Test Coverage: What percentage of your service boundaries have active chaos experiments running? If you're not testing a boundary under chaos, you're trusting luck.
Concurrent Load Test Coverage: What percentage of your message handlers and endpoints are tested under realistic concurrency? If the answer is "none," you're going to have a bad time in production.
Fix Verification Rate: When you fix a race condition, do you validate the fix under chaos, or just run your existing tests and ship it? The difference matters.

These aren't vanity metrics. They're leading indicators of whether your async system will survive production load.

The Mindset Shift

Here's the thing I've learned the hard way: the biggest challenge with async message-driven systems isn't technical. It's mental.

If you approach async distributed systems with synchronous debugging intuitions, you're going to build systems that look great in development and fail mysteriously in production. You'll add logs and try to trace execution order, but the logs won't help because the timing changes when you add logging. You'll run load tests, but they won't find the race conditions because you're not generating the right concurrency patterns.

The chaos-first philosophy isn't just a testing strategy—it's an operating principle. Async systems will fail in ways you didn't anticipate. You don't build confidence by hoping your tests are good enough. You build it by deliberately breaking things in a controlled environment before production does it for you. When you're running distributed systems at scale, this isn't optional.

That's what this framework gives you: a systematic way to manufacture the chaos that reveals the bugs, fix them with evidence rather than guesswork, and validate that your fixes actually work under pressure.

When you adopt tools like Wolverine or NServiceBus, you're not just adopting a messaging framework. You're adopting a completely different failure model. The sooner you adjust your testing and debugging approach to match that reality, the fewer 3 AM pages you'll get.

The whitepaper is a practical guide to making that shift. I highly recommend reading the full paper at https://zenodo.org/records/19390360 — especially if you're responsible for async .NET systems in production.

Because the next Heisenbug is already in your code. The question is whether you'll find it in your chaos test environment or in production.

About Brad Jolicoeur

Principal Architect with 20+ years building and transforming engineering organizations. Wharton Executive CTO Program graduate. Writing about architecture, distributed systems, production AI, and engineering leadership.

Get in touch → More articles →