Developing Error Handling Strategies for Asynchronous Messaging

I’m furiously working on what I hope is the last sprint toward a big new Jasper 2.0 release. Part of that work has been a big overhaul of the error handling strategies with an eye toward solving the real world problems I’ve personally experienced over the years doing asynchronous messaging in enterprise applications.

Whether you’re purposely using micro-services, having to integrate with 3rd party systems, or just the team down the hall’s services, it’s almost inevitable that an enterprise system will have to communicate with something else. Or at the very least have a need to do some kind of background processing within the same logical system. For all those reasons, it’s not unlikely that you’ll have to pull in some kind of asynchronous messaging tooling into your system.

It’s also an imperfect world, and despite your best efforts your software systems will occasionally encounter exceptions at runtime. What you really need to do is to plan around potential failures in your application, especially around integration points. Fortunately, your asynchronous messaging toolkit should have a robust set of error handling capabilities baked in — and this is maybe the single most important reason to use asynchronous messaging toolkits like MassTransit, NServiceBus, or the Jasper project I’m involved with rather than trying to roll your own one off message handling code or depend strictly on web communication via web services.

In no particular order, I think you need to have at least these goals in mind:

  • Craft your exception handling in such a way that it will very seldom require manual intervention to recover work in the system.
  • Build in resiliency to transient errors like networking hiccups or database timeouts that are common when systems get overtaxed.
  • Limit your temporal coupling to external systems both within your organization or from 3rd party systems. The point here is that you want your system to be at least somewhat functional even if external dependencies are unavailable. My off the cuff recommendation is to try to isolate calls to an external dependency within a small, atomic command handler so that you have a “retry loop” directly around the interaction with that dependency.
  • Prevent inconsistent state within your greater enterprise systems. I think of this as the “no leaky pipe” rule where you try to avoid any messages getting lost along the way, but it also applies to ordering operations sometimes. To illustrate this farther, consider the canonical example of recording a matching debit and withdrawal transaction between two banking accounts. If you process one operation, you have to do the other as well to avoid system inconsistencies. Asynchronous messaging makes that just a teeny bit harder maybe by introducing eventual consistency into the mix rather than trying to depend on two phase commits between systems — but who’s kidding who here, we’re probably all trying to avoid 2PC transactions like the plague.

Quick Introduction to Jasper

I’m using the Jasper framework for all the error handling samples here. Just to explain the syntax in the code samples, Jasper is configured at bootstrapping time with the JasperOptions type as shown in this sample below:

using var host = Host.CreateDefaultBuilder()
    .UseJasper(opts =>
    {
        opts.Handlers.OnException<TimeoutException>()
        // Just retry the message again on the
        // first failure
        .RetryOnce()

        // On the 2nd failure, put the message back into the
        // incoming queue to be retried later
        .Then.Requeue()

        // On the 3rd failure, retry the message again after a configurable
        // cool-off period. This schedules the message
        .Then.ScheduleRetry(15.Seconds())

        // On the 4th failure, move the message to the dead letter queue
        .Then.MoveToErrorQueue()

        // Or instead you could just discard the message and stop
        // all processing too!
        .Then.Discard().AndPauseProcessing(5.Minutes());
    }).StartAsync();

The exception handling policies are “fall through”, meaning that you probably want to put more specific rules before more generic rules. The rules can also be configured either globally for all message types, or for specific message types. In most of the code snippets the variable opts will refer to the JasperOptions for the application.

More in the brand new docs on error handling in Jasper.

Transient or concurrency errors that hopefully go away?

Assuming that you’ve done enough testing to remove most of the purely functional errors in your system. Once you’ve reached that point, the most common kind of error in my experience with system development is transient errors like:

  • Network timeouts
  • Database connectivity errors, which could be related to network issues or connection exhaustion
  • Concurrent access errors
  • Resource locking issues

For these types of errors, I think I’d recommend some sort of exponential backoff strategy that attempts to retry the message inline, but with an increasingly longer pause in between attempts like so:

// Retry the message again, but wait for the specified time
// The message will be dead lettered if it exhausts the delay
// attempts
opts.Handlers
    .OnException<SqlException>()
    .RetryWithCooldown(50.Milliseconds(), 100.Milliseconds(), 250.Milliseconds());

What you’re doing here is retrying the message a certain number of times, but with a pause to slow down processing in the system to allow for more time for a distressed resource to stabilize before trying again. I’d also recommend this approach for certain types of concurrency exceptions where only one process at a time is allowed to work with a resource (a database row? a file? an event store stream?). This is especially helpful with optimistic concurrency strategies where you might just need to start processing over against the newly changed system state.

I’m leaving it out for the sake of brevity, but Jasper will also let you put a message back into the end of the incoming queue or even schedule the next attempt out of process for a later time.

You shall not pass! (because a subsystem is down)

A few years ago I helped design a series of connected applications in a large banking ecosystem that ultimately transferred money from incoming payments in a flat file to a commercial, off the shelf (COTS) system. The COTS system exposed a web service endpoint we could use for our necessary integrations. Fortunately, we designed the system so that inputs to this service happened in a message handler fed by a messaging queue, so we could retry just the final call to the COTS web service in case of its common transient failures.

Great! Except that what also happened was that this COTS system could very easily be put into an invalid state where it could not handle any incoming transactions. In our then naive “retry all errors up to 3 times then move into a dead letter queue” strategy, literally hundreds of transactions would get retried those three times, spam the hell out of the error logs and production monitoring systems, and all end up in the dead letter queue where a support person would have to manually move them back to the real queue later after the COTS system was fixed.

This is obviously not a good situation. For future projects, Jasper will let you pause all incoming messages from a receiving endpoint (like a message queue) if a particular type of error is encountered like this:

using var host = await Host.CreateDefaultBuilder()
    .UseJasper(opts =>
    {
        // The failing message is requeued for later processing, then
        // the specific listener is paused for 10 minutes
        opts.Handlers.OnException<SystemIsCompletelyUnusableException>()
            .Requeue().AndPauseProcessing(10.Minutes());

    }).StartAsync();

Using that capability above, if you have all incoming requests to use an external web service coming through a single queue and receiving endpoint, you will be able to pause all processing of that queue if you detect an error that implies that the external system is completely invalid, but also try to restart listening later. All without user intervention.

Jasper would also enable you to chain additional actions to take after encountering that exception to send other messages or maybe raise some kind of alert through email or text that the listening has been paused. At the very worst, you could also use some kind of log monitoring tool to raise alerts when it sees the log message from Jasper about a listening endpoint being paused.

Dealing with a distressed resource

All of the other error handling strategies I’ve discussed so far have revolved around a single message. But what if you’re seeing a high percentage of exceptions across all messages for a single endpoint, which may imply that some kind of resource like a database is overloaded?

To that end, we could use a circuit breaker approach to temporarily pause message handling when a high number of exceptions are happening across incoming messages. This might help alleviate the load on the distressed subsystem and allow it to catch up before processing additional messages. That usage in Jasper is shown below:

opts.Handlers.OnException<InvalidOperationException>()
    .Discard();

opts.ListenToRabbitQueue("incoming")
    .CircuitBreaker(cb =>
    {
        // Minimum number of messages encountered within the tracking period
        // before the circuit breaker will be evaluated
        cb.MinimumThreshold = 10;

        // The time to pause the message processing before trying to restart
        cb.PauseTime = 1.Minutes();

        // The tracking period for the evaluation. Statistics tracking
        cb.TrackingPeriod = 5.Minutes();

        // If the failure percentage is higher than this number, trip
        // the circuit and stop processing
        cb.FailurePercentageThreshold = 10;

        // Optional allow list
        cb.Include<SqlException>(e => e.Message.Contains("Failure"));
        cb.Include<SocketException>();

        // Optional ignore list
        cb.Exclude<InvalidOperationException>();
    });
}).StartAsync();

Nope, that message is bad, no soup for you!

Hey, sometimes you’re going to get an exception that implies that the incoming message is invalid and can never be processed. Maybe it applies to a domain object that no longer exists, maybe it’s a security violation. The point being the message can never be processed, so there’s no use in clogging up your system with useless retry attempts. Instead, you want that message shoved out of the way immediately. Jasper gives you two options:

using var host = await Host.CreateDefaultBuilder()
    .UseJasper(opts =>
    {
        // Bad message, get this thing out of here!
        opts.Handlers.OnException<InvalidMessageYouWillNeverBeAbleToProcessException>()
            .Discard();
        
        // Or keep it around in case someone cares about the details later
        opts.Handlers.OnException<InvalidMessageYouWillNeverBeAbleToProcessException>()
            .MoveToErrorQueue();

    }).StartAsync();

Related Topics

Now we come to the point of the post when I’m getting tired and wanting to get this finished, so it’s time to just mention some related concepts for later research.

For the sake of consistency within your distributed system, I think you almost have to be aware of the outbox pattern — and conveniently enough, Jasper has a robust implementation of that pattern. MassTransit also recently added a “real outbox.” I know that NServiceBus has an improved outbox planned, but I don’t have a link handy for that.

Again for consistency within your distributed system, I’d recommend you familiarize yourself with the concept of compensating actions, especially if you’re trying to use eventual consistency in your system and the secondary actions fail.

And lastly, I’m not the world’s foremost expert here, but you really want some kind of system monitoring that detects and alerts folks to distressed subsystems, circuit breakers tripping off, or dead letter queues growing quickly.

Leave a comment