Message retries are one of the most powerful patterns in message based systems. Retries are probably one of the more compelling reasons to use a message based architecture and make a system that uses them significantly more durable and accurate. Retries are one of my favorite features in NServiceBus, mostly because NServiceBus has that pattern covered and I can focus my time on functionality the stakeholders are asking for.
Retries in Concept
In concept message retries are quite simple. During processing of a message an exception is encountered and the message is scheduled to be retried in the future. This saves you from scenarios like when that third party API you consume goes offline during an upgrade or the database server is rebooted during a patch.
While the concept of message retries is simple, implementing this pattern is far from trivial. The naive implementation of automated retries is to simply drop the message back onto the queue or to not finalize the consumption of a message when there is an unhandled exception. The first thing you will run into is the reality that if the database is down or the webservice is offline, it is rarely for less than a second. Rapid fire retries with failures will not make that transaction go through and it might even bring your system to it's knees. At this point you will realize you need to implement a back-off strategy for your retries.
It Gets Complicated
A back-off strategy of a few seconds is not too challenging, but in reality the outage will probably be more than a few seconds. At this point you will need to figure out a way to defer processing messages until later, maybe 5 to 10 minutes. Maybe your transport supports this, but if it doesn't you will need some kind of persistence to store the messages for later and a scheduler to rehydrate the messages and process them in the future.
Once you solve those problems, you will quickly discover that some messages are just not recoverable without some kind of external input. Maybe your trading partner sent you some crazy data you never expected or something happened in an order you did not anticipate. These messages become poison pills in your system and you will want two things to happen in that case. First you want the poison message to get put off to the side so that the other messages can continue processing. Second you want to be alerted so you can investigate and replay that message once the issue is cleared.
At some point you will also realize that not all situations should be handled in the same way. Some messages do not make sense to retry or the standard retry logic doesn't apply. You will need the ability to customize your retry logic for these specific scenarios.
The Reality Sets In
Consider that if you build all of this increasingly complex retry logic yourself, you will be the one that gets woken up to fix it when things go bump in the middle of the night. And they will. Your stakeholders will have no patience for the amount of time it will take to get this right since it does not directly fulfill their functional requirements. You will end up working on what is evolving into a framework over the weekend, during your lunch and in the middle of the night after you have done your 'real' work. To compound this issue, you will then need to document how the rest of your team can use this framework and answer their questions when they run into problems.
Hopefully you are getting the hint that while you are capable of building this functionality, it doesn't mean you should. As a messaging framework, NServiceBus has battle tested implementations of these patterns, written by people who spend their day focused on these problems. This is backed up with first rate documentation, training and support.
NServiceBus To The Rescue
NServiceBus has two types of retries known as Recoverability.
When message processing results in an exception being thrown, the message will immediately be made available for consumption again. This type of retry is good for resolving things like database deadlocks and brief network issues. The default number of immediate retries is 5. The number of immediate retries is configurable and can even be disabled all together.
I usually reduce the number of immediate retries to 3 or less. I have found that if there is an outage due to a service or database you are depending on and your system is under load, there can be cascading effects. The additional load of processing messages 5 times instead of once can cause your scale-out policy to kick in and now you are slamming the dependent service with 5 times the requests. Neither one of these things is going to improve your situation and is likely to extend the time it takes to fully recover.
After the immediate retries are exhausted, further exceptions will result in a Delayed Retry. This means the message will be scheduled to be available for consumption in the future with an increasing delay for subsequent message failures. The default behavior for Delayed Retries, is a 10, 20 and 30 second delay and is fully configurable.
Delayed Retries are intended to address situations where a third-party API is temporarily offline or a database is not available. These are both common situations in distributed systems and having this level of automated retries will result in a significantly more durable system.
Once the configured immediate and delayed retries are exhausted, messages are sent to the error queue. This is an implementation of the circuit breaker pattern. Once messages are in the error queue, ServiceControl can consume the message and make it easy to replay/retry that message using ServicePulse or ServiceInsights.
A detail the NServiceBus documentation does not highlight is that the exception stack trace and other diagnostic details are stored in the message headers of failed messages. These headers are extremely valuable for tracing issues and replaying messages.
The diagnostic headers include details like the machine and named endpoint that sent the message as well as the endpoint and machine that attempted to process the message. I have seen more than one scenario where these headers helped diagnose issues with a specific server. This can make you crazy in a scaled out scenario since the failure could happen on one server and not the others resulting in an intermittent failure.
Consider that if you build your own messaging framework, message headers are not going to be something you realize you need until after you have experienced issues tracing issues with messages.
Retries will make your system more durable in most cases, but do not make sense for messages that have no possibility of being processed. NServiceBus will by default send any message with a
MessageDeserializationException to the error queue without retrying it.
You can also define your own exceptions that will result in bypassing retries. This can be helpful in scenarios where you receive messages with invalid data or the data in the message is obsolete.
Use caution with this approach to handling invalid data. Consider that there is probably an expectation that the message is processed and you have created a situation where that is impossible. Use this capability for handling invalid data only after you have considered other options like validating data before it is sent in a message.
Custom Recoverability Policy
Almost everything about the recovery policy can be customized. Some of the aspects of retries that can be customized include the following:
- Number of immediate and delayed retries for all messages
- Number of immediate and delayed retries for specific exceptions
- Duration of a delayed retry for all messages or specific messages
- Define unrecoverable exceptions that are sent to the error queue without retries
- Discard messages that no longer make sense to process based on exception type
Consider the scenario where you are integrated with an API that returns 503 (Service Unavailable) on a regular basis and you know it usually comes back in 5 minutes. This is a good case for a custom recoverability policy.
You can create a custom policy that will bypass the immediate retries and move right to a delayed retry scheduled for 5 minutes in the future. This policy reduces the load on your system and might actually make the API more reliable since your endpoint is not slamming it with requests every time the API stumbles.
Retries are a key component to making your distributed system resilient assuming you have a solid framework that implements this pattern. NServiceBus has a battle tested implementation of retries that works out of the box and can be easily customized to accommodate your specific requirements.
While NServiceBus has many other features, Recoverability alone justifies the cost.