Recently I had an opportunity to try out implementing a Proof of Concept for our experiences to have some degree of resilience against transient errors. Our upstream APIs are built on Microsoft Web API, which are pretty reliable, the downstream legacy mainframe systems we rely on however, are extremely unstable. Throwing transient errors at all hours of the day, any day of the week. Some go into maintenance mode every night for up to two hours!
Surely we are not the only one having this plague of legacy systems, someone must have solved it, all I need is to google it, copy and paste, done. To my surprise, everywhere I looked, everyone is giving a solution that is too naive, almost borderline incorrect to be used in production.
“Insanity is doing the same thing over and over again and expecting different results”
The critical criterion that warrants a retry strategy is that you know what, why and when your system fails, and you know that there is a statistically high enough chance that it will get back up at some point not too far in the future.
If you are unsure what type of failure you anticipate and whether or not it is recoverable, trying it again is simply beating a dead horse at best. Worst is that our code base now is littered with useless retry code that only makes next developer’s life miserable. In this case, the best option is to FAIL FAST, not to retry at all.
If your core infrastructure lives in the cloud, you probably don’t need retry. Most reputable cloud providers all have excellent SLAs which render the ratio of effort to implement versus benefit it returns so high that makes it economically unattractive.
If you have a mission-critical operation, or a web of interdependent microservices that are mission-critical, even a ridiculously high SLA is not enough, retry alone will not solve your problem, as this Netflix Engineering article explains it well. It will require a careful design of full-fledged resilience engineering that covers retry, chaos, circuit-breaker, eventual consistency, monitoring, human intervention, etc. which is out of the scope for this discussion.
Obviously we can’t retry forever. We need to know when the anticipated failed system is going to get back up, therefore to determine the retry intervals and how long we will retry for.
A popular approach is Polly exponential backoff with jitter. Exponential backoff means we set the gaps between each retry larger and larger as we progress through retries. Normally, it is set to be the power of a number, e.g. 2^ n, where n is the number of retry. So for six retries, we get 2 ^ 1 = 2 seconds, then 2 ^ 2 = 4 seconds, then 2 ^ 3 = 8 seconds, then 2 ^ 4 = 16 seconds, then 2 ^ 5 = 32 seconds, 2 ^ 6 = 64 seconds. Now the trouble is, if we use a set of fixed retry intervals, when we have a service that sits far down in the dependency tree, and when it fails, it will cause all the upstream services to retry, and retry at the exact same rhythm. Depending on how large the dependency tree is, this could potentially mount a DOS attack to our own network. With Jitter, we intentionally introduce randomness into our retry rhythm, so all the retry calls generated from upstream services are smoothly distributed over the time axle. Below shows the characteristics of a Jitter algorithm recommended by the Polly community.
Broadly speaking, in the event of a transient error, we have two options: retry without notifying user, or get user consent before retry. If your app is customer facing, getting user consent before retry in my opinion is the most appropriate. Not only it gives us a chance to explain what went wrong, but it also can be used as an opportunity to prepare and educate user what they could do when retry did not work. In that case, a manual intervention from a staff member may be required.
In the systems I work on, when our API encounters a downstream transient error, our React app will display a retry consent, when user consents, it calls on the AgreeToRetry endpoint with some payload(state). In the example below I use POST, but it could be PUT, GET, PATCH. In the service layer, we call a Kafka producer to produce a retry-event. Redis could be a good alternative as it provides a simple, fast, in-memory sub-pub system. The key difference is Redis does not guarantee messages delivery as they are kept nowhere, whereas Kafka keeps a copy of messages and guarantees no data loss.
To retry, we need the necessary state at the point of failure to reconstitute the operation. There are many ways to capture the state, e.g. persist to DB, keep it in memory, etc. For our case, we capture the retry state by putting it into Kafka message value. It will then get picked up by the consumer leg for retry.
By default, Kafka can take up to 1MB as message size. Yes, we can increase Kafka message size, but don’t go overboard with this, as Kafka isn’t meant to handle large messages. If you need to handle extremely large state, consider persist it somewhere else, just push message broker a reference that points to the location of the state.
Almost all the code I have seen on the Internet, simply run the retry logic in ASP.NET Core app during the request. Examples given on Microsoft resilient HTTP requests even suggest baking retry policy directly into HttpClient in Startup. Please don’t do that! Unless you want to keep you user waiting while your code is busy retry exponentially and that might take a long time! Because when we run retry in an http request, it will not return until retry has finished. This reminds me sockets exhaustion scar years ago by incorrect use of HttpClient, and Microsoft finally corrected it by HttpClientFactory.
Retry should be carried out in an out of process long running background service. So that requests can return quickly. In the background service, we start a kafka consumer in a separate thread.
In the Kafka consumer, we call the retry API endpoint with the state we pull out from the message.
Put it together
Here is the sequence diagram explains how all the ingredients are going to work together.
First of all, if you haven’t installed Kafka, go install it. Then fire up your terminal, start Zookeeper and Kafka:
Start a new ASP.NET Core solution. Spin up a new controller.
This controller is served by a service which fetches ASP.NET Core document GitHub branches, then writes them to a file on disk. In this write to file operation, we purposefully inject chaos so that we can test our retry pattern.
In the AgreeToRetry method, we call on a Kafka producer to produce a retry event.
In the consumer leg, we first create a console app. Set up DIs on the Host.
Then spin up a background service called Worker, which will run our Kafka consumer in a separate thread.
Finally, the Kafka consumer. It is wrapped inside a long running background service and it is in here we carry out our retry logic.
It is a long post and a ton of code. If you don’t remember or don’t quite get everything I explained, that is totally fine, but there is one thing that I would like you to take away: do not run retry logic in ASP.NET Core! Push it to some out of process long running background service, and run it there, so that the request served by ASP.NET Core can return quickly.
If you would like to test it out or tweak it, here is the full source code. It is not production ready, but it has all the arms and legs you need should you want to scale it for production.