Thursday, July 14, 2011

EasyNetQ: How Should a Messaging Client Handle Errors?

EasyNetQ is my simple .NET API for RabbitMQ.

I’ve started thinking about the best patterns for implementing error handling in EasyNetQ. One of the aims of EasyNetQ is to remove as many infrastructure concerns from the application developer as possible. This means that the API should correctly handle any exceptions that bubble up from the application layer.

One of the core requirements is that we shouldn’t lose messages when the application throws. The question then becomes: where should the message, that the application was consuming when it threw, go? There seem to be three choices:

  1. Put the failed message back on the queue it was consumed from.
  2. Put the failed message on an error queue.
  3. A combination of 1 and 2.

Option 1 has the benefit that it’s the out-of-the-box behaviour of AMQP. In the case of EasyNetQ, I would simply catch any exceptions, log them, and just send a noAck command back to RabbitMQ. Rabbit would put the message at the back of the queue and then resend it when it got to the front.

Another advantage of this technique is that it gives competing consumers the opportunity to process the message. If you have more than one consumer on a queue, Rabbit will send the messages to them in turn, so this is out-of-the-box.

The drawback of this method is that there’s the possibility of the queue filling up with failed messages. The consumer would just be cycling around throwing exceptions and any messages that it might be able to to consume would be slowed down by having to wait their turn amongst a long queue of failed messages.

Another problem is that it’s difficult to manually inspect the messages and selectively delete or retry them.

Option 2 is harder to implement. When an error occurs I would wrap the failed message in a special error message wrapper. This can include details about the type and location of the exception and other information such as stack traces. I would then publish the error message to an error exchange. Each consumer queue should have a matching error exchange. This gives the opportunity to bind generic error queues to all error exchanges, but also to have special case error consumers for particular queues.

I would need to write an error queue consumer to store the messages in a database. I would then need to provide the user with some way to inspect the messages alongside the error that caused them to arrive in the error queue so that they could make a ignore/retry decision.

I could also implement some kind of wait-and-retry function on the error queue, but that would also add additional complexity.

It has the advantage that the original queue remains clear of failing messages. Failed messages and the error condition that caused the failure can be inspected together, and failed messages can be manually ignored or retried.

With the failed messages sitting in a database, it would also be simple to create a mechanism where those messages could be replayed on a developer machine to aid in debugging.

A combination of 1 and 2. I’m moving towards thinking that a combination of 1 & 2 might be the best strategy. When a message fails initially, we simply noAck it and it goes back to the queue. AMQP provides a Redelivered flag, so when the messages is consumed a second time we can be aware that it’s a retry. Unfortunately there doesn’t seem to be a retry count in AMQP, so the best we can do is allow for a single retry. This has the benefit that it gives a competing consumer a chance to process the message.

No retry count is a problem. One option some people use is to roll their own ‘nack’ mechanism. In this case, when an error occurs in the consumer, rather than sending a ‘nack’ to Rabbit and relying on the built-in behaviour, the client ‘acks’ the message to remove it from the queue, and then re-publishes it via the default exchange back to the originating queue. Doing this gives the client access to the message and allows a ‘retry count’ header to be set.

After the single retry we fall back to Option 2. The message is passed to the error queue on the second failure.

I would be very interested in hearing how other people have implemented error handling with AMQP/RabbitMQ.

Updated based on feedback on the 15th July


Travis Smith said...

You could keep track of the retry count in the in the headers of the message. Instead of noack'ing it you could update the retry count and redeliver it to the queue (putting it at the end of the line instead of the start). Once it hits a threshold, dead-leader it once RabbitMQ supports that or toss it to an error queue.

RobC said...

Have you ironed out any details of Option 2? For example, did you use one exchange to deliver all error messages, or one exchange per consumer? What routing key (if any) did you use for the error messages?

Mike Hadlow said...

Hi Rob,

Yes, I've got an implementation of Option 2 now working. See the code here:

I publish the error message to a message specific exchange that binds to a general error queue. I use the orginal message's routing key, since that's what I create the binding with.

So far I've only implemented a minimal error queue processing mechanism; a simple console app that can dump the error messages to disk and then resubmit them. You can see the code for that here:

cocowalla said...

What about comms issues - let's say that a producer has created a message and wants to send it to a queue, but there are networking problems.

How is this handled so we don't lose that message? Is there some kind of retry mechanism, or 'store and forward' functionality?

hazzik said...

Hi, is it possible to replace DefaultConsumerErrorStrategy with another implementation? I want to implement Option1 for my application.

Mike Hadlow said...

Hi hazzik,

Yes, you need to write your own version of RabbitHutch.CreateBus(). pass your new IConsumerErrorStrategy into the QueueingConsumerFactory constructor. Check out the RabbitHutch.CreateBus() code, it should be pretty obvious.