We are encountering problems with the implementation of Rabbit Mq in ServiceStack. Looking at how it’s implemented we believe messages are being Nak’d and then shoveled into the dead letter queue. The problem is we can’t figure out why.
The consistent markers are when we have long lived messages and a large number of messages backed up in the queue. Because, of the implementation these messages are published to the dlq instead of being moved there because of an X-death header. When the issue happens all messages are rapidly failed and moved to the dlq.
Our consumers are hosted on IIS and have no idle timeouts. We did implement this to fix a race condition which we thought was potentially the issue. Should AppHost.OnAfterInit start the mq server?
At this point we have no idea why this fails. We can run 100k long lived messages and it works fine and then the next time 70k of the messages will succeed and 30k will fail into the dlq. It’s as though the consumer pulls the message and then immediately naks it. I know under the covers ServiceStack uses a basic.get instead of a basic.consume. We have another non ServiceStack implementation used in a completely different application that never encounters problems.
Do you have any suggestions or any thoughts on what is happening here? How should we proceed to fix the issue?
For reference we have an updated RHL cluster for our RMQ server with and updated Erlang version and RabbitMq Server version.
No they don’t contain any information on why they failed. They never hit code execution so the assumption is they should have a presented x-death header with the message. Unfortunately the nak reason is abstracted and the message simply rolls back into the queue with “retryattempt” appended to it. This results in the move to the dlq but the underlying reason is lost.
Have you got any monitoring logs for the servers regarding other resources like memory, network, CPU usage etc when the issue occurs? The “consumer pulls the message and then immediately naks it” and “long lived messages” sounds like a possible memory constraint issue on the consumer that then throws errors when trying to process the subsequent messages as soon as they start.
I’ve experienced similar problems in large long running queue processing (Python based) and it was caused by a particular message requiring way larger than expected resources constraining other consumers on the same virtual machine.
Something you have probably checked, but good to rule out early if possible.
The only time ServiceStack MQ explicitly Nak’s a message is when it fails, but then the Error property of the Message should be populated which is populated on the “Error” Message Header property in RabbitMQ also there should be an error logged.
If there’s no error logging or “Error” header property isn’t being populated then I can’t see how the message is being Nak’d by ServiceStack, something else must be doing it or it’s being done implicitly, e.g. by RabbitMQ Client when the BasicGet fails.
If there’s no error logging the only thing I can recommend is debugging, i.e. from when ServiceStack pulls the message at:
If VS doesn’t support debugging, you shouldn’t have an issue with JetBrains Rider which offers a trial version.