RabbitMq Client stops consuming messages

mzoellner · August 23, 2016, 12:59pm

We are using RabbitMq Server as a MessageService.
For a big bulk email invitation task we are registering a Message Handler with the RabbitMqServer and triggering it via the /oneway route of the SendEmailsDto.

rabbitMqServer.RegisterHandler<SendEmailsDto>(appHost.ServiceController.ExecuteMessage);

The email sending is a service that first sends out emails to a couple of people of an organization, then it looks for child organizations and publishes a SendEmailsDto to the message queue, so that the people of the child organization are also invited. This recursive process stops when there’s no more child organizations.

public void Post(SendEmailsDto request)
{
  SendEmailToPeople(request.OrganizationId);
  
  var childOrganizations = GetChildOrganizations(request.OrganizationId);
  
  foreach(var childOrganization in childOrganizations) 
   {
     _messageQueueClient.Publish(new SendEmailsDto { OrganizationId = childOrganization.Id });
   }
}

The issue we face is that not all messages that are pushed to the message queue are consumed. After a random amount of messages, no more messages are taken from the in queue and everything basically just stops, leaving many unprocessed dtos in the inq.

Is there any known issues with deadlocks for consuming messages from RabbitMQ ?
Looking into the SS implementation of RabbitMQ, there should be an ACK or NAK sent to RabbitMQ to signal the successfull or not successfull processing of a message. This is probably not sent in our case, but it is not clear to us why this is not the case.

mythz · August 23, 2016, 1:27pm

I wont be able to tell what the issue is from here, there aren’t any outstanding issues with RabbitMQ MQ Server that we know of.

There is some introspection available into the running MQ Server with the GetStats() and GetStatsDescription() methods which you could return in a Service to have some insight on the running MQ Server:

public IMessageService MqService { get; set; }

public object Any(MqStats request)
{
    return MqService.GetStatsDescription();
}

You can also try restarting any stopped MQ Workers with MqService.StartWorkerThreads() but that only works if the RabbitMqWorker is identified as in a stopped state, not if it’s has an active subscription and RabbitMQ just isn’t sending any more messages. In this case you could stop/start the worker threads.

If none of that helps, one other thing you could try is to change the workers to use Polling instead of a long-lived subscription which you can configure when initializing your RabbitMqServer with:

container.Register<IMessageService>(c => new RabbitMqServer {
    UsePolling = true
});

mzoellner · August 23, 2016, 3:34pm

Hi Demis,

thanks for your quick response.

I built in the stats endpoint, but it doesn’t provide any more insights into the issue.
However, I also built an endpoint that lets me restart the worker threads (using StopWorkerThreads and StartWorkerThreads) and that leads to a very interesting observation:

I deployed the changes to our production system which was once again refusing to work on messages. Right after the deployment nothing changed. It looked like the Threads that were holding onto the messages were not even stopped although I was deploying the IIS application and we do an app pool stop and app pool start during deployment. When I hit the endpoint to restart the rabbitmq threads, I got 4 messages in the dead letter queue with the following exceptions:

ThreadInterruptedException:Thread was interrupted from a waiting state. (3)
ThreadAbortException:Thread was being aborted. (1)

We were using three threads to work on that queue (using the noOfThreads setting). It seemed like those three threads were somehow in a waiting state and even the deployment wouldn’t change that. Only the call to StopWorkerThreads stopped them. How could that be ??? Is it because they are background threads ?

This is the weirdest thing I have ever seen, because this only happens in our Production environment, and on none of our Test environments, which are in any aspect (infrastructure, code, data) a 1:1 copy of production. It is a nightmare to test, because it is not reproducible in the Test environments and the side effect on prod is sending out a bunch of emails.

The next thing I want to try is Polling, after that I am out of ideas.

Maybe you got another idea.

mzoellner · August 23, 2016, 4:14pm

Hi again,

I did some more research into the ThreadAbortExceptions and found out that it has to do with a database call that we make in order to get the people that we have to send an email to. The thread was basically hanging somehwere in that database call and then got interrupted when I tried to restart it.

So, I guess we have to look into that direction now, although it doesn’t explain why it only happens on Prod…

Thanks for your help anyway. I will post to this thread as soon as we have a solution.

mythz · August 23, 2016, 4:47pm

I don’t see how a background thread can survive after its AppDomain recycled, I’d expect it to throw a Thread abort exception which terminates the Thread.

When threads aren’t processing a message they’re blocked waiting for the next message from RabbitMQ. Stopping the Worker threads terminates the TCP connection to RabbitMQ which will free them from their blocking state.

Something else you can try is try is terminating the tcp connection on the RabbitMQ side by killing the connection from RabbitMQ’s Web UI from the connections tab: http://localhost:15672/#/connections

mzoellner · August 24, 2016, 2:30pm

Alright,

we found the real issue and it was in our code. We had an unfortunate data set, that caused an endless loop of database calls. The thread abort/interrupt we saw in the Exceptions was basically breaking that loop. It looked like the threads were surviving, but that was not the case, they just kept on spinning on the same message after the AppDomain restarted.

Sorry for bothering you with that and thanks a lot for your help. The restart endpoint was the hint that finally brought us to the right path.

Regards
Michael