Redis MQ not reconnecting after a connection timeout

The issue we are having is if there is an internal network issue in Azure and the connection to the managed Redis server and the managed Azure app service is lost, then the mq connections don’t come up again. We are still able to use it as a cache but a restart of the app service is needed for the mq to kick in again.

We have been running our applications as a .net 4.72 windows server for 3 years and this issue happens maybe 3, 4 times a month. We recently migrated to .NET 5.0.5 (docker container) and the issue happens there as well.

Is there any logic in ServiceStack to reconnect the mq if it’s disconnected? would it be better for us to migrate from Redis MQ over to Azure service bus?

Are you seeing any retries or anything that might be related in the logs? If you could post it here (minus any sensitive information). The RedisMqServer can take an optional retryCount, would you be able to code for how you are connecting to the managed Redis instance from your application in Azure?

Are you able to manually reproduce the issue?

Message: System.Net.Sockets.SocketException (110): Connection timed out

 File "/src/src/Taktikal.Onboarding.Logic/Logging/SentryLogger.cs", line 106, col 13, in void SentryLogger.InternalLog(object message, SentryLevel level, Exception ex)
  File "/src/src/Taktikal.Onboarding.PublicWeb/AppHost.cs", line 381, col 21, in void AppHost.RegisterMq(Container container)+(Exception ex) => { }
  File "C:\BuildAgent\work\3481147c480f4a2f\src\ServiceStack.Server\Messaging\Redis\RedisMqServer.cs", line 189, col 9, in void RedisMqServer.OnError(Exception ex)
  File "C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisPubSubServer.cs", line 386, col 21, in void RedisPubSubServer.NotifyAllSubscribers(string commandType)
  File "C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisPubSubServer.cs", line 186, col 13, in void RedisPubSubServer.SendHeartbeat(object state)
  ?, in void TimerQueueTimer.CallCallback(bool isThreadPool)
  ?, in void TimerQueueTimer.Fire(bool isThreadPool)
  ?, in void TimerQueue.FireNextTimers()

And here is another error at the same time: Exception in Redis MQ Server: Exceeded timeout of 00:00:10

   ServiceStack.Redis.RedisException: Exceeded timeout of 00:00:10
  File "C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs", line 728, col 25, in T RedisNativeClient.SendReceive<T>(byte[][] cmdWithBinaryArgs, Func<T> fn, Action<Func<T>> completePipelineFn, bool sendWithoutRead)
  File "C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisPubSubServer.cs", line 386, col 21, in void RedisPubSubServer.NotifyAllSubscribers(string commandType)

What happens after this is that I don’t see any errors when the message producers post the MQ and I see no log entries around the MQ not being able to start and process the messages.

Are you able to manually reproduce the issue?

There is a built-in AutoRetry which will automatically restart the MQ Server when the heartbeat has elapsed when the MQ Server isn’t disposed and it’s in a known graceful restart state. I’m not sure what’s preventing the restart in your use-case, but the best way for us to be able to resolve the issue is if we’re able to repro it ourselves. Otherwise the debug logs (i.e. debugEnabled:true) before it fails may capture how it got into that invalid state.

If you’re not able to reliably reproduce it, it’d likely be too noisy to leave the debug logging enabled until it happens again, so can you print out the GetStatsDescription() before you restart the server, e.g:

var mqServer = HostContext.Resolve<IMessageService>();
mqServer.GetStatsDescription().Print();

Instead of restarting the AppService, can you let me know if manually restarting it works, i.e:

var mqServer = HostContext.Resolve<IMessageService>();
mqServer.Stop();
mqServer.Start();

I’m not able to reproduce this locally. I can see the ServiceStack is trying to reconnect when I turn of the Redis server and as soon as it comes back up again it reconnects.

I have added two service endpoints to get the mq status and to restart it. I will use them next time this happens to get better details.

1 Like

Hello, I also have the same exact issue when running on Azure, and it also has been for a year or 2.

I have 3 ServerStack applications running ASP.NET 4.8 - all of them exchange messages via a Redis Message Queue. When we have the internal network issue in Azure, the Applications are still OK to push messages to the Queue… so nothing is lost.
But ALL 3 Applications are failing to pop the messages.

All the handlers created by this type of code stopped doing their job:

var clientsManager = ServiceStackHost.Instance.Resolve<IRedisClientsManager>();
var mqHost = new RedisMqServer(clientsManager, retryCount: 2);
mqHost.RegisterHandler<CallTask>(ServiceStackHost.Instance.ServiceController.ExecuteMessage);

I cannot also reproduce at will, it happens a few times in a month

But your answer is giving me an idea…
I have Hangfire jobs running, and I have one running every minute and I can check the SIZE of the message queues. They will start to grow when the issue is happening, more that usual.
In that case I can add your code in the job (but of course trigger it only one time :slight_smile: )

var mqServer = HostContext.Resolve<IMessageService>();
mqServer.GetStatsDescription();

and display in logs I am sending to slack
I can also add the code

var mqServer = HostContext.Resolve<IMessageService>();
mqServer.Stop();
mqServer.Start();

and hopefully it could solve the issue automatically.
So next time it happens, I will post the results here, and hopefully have the info from GetStatsDescription and also will know if the stop and start is working

3 Likes

I’ve also had this issue for years with my side project, unsure why I didn’t post, will follow this thread with interest.

I got this issue again on 01.04.2021 01:44.

After creating this topic I have created three new endpoints for debugging.

Here is the output of mqServer.GetStatsDescription()

#MQ SERVER STATS:
 Listening On: 
mq:CreatePdfForSigning.priorityq,
mq:CreatePdfForSigning.inq,
mq:CreateSigningProcessPdf.priorityq,
mq:CreateSigningProcessPdf.inq,
mq:CreateCustomerMessageEvent.priorityq,
mq:CreateCustomerMessageEvent.inq,
mq:CreateCustomerEmailEvent.priorityq,
mq:CreateCustomerEmailEvent.inq,
mq:TriggerMultiSignMessageEvent.priorityq,
mq:TriggerMultiSignMessageEvent.inq,
mq:CreateMultiSignMessageEvent.priorityq,
mq:CreateMultiSignMessageEvent.inq,
mq:CreateMultiSignWebhookEvent.priorityq,
mq:CreateMultiSignWebhookEvent.inq,
mq:CreateMultiSignSlackEvent.priorityq,
mq:CreateMultiSignSlackEvent.inq,
mq:CreateMultiSignEmailEvent.priorityq,
mq:CreateMultiSignEmailEvent.inq,
mq:CreateMultiDistansAndTimeEvent.priorityq,
mq:CreateMultiDistansAndTimeEvent.inq,
mq:CreateMultiSignSignatureRequestEvent.priorityq,
mq:CreateMultiSignSignatureRequestEvent.inq,
mq:CreateCustomerSlackEvent.priorityq,
mq:CreateCustomerSlackEvent.inq,
mq:SendEmailEvent.priorityq,
mq:SendEmailEvent.inq,
mq:SendWebhookRequest.priorityq,
mq:SendWebhookRequest.inq,
mq:CreateCustomerDistansAndTimeEvent.priorityq,
mq:CreateCustomerDistansAndTimeEvent.inq,
mq:SequentialEvent.priorityq,
mq:SequentialEvent.inq,
mq:DeliveryReportRequest.priorityq,
mq:DeliveryReportRequest.inq,
mq:CreateUserContactLazy.priorityq,
mq:CreateUserContactLazy.inq,
mq:SlackEvent.priorityq,
mq:SlackEvent.inq,
mq:DeleteAttachmentReference.priorityq,
mq:DeleteAttachmentReference.inq 

=============== Current Status: 
Stopping Times Started: 1
Num of Errors: 0 
Num of Continuous Errors: 0 
Last ErrorMsg: 
=============== 
STATS for CreatePdfForSigning: 
TotalNormalMessagesReceived: 0 
TotalPriorityMessagesReceived: 0 
TotalProcessed: 0 
TotalRetries: 0 
TotalFailed: 0 
LastMessageProcessed: 
---------------
STATS for CreatePdfForSigning: 
TotalNormalMessagesReceived: 52 
TotalPriorityMessagesReceived: 0 
TotalProcessed: 52 
TotalRetries: 0 
TotalFailed: 0 
LastMessageProcessed: 04/30/2021 23:43:08 
---------------
STATS for CreateSigningProcessPdf: 
TotalNormalMessagesReceived: 0 
TotalPriorityMessagesReceived: 0 
TotalProcessed: 0 
TotalRetries: 0 
TotalFailed: 0 
LastMessageProcessed: 
--------------- 
STATS for CreateSigningProcessPdf: 
TotalNormalMessagesReceived: 0 
TotalPriorityMessagesReceived: 0 
TotalProcessed: 0 
TotalRetries: 0 
TotalFailed: 0 
LastMessageProcessed: 
--------------- 
STATS for CreateCustomerMessageEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
--------------- 
STATS for CreateCustomerMessageEvent: 
TotalNormalMessagesReceived: 266 
TotalPriorityMessagesReceived: 0 
TotalProcessed: 266 
TotalRetries: 0 
TotalFailed: 0 
LastMessageProcessed: 05/01/2021 01:02:22 
 --------------- 
 STATS for CreateCustomerEmailEvent: 
TotalNormalMessagesReceived: 0 
TotalPriorityMessagesReceived: 0 
TotalProcessed: 0 
TotalRetries: 0 
TotalFailed: 0 
LastMessageProcessed: 
--------------- 
STATS for CreateCustomerEmailEvent: 
TotalNormalMessagesReceived: 47 
TotalPriorityMessagesReceived: 0 
TotalProcessed: 47 
TotalRetries: 0 
TotalFailed: 0 
LastMessageProcessed: 04/30/2021 23:45:20 
--------------- 
STATS for TriggerMultiSignMessageEvent: 
TotalNormalMessagesReceived: 0 
TotalPriorityMessagesReceived: 0 
TotalProcessed: 0 
TotalRetries: 0 
TotalFailed: 0 
LastMessageProcessed: 
--------------- 
STATS for TriggerMultiSignMessageEvent: 
TotalNormalMessagesReceived: 72 
TotalPriorityMessagesReceived: 0 
TotalProcessed: 20 
TotalRetries: 27 
TotalFailed: 52 
LastMessageProcessed: 04/30/2021 23:38:09 
 --------------- 
 STATS for CreateMultiSignMessageEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for CreateMultiSignMessageEvent:
TotalNormalMessagesReceived: 131
TotalPriorityMessagesReceived: 0
TotalProcessed: 131
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 04/30/2021 22:38:10 
---------------
STATS for CreateMultiSignWebhookEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for CreateMultiSignWebhookEvent:
TotalNormalMessagesReceived: 126
TotalPriorityMessagesReceived: 0
TotalProcessed: 126
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 04/30/2021 23:38:09 
---------------
STATS for CreateMultiSignSlackEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for CreateMultiSignSlackEvent:
TotalNormalMessagesReceived: 122
TotalPriorityMessagesReceived: 0
TotalProcessed: 122
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 04/30/2021 23:38:10 
---------------
STATS for CreateMultiSignEmailEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for CreateMultiSignEmailEvent:
TotalNormalMessagesReceived: 85
TotalPriorityMessagesReceived: 0
TotalProcessed: 85
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 04/30/2021 22:38:10 
---------------
STATS for CreateMultiDistansAndTimeEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for CreateMultiDistansAndTimeEvent:
TotalNormalMessagesReceived: 98
TotalPriorityMessagesReceived: 0
TotalProcessed: 98
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 04/30/2021 23:31:22 
---------------
STATS for CreateMultiSignSignatureRequestEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for CreateMultiSignSignatureRequestEvent:
TotalNormalMessagesReceived: 32
TotalPriorityMessagesReceived: 0
TotalProcessed: 32
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 04/30/2021 23:34:09 
---------------
STATS for CreateCustomerSlackEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for CreateCustomerSlackEvent:
TotalNormalMessagesReceived: 239
TotalPriorityMessagesReceived: 0
TotalProcessed: 239
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 05/01/2021 01:02:22 
---------------
STATS for SendEmailEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for SendEmailEvent:
TotalNormalMessagesReceived: 189
TotalPriorityMessagesReceived: 0
TotalProcessed: 189
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 04/30/2021 23:45:21 
---------------
STATS for SendWebhookRequest:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for SendWebhookRequest:
TotalNormalMessagesReceived: 36
TotalPriorityMessagesReceived: 0
TotalProcessed: 35
TotalRetries: 0
TotalFailed: 1
LastMessageProcessed: 04/30/2021 22:16:21 
---------------
STATS for CreateCustomerDistansAndTimeEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for CreateCustomerDistansAndTimeEvent:
TotalNormalMessagesReceived: 49
TotalPriorityMessagesReceived: 0
TotalProcessed: 49
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 04/30/2021 23:45:21 
---------------
STATS for SequentialEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for SequentialEvent:
TotalNormalMessagesReceived: 49
TotalPriorityMessagesReceived: 0
TotalProcessed: 49
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 04/30/2021 17:07:37 
---------------
STATS for DeliveryReportRequest:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for DeliveryReportRequest:
TotalNormalMessagesReceived: 316
TotalPriorityMessagesReceived: 0
TotalProcessed: 316
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 04/30/2021 22:04:38 
---------------
STATS for CreateUserContactLazy:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for CreateUserContactLazy:
TotalNormalMessagesReceived: 127
TotalPriorityMessagesReceived: 0
TotalProcessed: 127
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed: 04/30/2021 23:30:36 
---------------
STATS for SlackEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for SlackEvent:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for DeleteAttachmentReference:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------
STATS for DeleteAttachmentReference:
TotalNormalMessagesReceived: 0
TotalPriorityMessagesReceived: 0
TotalProcessed: 0
TotalRetries: 0
TotalFailed: 0
LastMessageProcessed:
---------------

I have another endpoint to restart the MQ as suggested by @mythz. That did not have any effect. I have scaled my Azure app service to use two instances so I’m not even sure of how to trigger them both to restart. After I did a hard restart of the server it started working again and around 30 events of type CreateCustomerMessageEvent showed up in Slack that happens during the night.

Thanks for the output, it appears the Redis MQ Server is stuck in the Stopping state which could be the result of subscription.UnSubscribeFromAllChannels() blocking indefinitely but I want to make sure.

I’ve added a CaptureEvents property on RedisMqServer to be able to capture lifecycle events on the underlying Redis PubSubServer which should help identify where it’s getting stuck, you can enable it with:

var mqServer = new RedisMqServer(redisFactory, retryCount:2) {
    CaptureEvents = true
};
container.Register<IMessageService>(c => mqServer);

Then when you get the GetStatsDescription() can you also print out the captured events, e.g:

var mqServer (RedisMqServer)HostContext.TryResolve<IMessageService>();
var output = mqServer.GetStatsDescription()
           + "\n\nEVENTS:\n" + mqserver.GetCapturedEvents();

Can you then post the output again the next time it happens.

Also can you try to see if killing the client connection on the Redis Sever with CLIENT KILL unblocks and resolves the issue, either via sending a command with redis-cli to your redis server:

$ CLIENT KILL type pubsub

Or using the Redis Client:

redis.KillClients(ofType:RedisClientType.PubSub);

You’ll need to download the latest v5.11.1 on MyGet for the new captured event apis.

I have migrated to Azure Service Bus. But I see other users have had this issue, maybe they can try it. But thanks for your help.