Redis MQ messages issue

tiptopweb · November 6, 2023, 11:34pm

Hello
Using ServiceStack 6.11
Redis Message Queues on AWS
It is working very well with a lot of messages going through in multiple queues (thousands a day)
I am keeping the message small and the queues are staying small
But from time to time (maybe once a week), it is suddenly starting to drop messages in the .DLQ queues
And the reading threads are getting stuck
Not many of them
When I restart the application, it works again after a little while (there are not many message in the .DLQ and I think other messages are getting dropped silently)
I am getting something similar to this in all the error queues (I redacted the bearer token):

mq:CallNop.dlq :
{"id":"e56dfd67c22247d499f55ec5b3ee1fb9","createdDate":"\/Date(1699306321239)\/","priority":0,"retryAttempts":2,"options":1,"error":{"errorCode":"NullReferenceException","message":"Object reference not set to an instance of an object.","errors":[]},"body":{"__type":"Tiptopweb.MojoPortal.Shared.QueueModel.CallNop, Tiptopweb.MojoPortal.Shared.QueueModel","bearerToken":"xxxxxxxxxxxxxxxxxxxxxxxx","listParams":[{"task":"GetNopLastOrders","paramId":0,"paramStr":""}]}}

I am protecting the access via ApiKeyAuthProvider
I have a special user with the "bearerToken":"xxxxxxxxxxxxxxxxxxxxxxxx" (redacted)

Would you have an idea what it could be?

Cannot reproduce or debug, as it is not happening often and running in production
Not sure how I could handle this.

Redis server is at 25% memory, all patches done. We have a replicate (2 nodes). It is also not happening at the busiest time.
I have a feeling the Redis Server itself getting stuck. When I republish it takes a bit of time for the queues to work again (not like if there the reading threads are corrupted and it starts again as soon as I republish)

Are the Redis queues reliable from experience?
Using them as I am also use Redis to Cache content, so it is convenient.

Do you think I should try maybe to move to a different queue system like the Amazon SQS (I am hosting on EC2 machines)

    public class CallNop : IReturn<CallNopResponse>, IHasBearerToken
    {
        public string BearerToken { get; set; }
        public List<CallNopParam> ListParams { get; set; }
    }

    public class CallNopParam
    {
        public string Task { get; set; }
        public int ParamId { get; set; }
        public string ParamStr { get; set; }
    }

    public class CallNopResponse
    {
        public ResponseStatus ResponseStatus { get; set; }
    }

   public partial class MojoPortalBackendService : Service
    {
        [Authenticate(ApiKeyAuthProvider.Name)]
        [RequiredRole(Constants.AdminQueue)]
        public object Any(CallNop listMessages)
        {
           ...
        }
   }

layoric · November 7, 2023, 12:18am

These kinds of problems are hard to pin down since there are multiple infrastructure parts involved, but generally we don’t see these issues with Redis no. Some other things to check for would be:

Is the Redis instance still available for other operations outside the MQ functionality from the same box?
Are you running in a cluster? If so, is it just one instance having the problem or do multiple start having the same issue at the same time?
Are you seeing any Timeout issues in your logs in regards to the Redis instance?
How are you resolving the Redis instance? Eg, is DNS involved and how?
Are any other network related issues occurring on the box at the same time? eg if your app connects to a DB or external service, does that still work?

It is possible it is threading related as the Redis MQ can create quite a few background threads. Eg, from the comments:

/// <summary>
/// Creates a Redis MQ Server that processes each message on its own background thread.
/// i.e. if you register 3 handlers it will create 7 background threads:
///   - 1 listening to the Redis MQ Subscription, getting notified of each new message
///   - 3x1 Normal InQ for each message handler
///   - 3x1 PriorityQ for each message handler (Turn off with DisablePriorityQueues)
/// 
/// When RedisMqServer Starts it creates a background thread subscribed to the Redis MQ Topic that
/// listens for new incoming messages. It also starts 2 background threads for each message type:
///  - 1 for processing the services Priority Queue and 1 processing the services normal Inbox Queue.
/// 
/// Priority Queue's can be enabled on a message-per-message basis by specifying types in the 
/// OnlyEnablePriorityQueuesForTypes property. The DisableAllPriorityQueues property disables all Queues.
/// 
/// The Start/Stop methods are idempotent i.e. It's safe to call them repeatedly on multiple threads 
/// and the Redis MQ Server will only have Started or Stopped once.
/// </summary>

If you are using Request/Response filters, ensure they are thread-safe as well as this might cause issues. You could try calling ForceRestartWorkerThreads when you see the issue via a HTTP service call, or interrogate how many workers there are to see if it is the number you expect.

Hope that helps!

tiptopweb · November 7, 2023, 12:32am

Thank you

Redis is accessed from within AWS EC2 local network
Does not seem to be a network issue as the DB on the same local network is accessible

I have a feeling the Redis Server itself getting stuck. When I republish the app with the threads, it takes a bit of time for the queues to work again (not like if the reading threads are corrupted and it starts again as soon as I republish)

Will check next time it happens if I can access the Redis Cache in the same time or if it is also off. Good idea.

tiptopweb · November 7, 2023, 12:43am

but what would this error mean in the DLQ message

"error":{"errorCode":"NullReferenceException","message":"Object reference not set to an instance of an object.","errors":[]}

I have 2 different EC2 machine one App on one EC2 is writing to the Queue, one App on the other EC2 is reading from the Queue (both Apps are still responsive)

The DTO seems to be well formed
The exception would be on the reading code, from inside ServiceStack?
My reading thread have global catch and try and are not throwing any exception
Is there any way to try to track down further this exception?

"body":{"__type":"Tiptopweb.MojoPortal.Shared.QueueModel.CallNop, Tiptopweb.MojoPortal.Shared.QueueModel","bearerToken":"xxxxxxxxxxxxxxxxxxxxxxxx","listParams":[{"task":"GetNopLastOrders","paramId":0,"paramStr":""}]}}

layoric · November 7, 2023, 1:18am

It would be the reader yes, however where and which code you would need to enable debug I believe to see the stacktrace. If you haven’t already you can hook into ErrorHandler and log somewhere to get more info which might shed some more light on the root cause.

Since both are responsive you could try using Redis Admin to interrogate the state of Redis, but also setup your own endpoints to look at/interact with RedisMqServer directly to see if there is a threading issue.

Do you mean the service processing the messages or a registered ErrorHandler on the RedisMqServer?

Run with Debug logging if you haven’t already to see if you can get a stacktrace.

Regarding using an alternate MQ, AWS SQS is pretty clunky but does work well enough. I’ve used it in fairly large throughput environments, but it does take some management. If you are running Redis on your local EC2 network, remember this doesn’t mean you won’t get network or IO related issues. A good example is overloading the host’s ulimit with incorrect usage of the HttpClient, ‘local’ network or otherwise, which will still cause issues. I’m not saying that is what is happening here, but something to keep in mind.

tiptopweb · November 7, 2023, 1:55am

Thank you
I think you have a point
Will look at putting an ErrorHandler on RedisMqServer
But I just now realised that a while ago I have been ‘cleaning’ some old code to do with HttpClient

tiptopweb · November 7, 2023, 3:13am

Yes, this is it, I have been an idiot, it is socket exhaustion.
Thank you for your help.

Regarding replacing Redis MQ by Amazon SQS
I am also thinking if I use Amazon SES to send emails, I can get the bounce, complaints notification to Amazon SQS and will make it easy to process