Enabling Redis Server Events causes high CPU usage

VAST · October 11, 2021, 9:25am

Hello,

When I follow RedisServer Events and add the following code to Startup I experience high CPU usage:

var redisHost = AppSettings.GetString("RedisHost");
if (redisHost != null)
{
    container.Register<IRedisClientsManager>(
        new RedisManagerPool(redisHost));

    container.Register<IServerEvents>(c => 
        new RedisServerEvents(c.Resolve<IRedisClientsManager>()));
    
    container.Resolve<IServerEvents>().Start();
}

This appears to happen every so often and both the 2 x IIS VMs (Windows 2019 Core) and Redis v5.0.3 VM (Linux CentOS 8) have 100% CPU for a while. I am using ServiceStack v5.12.1. The clients then hang when sending http client requests.

If I remove the code above to use InMemory SSE everything appears to be fine.

The Redis is standard config with the bindings changed so the Windows VM can connect to it. I also changed a couple of things due to warnings after installing Redis:

Disable transparent huge pages which helps with performance
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Change a limit to stop Redis warning of the tcp backlog setting of 511 cannot be enforced
/proc/sys/net/core/somaxconn
Change the 128 to 1024
/etc/sysctl.conf
vm.overcommit_memory=1
net.core.somaxconn=65535

Do you have any ideas what could be causing this? Is there anything that runs on a schedule when using Redis SSE that could trigger something? Is CentOS a sensible OS to run Redis on or would you recommend another version of Linux?

Thank you for your help.

VAST · October 11, 2021, 3:29pm

Not sure if this has anything to do with it but I just found that my Redis server shows 109,000 keys when I run redis-cli keys "*".

They all begin with urn:iauthsession:

I have 450 accounts and not all of them will be connected at the same time.

Thanks.

layoric · October 11, 2021, 11:24pm

Hi @VAST,

Have you used MONITOR to have a look at the commands or test what commands might be causing the CPU spike in Redis?

Just to confirm, is in your app server using 100% CPU or Redis itself? If so, have you confirmed it is the dotnet process of the apphost that is maxing the CPU.

Have you checked the network load between appserver and Redis? Or memory usage over short time spans of your app servers if you are pulling in large values?

Any debug logs from the Start call of the RedisServerEvents (and RedisPubSubServer)? All the above you’ve shared is standard startup code that shouldn’t cause any significant load.

Check the above if you haven’t already, as well as try profiling the code if possible locally (with a local or remote redis instance) to see if you can get more details.

VAST · October 12, 2021, 7:24am

Hi @layoric,

Thank you.

To confirm, it is Redis on the VM that runs at 100% CPU and also the Application Pool runs at 100% CPU on a separate VM. Yes on the IIS VM it is w3wp.exe running as the Application User.

I will double check but I am sure the memory usage and network usage wasn’t stressed at all.

In the meantime, is it normal to have 1000’s of keys that aren’t in use anymore? When I run GET they were created a while ago e.g. 10 days and also modified 10 days ago so they are no longer needed. Should old keys not get removed?

Thanks again.

layoric · October 12, 2021, 7:41am

Depends what they are. Having old keys hanging around on Redis does not incur any performance penalty if they don’t have an expiration. I’ve had instances need to clean up >500k keys on regular 15-30 min (not an efficient way to use Redis) and still perform pretty. Keys with an expiry set will not automatically get cleaned up. Expiry is usually set on creation but can also be set after the fact.

For the Redis VM, use the MONITOR command to confirm what commands are running (and confirm it is the Redis process taking up 100% CPU). Redis is single threaded so if the Redis VM have more than 1 core, it is unlikely to be related to Redis.

Another tip, is on the Redis docs you will see a “Time complexity” express in Big O notation that can help when looking at optimizing how you are using Redis.

As for your AppHost VMs, check if you have any RedisPubSub or IServerEvents events that might be running on Start. Should should be very obvious by profiling your application even running locally connecting to a local or remote Redis instance. If something is stuck in an infinite loop on startup, this should be visible from MONITOR in Redis and/or profiling locally.

If it is environment specific, MONITOR would be the best first step to isolate the root cause.

layoric · October 12, 2021, 8:17am

If you can produce a minimal reproduction of the problem (removing any sensitive info), and put it up on GitHub, be happy to have a look at it.

VAST · October 12, 2021, 8:20am

Hi @layoric,

The old keys are all sessions and the key begins with urn:iauthsession:

I have removed Redis as it was slowing things down but I will add it back in and use MONITOR to see what is happening. But since removing the Redis code both the Redis and IIS CPU is pretty much idle.

I’m not sure how the Big O notation helps me, all I am using is the code in my first post, I haven’t added any other Redis code other than ((RedisServerEvents)ServerEvents).UnRegisterExpiredSubscriptions(); when someone logs in but I am going to remove that now as it doesn’t seem to do anything.

As mentioned above, all I have added in regards to Redis is:

var redisHost = AppSettings.GetString("RedisHost");
if (redisHost != null)
{
    container.Register<IRedisClientsManager>(
        new RedisManagerPool(redisHost));

    container.Register<IServerEvents>(c => 
        new RedisServerEvents(c.Resolve<IRedisClientsManager>()));
    
    container.Resolve<IServerEvents>().Start();
}

I do however use service.ServerEvents.NotifyChannel within OnSubscribeAsync and OnUnsubscribeAsync to notify specific users when a subscription is subscribed or unsubscribed. As I only want certain clients to receive these I have also set NotifyChannelOfSubscriptions = false

Other than that my Startup doesn’t reference Redis or ServerEvents.

I will reintroduce the Redis server shortly and run MONITOR to see if that highlights anything.

Thanks again.

layoric · October 13, 2021, 11:51pm

Having another look at your previous post, with GetAllSubscriptionInfos, using MONITOR will show the related Redis commands to what your application is doing. You can then relate those commands to the Redis documentation with Time Complexity which should help you identify which commands are causing your high CPU usage on Redis.

I can only guess what it might be but since GetAllSubscruptionInfos calls ScanAllKeys, I’m pretty sure that will be your problem if you are still running that on every OnUnsubscribeAsync event.

Scanning all channels and returning all results every time someone connects or reconnects could lead to high load causing further problems. For example, it could result in client timeouts which might result in clients trying to reconnect leading to more load putting your system into a loop it can’t break without shedding lots of users.

Running UnRegisterExpiredSubscriptions when someone logs could also lead to additional scans.

These are all guesses that you will need to check with observations using logging and tools like MONITOR command for Redis. Once you’ve found the root cause, you’ll have a baseline to work with to iterate and test alternate solutions as you go. The more you can measure/observe, the easier that will be.

VAST · October 14, 2021, 1:45pm

Hi @layoric,

I have been running my application with Redis again for a couple of days and currently Redis is using 10% but I fear this is going to start climbing over time as it started off at 1% for a while.

When I run MONITOR at the moment I am seeing a constant flow of heartbeats from all the different machines (around 300 machines connected at the moment):

1634218103.277274 [0 *IP*] "PUBLISH" "sse:topic" "notify.subscription.gx1WFpa2YzMpXZb7XCRs cmd.onHeartbeat {\"isAuthenticated\":\"true\"......

I assume this is all ok?

Then I see the occasional:

1634218404.228631 [0 *IP*] "MGET"

which is followed by a lot of sse:id values, I assume this is where GetAllSubscruptionInfos is getting called.

When I check the IIS requests all I see is a list of:

/event-stream?channels=...

I assume this is also normal?

So do you think I should go back to looking into Redis Sorted Set to keep a list of what accounts are currently “online”?

I basically just need a client to be able to logon and download a list of other clients and display which are currently “online”. After that I need the client to be able to show them going on and offline. One of the checks I do on the unsubscribe is to prevent machines showing offline due to a late unsubscribe event where it has already reconnected / restarted.

Once again, your help is appreciated.

layoric · October 15, 2021, 12:46am

Again, best to test yourself and measure. MONITOR does have a performance impact and depends how often the MGET (which will get more costly as number of sessions increase) is getting called. If this gets called more often as number of clients/sessions increase then you are looking at exponential increase in load as your number of connected clients increases.

You can use grep and/or log MONITOR to a file so you can better look through but there should be use of SCAN as well (See code). Logging on your app server to get a rough idea of time it takes per GetAllSubscriptionInfos for N number of users, since it combines multiple commands together to return the result of that method, it will be harder to see from Redis MONITOR. I just don’t think it is a method you want to be running as a part of the unsubscribe/subscribe event hooks due to the growth in use as stated above. Eg, 1 method call runs many many Redis commands.

Correct, this is expected and will be very light load, but again, best to measure/load test to be sure based on your infrastructure/setup.

Possibly, you’re in the best position to know this but have a think about the rules/behavior of your system in relation to who gets notified of what. Rather than iterating all users to only inform some of a status change, have a think if there a way you could store data so that it is easy to work out if there is a pending status change that needs to be sent, another for “last known” perhaps.

Depending on your requirements, how often these status change events are sent out could be intentionally lagged (say 5 seconds) and you can store the events/changes as they happen but only resolve who needs to get updated on an interval. This turns the job of resolving who needs to get a status change event from one that depends on a user generated event (going online/offline), to one that you control, making your load more predictable at the cost of a fixed delay. Might not suit your requirements (or just improving storage/fetching of status info might be enough), which is why you are best placed to make those architectural choices and what impacts they have on your functionality vs infrastructure vs requirements.

VAST · October 19, 2021, 12:50pm

Hi @layoric,

Thank you for your reply.

It looks like the CPU gets higher and higher the more stale sessions are left on Redis. At first it is nice and quick but then when there are 100,000 stale keys it starts to go slow.

I have changed my SSE client to use JWT auth and set the JWT server AuthFeature to

UseTokenCookie = true,
UseRefreshTokenCookie = true

When the client shuts down I can see the following being called at the server:

https://localhost:xxxxx/event-unregister?id=xxxxxxxxxxxx

I can also see the UnSubscribeAsync being called at the server which is all great.

But, when I list the keys at Redis it is leaving:

"sse:id:xxxxxxxxxxxxx"

I left it for an hour and it is still there. I thought that once the heartbeat timed out this would be removed?

Thank you.

layoric · October 19, 2021, 11:48pm

Are you overriding the Local.NotifyLeaveAsync event? The default implementation of this is performing the RemoveSubscriptionFromRedis via the HandleOnLeaveAsync. If you are not, could you confirm that is being fired?

These sse:id:* are not unused keys since they are subscriptions that get notified when published, eg they need to go through Redis PubSub. Appologies, I should have been clearer before about “old keys hanging around”. Having additional unused keys in Redis shouldn’t impact performance.

The sse:id:* keys are different than the original urn:iauthsession: you originally mentioned, and I assumed in the MONITOR commands you were just seeing normal active subscriptions rather than ones that were hanging around.

Having subscription keys expire by default would mean clients would need to re-register their subscriptions periodically which is why these need to be managed. By default, they are managed for you by the default HandleOnLeaveAsync private method in RedisServerEvents but the recent change to expose the ability to override Local means this is currently up to the implementation of the override to then manage it. If you are overriding any of the default Local MemoryServerEvents in RedisServerEvents, have a look at their default implementation since there might be functionality you still need.

If you can provide a minimal reproduction of your use of RedisServerEvents, this would likely make it a lot clearer to us trying to help possible issues. Trying to infer from various snippets, as I’m sure you can understand, is much more difficult than just running something locally to test out.

VAST · October 20, 2021, 8:48am

Hi @layoric,

In ref to my previous post, the JWT tokens have nothing to do with what I am experiencing (sse keys being left behind), it happens with both JWT and CredentialsAuthProvider.

I am not overriding the Local.NotifyLeaveAsync event. I have tried to override it to confirm it is being fired. I changed my code to this, is it correct?

var redisHost = AppSettings.GetString("RedisHost");

if (redisHost != null)
{
     container.Register<IRedisClientsManager>(new RedisManagerPool(redisHost));
     container.Register<IServerEvents>(c => new RedisServerEvents(c.Resolve<IRedisClientsManager>()));

     ((RedisServerEvents)container.Resolve<IServerEvents>()).Local.NotifyLeaveAsync = async (sub) =>
     {
          logger.Info("NotifyLeaveAsync Should Trigger Here...");
     };

     container.Resolve<IServerEvents>().Start();
}

Doing this then starting a subscription and closing again did not trigger the NotifyLeaveAsync event. Do you think this is why it is leaving the sse:id:* after the subscription has unsubscribed?

I don’t need the keys to expire by default, I just want them to be removed when the subscription unsubscribes.

Yes I fully understand that it would be easier to have a project that can reproduce the problem. I will see what I can do.

Thanks again.

VAST · October 20, 2021, 11:36am

Hello @layoric,

I have created a project for you to replicate the issue.

You can download it from here: RedisSSETest

I just used the web template and added a C# client to the project. There is very little to it, no authentication etc.

The redis server in the appsettings.json file is 192.168.77.54:6379 so you will need to change that.

To reproduce the issue, start the project then start an instance of RedisSSETest-Client. After the RedisSSETest-Client has started you can see that running redis-cli keys "*" will show something silimar to:

“sse:seq:anonUser”

“sse:session:3GotzS9k5q6QNU1m3vaw”

“sse:userid:-1”

“sse:channel:channel1”

“sse:ids”

“sse:id:O5x7vgWnMyUVNxnyzxo0”

The client will automatically disconnect in 10 seconds. After 10 seconds running redis-cli keys "*" will show the same 6 entries as above.

I hope that helps.

Thanks again.

layoric · October 21, 2021, 12:11am

Hi @VAST,

Thanks for the example project. I expanded on it a bit here to show the ServerEventsFeature plugin is cleaning up expired connections periodically so that `“sse:ids:*” keys don’t build up over time. Project is on GitHub here, if you want to expand on it compared to your application that isn’t cleaning up those sessions, that should help find the root cause of the problem you are experiencing.

Your client side might be generating a lot of sessions accidentally as well if you only have 450 users. Once the server is shut down/stopped, those keys will persist but on startup + in general operation, expired sessions should be getting cleaned up as this project is showing. Hope that helps.

VAST · October 21, 2021, 3:49pm

Hi @layoric,

Are you saying that the project I uploaded did clear up expired sessions for you? As I left it for 1 - 2 hours after the client had disconnected and it didn’t for me. I notice you added an IdleTimeout, what is the default please?

Thank you.

layoric · October 21, 2021, 9:53pm

No it didn’t, but through the normal operation of SSE requests, these sessions are periodically checked and cleaned up. Please run the modified project I linked to show that these sessions are automatically cleaned up given the setup you provided in your example project.

Redis doesn’t automatically remove these keys, they will persist indefinitely, but “expiredKeys” will be are detected using the specified IdleTimeout periodically during requests which triggers clients that dropped or that used the /event-unregister. They will indeed hang around for a short time before being cleaned up, but this will be within a small number of events from the IdleTimeout.

Also, when the server restarts, these expired sessions are cleared from Redis but if this number is growing as clients reconnect to your system, something else is going on that isn’t represented in the example project.

I recommend running the modified client in that and looking at the order of events and output to see those sessions getting cleaned up. If you get a significantly different output (like ending up with 50+ client sessions still in Redis after the client app has completed), then that would be a problem and will need to replicate what you are getting to find the root cause.

VAST · October 22, 2021, 3:57pm

Ok @layoric, I will take a look and get back to you.

Thanks again.

VAST · November 8, 2021, 12:38pm

Hi @layoric,

Sorry for the delay, I have been away.

I have managed to reproduce the issue with the project you uploaded. I have uploaded it here: Redis SSE Modified (Sorry I forgot to put the Redis IP back so you will need to change it back to localhost.)

I basically commented out the second SSE client in your code:

//Console.WriteLine("Generate persisted client and send 30 messages");
//var sseClient2 = new ServerEventsClient("https://localhost:44328/", new string[] { "channel1" });
//sseClient2.OnMessage = message =>
//{
//    Console.WriteLine($"Message received: {message.Data}");
//UpdateRedisSessionCount(redisManager, "Redis sessions");
//};
//sseClient2.Start();

After that, the output was:

Initial Redis sessions: 0
Generating client sessions…
After generated sessions: 50
Before messages Redis sessions: 50
Final Redis session count: 50

Then when I run redis-cli keys "*" on the Redis server there are 153 keys.

Thank you once again.

layoric · November 9, 2021, 1:43am

Hi @VAST,

By commenting out the second client connecting and listening, there are no connected clients. The first 50 clients are disposed and call unregister. Without any additional clients connected to publish messages to, the original subscriptions in redis won’t automatically be cleaned up via the maintenance method being periodically called, DoAsyncTasks process in MemoryServerEvents of the RedisServerEvents Local.

It might be worth you sharing how your clients are being used? In this test setup there are no other clients handling/receiving any messages which doesn’t represent how a SSE server would be used operationally. It is possible your clients are unregistering + registering faster/more often then your configured IdleTimeout which would cause subscription numbers to rise over time, and cause growing work for your GetAllSubscriptionInfos which would eventually become predominantly populated by disconnected subscriptions.