Enabling Redis Server Events causes high CPU usage

Hello,

When I follow RedisServer Events and add the following code to Startup I experience high CPU usage:

var redisHost = AppSettings.GetString("RedisHost");
if (redisHost != null)
{
    container.Register<IRedisClientsManager>(
        new RedisManagerPool(redisHost));

    container.Register<IServerEvents>(c => 
        new RedisServerEvents(c.Resolve<IRedisClientsManager>()));
    
    container.Resolve<IServerEvents>().Start();
}

This appears to happen every so often and both the 2 x IIS VMs (Windows 2019 Core) and Redis v5.0.3 VM (Linux CentOS 8) have 100% CPU for a while. I am using ServiceStack v5.12.1. The clients then hang when sending http client requests.

If I remove the code above to use InMemory SSE everything appears to be fine.

The Redis is standard config with the bindings changed so the Windows VM can connect to it. I also changed a couple of things due to warnings after installing Redis:

  • Disable transparent huge pages which helps with performance
    echo never > /sys/kernel/mm/transparent_hugepage/enabled

  • Change a limit to stop Redis warning of the tcp backlog setting of 511 cannot be enforced
    /proc/sys/net/core/somaxconn
    Change the 128 to 1024

  • /etc/sysctl.conf
    vm.overcommit_memory=1
    net.core.somaxconn=65535

Do you have any ideas what could be causing this? Is there anything that runs on a schedule when using Redis SSE that could trigger something? Is CentOS a sensible OS to run Redis on or would you recommend another version of Linux?

Thank you for your help.

Not sure if this has anything to do with it but I just found that my Redis server shows 109,000 keys when I run redis-cli keys "*".

They all begin with urn:iauthsession:

I have 450 accounts and not all of them will be connected at the same time.

Thanks.

Hi @VAST,

Have you used MONITOR to have a look at the commands or test what commands might be causing the CPU spike in Redis?

Just to confirm, is in your app server using 100% CPU or Redis itself? If so, have you confirmed it is the dotnet process of the apphost that is maxing the CPU.

Have you checked the network load between appserver and Redis? Or memory usage over short time spans of your app servers if you are pulling in large values?

Any debug logs from the Start call of the RedisServerEvents (and RedisPubSubServer)? All the above you’ve shared is standard startup code that shouldn’t cause any significant load.

Check the above if you haven’t already, as well as try profiling the code if possible locally (with a local or remote redis instance) to see if you can get more details.

Hi @layoric,

Thank you.

To confirm, it is Redis on the VM that runs at 100% CPU and also the Application Pool runs at 100% CPU on a separate VM. Yes on the IIS VM it is w3wp.exe running as the Application User.

I will double check but I am sure the memory usage and network usage wasn’t stressed at all.

In the meantime, is it normal to have 1000’s of keys that aren’t in use anymore? When I run GET they were created a while ago e.g. 10 days and also modified 10 days ago so they are no longer needed. Should old keys not get removed?

Thanks again.

Depends what they are. Having old keys hanging around on Redis does not incur any performance penalty if they don’t have an expiration. I’ve had instances need to clean up >500k keys on regular 15-30 min (not an efficient way to use Redis) and still perform pretty. Keys with an expiry set will not automatically get cleaned up. Expiry is usually set on creation but can also be set after the fact.

For the Redis VM, use the MONITOR command to confirm what commands are running (and confirm it is the Redis process taking up 100% CPU). Redis is single threaded so if the Redis VM have more than 1 core, it is unlikely to be related to Redis.

Another tip, is on the Redis docs you will see a “Time complexity” express in Big O notation that can help when looking at optimizing how you are using Redis.

As for your AppHost VMs, check if you have any RedisPubSub or IServerEvents events that might be running on Start. Should should be very obvious by profiling your application even running locally connecting to a local or remote Redis instance. If something is stuck in an infinite loop on startup, this should be visible from MONITOR in Redis and/or profiling locally.

If it is environment specific, MONITOR would be the best first step to isolate the root cause.

If you can produce a minimal reproduction of the problem (removing any sensitive info), and put it up on GitHub, be happy to have a look at it.

Hi @layoric,

The old keys are all sessions and the key begins with urn:iauthsession:

I have removed Redis as it was slowing things down but I will add it back in and use MONITOR to see what is happening. But since removing the Redis code both the Redis and IIS CPU is pretty much idle.

I’m not sure how the Big O notation helps me, all I am using is the code in my first post, I haven’t added any other Redis code other than ((RedisServerEvents)ServerEvents).UnRegisterExpiredSubscriptions(); when someone logs in but I am going to remove that now as it doesn’t seem to do anything.

As mentioned above, all I have added in regards to Redis is:

var redisHost = AppSettings.GetString("RedisHost");
if (redisHost != null)
{
    container.Register<IRedisClientsManager>(
        new RedisManagerPool(redisHost));

    container.Register<IServerEvents>(c => 
        new RedisServerEvents(c.Resolve<IRedisClientsManager>()));
    
    container.Resolve<IServerEvents>().Start();
}

I do however use service.ServerEvents.NotifyChannel within OnSubscribeAsync and OnUnsubscribeAsync to notify specific users when a subscription is subscribed or unsubscribed. As I only want certain clients to receive these I have also set NotifyChannelOfSubscriptions = false

Other than that my Startup doesn’t reference Redis or ServerEvents.

I will reintroduce the Redis server shortly and run MONITOR to see if that highlights anything.

Thanks again.

Having another look at your previous post, with GetAllSubscriptionInfos, using MONITOR will show the related Redis commands to what your application is doing. You can then relate those commands to the Redis documentation with Time Complexity which should help you identify which commands are causing your high CPU usage on Redis.

I can only guess what it might be but since GetAllSubscruptionInfos calls ScanAllKeys, I’m pretty sure that will be your problem if you are still running that on every OnUnsubscribeAsync event.

Scanning all channels and returning all results every time someone connects or reconnects could lead to high load causing further problems. For example, it could result in client timeouts which might result in clients trying to reconnect leading to more load putting your system into a loop it can’t break without shedding lots of users.

Running UnRegisterExpiredSubscriptions when someone logs could also lead to additional scans.

These are all guesses that you will need to check with observations using logging and tools like MONITOR command for Redis. Once you’ve found the root cause, you’ll have a baseline to work with to iterate and test alternate solutions as you go. The more you can measure/observe, the easier that will be.

Hi @layoric,

I have been running my application with Redis again for a couple of days and currently Redis is using 10% but I fear this is going to start climbing over time as it started off at 1% for a while.

When I run MONITOR at the moment I am seeing a constant flow of heartbeats from all the different machines (around 300 machines connected at the moment):

1634218103.277274 [0 *IP*] "PUBLISH" "sse:topic" "notify.subscription.gx1WFpa2YzMpXZb7XCRs cmd.onHeartbeat {\"isAuthenticated\":\"true\"......

I assume this is all ok?

Then I see the occasional:

1634218404.228631 [0 *IP*] "MGET"

which is followed by a lot of sse:id values, I assume this is where GetAllSubscruptionInfos is getting called.

When I check the IIS requests all I see is a list of:

/event-stream?channels=...

I assume this is also normal?

So do you think I should go back to looking into Redis Sorted Set to keep a list of what accounts are currently “online”?

I basically just need a client to be able to logon and download a list of other clients and display which are currently “online”. After that I need the client to be able to show them going on and offline. One of the checks I do on the unsubscribe is to prevent machines showing offline due to a late unsubscribe event where it has already reconnected / restarted.

Once again, your help is appreciated.

Again, best to test yourself and measure. MONITOR does have a performance impact and depends how often the MGET (which will get more costly as number of sessions increase) is getting called. If this gets called more often as number of clients/sessions increase then you are looking at exponential increase in load as your number of connected clients increases.

You can use grep and/or log MONITOR to a file so you can better look through but there should be use of SCAN as well (See code). Logging on your app server to get a rough idea of time it takes per GetAllSubscriptionInfos for N number of users, since it combines multiple commands together to return the result of that method, it will be harder to see from Redis MONITOR. I just don’t think it is a method you want to be running as a part of the unsubscribe/subscribe event hooks due to the growth in use as stated above. Eg, 1 method call runs many many Redis commands.

Correct, this is expected and will be very light load, but again, best to measure/load test to be sure based on your infrastructure/setup.

Possibly, you’re in the best position to know this but have a think about the rules/behavior of your system in relation to who gets notified of what. Rather than iterating all users to only inform some of a status change, have a think if there a way you could store data so that it is easy to work out if there is a pending status change that needs to be sent, another for “last known” perhaps.

Depending on your requirements, how often these status change events are sent out could be intentionally lagged (say 5 seconds) and you can store the events/changes as they happen but only resolve who needs to get updated on an interval. This turns the job of resolving who needs to get a status change event from one that depends on a user generated event (going online/offline), to one that you control, making your load more predictable at the cost of a fixed delay. Might not suit your requirements (or just improving storage/fetching of status info might be enough), which is why you are best placed to make those architectural choices and what impacts they have on your functionality vs infrastructure vs requirements.