Application appears to be deadlocked after Sentinel failover

Hi!
I am experiencing what appears to be a deadlock in my application during certain type of Sentinel failover, when master and sentinel instances on the same node are being shut down simultaneously. I have written a simple prototype application that can reliably reproduce a problem.
The said prototype works in two modes, let’s call them redis-interaction mode and redis-kill mode. redis-interaction mode emulates a relatively intensive interaction with redis (get/set operations and redisClient.Info are continuously executed in separate threads). redis-kill mode searches for current master in a Sentinel setup and shuts it down (both master and its sentinel), then, after a delay restarts it. That part is executed by means of redis-cli and plink to send restart commands, so ServiceStack libraries are not used. While redis-kill is running, redis-interaction hangs due to a lock after some time (after several master restarts).

My sentinel runs on three debian machines, one for master and two for slaves. I can also reproduce the problem on a windows sentinel setup using https://github.com/ServiceStack/redis-config.

I can send you all additional information if needed (including a dump file and a prototype).

Redis version: 4.0.8
ServiceStack version: 5.0.2 and 5.1.0

It appears that RedisPubSubServer.KillBgThreadIfExists Thread (uppermost thread on a screenshot) tries to abort a RedisSentinel.GetSentinelInfo Thread (second from the left on a screenshot). KillBgThreadIfExists is already inside a lock (validSentinelWorker) block. At the same time, GetSentinelInfo Thread is also trying to acquire that very same lock while being inside a finally block (RedisPubSubServer.RunLoop method). I think that may be the reason why this.bgThread.Abort inside KillBgThreadIfExists Thread hangs.

I also posted a comment on a relevant issue on github

Can you post the prototype that repro’s the issue on GitHub please.

Hi, mythz!

I posted it on github as an attachment along with instructions, if that’s OK. I tried to minimize the amount of code in a prototype to make a minimal working example. Here’s how it works:

  1. redis-interaction mode gets redis client and sends info request once every second
  2. redis-kill mode searches for active master, kills it along with the corresponding sentinel, waits a specified amount of time, then starts master and sentinel. It works only on unix setup, and uses plink to restart instances, so you may need to adjust that. redis-kill mode by default runs for 5 minutes, and waits for a minute before sentinel restart
  3. It is possible to reproduce the issue by hand (without redis-kill mode) on windows sentinel, albeit I find this process kind-a painful, because sometimes the issue manifests itself only after several restarts

Unfortunately, you will have to make changes to source code and recompile an app in order for it to work, because sentinel addresses are currently hard-coded. redis-kill mode will also work only after compilation.

Here are the steps to prepare and run a prototype:

  1. Open the solution in an IDE
  2. Restore nuget packages
  3. Specify a list of addresses of sentinels in Program.cs
  4. Specify a unix password in Program.cs if you are going to run redis-kill mode
  5. Compile an application
  6. Run an application, you should see results of info command every second
  7. Run redis_buster.bat, located in the app root directory
  8. After some time, hopefully the application will hang and stop printing messages. A screenshot of what application looks like when it hangs is in the attachment

I don’t see redis_buster.bat anywhere, can you please include a copy/link to it somewhere, thanks.

Also please keep your License Key confidential (i.e. remove it from any public sources). Note: I’ve deleted your GitHub comment which contained the attachment with your License Key.

Thanks for the heads up about liceinse info, I will reupload a prototype without this information.

As for the bat file, it really contains only one line of code, so you can easily recreate it:

start /b %~dp0\bin\Debug\RedisPrototype.exe 300 60 redis-buster

It should be created one level above the bin directory

Ok thanks for the prototype, this issue should be hopefully be resolved with this commit.

This change is available from v5.1.1 that’s now available on MyGet.

Can you try with the latest v5.1.1 and let me know if it resolves the issue, thanks.

Hi, mythz!

Thanks for the quick fix, the issue appears to be resolved. However, application now periodically throws new exception, which I have not seen before 5.1.1. Maybe it’s not a big deal, it does not affect the way prototype works. The exception details are described below. Also, can you please provide an ETA for 5.1.1 release? Will you also update an issue on github?

    ERROR: [09:40:30.779] Unable to Connect: sPort: 51820, Error: Cannot block a call on this socket while an earlier asynchronous call is in progress.
   at System.Net.Sockets.Socket.ValidateBlockingMode()
   at System.Net.Sockets.Socket.Send(IList`1 buffers, SocketFlags socketFlags, SocketError& errorCode)
   at System.Net.Sockets.Socket.Send(IList`1 buffers, SocketFlags socketFlags)
   at System.Net.Sockets.Socket.Send(IList`1 buffers)
   at ServiceStack.Redis.RedisNativeClient.FlushSendBuffer()
   at ServiceStack.Redis.RedisNativeClient.SendReceive[T](Byte[][] cmdWithBinaryArgs, Func`1 fn, Action`1 completePipelineFn, Boolean sendWithoutRead)
ERROR: Error when trying to Quit(), Exception: [09:40:30.779] Unable to Connect: sPort: 51820, Error: Cannot block a call on this socket while an earlier asynchronous call is in progress.
   at System.Net.Sockets.Socket.ValidateBlockingMode()
   at System.Net.Sockets.Socket.Send(IList`1 buffers, SocketFlags socketFlags, SocketError& errorCode)
   at System.Net.Sockets.Socket.Send(IList`1 buffers, SocketFlags socketFlags)
   at System.Net.Sockets.Socket.Send(IList`1 buffers)
   at ServiceStack.Redis.RedisNativeClient.FlushSendBuffer()
   at ServiceStack.Redis.RedisNativeClient.SendReceive[T](Byte[][] cmdWithBinaryArgs, Func`1 fn, Action`1 completePipelineFn, Boolean sendWithoutRead)
ERROR: [09:40:30.802] Unable to Connect: sPort: 51820, Error: Object reference not set to an instance of an object.
   at ServiceStack.Redis.RedisNativeClient.Connect()
   at ServiceStack.Redis.RedisNativeClient.Reconnect()
   at ServiceStack.Redis.RedisNativeClient.TryConnectIfNeeded()
   at ServiceStack.Redis.RedisNativeClient.SendReceive[T](Byte[][] cmdWithBinaryArgs, Func`1 fn, Action`1 completePipelineFn, Boolean sendWithoutRead)

v5.1.1 is already available on MyGet. There’s no ETA for NuGet releases.

How can I repro this issue? Your app isn’t sharing Redis client instances across multiple threads right? The Exception suggests the same RedisClient instance is being used in concurrent requests. Each Thread should resolve its own RedisClient instance and dispose of it after use, e.g:

using (var redis = redisManager.GetClient())
{
    // Use redis client...
}

Hi, mythz!

Sorry for the delay. I was not able to reliably reproduce this error. If it will occur again I will post it as a separate issue. Thanks for a quick fix for a deadlock!

1 Like