Master promotion not recognized by ServiceStack.Redis

tbednarz · June 12, 2019, 9:43am

I run a replica set with three nodes and a quorum of two. I am testing failovers and they do not work with ServiceStack.
The machines I run are:

dbinfra1 (master) IP 172.16.63.51
dbinfra2 (slave) IP 172.16.63.52
dbinfra3 (slave) IP 172.16.63.53
Then I shutdown server dbinfra1. The sentinel logs show, that the master was switched correctly:

18940:X 12 Jun 2019 09:56:32.693 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
18940:X 12 Jun 2019 09:56:32.693 # Redis version=5.0.5, bits=64, commit=00000000, modified=0, pid=18940, just started
18940:X 12 Jun 2019 09:56:32.693 # Configuration loaded
18940:X 12 Jun 2019 09:56:32.693 * supervised by systemd, will signal readiness
18940:X 12 Jun 2019 09:56:32.699 * Running mode=sentinel, port=26387.
18940:X 12 Jun 2019 09:56:32.705 # Sentinel ID is bf4bde3c8a398a84e61b6a2414a98b46b9dd94b1
18940:X 12 Jun 2019 09:56:32.705 # +monitor master redis_6387_config 172.16.63.51 6387 quorum 2
18940:X 12 Jun 2019 09:56:32.707 * +slave slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 09:56:32.715 * +slave slave 172.16.63.52:6387 172.16.63.52 6387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 09:56:34.571 * +sentinel sentinel d69802fb258357b3c56b29d0f79f35c6f78eb85c 172.16.63.51 26387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 09:56:34.790 * +sentinel sentinel 1aa056b49df89874c3b01319e48fa0c5726eee8d 172.16.63.53 26387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:30.468 # +sdown sentinel d69802fb258357b3c56b29d0f79f35c6f78eb85c 172.16.63.51 26387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:30.731 # +sdown master redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:30.815 # +odown master redis_6387_config 172.16.63.51 6387 #quorum 2/2
18940:X 12 Jun 2019 10:14:30.816 # +new-epoch 1
18940:X 12 Jun 2019 10:14:30.816 # +try-failover master redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:30.825 # +vote-for-leader bf4bde3c8a398a84e61b6a2414a98b46b9dd94b1 1
18940:X 12 Jun 2019 10:14:30.840 # 1aa056b49df89874c3b01319e48fa0c5726eee8d voted for bf4bde3c8a398a84e61b6a2414a98b46b9dd94b1 1
18940:X 12 Jun 2019 10:14:30.909 # +elected-leader master redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:30.909 # +failover-state-select-slave master redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:30.992 # +selected-slave slave 172.16.63.52:6387 172.16.63.52 6387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:30.992 * +failover-state-send-slaveof-noone slave 172.16.63.52:6387 172.16.63.52 6387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:31.064 * +failover-state-wait-promotion slave 172.16.63.52:6387 172.16.63.52 6387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:31.845 # +promoted-slave slave 172.16.63.52:6387 172.16.63.52 6387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:31.845 # +failover-state-reconf-slaves master redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:31.911 * +slave-reconf-sent slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:32.944 * +slave-reconf-inprog slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:32.944 * +slave-reconf-done slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:33.034 # -odown master redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:33.034 # +failover-end master redis_6387_config 172.16.63.51 6387
18940:X 12 Jun 2019 10:14:33.034 # +switch-master redis_6387_config 172.16.63.51 6387 172.16.63.52 6387

As you can see at 10:14:30.468 it realizes that the master is down and started election and promotion of a new master. This process ended 2.5 seconds later by finished promoting 172.16.63.52 to the new Master.

Unfortunately servicestack.redis does not recognise this for whatever reasons. It tries the old address 172.16.63.51 FOREVER and never tries another server of the array. Here are a few log extracts which illustrate the problem:

2019-06-12 10:14:30.606 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 10] n/a n/a (n/a) #18 Could not connect to redis Instance at dbinfra1.tbhome.int:26385
2019-06-12 10:14:30.607 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 6] n/a n/a (n/a) #4 Could not connect to redis Instance at dbinfra1.tbhome.int:26387
2019-06-12 10:14:30.611 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 6] n/a n/a (n/a) #4 SocketException in SendReceive, retrying...
System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (111): Connection refused 172.16.63.51:26387
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at System.Net.Sockets.Socket.Connect(IPAddress address, Int32 port)
   at ServiceStack.Redis.RedisNativeClient.Connect() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 100
   at ServiceStack.Redis.RedisNativeClient.Reconnect() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 321
   at ServiceStack.Redis.RedisNativeClient.TryConnectIfNeeded() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 302
   at ServiceStack.Redis.RedisNativeClient.SendReceive[T](Byte[][] cmdWithBinaryArgs, Func`1 fn, Action`1 completePipelineFn, Boolean sendWithoutRead) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 574
2019-06-12 10:14:30.613 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 10] n/a n/a (n/a) #18 SocketException in SendReceive, retrying...
System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (111): Connection refused 172.16.63.51:26385
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at System.Net.Sockets.Socket.Connect(IPAddress address, Int32 port)
   at ServiceStack.Redis.RedisNativeClient.Connect() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 100
   at ServiceStack.Redis.RedisNativeClient.Reconnect() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 321
   at ServiceStack.Redis.RedisNativeClient.TryConnectIfNeeded() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 302
   at ServiceStack.Redis.RedisNativeClient.SendReceive[T](Byte[][] cmdWithBinaryArgs, Func`1 fn, Action`1 completePipelineFn, Boolean sendWithoutRead) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 574
2019-06-12 10:14:30.613 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 7] n/a n/a (n/a) #12 Could not connect to redis Instance at dbinfra1.tbhome.int:26386
2019-06-12 10:14:30.655 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 7] n/a n/a (n/a) #12 SocketException in SendReceive, retrying...
System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (111): Connection refused 172.16.63.51:26386
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at System.Net.Sockets.Socket.Connect(IPAddress address, Int32 port)
   at ServiceStack.Redis.RedisNativeClient.Connect() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 100
   at ServiceStack.Redis.RedisNativeClient.Reconnect() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 321
   at ServiceStack.Redis.RedisNativeClient.TryConnectIfNeeded() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 302
   at ServiceStack.Redis.RedisNativeClient.SendReceive[T](Byte[][] cmdWithBinaryArgs, Func`1 fn, Action`1 completePipelineFn, Boolean sendWithoutRead) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 574

Do not be confused about the different ports, I run six Redis databases on the same machine, each with a different IP Port and each with its own senitel wich runs at port + 20000.

My server (which is a redis client) runs as a docker container. Even if I restart that container, it does NOT switch and simply seems to try the first server supplied as an array of replicaset members. Why does it not try the other replicaset members?

Here is some of my code, that configures the replicaset in my C# code:

// In my docker file I pass some of the parameters for the server program. A sample for one Redis database looks like this
//"--rediscacheservers", "dbinfra1.tbhome.int:dbinfra2.tbhome.int:dbinfra3.tbhome.int", \
//"--rediscachedbreplicasetname", "redis_6385_cache", \
//"--rediscacheport", "6385", \

// When bootstrapping my server I do

if (redisCfgServer.Contains(':')) // we have a replica set
{
	var replSetName = AppSettings.Get<string>("RedisCfgDbReplicaSetName");
	var redisConfigServerArray = redisCfgServer.Split(':', StringSplitOptions.RemoveEmptyEntries);
	for (var i = 0; i < redisConfigServerArray.Length; i++)
	{
		var server = redisConfigServerArray[i];
		server = $"{server}:2{redisCfgDbPort}"; //sentinel ports are by default Redis port + 20000
		redisConfigServerArray[i] = server;
	}

	var sentinelHosts = redisConfigServerArray;
	try
	{
		var sentinel = new RedisSentinel(sentinelHosts, masterName: replSetName);
		sentinel.HostFilter = host => $$"{host}?Db={BbConsumerConstants.RedisProdDb}&RetryTimeout=5000" +
									 $$"&password={HttpUtility.UrlEncode(pwdRedisConfig)}";
		IRedisClientsManager redisManager = sentinel.Start();
		container.Register(c => new SysConfigRepository(redisManager));
	}
	catch (Exception e)
	{
	      // some logging	
                  throw;
	}
}
else // standalone setup
...

Any idea what is going wrong here??

tbednarz · June 12, 2019, 10:54am

I was just looking into the Source of RedisSentinel and found a hardcoded value of 10 Minutes to refresh hosts. Does that mean that it does not detect missing servers or promoted servers during a period of 10 Minutes?

Digging again a bit deeper into the problem, I have implemented the following event handlers:

	var sentinel = new RedisSentinel(sentinelHosts, masterName: replSetName)
	{
		OnFailover = manager => 
		{
			Logger.Debug("Redis Config Server Managers were Failed Over to new hosts");
		},
		OnWorkerError = ex =>
		{
			Logger.Debug($"Redis Config Server worker error: {ex.GetAllExceptions()}");
		},
		OnSentinelMessageReceived = (channel, msg) =>
		{
			Logger.Debug($"Redis Config Server: Received message '{msg}' on channel '{channel}' from Sentinel.");
		}, 
	};

I was expecting to get one of these fired if the master server goes down. But my log does not contain any entries except the socket errors shown above.

The constructor of RedisSentinel receives the following data:

Sentinel hosts: ["dbinfra1.tbhome.int:26387","dbinfra2.tbhome.int:26387","dbinfra3.tbhome.int:26387"], Redis Cfg Server ReplicaSet name: redis_6387_config

The server is up and running and listening:

[14:28:07 root@dbinfra3 /var/log/redis]$ systemctl status sentinel_26387_config.service 
\u25cf sentinel_26387_config.service - Redis Sentinel
   Loaded: loaded (/etc/systemd/system/sentinel_26387_config.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-06-12 09:56:32 CEST; 4h 32min ago
  Process: 15809 ExecStop=/usr/libexec/redis-shutdown sentinel_26387_config (code=exited, status=0/SUCCESS)
 Main PID: 18757 (redis-sentinel)
   CGroup: /system.slice/sentinel_26387_config.service
           \u2514\u250018757 /usr/bin/redis-sentinel *:26387 [sentinel]

No idea what is going wrong…

mythz · June 12, 2019, 12:30pm

That’s not hard coded, that’s the default value, which you can change.

This means that it refreshes the list of available Redis sentinels every 10 minutes, it doesn’t mean it does not detect missing redis instances. The RedisSentinel subscribes to a Redis Sentinel instance which sends it events after it detects when redis instances are subjectively/objectively down or after it’s performed a failover after which the RedisSentinel will failover to use the latest master/slave instances as known to the connected Redis Sentinel instance.

You can add a callback for when RedisSentinel receives these events by adding a callback OnSentinelMessageReceived. When using Redis Sentinel it only fails over when it’s been instructed to by the Redis Sentinel instance, you can also have it failover to the latest master/slaves as known by the connected sentinel by calling sentinel.ResetClients().

You can also bypass waiting for the Redis Sentinels to reach consensus for detecting when a master is down and manually force a failover yourself by calling sentinel.ForceMasterFailover() which calls the SENTINEL failover command.

tbednarz · June 12, 2019, 3:01pm

Hi Demis,

In the meantime I figured out, that the SysOps in the DataCenter did not open the communication between my database servers and the docker host which runs my containers on the firewall of the docker host. After they corrected that, I can see my log entries I added to the event handlers. I got two events fired:

OnSentinelMessageReceived. It says Received message 'slave 172.16.63.51:6387 172.16.63.51 6387 @ redis_6387_config 172.16.63.52 6387' on channel '+reboot' from Sentinel which I do not know what I should learn from that. I got three of them within 20ms
OnFailover(). I got two of them within 10 ms

I’m not sure if I understand you right:

Are you saying that the Linux boxes running the redis servers and sentinels perform the switch of the master and I then need to catch an event (e.g. OnFailover) on the client? In the handler of that event I have to kind of ‘reconfigure’ the client?

When I look at the ForceMasterFailover() method, it requires the master group name. So if I call that in the event handler of OnFailover, will it then change the master and try to connect to the new master server?

mythz · June 12, 2019, 3:30pm

The Sentinel Instances are normally who monitors the redis master/slaves and performs any fails failovers after they reach consensus that the master instance is unavailable, during which time they’ll send events to the RedisSentinel client who will failover the Redis Client Managers to use the latest active master/slaves.

The ForceMasterFailover() is if you want to not wait for the sentinels to reach consensus and perform an immediate SENTINEL failover telling the sentinel instance to force a failover, the ForceMasterFailover() API doesn’t need the <master name> because it already knows it, which is configured in the RedisSentinel constructor which defaults to mymaster by default.

tbednarz · June 13, 2019, 9:35am

Hi Demis

Sorry I am still struggling with this. I understand how these things work on LINUX. I just learned that the settings in the sentinel.conf files seem to be critical and it depends on these values, if I receive the events in my service or not.

What I do NOT understand is the following:
(please remember I have dbinfra1 (172.16.63.51), dbinfra2 (172.16.63.52) and dbinfra3 (172.16.63.53) CENTOS VMs which run the Redis ReplicaSet. dbinfra1 is the MASTER. **I stopped the master and redis decided to promote dbinfra3 as new master) This is the starting point where all the mess begins!

So lets show the details step-by-step:

I SSH to the dbinfra3 server and start the redis-cli and issue the command info replication. It returns:

127.0.0.1:6387> info replication
# Replication
role:master
connected_slaves:1
slave0:ip=172.16.63.52,port=6387,state=online,offset=721238,lag=0
master_replid:87d100c193086907184dc0d35740919e12144d81
master_replid2:2528c29c5a58530615842f706b21158efef67ebe
master_repl_offset:721238
second_repl_offset:81834
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:721238
127.0.0.1:6387>

This shows that its ROLE is MASTER and it has ONE SLAVE: 172.16.63.52 (dbinfra2)

Now in my C# code I construct the RedisSentinel as follows:

try
{
	var sentinel = new RedisSentinel(sentinelHosts, masterName: replSetName)
	{
		OnFailover = manager => 
		{
			Logger.Debug($$"ONFAILOVER event received from {replSetName} ...");
		},
		OnWorkerError = ex =>
		{
			Logger.Debug($$"ONWORKERERROR event received from {replSetName}. Error: {ex.GetAllExceptions()}");
		},
		OnSentinelMessageReceived = (channel, msg) =>
		{
			Logger.Debug($$"ONSENTINELMESSAGERECEIVED event received from {replSetName}. Message: '{msg}' " +
						 $$"Channel '{channel}'...");
		}, 
	};
	sentinel.HostFilter = host => $$"{host}?Db={BbConsumerConstants.RedisProdDb}&RetryTimeout=2000" +
								  $$"&password={HttpUtility.UrlEncode(pwdRedisConfig)}";
	sentinel.RefreshSentinelHostsAfter = TimeSpan.FromSeconds(60);
	IRedisClientsManager redisManager = sentinel.Start();
	container.Register(c => new SysConfigRepository(redisManager));
}
catch (Exception e)
{
	Logger.Error($$"Failed to create, start and register the Redis config database Sentinel server. Error: {e.GetAllExceptions()}");
	throw;
}

The data of sentinelHosts is "dbinfra3.tbhome.int:26387","dbinfra2.tbhome.int:26387","dbinfra1.tbhome.int:26387" in exactly this order. The value of replSetName is redis_6387_config.

The BAD THING starts with sentinel.Start():
It seems to get an exception but I cannot catch it! I only see the following in the log:

2019-06-13 10:38:54.257 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 1]   #3 Could not connect to redis Instance at dbinfra1.tbhome.int:26387
2019-06-13 10:38:56.607 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 1]   #3 SocketException in SendReceive, retrying...
System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (113): No route to host 172.16.63.51:26387
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)

WHY DOES IT STILL TRY DBIFRA1 (172.16.63.51)??? That is the server I have STOPPED and as you can see above from the redis shell, the server DBINFRA3 with IP 172.16.63.53 is the new master.

So where does your library get the information from to connect to 172.16.63.51???

I tried to track this down a bit in your source:

I found GetActiveSentinelHosts in the Start() method.

There you loop through the machine names I passed to the constructor. For each machine you create a SentinelWorker. What happens if you use the server which is down?

Then you call GetSentinelHosts. It would be interesting to see what it returns, only the two currently active hosts or all members defined for the ReplicaSet?

Finally you seem to start this worker thread (GetValidSentinelWorker) which loops through the machines. Don’t know if this is ever reached in my case…

I can only state (see my log output above) that it only tries the server which is stopped and it seems to NEVER try the other two machines.

Any help would be greatly appreciated!

tbednarz · June 13, 2019, 12:17pm

=========== TESTING UPDATE ================
I at least figured out, what values are required in the sentinel.conf files to get the event handlers fired:

redis_sentinel_down_after:  5000 
redis_sentinel_failover_timeout: 10000

The first parameter is the time until the sentinel marks the state with +sdown. The default here is 180000 ms which is way too long!
The second parameter is the time we allow to perform the failover.

In my application I get the messages BUT NO FAILOVER is performed:

2019-06-13 13:49:08.199 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 10] n/a n/a (n/a) ONFAILOVER event received from redis_6385_cache ...
2019-06-13 13:49:08.203 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 10] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6385_cache. Message: 'sentinel da2032ff95dbcd9197645c9dd49ee642285fd16d 172.16.63.53 26385 @ redis_6385_cache 172.16.63.52 6385' Channel '+sdown'...
2019-06-13 13:49:08.244 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 7] n/a n/a (n/a) ONFAILOVER event received from redis_6386_admin ...
2019-06-13 13:49:08.247 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 7] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6386_admin. Message: 'sentinel f324537123b87edc38152ef0d37979704f906953 172.16.63.53 26386 @ redis_6386_admin 172.16.63.52 6386' Channel '+sdown'...
2019-06-13 13:49:08.320 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 7] n/a n/a (n/a) ONFAILOVER event received from redis_6386_admin ...
2019-06-13 13:49:08.322 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 7] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6386_admin. Message: 'slave 172.16.63.53:6386 172.16.63.53 6386 @ redis_6386_admin 172.16.63.52 6386' Channel '+sdown'...
2019-06-13 13:49:08.381 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 10] n/a n/a (n/a) ONFAILOVER event received from redis_6385_cache ...
2019-06-13 13:49:08.383 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 10] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6385_cache. Message: 'slave 172.16.63.53:6385 172.16.63.53 6385 @ redis_6385_cache 172.16.63.52 6385' Channel '+sdown'...
2019-06-13 13:49:40.106 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 6] n/a n/a (n/a) #8 Could not connect to redis Instance at 172.16.63.53:6387
2019-06-13 13:49:40.114 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 6] n/a n/a (n/a) #8 SocketException in SendReceive, retrying...
System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (113): No route to host 172.16.63.53:6387
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at System.Net.Sockets.Socket.Connect(IPAddress address, Int32 port)
   at ServiceStack.Redis.RedisNativeClient.Connect() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 95
   at ServiceStack.Redis.RedisNativeClient.Reconnect() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 321
   at ServiceStack.Redis.RedisNativeClient.TryConnectIfNeeded() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 302
   at ServiceStack.Redis.RedisNativeClient.SendReceive[T](Byte[][] cmdWithBinaryArgs, Func`1 fn, Action`1 completePipelineFn, Boolean sendWithoutRead) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 574
2019-06-13 13:49:40.196 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 6] n/a n/a (n/a) #8 Exceeded timeout of 00:00:02
2019-06-13 13:49:40.197 +02:00 [ERR] [ServiceStack.Redis.RedisNativeClient] [ThreadId: 6] n/a n/a (n/a) Error when trying to Quit()
ServiceStack.Redis.RedisException: Exceeded timeout of 00:00:02 ---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException: No route to host 172.16.63.53:6387
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at System.Net.Sockets.Socket.Connect(IPAddress address, Int32 port)
   at ServiceStack.Redis.RedisNativeClient.Connect() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 95
   at ServiceStack.Redis.RedisNativeClient.Reconnect() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 321
   at ServiceStack.Redis.RedisNativeClient.TryConnectIfNeeded() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 302
   at ServiceStack.Redis.RedisNativeClient.SendReceive[T](Byte[][] cmdWithBinaryArgs, Func`1 fn, Action`1 completePipelineFn, Boolean sendWithoutRead) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 574

So this means my server are not HA!

When I restart the former primary (172.16.63.53 in the above test case), I see the following in my application log:

2019-06-13 13:59:43.342 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 7] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6386_admin. Message: 'slave 172.16.63.53:6386 172.16.63.53 6386 @ redis_6386_admin 172.16.63.52 6386' Channel '+reboot'...
2019-06-13 13:59:43.373 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 6] n/a n/a (n/a) ONFAILOVER event received from redis_6387_config ...
2019-06-13 13:59:43.374 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 6] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6387_config. Message: 'sentinel f5231d532d66789170cdc388d78cd90528fb2a9b 172.16.63.53 26387 @ redis_6387_config 172.16.63.52 6387' Channel '-sdown'...
2019-06-13 13:59:43.380 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 10] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6385_cache. Message: 'slave 172.16.63.53:6385 172.16.63.53 6385 @ redis_6385_cache 172.16.63.52 6385' Channel '+reboot'...
2019-06-13 13:59:43.432 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 7] n/a n/a (n/a) ONFAILOVER event received from redis_6386_admin ...
2019-06-13 13:59:43.465 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 10] n/a n/a (n/a) ONFAILOVER event received from redis_6385_cache ...
2019-06-13 13:59:43.471 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 10] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6385_cache. Message: 'slave 172.16.63.53:6385 172.16.63.53 6385 @ redis_6385_cache 172.16.63.52 6385' Channel '-sdown'...
2019-06-13 13:59:43.478 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 7] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6386_admin. Message: 'slave 172.16.63.53:6386 172.16.63.53 6386 @ redis_6386_admin 172.16.63.52 6386' Channel '-sdown'...
2019-06-13 13:59:43.487 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 10] n/a n/a (n/a) ONFAILOVER event received from redis_6385_cache ...
2019-06-13 13:59:43.489 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 10] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6385_cache. Message: 'sentinel da2032ff95dbcd9197645c9dd49ee642285fd16d 172.16.63.53 26385 @ redis_6385_cache 172.16.63.52 6385' Channel '-sdown'...
2019-06-13 13:59:43.490 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 7] n/a n/a (n/a) ONFAILOVER event received from redis_6386_admin ...
2019-06-13 13:59:43.492 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 7] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6386_admin. Message: 'sentinel f324537123b87edc38152ef0d37979704f906953 172.16.63.53 26386 @ redis_6386_admin 172.16.63.52 6386' Channel '-sdown'...
2019-06-13 13:59:44.353 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 6] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6387_config. Message: 'slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.52 6387 new reported role is master' Channel '-role-change'...
2019-06-13 13:59:44.439 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 6] n/a n/a (n/a) ONFAILOVER event received from redis_6387_config ...
2019-06-13 13:59:44.440 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 6] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6387_config. Message: 'slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.52 6387' Channel '-sdown'...
2019-06-13 13:59:54.370 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 6] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6387_config. Message: 'slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.52 6387 new reported role is slave' Channel '+role-change'...

From that moment on, the application is working again. But for this, I do NOT need a ReplicaSet!

So my conclusion so far is: RedisSentinel gets the events that a server is down and even that a new Primary was promoted. But the connection is not switched and therefore the applications throws exception until the stopped server is started again.

How can I fix this?

mythz · June 13, 2019, 8:11pm

Why is your Retry Timeout only 2 seconds? (The default is now 10 seconds) Are you using the latest v5.5 of ServiceStack.Redis? Please include all custom configuration you’re using.

You can see in the implementation that if it receives an event that calls ResetClients() which will failover the connection using the latest info from the Redis Sentinel. Are all your redis clients being resolved from the Redis Sentinel RedisManager? i.e. you’re not newing up any RedisClient instances yourself?

Can you log RedisStats.ToDictionary().Dump() after the events are triggered so you can display the events that have happened.

In our testing it does failover as expected after processing the sentinel events but it can take some time for the whole process of the Redis Sentinels determining the master is down, performing the failover, sending the events and having RedisSentinel fail over to the new clients. The redis-config project provides an easy environment you can quickly launch the popular configuration of 3x sentinels, 1x master, 2x slaves so you can see this in action. Here’s a simple RedisSentinelFailoverTests.cs you can run that makes a master/slave request every second where after killing the master you’ll eventually see the clients failing over to new the newly promoted slave.

I’ve added additional logging in this commit which will log which instances the client manager fails over to, now available from the latest v5.5.1 on MyGet.

tbednarz · June 14, 2019, 4:46am

Hi Demis,
Thank you VERY much for your support! I will look into all of that today…

Yes, all my servers use the latest Version

I was playing around with all those settings, mainly on the LINUX side. I will increase it. I like to avoid, that users get errors just because the backend is performing stuff, that the end user cannot understand…

For the rest of you answer, I will give you feedback as soon as I have some results.

tbednarz · June 14, 2019, 10:16am

Ok, I have downloaded the 5.5.1 release from your MyGet repo and added some more logging to my code. Since I do not like to paste long log files here, I put them in a public GitHub repo. I will reference the files there in the following text.

Short explanation on my test-setup:

I develop everything on CentOS or Fedora, NO WINDOWS AT ALL on the server side. I use JetBrains Rider and .NETCore 2.1 (at the moment)
All database servers are running on CentOS / RHEL headless servers (development / production)
My dev environment consists of a number of VMs all in the same network. In Prod we are spread over two locations, but to avoid latency issues, we allow ReplicaSet failovers only in ONE datacenter. The other location just gets replicated data to implement geo-redundancy of the data.
My Redis RS currently consist of 4 CentOS boxes, one of them has set replica_priority 0 which means it will NEVER be promoted to MASTER (see point #3), the other machines have replica_priority 100.
I run six database instances, each on a different port on every machine. Each database instance has also one sentinel assigned. In a bash this looks as follows:

6497 /usr/bin/redis-server *:6380                                         
6531 /usr/bin/redis-server *:6381                                          
6564 /usr/bin/redis-server *:6382                                         
6597 /usr/bin/redis-server *:6385                                         
6629 /usr/bin/redis-server *:6386                                         
6662 /usr/bin/redis-server *:6387                                          
6701 /usr/bin/redis-sentinel *:26380 [sentinel]                                 
6733 /usr/bin/redis-sentinel *:26381 [sentinel]                                  
6765 /usr/bin/redis-sentinel *:26382 [sentinel]                                 
6797 /usr/bin/redis-sentinel *:26385 [sentinel]                                 
6829 /usr/bin/redis-sentinel *:26386 [sentinel]                                 
6861 /usr/bin/redis-sentinel *:26387 [sentinel]

The program I use to perform my tests uses three of these databases with the ports 6385/26385, 6386/26386, 6387/26387

OK, I guess you can understand the log files better now.

Biggest problem I have in all of my tests: I only get the events OnFailover and OnSentinelMessageReceived RANDOMLY! The reason for that is most likely the cause why the client does not perform the failover. I currently do NOT YET understand, how all these configuration settings for timeouts depend on each other. I can only say, that behaviour changes a lot depending on these settings!

So on GitHub you find three log files:
My application log It is an extract only, removed sensitive data
Redis Log extract for the test for database instance on Port 6387
Sentinel Log extract for the test for sentinel instance on Port 26387

When starting the test, the machine with IP 172.16.63.51 was the master. I shutdown this machine. The logs I pushed are all from the same SLAVE machine. The logs show, that the failover worked on the server side on Linux, but my program did not receive events nor did it react on the failover. Here the details:

At 10:10:17.127 the Redis instance reports that it has lost connection to the master
At 10:10:22.178 the Sentinel noticed that the master is down.
At 10:10:17.289 just a bit more than 100ms later it started with the voting process to elect a new master.
This process takes a very long time… almost one minute!
At 10:11:06.354 Sentinel reports that the master is switched
At 10:11:05.330 the Redis server starts a replication and a few ms later it connects to the new MASTER.

Now lets see what happens in my application:
At 10:08:11.278 I start the docker container with my application. After RedisSentinel.Start() I dump the stats for the first time. (I have a total of three connections to three different redis databases)
At10:10:22.297 the application receives the first OnSentinelMessageReceived event.
At 10:10:54.572 I receive the first exception because a connection to the stopped redis server is not possible anymore.
At 10:11:54.914 I get the last OnSentinelMessageReceived event. I have the impression, the handler misses a number of events…. Also I NEVER got an OnFailover event…
At 10:16:34.890 I stopped the container with my application.

I have also added some C# code extract from my App.Configure() implementation where you can see how I init all the Redis stuff and register it with the IOC container.

If you have any idea how I can optimize all those many timeout settings or if you see any reason, why we do not get all events, please let me know. If you need to add more debugging code to your lib and want me to execute it, just let me know. I do whatever is required to get this issue resolved!

Many thanks again for you help!

mythz · June 14, 2019, 7:56pm

That definitely sounds like the cause of the issue, but it’s the first time I’ve heard of a Redis Pub/Sub subscription missing or not receiving events which has been a solid communication channel. It’s using the same managed Redis Pub/Sub implementation that Redis MQ and Redis Server Events use.

The cause of the missing sentinel events would be the next area of focus, so I would look at seeing if the Redis Sentinel instance is physically sending them to the RedisSentinel client using a packet sniffer like WireShark. It’s possible there could be some network configuration or firewall interfering with the connection or maybe the Sentinel instances are just not sending them for some reason, e.g. misconfiguration or since you’re running in Docker be aware of the Docker/NAT issues with Sentinels.

I’m unclear of your Redis configuration though, you can only have 1 master but seem to be running 6x redis instances? Is that because of your GEO redundancy so you’re only running 1x Master, 2x Slaves, 3x Sentinels per geo location? or does the 2nd DC only have 3x replicas? Are the sentinel configuration isolated so it’s only monitoring the redis instances in each DC?

Are you trying to implement GEO redundancy yourself or are you using a solution like Redis Labs Active-Active GEO Distribution or Azure Redis Cache geo distribution?

tbednarz · June 15, 2019, 6:47am

Hi Demis,

My infrastructure:

Only my servers (that is the Redis clients) are running on Docker. The Redis servers do not. So there is no NAT issue. Also this time I think it is NOT a firewall issue, firewalls are binary: they block or they do not, but I have never seen firewalls which block randomly… So since I do get some events from various channels, it must be something else.

My configuration:

There is a lot of recommendations in the internet (and books I have as well) NOT to run multiple databases in ONE Redis instance even if it is possible. Instead you should run one Redis instance per database and assign different ports to them. Thats what I am doing. Theoretically you can have up to 16 databases per Redis instance, they are numbered 0 - 15 and you can configure thatn in redis.conf files. This blog post explains some aspects why it is NOT recommended. So as you can see in my previous shell dump, I am running six Redis instances (one for every database I have) on every server. Each instance has also a Sentinel instance which is monitoring the database. This setup is spread over four CentOS VMs which form together the ReplicaSet. Since we replicate data to another location to have only data geo-redundant (not the applications) we use for both Redis and MongoDB a ReplicaSet configuration that makes sure, that machines on the redundant site NEVER can become MASTER. In fact, they just keep an off-site database copy.

Redis PUB/SUB

I am not familiar with Redis PUB/SUB so I can’t say anything about it. I also don’t know, WHAT exactly a channel is and HOW you subscribe to them. But it looks like ALL MESSAGES in the Sentinel Log should be received by your handler. So if you compare the sentinel log with my application log, you can see what I get und what I miss. Please note, I am just looking at messages sent by the Redis instance running on Port 6387

-- SENTINEL LOG
6767:X 14 Jun 2019 10:10:22.282 # +new-epoch 8
6767:X 14 Jun 2019 10:10:22.289 # +vote-for-leader 75d6da2fe2b67695af9e015c8fc39766a02e7a41 8
6767:X 14 Jun 2019 10:10:23.252 # +odown master redis_6387_config 172.16.63.51 6387 #quorum 2/2
6767:X 14 Jun 2019 10:10:23.252 # Next failover delay: I will not start a failover before Fri Jun 14 10:10:42 2019
6767:X 14 Jun 2019 10:10:42.528 # +new-epoch 9
6767:X 14 Jun 2019 10:10:42.533 # +vote-for-leader 75d6da2fe2b67695af9e015c8fc39766a02e7a41 9
6767:X 14 Jun 2019 10:10:42.555 # Next failover delay: I will not start a failover before Fri Jun 14 10:11:02 2019
6767:X 14 Jun 2019 10:11:03.010 # +new-epoch 10
6767:X 14 Jun 2019 10:11:03.010 # +try-failover master redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:03.016 # +vote-for-leader 547b4d726635832d950a4af6dae166889c71c0f1 10
6767:X 14 Jun 2019 10:11:03.026 # 75d6da2fe2b67695af9e015c8fc39766a02e7a41 voted for 547b4d726635832d950a4af6dae166889c71c0f1 10
6767:X 14 Jun 2019 10:11:03.028 # cb7a3251ac116a0381b70622a9433ee72bc86d34 voted for 547b4d726635832d950a4af6dae166889c71c0f1 10
6767:X 14 Jun 2019 10:11:03.072 # +elected-leader master redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:03.072 # +failover-state-select-slave master redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:03.155 # +selected-slave slave 172.16.63.52:6387 172.16.63.52 6387 @ redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:03.156 * +failover-state-send-slaveof-noone slave 172.16.63.52:6387 172.16.63.52 6387 @ redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:03.247 * +failover-state-wait-promotion slave 172.16.63.52:6387 172.16.63.52 6387 @ redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:03.443 # +promoted-slave slave 172.16.63.52:6387 172.16.63.52 6387 @ redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:03.443 # +failover-state-reconf-slaves master redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:03.510 * +slave-reconf-sent slave 172.16.63.57:6387 172.16.63.57 6387 @ redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:04.147 # -odown master redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:04.203 * +slave-reconf-inprog slave 172.16.63.57:6387 172.16.63.57 6387 @ redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:05.268 * +slave-reconf-done slave 172.16.63.57:6387 172.16.63.57 6387 @ redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:05.329 * +slave-reconf-sent slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:06.301 * +slave-reconf-inprog slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:06.301 * +slave-reconf-done slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:06.354 # +failover-end master redis_6387_config 172.16.63.51 6387
6767:X 14 Jun 2019 10:11:06.354 # +switch-master redis_6387_config 172.16.63.51 6387 172.16.63.52 6387
6767:X 14 Jun 2019 10:11:06.354 * +slave slave 172.16.63.57:6387 172.16.63.57 6387 @ redis_6387_config 172.16.63.52 6387
6767:X 14 Jun 2019 10:11:06.354 * +slave slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.52 6387
6767:X 14 Jun 2019 10:11:06.354 * +slave slave 172.16.63.51:6387 172.16.63.51 6387 @ redis_6387_config 172.16.63.52 6387
6767:X 14 Jun 2019 10:11:11.408 # +sdown slave 172.16.63.51:6387 172.16.63.51 6387 @ redis_6387_config 172.16.63.52 6387


-- APP LOG
2019-06-14 10:10:22.297 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 6] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6387_config. Message: '8' Channel '+new-epoch'...
2019-06-14 10:10:22.300 +02:00 [DBG] [BizBusLicenseServer.AppHost] [ThreadId: 6] n/a n/a (n/a) ONSENTINELMESSAGERECEIVED event received from redis_6387_config. Message: '75d6da2fe2b67695af9e015c8fc39766a02e7a41 8' Channel '+vote-for-leader'...

THAT IS THE ONLY MESSAGE I GET FROM PORT 6385, THERE ARE 9 MESSAGES I GET FROM THE INSTANCE WITH PORT 6385, EVERYTHING ELSE IS MISSING!

Just speculation here: If you subscribe to messages sent by a specific machine, e.g. the MASTER and this machine goes down - how you get your subscribed messages? From which machine, port ?

redis-cli

The redis-cli tool is VERY POWERFUL. If you have a chance, connect to a redis database instance with this tool and type monitor. You will see everything that is entering or leaving that Redis instance. It shows tons of PINGs which seem to be kind of a heart-beat sent to all RS-members etc.

tbednarz · June 15, 2019, 7:48am

Followup regarding PUB/SUB

As already mentioned I am not familiar with it, I am running RabbitMQ for messaging. However I found interesting information, starting with this stackoverflow posting (see the first answer) and the link there. The interesting bit is:

When you execute PUBLISH it will publish the message to the given channel immediately and return the number of clients which received it. In what way does a failure scenario bring this to relevance? Consider the case of failover. The client will have to reconnect as the TCP connection goes away. If talking to a proxy it will do a direct reconnect and get a different server. If doing Sentinel discovery it will take a bit longer and still connect to a different server. Where the issue lies is in timing. If the publisher connects to the new server before any subscribers do, the message is gone – lost. But can that happen? To understand that we look to the SUBSCRIBE commands.
When your code issues a subscribe command the server registers in it’s in-memory data structures that that specific connection should get messages on the channel(s) subscribed to. This information is not propagated to slaves. Thus when a failure happens and the client reconnects it must then re-issue the subscribe command. Otherwise the new server does not know that particular connection needs to receive certain messages. Now we have the clear potential for a subscriber to reconnect to the new master after the publisher. The conditions for loss of messages now exist. So how to minimize or prevent this? There are a few options.

Maybe this explains why I do not get messages and why the entire systems behaves so different depending on the timing configuration parameters I set in the configuration files…

mythz · June 15, 2019, 5:34pm

That’s not true, It’s not about blocking randomly, they can allow outgoing initiated TCP connections but not incoming ones, drop different TCP handshake packets, etc. A misconfigured firewall can cause issues like lost packets. Since the issue is the Redis Clients not receiving events that should be the area of focus.

So you’re running 6 different master databases, does that mean you have 6 different master groups? How many sentinels are listening to each master group?

Does that mean that each of the 6 masters have 3x replicas each?

Right, but you’re saying the client doesn’t receive them which is what’s needed to investigate, the log doesn’t tell you want messages were delivered to clients, that’s what a packet sniffer like Wire Shark will tell you, if the events aren’t physically delivered to the client, there’s no way the client can receive them, that’s what you need to check.

The subscription is to the Redis Sentinel Instance, if the Redis Sentinel Instance goes down, it connects to a different Sentinel Instance in the master group.

I know about redis-cli and the MONITOR command, it logs the redis commands processed by the Redis Server, but it’s not proof and does not tell you which commands were sent to which connected clients, that’s what a network packet sniffer will let you see, you want to investigate the missing events so we need to verify that the events are being physically delivered to the redis client or not. If the events are not delivered there’s something wrong with your Redis setup, if the the events are being delivered then we need to look at the client Pub/Sub connection with the sentinel to see why the events were not fired.

During the re-connection it’s possible to lose some sentinel events, but this is just pure speculation without verifying what’s actually happening. For it to miss the failover messages the redis client must not be connected to any sentinels and the master has to failover during that window.

tbednarz · June 15, 2019, 7:04pm

Yes.
One per database instance, that means six per host

yes

I run several ServiceStack based servers as docker containers. They use usually three Redis databases as you can see in my App.Configure() code. One is uses as SessionCache, one as ServiceStack User Database and one as my servers configuration database.

I can see if I can install wireshark on a minimal CentOS install. It would require to run a capture on a command line and analize this on a machine with a UI. Not so easy to setup…

tbednarz · June 15, 2019, 7:58pm

Ok, I figured out how that works with wireshark on cli only, including how to set filters. These are required since captures can grow VERY quickly. I will do another test on Monday.

tbednarz · June 17, 2019, 12:18pm

Hi Demis,
I have a lot of capture data but its not so easy to read it… its all binary. When following a TCPSTREAM in Wireshark I see a bit more, but I do not see something like +switch-master which would show the notification that the failover has finished.

I did a capture on my docker host (all traffic). If I look at the traffic received from the slave with IP 172.16.63.52 I can see things like:

*1
$4
INFO
$1634
# Server
redis_version:5.0.5
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:619d60bfb0a92c36
redis_mode:sentinel
os:Linux 3.10.0-957.21.2.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:4.8.5
process_id:3658
run_id:0f271a3ac5a5abe6e8cff319367ee14fdab6d09b
tcp_port:26387
uptime_in_seconds:12272
uptime_in_days:0
hz:17
configured_hz:10
lru_clock:484017
executable:/usr/bin/redis-sentinel
config_file:/etc/sentinel_26387_config.conf

# Clients
connected_clients:4
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

# CPU
used_cpu_sys:22.110637
used_cpu_user:28.294331
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000

# Stats
total_connections_received:4
total_commands_processed:53507
instantaneous_ops_per_sec:5
total_net_input_bytes:3223872
total_net_output_bytes:320354
instantaneous_input_kbps:0.26
instantaneous_output_kbps:0.03
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=redis_6387_config,status=ok,address=172.16.63.51:6387,slaves=3,sentinels=7

*3
$8
SENTINEL
$9
sentinels
$17
redis_6387_config
*6
*28
$4
name
$40
f849e1d51d636ec445868d5d1377ddb033b70983
$2
ip
$12
172.16.63.51
$4
port
$5
26387
$5
runid
$40
f849e1d51d636ec445868d5d1377ddb033b70983
$5
flags
$8
sentinel
$21
link-pending-commands
$1
0
$13
link-refcount
$1
1
$14
last-ping-sent
$1
0
$18
last-ok-ping-reply
$3
240
$15
last-ping-reply
$3
240
$23
down-after-milliseconds
$5
15000
$18
last-hello-message
$2
55
$12
voted-leader
$1
?
$18
voted-leader-epoch
$1
0
*28
$4
name
$40
75d6da2fe2b67695af9e015c8fc39766a02e7a41
$2
ip
$12
172.16.63.57
$4
port
$5
26387
$5
runid
$40
75d6da2fe2b67695af9e015c8fc39766a02e7a41
$5
flags
$8
sentinel
$21
link-pending-commands
$1
0
$13
link-refcount
$1
1
$14
last-ping-sent
$1
0
$18
last-ok-ping-reply
$3
240
$15
last-ping-reply
$3
240
$23
down-after-milliseconds
$5
15000
$18
last-hello-message
$3
111
$12
voted-leader
$1
?
$18
voted-leader-epoch
$1
0
*30
$4
name
$40
1e3cc53dc55e1ff596123ab79f9c650fcae0a14c
$2
ip
$12
172.16.63.51
$4
port
$1
0
$5
runid
$40
1e3cc53dc55e1ff596123ab79f9c650fcae0a14c
$5
flags
$28
s_down,sentinel,disconnected
$21
link-pending-commands
$1
0
$13
link-refcount
$1
1
$14
last-ping-sent
$8
12271620
$18
last-ok-ping-reply
$8
12271620
$15
last-ping-reply
$8
12271620
$11
s-down-time
$8
12256616
$23
down-after-milliseconds
$5
15000
$18
last-hello-message
$8
12271620
$12
voted-leader
$1
?
$18
voted-leader-epoch
$1
0
*30
$4
name
$40
f5231d532d66789170cdc388d78cd90528fb2a9b
$2
ip
$12
172.16.63.53
$4
port
$1
0
$5
runid
$40
f5231d532d66789170cdc388d78cd90528fb2a9b
$5
flags
$28
s_down,sentinel,disconnected
$21
link-pending-commands
$1
0
$13
link-refcount
$1
1
$14
last-ping-sent
$8
12271620
$18
last-ok-ping-reply
$8
12271620
$15
last-ping-reply
$8
12271620
$11
s-down-time
$8
12256616
$23
down-after-milliseconds
$5
15000
$18
last-hello-message
$8
12271620
$12
voted-leader
$1
?
$18
voted-leader-epoch
$1
0
*30
$4
name
$40
910f66968adeb8ad7817931b8534af2c0897b13d
$2
ip
$12
172.16.63.51
$4
port
$1
0
$5
runid
$40
910f66968adeb8ad7817931b8534af2c0897b13d
$5
flags
$28
s_down,sentinel,disconnected
$21
link-pending-commands
$1
0
$13
link-refcount
$1
1
$14
last-ping-sent
$8
12271620
$18
last-ok-ping-reply
$8
12271620
$15
last-ping-reply
$8
12271620
$11
s-down-time
$8
12256615
$23
down-after-milliseconds
$5
15000
$18
last-hello-message
$8
12271620
$12
voted-leader
$1
?
$18
voted-leader-epoch
$1
0
*28
$4
name
$40
547b4d726635832d950a4af6dae166889c71c0f1
$2
ip
$12
172.16.63.53
$4
port
$5
26387
$5
runid
$40
547b4d726635832d950a4af6dae166889c71c0f1
$5
flags
$8
sentinel
$21
link-pending-commands
$1
0
$13
link-refcount
$1
1
$14
last-ping-sent
$1
0
$18
last-ok-ping-reply
$3
240
$15
last-ping-reply
$3
240
$23
down-after-milliseconds
$5
15000
$18
last-hello-message
$3
877
$12
voted-leader
$1
?
$18
voted-leader-epoch
$1
0
*1
$4
QUIT
+OK

For the same period of time I see in the log of the sentinel with port 26387:

3658:X 17 Jun 2019 08:27:13.597 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
3658:X 17 Jun 2019 08:27:13.602 # Redis version=5.0.5, bits=64, commit=00000000, modified=0, pid=3658, just started
3658:X 17 Jun 2019 08:27:13.602 # Configuration loaded
3658:X 17 Jun 2019 08:27:13.602 * supervised by systemd, will signal readiness
3658:X 17 Jun 2019 08:27:13.619 * Running mode=sentinel, port=26387.
3658:X 17 Jun 2019 08:27:13.619 # Sentinel ID is cb7a3251ac116a0381b70622a9433ee72bc86d34
3658:X 17 Jun 2019 08:27:13.619 # +monitor master redis_6387_config 172.16.63.52 6387 quorum 2
3658:X 17 Jun 2019 08:27:13.673 # +new-epoch 12
3658:X 17 Jun 2019 08:27:13.673 # +config-update-from sentinel f849e1d51d636ec445868d5d1377ddb033b70983 172.16.63.51 26387 @ redis_6387_config 172.16.63.52 6387
3658:X 17 Jun 2019 08:27:13.673 # +switch-master redis_6387_config 172.16.63.52 6387 172.16.63.51 6387
3658:X 17 Jun 2019 08:27:13.673 * +slave slave 172.16.63.53:6387 172.16.63.53 6387 @ redis_6387_config 172.16.63.51 6387
3658:X 17 Jun 2019 08:27:13.673 * +slave slave 172.16.63.57:6387 172.16.63.57 6387 @ redis_6387_config 172.16.63.51 6387
3658:X 17 Jun 2019 08:27:13.673 * +slave slave 172.16.63.52:6387 172.16.63.52 6387 @ redis_6387_config 172.16.63.51 6387
3658:X 17 Jun 2019 08:27:28.600 # +sdown sentinel 1e3cc53dc55e1ff596123ab79f9c650fcae0a14c 172.16.63.51 0 @ redis_6387_config 172.16.63.51 6387
3658:X 17 Jun 2019 08:27:28.600 # +sdown sentinel f5231d532d66789170cdc388d78cd90528fb2a9b 172.16.63.53 0 @ redis_6387_config 172.16.63.51 6387
3658:X 17 Jun 2019 08:27:28.600 # +sdown sentinel 910f66968adeb8ad7817931b8534af2c0897b13d 172.16.63.51 0 @ redis_6387_config 172.16.63.51 6387

The entry 3658:X 17 Jun 2019 08:27:13.673 # +switch-master redis_6387_config 172.16.63.52 6387 172.16.63.51 6387 which says that it switched the master from 172.16.63.51 to 172.16.63.52 does not show up anywhere in my capture.

If you want to look into the capture yourself, you can send my a private email (to the address in my profile) and I can send you back a download link to a ZIP file containing logs and capture. I cannot publish this information here.

mythz · June 17, 2019, 3:56pm

Yeah that’s the best way to view a TCP’s session text protocol.

Which would indicate that the RedisSentinel client never received the event, so possibly an invalid Sentinel configuration.

tbednarz · June 18, 2019, 2:42pm

Ok, we cannot find any misconfiguration and on the server side it works fine, just the .NET client is not able to perform the failover.

So we decided to get rid of the Redis ReplicaSet within the next few months and replace the replicated redis databases with MongoDB. MongoDB is a lot smarter and the failover works on several platforms without any problems. The sentinels also make changes to the config files which is very bad, since we use RedHat Ansible to manage our entire infrastructure. It is very hard to detect changes in a configuration file, which were performed by something else than Ansible. That’s another reason why we believe Redis Replica Sets are not suitable for us, MongoDB is a much better choice.

For the time being we have an organizational work-around if we need to reboot replica set members (which only happens if we have to apply kernel updates or very special security updates).

So I will not continue to research this any further. Thanks for your help anyway.

mythz · June 18, 2019, 6:20pm

The client performs the failover fine when it’s been instructed to, you can even manually force the sentinel to failover but it can’t automatically perform a failover for an event it never received. Given it’s a core function of Sentinel instances to monitor redis instances, perform necessary failovers and propagate those changes to subscribed clients, the fact it’s not sending those events would suggest to me an invalid configuration issue.

IMO you’ve made the configuration complicated for yourself by shooting for max efficiency of having a separate Redis instance per database, so instead of the typical 1x Master and 2x slaves you’re running 6x Masters, which presumably means you’re also running 6x master groups that I’m still not clear how the Sentinels were configured to monitor, whether you’ve got dedicated sentinel instance monitoring different master groups or all sentinel instances monitoring all master groups. I’m also not sure if you’re sentinel instances are monitoring redis instances across GEO boundaries, it sounds like they are, but they’re typically used for monitoring redis instances in the same network.

Something tells me if you went with a managed cloud redis configuration (or Redis Labs geo replication) using their Multi AZ deployment / geo redundancy that you wouldn’t have this issue of missing failover events.