We have been experiencing intermittent exceptions using Azure Redis as our ICacheClient.
Frequently, we receive the following exception from our service for methods that cache responses in the ICacheClient:
Internal Server Error: 500 - Internal Server Error: Exceeded timeout of 00:00:03
Needless to say the responses are not cached in Redis when we get this exception.
This problem then goes away a short time later, and the service then caches and returns the correct responses. We don’t get a whole lot of help from trace logs.
Does anyone have any pointers on what the problem could be related to?
We are using the C1 1GB basic tier in Azure, and our site does not have significant traffic at present.
We are registering redis as our ICacheClient as follows:
Yes when it encounters an error it will automatically retry the command for a default of 3 seconds before throwing a Timeout. There are continuous underlying errors that’s causing it to retry and ultimately throw a Timeout, you can try disabling the Timeout to see if it sheds light on what the underlying error is:
Can you show me how to configure this RedisConfig.DefaultRetryTimeout on our registration of PooledRedisClientManager, as part of the connection string AND in code
@jezzsantos Did the timeout change resolve your issue? We are having the exact same issue using redis on Azure as well. Our website is also hosted on Azure (we haven’t ruled out that as a possible cause either).
We also have a different setup of redis on redislabs.com and are having the same issues when pointing there as well which indicates this may not be an azure redis specific issue. In fact, every so often we consistently see our connection count reach 80% and trigger some alarms (300 connections max). Our app currently does not have many concurrent users (no more than 10 at a time currently so 200+ connection indicates some kind of retry issue). Ultimately it ends in a SocketException (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond). This is after setting the retry to 0.
Other articles in StackOverflow suggest that possibly redis persistence (RDB) maybe causing the timeout when it flushes to disk and as a result we’ve disabled that but are still seeing the same connection issues.
Just wanted to know if there has been a solution or progress made on your end.
@Nadeem
No, the timeout change made no difference.
IN fact, what is wierd is that we put ?RetryTimeout=10000 on the end of our connection string, and we saw ‘Exceeded timeout of 00:00:10’ which confirmed that the setting was working. But then every now and again we still see: ‘Exceeded timeout of 00:00:03’
It has us totally puzzled, and its killing our site very frequently.
Thank @mythz and @jezzsantos. I’ll put in a call to Azure support to see what “patch” they deployed to rectify the problem. Do you happen to know any details on what settings the tweaked?
I ask because we have the same issue on our redislabs.com instance which is hosted on AWS. And I presume they are using Linux and not windows. I can also ask them to make similar adjustments to see if it might be a configuration issue.
No don’t know any details about the Azure patch other than it’s installed on their server redis instances. They say they’re already aware of the problem so it should be easy to request.
I just wanted to follow up here some additional information. We had Microsoft support patch our Redis as well and have not seen any issues since. I inquired as to what the patch addresses:
From Msft Support:
Root Cause:
We found an issue wherein long standing idle connections were not being cleanly closed. Service Stack would continue to see these connections as still being connected and could try to reuse the connection for sending additional requests which would result in connection errors. Eventually, this would recover. Patch has fixed this issue.
Patch was to close the connection and notify the client.
I don’t know if there is something we can do from our end to mitigate this (ie self close long idle connections, use a keep-alive etc…) in case other hosts are not able to apply a patch but I just wanted to pass along the root cause and solution for posterity.