SSE Client Heartbeat Hang

TheDruidsKeeper · December 22, 2016, 9:26pm

We have a fairly frequent reoccurring issue with the SSE client. Occasionally it gets stuck after “Prep for Heartbeat”, and never recovers. We’ve been trying to track down the root cause, but have been unsuccessful. We have a private network, with ~ 20 clients connected for the majority of each day: 6 of which are browsers (using ss-utils), the rest are windows apps using the built in SS framework.

Exhibit A

Monday, Dec 19th. We did see a short network blip (most clients disconnected, then reconnected) ~ 1 min before loosing this client. At the time this client lost it’s connection, 1 other client also lost connection - the other client’s SSE did a Stop() & Start() and continued working just fine. This client’s SSE never tried to restart itself.

Here’s a relative part from our client-side log:

14:41:24.4974 DEBUG [ServerEventsClient] [SSE-CLIENT] Prep for Heartbeat…
14:41:24.5124 DEBUG [ServerEventsClient] [SSE-CLIENT] LastPulseAt: 14:41:24.5124411
14:41:24.5124 DEBUG [ServerEventsClient] [SSE-CLIENT] OnHeartbeatReceived: (ServerEventHeartbeat) #5 on #user35 (vecclient:v003s00-r027, allvecclients)
14:41:24.5124 DEBUG [ServerEventsClient] [SSE-CLIENT] Heartbeat sent to: http://vecadminserver/api/event-heartbeat?id=aBg9k004325kdzXCQxw3
14:41:44.5246 DEBUG [ServerEventsClient] [SSE-CLIENT] Prep for Heartbeat…
14:41:44.5246 DEBUG [ServerEventsClient] [SSE-CLIENT] Heartbeat sent to: http://vecadminserver/api/event-heartbeat?id=aBg9k004325kdzXCQxw3
14:41:44.5246 DEBUG [ServerEventsClient] [SSE-CLIENT] LastPulseAt: 14:41:44.5336282
14:41:44.5446 DEBUG [ServerEventsClient] [SSE-CLIENT] OnHeartbeatReceived: (ServerEventHeartbeat) #6 on #user35 (vecclient:v003s00-r027, allvecclients)
14:42:04.5379 DEBUG [ServerEventsClient] [SSE-CLIENT] Prep for Heartbeat…
14:42:04.5539 DEBUG [ServerEventsClient] [SSE-CLIENT] Heartbeat sent to: http://vecadminserver/api/event-heartbeat?id=aBg9k004325kdzXCQxw3
14:42:04.5539 DEBUG [ServerEventsClient] [SSE-CLIENT] LastPulseAt: 14:42:04.5539860
14:42:04.5670 DEBUG [ServerEventsClient] [SSE-CLIENT] OnHeartbeatReceived: (ServerEventHeartbeat) #7 on #user35 (vecclient:v003s00-r027, allvecclients)
14:42:24.5702 DEBUG [ServerEventsClient] [SSE-CLIENT] Prep for Heartbeat…
14:42:24.5702 DEBUG [ServerEventsClient] [SSE-CLIENT] LastPulseAt: 14:42:24.5792432
14:42:24.5832 DEBUG [ServerEventsClient] [SSE-CLIENT] OnHeartbeatReceived: (ServerEventHeartbeat) #8 on #user35 (vecclient:v003s00-r027, allvecclients)
14:42:24.5702 DEBUG [ServerEventsClient] [SSE-CLIENT] Heartbeat sent to: http://vecadminserver/api/event-heartbeat?id=aBg9k004325kdzXCQxw3
14:42:44.6011 DEBUG [ServerEventsClient] [SSE-CLIENT] Prep for Heartbeat…
14:42:46.1943 DEBUG [ServerEventsClient] [SSE-CLIENT] OnMessageReceived: 9 on #user35 (vecclient:v003s00-r027, allvecclients)
14:44:49.4449 DEBUG [ServerEventsClient] Stop()

Heartbeats are regular, until 14:42:44 where it got stuck somehow. The following OnMessageReceived was a command sent from the server, which it received and processed just fine. However, without the active heartbeat, the client was considered disconnected, and we had to manually restart the application.

Interestingly, shortly after the last “Prep for Heartbeat”, the client app did a separate http post (a status update that it does every 3-15 seconds) that took 12 seconds to respond - something that normally only takes .5-1 second. The server logs however show that the request processed sub-second.

On the server side of things, here’s the request that triggered the message that was received on the client at 14:42:46:

14:42:41.1919 DEBUG [AppHost] [GlobalFilters] Request: SessionID=r4QoSAew0PMc8rJ1Rpj1 Id=10351230 Url=/api/vecclient/dimensions/stop
14:42:41.1919 DEBUG [MemoryServerEvents] [SSE-SERVER] Sending vecclient:v003s00-r027@cmd.DimensionStopDto msg to aBg9k004325kdzXCQxw3 on (vecclient:v003s00-r027, allvecclients)
14:42:41.1919 DEBUG [AppHost] [GlobalFilters] Response: SessionID=r4QoSAew0PMc8rJ1Rpj1 Id=10351230 Url=/api/vecclient/dimensions/stop

Our server-side logs show plenty of normal activity going on during that surrounding time period. Requests were still coming in & being processed in a timely fashion.

I’m not sure what other data I can provide that would be useful, but let me know if there is something. It does seem like the network is a trigger for this locked-up state (the network sometimes gets flooded with traffic, or wifi temporarily drops - these are separate issues we’re investigating, but SSE should recover on its own anyway).

mythz · December 23, 2016, 12:45am

Before we can attempt to resolve network/unreliability issues like this we really need a repro in order to observe and identify what’s causing it, which could be caused by context we don’t have, i.e. something specific to your environment.

The “Prep for Heartbeat…” log happens in Heartbeat method which doesn’t do much other than send and new heartbeat using HttpWebRequest. I’ve added more debug logging in this commit which should tell us whether the HTTP Heartbeat request was actually sent. This change is available on MyGet.

The only place I see it hanging in that method is sending the HTTP Request and something that could cause a .NET HttpWebRequest to hang is having too many concurrent connections, either by having too many in-flight or not properly disposing existing connections. So if you’re sending other .NET WebRequest yourself, make sure the Request/Response streams are properly disposed.

One thing that would help identify the issue is to find out which method the Thread is being blocked on, i.e by debugging a hanged process.

TheDruidsKeeper · December 23, 2016, 5:29pm

I’m glad you added the logging, that should help narrow it down - I’ve looked through that code pretty thoroughly, and had hoped we could add logging to get a better idea of where it is getting hung.

We did run into the max concurrent requests block (did a head request that we didn’t close). We fixed that bug, and increased the ServicePointManager.DefaultConnectionLimit - which I’d actually recommend adding that as a note on the Server-Events wiki, since the .net default is only 2 and SSE client uses 2 connections by itself (1 connection is always open, and an additional connection is made when doing the heartbeats). We discovered that our status updates could delay heartbeats, or vice versa.

Considering the low repro rate, and how we only see this happen in prod, I’m not sure if we’ll be able to debug it, but we’ll see what we can do on that front.