Issues With Persistent Socket Connections
-
2012年6月3日 下午 06:39
Background:
I am working on an "internet of things" project that communicates with hundreds (eventually thousands) of devices that make a persistent socket connection to the server on a n port 8080. These connections are supposed to be really persistent so that the server can control the devices 24x7. My application runs under IIS and has two parts: one that allows clients to communicate over ASP.net to monitor the devices and the part that accepts socket connections from remote devices creates a thread that handles the communication to and from device, updates the databases, etc. The devices communicate using TCP sockets on port 8080 and are designed to re-connect automatically if the connection to the server drops. Initial connection is always established by the device in order to pass through the firewall. Once the connection is established, the server can send data to the device at any time. In addition to the regular communication between the device and the server, the server sends a heartbeat message every 30 second. The device acknowledges this heartbeat message with a response. My server is Windows7 64 with 32GB of RAM.
The Problem:
In practice, I am finding that devices are connected persistently as expected. Some stay connected all the time and some of the go through hundreds of connection drops and re-connects. Since the devices connect through a variety of ISPs, routers, etc. I am having a hard time finding a pattern that explains why the socket connection drops. On some sites the connection drops roughly every 5 minutes other sites may go on a whole day without a dropped connection.
Question:
What are all the possible causes of these dropped connections?
所有回覆
-
2012年6月3日 下午 08:42
I'll assume the conenctions existed before the connection was dropped. TCP has two parameters that that can cause the connection to bed dropped
TC parameters
1) Idle time - When a channel doesn't have any traffic for a period of time a connection can close
2) Max Retries - TCP has a parameter usually set between 3 - 5 retires. When the max retry is reached a connection may close. when you are on a multiple hop network when one router stops working the another router should be able to complete the forward the data. This doesn't always happen. Sometimes after a server goes down it take a few seconds for another route/server to make the conenction and to continue sending the data. The new connection time may be too long and you can exceeed the idle time and loose the connection.
Non TCP reasons you can loose a connection
1) Some servers have been puting limits on the max size of data, or when a max size is reached to slow down the data rate.
2) Server going off line.
3) Too many errors. Servers may be program to drop connections when too many error occur
4) Server/Router busy. When a server/router is too busy to handle tragffic retry attempts go up which can cause the idle time or max retry limits to be reashed.
5) Intermittent hardware failures.
jdweng
-
2012年6月4日 上午 12:58
Joel, thanks for your reply. Here are some more details on my setup based on your response.
1. All connection work successfully before dropping the connection and then re-connect
2. The server (the product that I am responsible for and developed) is up all the time and while some devices drop, others stay on. There are always several hundered devices that are connected and communicate successufully.
3. We send a hearbeart message every 30 seconds so conenctions are never idle for more than 30 seconds.
4. The data rate from our devices are very low. Several KB/Minute.
My ISP is the able company. I don't know if they have max limit on the number of incoming conenctions but this does not seem likely as devices connect and re-conenct.
Jason Porter
-
2012年6月4日 上午 01:44If you have the linger state turned on in TCP the default idle time before a connection closes is 30 seconds. I would try changing the heart beat to 15 seconds and see if the failures stops happening.
jdweng
-
2012年6月5日 上午 02:49This does not explain why hundreds of sites stay connected persistently while others drop the connection at random intervals. Because of this I assume that the connections are being dropped at the clients. Is there a way using WireShark or other tools to determine if the connection was dropped by the server or the client?
Jason Porter
-
2012年6月6日 下午 02:05
The issue is likely with NAT/Firewalls. I suspect you'll know this, but... The routers we use to connect to the internet these days mostly have NAT (Network Address Translation) components included -- that allows the ISP to give one IP Address but multiple PCs to be use on the LAN. The NAT converts the packets from multiple connections from multiple IP Addresses on the LAN to multiple connections from one IP address on the Internet side. Each connection that a NAT is handling takes up some memory in the NAT. Therefore most NATs will discard a connection after some idle time.
I guess that's what is closing the connections but I would have thought that a 30 second heartbeat would have been enough to keep the connection from being seen as active and thus not ready to be discarded.
Is the data you are sending/receiving on port 8080 in HTTP protocol or not?
http://www.alanjmcf.me.uk/ Please follow-up in the newsgroup. If I help, please vote and/or mark the question answered. Available for contract programming.
-
2012年6月6日 下午 02:16
Joel where on earth do you get this stuf!!! :-(
There is NO idle time in the TCP/IP protocol. It is specifically designed to be able to sit without sending any traffic and not closing. If idle times are used they are added by protocols above TCP/IP. Where do you get this 30 seconds timeout?
Linger is used ONLY at the graceful close of a connection when there is send data pending. There is no programmatic close of the connection here so that linger condition does not occur. Remember that the case here is the connection somehow being closed by the network and not by the two programs.
http://www.alanjmcf.me.uk/ Please follow-up in the newsgroup. If I help, please vote and/or mark the question answered. Available for contract programming.
-
2012年6月6日 下午 02:35
Alan: read the Lingerstate method. TCP specifcation has a requirement for idle shutodwn of the connection and I tried enabling the linger state in TCP and the connection closed in 30 seconds exactly.
jdweng
-
2012年6月6日 下午 03:54
Please quote me where in the TCP specification(s) say to do idle shutdown. I have double-checked RFC793 and RFC1122 today and they don't say to close on idle.
On LingerState:
"The LingerState property changes the way Close method behaves."
"Specifies whether a Socket will remain connected after a call to the Close or Close methods "
"The LingerState property changes the way Close method behaves."
Nothing about closing on idle. All above what happens after the program calling Close.
http://www.alanjmcf.me.uk/ Please follow-up in the newsgroup. If I help, please vote and/or mark the question answered. Available for contract programming.
-
2012年6月6日 下午 04:41
Did you look at RFC 5482? I was using the time out on a military radio network so I knew it existed. On a multi hop metwork you need so if one router/server goes down in the middle of traffic it is still possible to complete the transfer with another router/server completing the data transfer.
http://tools.ietf.org/html/rfc5482
I found the refernce first a RFC 2616 which is also an HTTP timeout requriement.
Did you look at RFC 2616?
http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html
This document is strange the an HTTP requirement is implying that the requirement is TCP. Now I have to find the TCP requirement.
8.1.4 Practical Considerations
Servers will usually have some time-out value beyond which they will no longer maintain an inactive connection. Proxy servers might make this a higher value since it is likely that the client will be making more connections through the same server. The use of persistent connections places no requirements on the length (or existence) of this time-out for either the client or the server.
jdweng
-
2012年6月6日 下午 09:28
Neither of those have anything to do with TCP idle timeout. The former is not implemented in Windows so couldn't be having an effect here. It also is for changing the TCP non throughput timeout, and not an idle timeout. An idle timeout by definition only applies when no data is being sent at all.
The second case, RFC2616, is exactly what I was saying earlier: it is a timeout in an upper-layer protocol i.e. HTTP.
I think you may actually be talking about the Retransmission failure case. If TCP has data to send and the network breaks such that the data doesn't get to the other end and no ACKs are received from the peer then TCP will retry three times (by default) and then will abort the connection and signal an error to the application. That's not an idle timeout.
Jason what behaviour do you get on connection close (at both ends)? Do you get EoF i.e. zero byte return or do ou get an error, and if so which code? 10061?http://www.alanjmcf.me.uk/ Please follow-up in the newsgroup. If I help, please vote and/or mark the question answered. Available for contract programming.
-
2012年6月6日 下午 09:49
This is the definition of LingerState. When I set the linger state the connection closed after 30 seconds. Linger State set to true give a nonzero time-out which is defaulted to 30 seconds.
LingerState.Enabled
LingerState.LingerTime
Behavior
false (disabled), the default value
The time-out is not applicable, (default).
Attempts to send pending data until the default IP protocol time-out expires.
true (enabled)
A nonzero time-out
Attempts to send pending data until the specified time-out expires, and if the attempt fails, then Winsock resets the connection.
true (enabled)
A zero timeout.
Discards any pending data and Winsock resets the connection.
jdweng
-
2012年6月8日 上午 07:58
Alan, when a connection drops I get the generic Error=System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host
at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags) during read.BTW, the selection of the port was totally random. I have a device-sever protocol that consists of very short 200 byte max packets over the socket that are binary. So it is not HTTP. But I think you may be on to something regarding the NAT/Firewall limit. It certainly feels like that. I have a cable ISP with a Netgear WNDR37 router. Out of desperation I was going to replace the Netgear with a Zywall USG 20 just to see if the Netgear router has a limit that is causing this issue. I have not looked at the packets with WireShark but spent many hours with the debugger on the server. The observation from the server is that I get some upper limit (around 600-700) devices that connect successfully and stay connected persistently. And then I have 100 or so devices that keep dropping and re-connecting.
Jason Porter
-
2012年6月8日 上午 09:36
Just a short one just now.
What's the value of property SocketErrorCode and NativeErrorCode on the SocketException?
As the HTTP question I was just wondering if some NATs are the device locations assume that traffic on port 8080 is web traffic and thus treat close the socket when no HTTP request is seen.
What's the whole set-up? One 'server' at your location, connected behind an ADSL router, and lots of devices on the internet also behind ADSL routers? And which end connects to which?
So it could be the NAT at your end or at the remote ends closing the connection. I originally presumed that latter to explain the wide difference in behaviour that you see.
http://www.alanjmcf.me.uk/ Please follow-up in the newsgroup. If I help, please vote and/or mark the question answered. Available for contract programming.
-
2012年6月8日 上午 10:46
Alan, the error is: System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host
at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags) during read.As far as my set up on the server side it is very simple. I have cable modem connected to NetGear Router with just one PC (the Windows 7 X64) server. That's why I think the culprit could be the NetGear Router.
Jason Porter
-
2012年6月8日 上午 11:08
Any error starting with 0x8 is a permission error or an invalid pointer. The router could be the problem is it is corrupting bits in the ethernet stream and corrupting the data. Are you sure that a virsus isn't trying to hack your network and discover passwords? The 0x8 indicates that somebody is trying to make a connection with the wrong credentials or atttempting to access information that the arren'tt allowed. sometimes it is an invalid pointer. It could be a FTP application trying to read/write to a file in a directory they don't have permission to write.
Port 8080 is a common port number and you just may be conflicting with another process using port 8080. I would use a less common port number over 10,000 where you wouldn't get any conflicts with other applications. A network connection consists of a source IP address, a destination IP address, and a port number. You can only have one connection at a time with all 3 parameters being the same. So if another application is using port 8080 your application won't be able to make a connection and it will fail occasionally (when the connection is already being used).
jdweng
-
2012年6月8日 下午 03:18This is the standard .net error code when you are trying to read and the connection is closed and has nothing to do with permissions. My development environment is .net framework and C#. This has nothing to do with viruses and have tried less common ports and got the same behavior.
Jason Porter
-
2012年6月8日 下午 03:22Jason: Yes, when you have a connection that is closed you have an invalid NULL pointer. that is why it is listed as a 0x8 error message.
jdweng
-
2012年6月8日 下午 08:15
Joel, do you just make this stuff up?!? It would be better if you would just quote facts. :-(
.NET Frameworks include a HResult property so that the exception can be represented in COM. See http://msdn.microsoft.com/en-us/library/system.exception.hresult.aspx "COM methods report errors by returning HRESULTs; .NET methods report them by throwing exceptions. The runtime handles the transition between the two. Each exception class in the .NET Framework maps to an HRESULT."
If we look at every SocketException, every Win32Exception and nearly-every ExternalException. We see that the HResult is 0x80004005. And if we look at the disassembly of all the constructors of those three classes we see that value is set only by the three general constuctors on the ExternalException:
IL_0008: ldc.i4 0x80004005
IL_000d: call instance void System.Exception::SetErrorCode(int32)So that, and that alone, is where the 0x80004005 value comes from. So hopefully that shows the folly in trying to make some amazing assumption that it is do to an "invalid NULL pointer". :-(
http://www.alanjmcf.me.uk/ Please follow-up in the newsgroup. If I help, please vote and/or mark the question answered. Available for contract programming.
-
2012年6月8日 下午 08:16
We didn't have the value of the two properties I mentioned above quoted there, but we see the exception message "An existing connection was forcibly closed by the remote host". And if we look at http://msdn.microsoft.com/en-us/library/windows/desktop/ms740668(v=vs.85).aspx we see that's the message for WSAECONNRESET with id 10054. (So NativeErrorCode on SocketException would have been 10054 and SocketErrorCode would have been ConnectionReset see http://msdn.microsoft.com/en-us/library/system.net.sockets.socketerror.aspx)
So we know that's that error that occurs when we get a RST packet from the remote end. I bet if you run a netmon http://blogs.technet.com/b/netmon/ (or wireshark etc) trace then you'll see a RST coming into the server from the network. (It would be interesting to see if the remote client sent that, or some peculiar device in the path sent it.....)
http://www.alanjmcf.me.uk/ Please follow-up in the newsgroup. If I help, please vote and/or mark the question answered. Available for contract programming.
-
2012年6月8日 下午 08:20
Come on Joel don't quote selectively. :-( The top of that page says: "[...] the amount of time to remain connected after calling the Socket.Close method if data remains to be sent.". Note "after calling the Socket.Close method".
There is no idle timeout on TCP (especially on Windows). There are error timeouts but no idle timeouts.
I have two programs connected sending no data, and the connection stays open for hours:
Connection opened at 07/06/2012 17:37:21 GMT. Been running for 00:01:00.0021193, now 07/06/2012 17:38:21 GMT. Been running for 00:02:00.0027384, now 07/06/2012 17:39:21 GMT. Been running for 00:03:00.0028575, now 07/06/2012 17:40:21 GMT. Been running for 00:04:00.0029765, now 07/06/2012 17:41:21 GMT. Been running for 00:05:00.0030956, now 07/06/2012 17:42:21 GMT. Been running for 00:06:00.0032146, now 07/06/2012 17:43:21 GMT. Been running for 00:07:00.0033337, now 07/06/2012 17:44:21 GMT. Been running for 00:08:00.0034527, now 07/06/2012 17:45:21 GMT. Been running for 00:09:00.0035718, now 07/06/2012 17:46:21 GMT. Been running for 00:10:00.0036908, now 07/06/2012 17:47:21 GMT. Been running for 00:11:00.0038099, now 07/06/2012 17:48:21 GMT. ... ... ... Been running for 04:47:00.0776722, now 07/06/2012 22:24:21 GMT. Been running for 04:48:00.0777912, now 07/06/2012 22:25:21 GMT. Been running for 04:49:00.0779103, now 07/06/2012 22:26:21 GMT. Ctrl+C
http://www.alanjmcf.me.uk/ Please follow-up in the newsgroup. If I help, please vote and/or mark the question answered. Available for contract programming.
-
2012年6月10日 上午 03:43
Alan,
I have some new information which may shed some light on the subject. I looked at the NetGear log and it looks like my router is incorrectly taking some of packets from our devices as DoS attacks. Here's part of the log file
[DoS Attack: ACK Scan] from source: xx.xx.0.32, port 4097, Saturday, June 09,2012 11:22:56
[DoS Attack: RST Scan] from source: xx.xx.1xx.62, port 4100, Saturday, June 09,2012 11:03:48
[DoS Attack: RST Scan] from source: xx.xx.xx.1xx., port 4108, Saturday, June 09,2012 10:46:43
[DoS Attack: RST Scan] from source: xx.xx.1xx.146, port 4097, Saturday, June 09,2012 10:43:56
[DoS Attack: ICMP Scan] from source: xx.xx.xx.196, Saturday, June 09,2012 10:41:03
[DoS Attack: ACK Scan] from source: xx.xx.xx.196, port 49300, Saturday, June 09,2012 10:41:02
[DoS Attack: ACK Scan] from source: xx.xx.xx.196, port 49299, Saturday, June 09,2012 10:41:02
[DoS Attack: RST Scan] from source: xx.xx.xx.124, port 4109, Saturday, June 09,2012 10:03:12
[DoS Attack: RST Scan] from source: xx.xx.xx.124, port 4107, Saturday, June 09,2012 10:02:52
[DoS Attack: RST Scan] from source: xx.xx.xx.124, port 4105, Saturday, June 09,2012 10:02:28
[DoS Attack: RST Scan] from source: xx.xx.xx.124, port 4103, Saturday, June 09,2012 10:02:08
[DoS Attack: RST Scan] from source: xx.xx.xx.124, port 4100, Saturday, June 09,2012 10:01:32
[DoS Attack: RST Scan] from source: xx.xx.1xx.62, port 4099, Saturday, June 09,2012 09:41:48
[DoS Attack: RST Scan] from source: xx.xx.xx.186, port 4107, Saturday, June 09,2012 06:51:50
[DoS Attack: RST Scan] from source: xx.xx.xx.186, port 4105, Saturday, June 09,2012 06:51:26
[DoS Attack: RST Scan] from source: xx.xx.3xx.47, port 61698, Saturday, June 09,2012 05:44:36
[DoS Attack: RST Scan] from source: xx.xx.xx.186, port 4100, Saturday, June 09,2012 05:26:31
[DoS Attack: RST Scan] from source: xx.xx.3xx.47, port 61599, Saturday, June 09,2012 04:54:31
[DoS Attack: ACK Scan] from source: xx.xx.3xx.47, port 61586, Saturday, June 09,2012 04:50:21
[DoS Attack: ACK Scan] from source: xx.xx.3xx.47, port 61571, Saturday, June 09,2012 04:47:56
[DoS Attack: ACK Scan] from source: xx.xx.3xx.47, port 61563, Saturday, June 09,2012 04:45:46
[DoS Attack: RST Scan] from source: xx.xx.3xx.47, port 61532, Saturday, June 09,2012 04:42:26
[DoS Attack: RST Scan] from source: xx.xx.3xx.47, port 61531, Saturday, June 09,2012 04:42:15
[DoS Attack: RST Scan] from source: xx.xx.3xx.47, port 61530, Saturday, June 09,2012 04:42:02
[DoS Attack: ACK Scan] from source: xx.xx.3xx.47, port 61524, Saturday, June 09,2012 04:38:55
[DoS Attack: ACK Scan] from source: xx.xx.3xx.47, port 61495, Saturday, June 09,2012 04:12:33
[DoS Attack: ACK Scan] from source: xx.xx.3xx.47, port 61483, Saturday, June 09,2012 04:11:06
[DoS Attack: RST Scan] from source: xx.xx.80.69, port 4118, Saturday, June 09,2012 03:51:17
[DoS Attack: RST Scan] from source: xx.xx.2xx.73, port 4117, Saturday, June 09,2012 03:43:31
[DoS Attack: RST Scan] from source: xx.xx.1xx.62, port 4121, Saturday, June 09,2012 03:39:18
[DoS Attack: RST Scan] from source: xx.xx.xx.186, port 4105, Saturday, June 09,2012 03:37:17
[DoS Attack: RST Scan] from source: xx.xx.xx.186, port 4104, Saturday, June 09,2012 03:37:09
[DoS Attack: ACK Scan] from source: xx.xx.101.41, port 50364, Saturday, June 09,2012 03:11:51
[DoS Attack: ACK Scan] from source: xx.xx.xx.141, port 4097, Saturday, June 09,2012 02:52:26
[DoS Attack: ACK Scan] from source: xx.xx.xx.141, port 4097, Saturday, June 09,2012 02:51:53
[DoS Attack: ACK Scan] from source: xx.xx.xx.141, port 4097, Saturday, June 09,2012 02:51:42
[DoS Attack: ACK Scan] from source: xx.xx.0.32, port 4121, Saturday, June 09,2012 02:38:35
[DoS Attack: RST Scan] from source: xx.xx.9xx.5, port 20733, Saturday, June 09,2012 00:47:34
[DoS Attack: RST Scan] from source: xx.xx.xx.124, port 4101, Saturday, June 09,2012 00:43:14
[DoS Attack: RST Scan] from source: xx.xx.xx.124, port 4100, Saturday, June 09,2012 00:43:06
[DoS Attack: RST Scan] from source: xx.xx.xx.124, port 4099, Saturday, June 09,2012 00:42:54
[DoS Attack: RST Scan] from source: xx.xx.xx.124, port 4113, Saturday, June 09,2012 00:17:24
[DoS Attack: SYN Flood] from source: xx.xx.101.34, port 45240, Saturday, June 09,2012 00:13:08
[DoS Attack: SYN Flood] from source: xx.xx.131.39, port 3xx.07, Saturday, June 09,2012 00:13:03
[DoS Attack: SYN Flood] from source: xx.xx.2xx.102, port 4106, Saturday, June 09,2012 00:12:57
[DoS Attack: SYN Flood] from source: xx.xx.2xx.102, port 4105, Saturday, June 09,2012 00:12:52
[DoS Attack: SYN Flood] from source: xx.xx.xx.243, port 4105, Saturday, June 09,2012 00:12:48
[DoS Attack: RST Scan] from source: xx.xx.xx.186, port 4120, Saturday, June 09,2012 00:12:43
[DoS Attack: SYN Flood] from source: xx.xx.205.109, port 46999, Saturday, June 09,2012 00:12:12
[DoS Attack: SYN Flood] from source: xx.xx.119.116, port 4115, Saturday, June 09,2012 00:07:57
[DoS Attack: SYN Flood] from source: xx.xx.8xx.180, port 4111, Saturday, June 09,2012 00:07:52
[DoS Attack: SYN Flood] from source: xx.xx.219.193, port 4103, Saturday, June 09,2012 00:07:43
[DoS Attack: SYN Flood] from source: xx.xx.xx.193, port 4116, Saturday, June 09,2012 00:07:37
[DoS Attack: SYN Flood] from source: xx.xx.2xx.140, port 19276, Saturday, June 09,2012 00:07:32
[DoS Attack: SYN Flood] from source: xx.xx.xx.143, port 4103, Saturday, June 09,2012 00:07:22
[DoS Attack: SYN Flood] from source: xx.xx.1xx.176, port 4104, Saturday, June 09,2012 00:07:17
[DoS Attack: SYN Flood] from source: xx.xx.41.107, port 4106, Saturday, June 09,2012 00:07:12
[DoS Attack: SYN Flood] from source: xx.xx.41.107, port 4105, Saturday, June 09,2012 00:07:07(I change the ip address parts to xx. )
I verified that these addresses were legitimate and matched it against my server log where the actual connections were made. I must clarify that in our server, I check the mac address of the device and some other information to make sure the connection is from a legitimate device and not a DoS.
At this stage I don't know if the NetGear is actually dropping the connections once it thinks it is a DoS or just logging them. It also appears that our devices try to immediately re-connect to the server once they detect that the connection has been dropped so this may cause the router to think that it is a DoS. I am trying to connect the people who wrote the code for the devices to see if this can be throttled so that the devices wait a few seconds before attempting to reconnect.
Having said all of this, I still don't understand the cause of certain connections being dropped and others happily staying connected which was my original question. But thanks to you hints and probes I think I am getting closer to getting to the bottom of this.
Jason Porter
-
2012年6月10日 上午 09:55
I think the Dos Attack isn't real, but a symptom of the problem (see webpage below). since the problem that you are seeing seems to be over a number of different connections I don't think is is hardware releated (unless all the messages that are dropped were routed through one bad router). I'm leaning towards heavy traffic that is causing the problem. Maybe monitoring the load on the computers to see if the dropped connection is related to the load on the computer.
http://forum1.netgear.com/showthread.php?t=6212
jdweng

