I/O Completion Ports and Socket Send
I am in the process of developing a very high performance socket application. I have been using the new Socket.ReceiveAsync (along with pooled SocketAsyncEventArgs) API to acheive quite acceptable message receiving performance in the order of ~150k 128 byte messages per second per connected socket (on some run of the mill hardware). I have been running both client and server on the same machine.
I had initially developed a native Win32 C++ app to act as a test server (producer) that sent a configured number of messages to the .NET receiving client as fast as it could. My first cut at this C++ app used a standard blocking call to WSASend (basically the same as Socket.Send), which internally sends the message to the consumer (client) and waits for that consumer to receive the message before returning. It appeared as if my client-side receive performance was being mostly bounded by the rate at which the server was able to send it messages.
I then moved my server C++ app over to .NET and began using the Socket.Send API and found it to have similar performance to the C++ WinSock version. In the hopes to improve server send performance, I moved to using the Socket.SendAsync API and pooled SocketAsyncEventArgs. Internally, the XXXAsync (and Begin/EndXXX for that matter) use the CLR I/O completion port, which the CLR I/O Thread Pool monitors and calls your provided completion callbacks/events from. My hope was to improve send performance by calling SendAsync quickly enough so that multiple send operations would be oustanding on the socket so that it could send a batch of them at a time without transitioning out of kernel mode. However, I saw my performance take a huge hit where I was only sending about 1/3 the number of messages per second as I was when using standard synchronous sends. I saw even worse performance when using Begin/EndSend.
These findings prompted me to investigate further into what was going on with this poor IOCP performance for send operations and I went back to developing my server using C++ and straight WinSock. This time I used overlapped sockets, which I bound to the Win32 Thread Pool's IOCP that would call my provided callback when each send operation was received by the client. I also saw the same similar poor performance that I saw when using the .NET Socket.SendAsync API. Oddly, when running the client receiver side-by-side with the server/sender, the client would have received all the messages at the exact time the server issued its last overlapped/async WSASend call. I would've expected the server to make its last async WSASend call (not necessarily completing it on the IOCP) before the client had received all of the messages. The observed behavior was what would be expected of synchronous WSASend calls, not async.
Next I blocked my receiving client long enough so that the server would have time to send all of its messages to it before the client received the first one. In this case, the server would issue all of its asncy WSASend calls extermely quickly. When the client "woke up" it would process all the messages in record time - around 225k messages per second. This is what I would've expected for asynch WSASends and IOCPs.
Having a further look, I commented out by code in the server that bound my socket (and callback) to the Win32 Thread Pool's IOCP and also removed my sleep/blocking in the client receiver and ran my test again. I saw similar performance using async WSASend calls as I did with using sync WSASend calls.
So it would appear as if there is some "chatter" that occurs when using IOCP on Win32 (and in .NET). Is this due to some kernel/user mode transitions to call the callback from an IOCP thread? Is there anyway around this? I'm really looking for an answer that includes what's going on internally within Windows/WinSock. I appreciate the help on this one.Below is the C++ WinSock code for the server:
// SocketTestServer.cpp : Defines the entry point for the console application. // #include "stdafx.h" using namespace std; int numOfMessagesToSend = 0; volatile LONG numOfMessagesSent = 0; VOID CALLBACK IocpCompletionCallback (DWORD dwErrorCode, DWORD dwNumberOfBytesTransferred, LPOVERLAPPED lpOverlapped) { if (dwErrorCode != 0) { cout << "Error occured during IOCP completion: " << dwErrorCode << endl; return; } if (dwNumberOfBytesTransferred <= 0) { cout << "Server initiated disconnect. bytes transferred <=0." << endl; return; } //delete lpOverlapped; //InterlockedIncrement (&numOfMessagesSent); } int main(int argc, char* argv[]) { WSADATA wsaData; WSAStartup (MAKEWORD(2,2), &wsaData); numOfMessagesToSend = atoi (argv[2]); addrinfo *pResultAddrInfo = NULL, hintAddrInfo; ZeroMemory (&hintAddrInfo, sizeof(hintAddrInfo)); hintAddrInfo.ai_family = AF_INET; hintAddrInfo.ai_socktype = SOCK_STREAM; hintAddrInfo.ai_protocol = IPPROTO_TCP; hintAddrInfo.ai_flags = AI_PASSIVE; int iResult = getaddrinfo ( NULL, argv[1], &hintAddrInfo, &pResultAddrInfo); if (iResult != 0) { cout << "Server getaddrinfo failed: " << iResult << endl; return 1; } SOCKET listenSocket = INVALID_SOCKET; listenSocket = WSASocket ( pResultAddrInfo->ai_family, pResultAddrInfo->ai_socktype, pResultAddrInfo->ai_protocol, NULL, 0, WSA_FLAG_OVERLAPPED); if (listenSocket == INVALID_SOCKET) { cout << "Server WSASocket failed: " << WSAGetLastError () << endl; freeaddrinfo (pResultAddrInfo); return 1; } iResult = bind (listenSocket, pResultAddrInfo->ai_addr, (int)pResultAddrInfo->ai_addrlen); if (iResult == SOCKET_ERROR) { cout << "Server bind failed: " << WSAGetLastError () << endl; closesocket (listenSocket); listenSocket = INVALID_SOCKET; freeaddrinfo(pResultAddrInfo); return 1; } if (listen (listenSocket, SOMAXCONN) == SOCKET_ERROR) { cout << "Server listen failed: " << WSAGetLastError () << endl; closesocket (listenSocket); listenSocket = INVALID_SOCKET; freeaddrinfo(pResultAddrInfo); return 1; } freeaddrinfo(pResultAddrInfo); cout << "Server listening for connections on port: " << argv[1] << endl; SOCKET clientSocket = INVALID_SOCKET; clientSocket = WSAAccept ( listenSocket, NULL, NULL, NULL, 0); if (clientSocket == INVALID_SOCKET) { cout << "Server WSAAccept failed: " << WSAGetLastError () << endl; freeaddrinfo (pResultAddrInfo); closesocket (listenSocket); return 1; } WSABUF sendBuffer; int iDataSize = atoi(argv[3]); char* pData = new char[iDataSize]; memset(pData, 'D', iDataSize); memset(pData, 0, 4); _itoa_s(iDataSize, pData, 4, 10); sendBuffer.buf = pData; sendBuffer.len = iDataSize; DWORD dwNumOfBytesReceived = 0; DWORD dwNumOfBytesSent = 0; DWORD dwFlags = 0; int messagesSent = 0; cout << "Client connection accepted. Enter 's' to send " << argv[2] << " messages: "; char cInput; cin >> cInput; BOOL bSuccess = BindIoCompletionCallback ( (HANDLE)clientSocket, IocpCompletionCallback, 0); if (bSuccess == FALSE) { cout << "BindIoCompletionCallback (associate socket) failed: " << GetLastError () << endl; closesocket (listenSocket); freeaddrinfo (pResultAddrInfo); WSACleanup (); return 1; } LPWSAOVERLAPPED pOverlappedArray = new WSAOVERLAPPED[numOfMessagesToSend]; ZeroMemory (pOverlappedArray, sizeof(WSAOVERLAPPED) * numOfMessagesToSend); if (cInput == 's') { cout << "Sending messages..." << endl; do { messagesSent = 0; while (messagesSent < numOfMessagesToSend) { LPWSAOVERLAPPED pOverlapped = &pOverlappedArray[messagesSent]; iResult = WSASend ( clientSocket, &sendBuffer, 1, &dwNumOfBytesSent, 0, pOverlapped, NULL); if (iResult == SOCKET_ERROR) { int err = WSAGetLastError (); if (err != WSA_IO_PENDING) { cout << "Server WSASend failed: " << err << endl; break; } } messagesSent++; } cout << "Messages posted to IOCP" << endl; cout << "Enter 'q' to disconnect and shutdown or 's' to send messages again: "; cin >> cInput; } while (cInput == 's'); } cout << "Shutting down server..." << endl; delete pData; iResult = shutdown (clientSocket, SD_BOTH); if (iResult == SOCKET_ERROR) { cout << "Server shutdown failed: " << WSAGetLastError () << endl; } closesocket(clientSocket); WSACleanup (); return 0; }
FYI - I've also cross-posted this to the .NET System.Net and Win32 forums.
Thanks,
Brandon
解答
- As you've discovered, there is some extra overhead to doing async I/O vs. synchronous calls. There are certainly more transitions in and out of the kernel (one kernel call to initiate the I/O, another to receive the completion), but the bigger issue is usually the extra heap allocations needed for async I/O. Synchonous calls happen entirely on the stack, while async calls need the Overlapped/NativeOverlapped to be heap-allocated, along with any of your variables that need to be preserved across the call. This is not specific to managed code (you need to preserve state across native calls as well), but in managed code there's the additional issue of pinning: at least the NativeOverlapped and any buffers you pass to the kernel need to be pinned across the call, which can have a huge negative impact on GC performance. Synchronous calls need the buffer to be pinned as well, but they get to use a much more efficient pinning mechanism due to the fact that these are stack variables.
An additional issue is the cost of returning to managed code from each of these; synchronous calls return to managed code by, well, returning. Async calls come back as a completion on an IOCP in the CLR VM, which then needs to call into managed code via a much more expensive code path than the usual "return from a p/invoke" path.
Async programming is usually only worth it when it's simply not practical to do synchronous calls. If you need 1000's of outstanding I/O requests, right now you have little choice but to go async. But for smaller-scale stuff, synchronous calls are usually the better choice (not to mention the fact that they're *much* easier).- 已標示為解答Stephen Toub - MSFTMSFT, 版主Tuesday, 16 June, 2009 20:29
- 已取消標示為解答Stephen Toub - MSFTMSFT, 版主Tuesday, 16 June, 2009 23:13
- 已標示為解答Stephen Toub - MSFTMSFT, 版主Saturday, 20 June, 2009 1:15
所有回覆
- As you've discovered, there is some extra overhead to doing async I/O vs. synchronous calls. There are certainly more transitions in and out of the kernel (one kernel call to initiate the I/O, another to receive the completion), but the bigger issue is usually the extra heap allocations needed for async I/O. Synchonous calls happen entirely on the stack, while async calls need the Overlapped/NativeOverlapped to be heap-allocated, along with any of your variables that need to be preserved across the call. This is not specific to managed code (you need to preserve state across native calls as well), but in managed code there's the additional issue of pinning: at least the NativeOverlapped and any buffers you pass to the kernel need to be pinned across the call, which can have a huge negative impact on GC performance. Synchronous calls need the buffer to be pinned as well, but they get to use a much more efficient pinning mechanism due to the fact that these are stack variables.
An additional issue is the cost of returning to managed code from each of these; synchronous calls return to managed code by, well, returning. Async calls come back as a completion on an IOCP in the CLR VM, which then needs to call into managed code via a much more expensive code path than the usual "return from a p/invoke" path.
Async programming is usually only worth it when it's simply not practical to do synchronous calls. If you need 1000's of outstanding I/O requests, right now you have little choice but to go async. But for smaller-scale stuff, synchronous calls are usually the better choice (not to mention the fact that they're *much* easier).- 已標示為解答Stephen Toub - MSFTMSFT, 版主Tuesday, 16 June, 2009 20:29
- 已取消標示為解答Stephen Toub - MSFTMSFT, 版主Tuesday, 16 June, 2009 23:13
- 已標示為解答Stephen Toub - MSFTMSFT, 版主Saturday, 20 June, 2009 1:15
Eric,
Thanks for your help yet again. That is enlightening information regarding the managed components of async I/O with IOCP, especially in regards to the transitions and pinning mechanisms.
Suprisingly, the managed socket async I/O with ICOP seems to perform very well when it comes to receive operations initiated with ReceiveAync. My receive processing using this mechanism seems to be able to keep up with any socket server I've been able to develop that sends messages to it as fast as it can, which includes both native C++ and managed C# implementations.
I did do some more research on this since my original post. It seems that documentation pertaining to details about WinSock and IOCPs is a bit sparse and there is also a lot of misinformation out there. What I'm really interested in, is what's going on inside of WinSock and IOCP during an async/overlapped WSASend operation that makes it so slow in comparison to the async WSAReceive and the synchronous calls to WSASend. I've run some additional tests and have been capturing performance information on the amount of time it takes to make all of the aync calls to WSASend (I send 1M messages per test) in some various scenarios. Here's what I've found:
Active Client Recieve (no blocking/sleeping) without Sever Send IOCP
In this scenario, I have my .NET client using ReceiveAsync and IOCP (as in all the scenarios) receiving messages sent to it as fast as it can from my native C++ server. My server is issueing async/overlapped calls to WSASend as fast as it can. I have not bound the socket to the Win32 Thread Pool's IOCP, so no IOCPs are being queued and I'm not getting any information on when the sends actually complete (obviously not a real world scenario in this regard). Here is what I believe is happening in WinSock from the server's prespective:
Synchronous WSASend Completion -
The buffer specified in the WSASend call is copied to the socket send buffer if it is not already full.
The data is sent from the socket send buffer (in kernel mode) to the client's WinSock driver.
The client WinSock driver reads all the network packets into the receive buffer and sends an ack back to the server.
The server WinSock driver sees the ack and returns from the WSASend call indicating a synchrounous completion.
Async WSASend Completion -
The buffer specified in the WSASend call is copied to the socket send buffer if it is not already full.
The data is sent from the socket send buffer (in kernel mode) to the client's WinSock driver.
The client's receive buffer is full and no ack is sent
The server WinSock driver detects this and returns from the WSASend call indicating that the op will be completed async
This is by far the best peforming scenario and is what I would've expected in general when using async Sockets and IOCPs. I see the following send peformance stats:
~150k calls to WSASend per second
Over 99% of the calls to WSASend indicate that they've completed synchronously
I see about exactly the same 150k messages received per second on the client
Inactive Client Receive (sleeping/blocking) without Server Send IOCPThis is the same scenario as the previous, except I sleep for a period of time in the client with no calls to ReceiveAsync. This should cause the client's WinSock receive buffer to fill and stop the client from sending acks back to the server. In turn, the server's WinSock send buffer will fill up and cause the calls to WSASend to end up queing the the sends.
I see the following send peformance stats:
~127k calls to WSASend per second
Over 99% of teh calls to WSASend indicate that they will complete asynchronously
After the client wakes up and starts receiving messages, I see ~130k messages/sec received
Active Client Receive with Server Send IOCP
In this scenario, the client is actively receiving messages, but I've actually bound the server socket to the Win32 Thread Pool's IOCP. This causes my completion callback to be called from a Win32 pool thread each time a send completion is posted to the IOCP when the client WinSock driver acks that it has been received into the client's WinSock receive buffer. Note that this completion routine is called regardless of whether or not the operation completed synchronously or asynchronously.
I see aweful send performance in this scenario when an IOCP is used during active receive processing as the stats below indicate:
~40k calls to WSASend per second
Over 99% of the calls to WSASend indicate they've completed synchronously
I see about exactly the same 40k messages per second received on the client
Inactive Client with Server Send IOCP
This scenario is the same as the above, execpt my client is sleep and not receiving the messages until all have been sent by the server. This has the same affect as the similar scenario not using server send IOCP already decribed, but a bit to my suprise, I saw much better send performance than serer Send IOCP with the active client as the stats show:
~125k calls to WSASend per second
Over 99% of the calls to WSASend indicate they will complete asynchronously
I see about 100k messages per second received on the client after it wakes up
This one is a bit strange to me. Aparently, when the send operations don't complete synchronously (and the IOCP callback isin't being called as WSASend is being called), we see better performance.
So the question du jur is - What is going on when send completions are being posted to an IOCP that causes calls to WSASend to run so slowly and is there any reasonable way around this? I'd really like to use aync sockets and IOCP as it should be able to provided the cleanest solution to allow me to send a message from another thread without blocking it and to see if the operations succeeded on another thread, while leveraging the existing thread pools.
Thanks,
Brandon

