locked
Performance issues with WFP callout on FWPM_LAYER_STREAM_V4, RRS feed

  • Question

  • I have a driver that uses WFP to register callouts on a number of layers. I noticed a significant performance penalty, which I was able to attribute to the STREAM_V4 callout. 

    I have a test program that sends a large volume of data on loopback. Without the STREAM callout the program can achieve 35MB/s. With the callout enabled, but empty, i.e., no code inside it, throughput falls down to 18Mb/s. This is almost a 50% performance hit for registering an empty callout!

    I am registering the corresponding filter on sublayer FWPM_SUBLAYER_INSPECTION and its action is: FWP_ACTION_CALLOUT_INSPECTION. I have a single filter condition: localport >= 0 (i.e., all TCPv4 traffic).

    What am I doing wrong? Is my way of registring the callout causing the system to make a copy of all data? Is there something I can do to avoid this overhead?

    Thanks,
    --aydan
    Wednesday, January 27, 2010 5:20 PM

Answers

  • Unfortunately the reduced performance is an expected hit and is being worked on.  This is due to shimming the stack and processing of the packet to provide the stream functionality.  Of note though is that should you add any other filters at STREAM, then the further performance hit is mostly due to what the callout is doing.

    You may gain a little more perf by modifying your filter to having 0 filter conditions meaning that all tcp traffic will be hit.

    Hope this helps.



    Dusty Harper [MSFT]
    Microsoft Corporation
    ------------------------------------------------------------
    This posting is provided "AS IS", with NO warranties and confers NO rights
    ------------------------------------------------------------
    Wednesday, January 27, 2010 5:57 PM
    Moderator

All replies

  • Unfortunately the reduced performance is an expected hit and is being worked on.  This is due to shimming the stack and processing of the packet to provide the stream functionality.  Of note though is that should you add any other filters at STREAM, then the further performance hit is mostly due to what the callout is doing.

    You may gain a little more perf by modifying your filter to having 0 filter conditions meaning that all tcp traffic will be hit.

    Hope this helps.



    Dusty Harper [MSFT]
    Microsoft Corporation
    ------------------------------------------------------------
    This posting is provided "AS IS", with NO warranties and confers NO rights
    ------------------------------------------------------------
    Wednesday, January 27, 2010 5:57 PM
    Moderator
  • Dusty,

    Thanks for the fast reply.

    Best,
    --aydan
    Wednesday, January 27, 2010 8:04 PM
  • Hi,

    I have a test the attempts to push through as many database requests as it can over TCP/IP V4 through a server running WFP.  The database update statement has a "where 1=2" clause so SQL server should never hit the disk.  The idea is to measure database throughput over a dedicated network.

    I installed and loaded on the server the filters that would invoke my callout driver at the STREAM_V4 layer and the FLOW_ESTABLISHED_V4 layer.  However I stopped my callout driver.  Then I ran my test and compared it against a baseline test where the filters were not installed, and I saw a 7% penalty in the database throughput.  That's a pretty steep penalty considering the engine is not invoking any callout drivers and is just permitting traffic.  I was hoping for a 1% to 2% hit at the most.

    Any way this can be addressed in the Windows 8 Server timeframe?

    Server is running Windows 2008 R2.  Plenty of CPU cycles and memory.  Only bottleneck was the network.

    Thanks,

    Chris

    Sunday, November 20, 2011 2:46 PM
  • Hi Dusty,

    I have WFP driver on windows-2008R2 server that inspect all TCP data at the stream layer and I get about 40% performance hit with an empty stream callout - i.e. return immediately FWP_ACTION_CONTINUE to every intercepted packet.

    I would like to know whether the performance issue you have mentioned (shimming the stack) has been addressed already? and whether there is a workaround?

    When I changed the callout to intercept on TRANSPORT layer and the performance hit improved to 32% degrdation. do you know if in lower layer (IP) the performance will be better?

    Thanks,

    Snir


    • Edited by Snir Medwel Wednesday, May 14, 2014 11:46 AM
    Wednesday, May 14, 2014 11:04 AM