none
CAN NDIS6 NdisSendNetBufferLists achieve WIRE-SPEED? RRS feed

  • Question

  • I am working on NDIS high-performance network firewall.

    Not long ago I did a send packet test. here is my hardware platform:
    CPU: Intel E1230 v2
    Memory: Dual-Channel DDR3 1600 4G * 2
    NIC: Intel E1G42ET

    This hardware platform working on linux can easily achieve wire-speed (use pktgen to do test)

    Tx Rate is about 1,488,078 pps (99.99% of wire-speed).

    But when i use NDIS Filter to make this test. I was frustrated

    I make the test on Windows 8 Pro x64

    I have disabled all the other NDIS Filter such as built-in QoS services.

    RESULT is about 400,000 pps (27% of wire-speed) .

    so i wonder that can NDIS6 NdisSendNetBufferLists achieve wire-speed ? or how ?

    Sunday, August 25, 2013 5:12 AM

Answers

  • To answer your first question: yes, NDIS can achieve wire speed in your configuration. In fact, it is capable of scaling much higher than 2Gbps.  New servers ship with dual- or quad-port 10Gbps NICs that use the same NDIS6 APIs.  There are even 40Gbps NICs available now.  Furthermore, the TX side isn't usually the bottleneck; the RX path is much more difficult.

    The answer to your second question, "how?", is complex.  Here's the general checklist you should use to approach any sort of performance problem.

    First, make sure you have a good measurement system.  You're talking about packets per second, which is the right metric for this sort of problem (pps is more important than bps).  Make sure your test uses packets that are representative of your customers' workload  (e.g, steady small packets for VOIP traffic, large packets for file servers, and variable-sized, bursty packets for HTTP servers).  If you're using TCP packets, then we need to worry about the RX side too.  What size packets are you using?  UDP (no ACKs) or TCP (with ACKs)?  Are all the packets the same size, or do they vary?  What distribution (bathtub, bellcurve)?

    Next, think about your architecture.  In particular, think about how data flows around the system, and where the bottlenecks will be.  You're obviously not going to send infinity packets per second, so what force is holding you back from that limit?  Is it really the NIC -- will the NIC really have a long queue of TX packets at all times?  Or maybe you're only using a small number of NBLs, so the limiting factor is usually your driver waiting for another NBL to be returned.  Are there any parts of the system that are synchronous or blocking?  (Note that NDIS itself doesn't ever block; with the datapath, NDIS is a very thin layer between you and the NIC.)

    Now that you have thought about what the theoretical bottlenecks are, you should go measure.  There is a bottleneck somewhere; you must find it.  Is any CPU core pegged at 100%?  Maybe the link itself is congested -- is the NIC reporting that the link is at 100% usage?  Is the NIC reporting a high (or really, any nonzero) number of TX or RX errors?  Does your driver ever hit any of its internal corner cases (couldn't allocate a packet)?  Does your I/O path block on a serialized handle (this is a common mistake with user-to-kernel ioctls; you must use async I/O and overlapped handles in usermode).  I've seen the PCI bus bandwidth turn up as the actual bottleneck, although that only tends to be an issue at higher throughputs, around 7Gbps.

    The last step depends on where you found the bottleneck.  E.g., if you find that one CPU core is pegged, while others are idle, then you should think about how to better distribute the workload.  Although 2Gbps should be fine on a single core, unless you need to do heavy application-level processing per I/O.  Use WPT (aka xperf) to find where the CPU is spending most of its time, and see if there are obvious optimization opportunities.  Are you doing anything single-packet-at-a-time?  Batching is essential for scalability.  Are you doing any full buffer copies of packet payloads?  This will make memcpy jump to the top of the WPT charts; don't let it get there.  Are you spending a lot of time spinning trying to acquire spinlocks?  You should try and avoid spinlocks in the datapath.  If a lock is necessary, use a lock that is designed to scale better, like NdisRWLocks.

    Good luck.  Perf is rarely simple; often it takes lots of work to find the real problem.

    • Marked as answer by MengXP Wednesday, August 28, 2013 10:27 AM
    Wednesday, August 28, 2013 7:12 AM

All replies

  • anybody knows?
    Tuesday, August 27, 2013 1:03 PM
  • To answer your first question: yes, NDIS can achieve wire speed in your configuration. In fact, it is capable of scaling much higher than 2Gbps.  New servers ship with dual- or quad-port 10Gbps NICs that use the same NDIS6 APIs.  There are even 40Gbps NICs available now.  Furthermore, the TX side isn't usually the bottleneck; the RX path is much more difficult.

    The answer to your second question, "how?", is complex.  Here's the general checklist you should use to approach any sort of performance problem.

    First, make sure you have a good measurement system.  You're talking about packets per second, which is the right metric for this sort of problem (pps is more important than bps).  Make sure your test uses packets that are representative of your customers' workload  (e.g, steady small packets for VOIP traffic, large packets for file servers, and variable-sized, bursty packets for HTTP servers).  If you're using TCP packets, then we need to worry about the RX side too.  What size packets are you using?  UDP (no ACKs) or TCP (with ACKs)?  Are all the packets the same size, or do they vary?  What distribution (bathtub, bellcurve)?

    Next, think about your architecture.  In particular, think about how data flows around the system, and where the bottlenecks will be.  You're obviously not going to send infinity packets per second, so what force is holding you back from that limit?  Is it really the NIC -- will the NIC really have a long queue of TX packets at all times?  Or maybe you're only using a small number of NBLs, so the limiting factor is usually your driver waiting for another NBL to be returned.  Are there any parts of the system that are synchronous or blocking?  (Note that NDIS itself doesn't ever block; with the datapath, NDIS is a very thin layer between you and the NIC.)

    Now that you have thought about what the theoretical bottlenecks are, you should go measure.  There is a bottleneck somewhere; you must find it.  Is any CPU core pegged at 100%?  Maybe the link itself is congested -- is the NIC reporting that the link is at 100% usage?  Is the NIC reporting a high (or really, any nonzero) number of TX or RX errors?  Does your driver ever hit any of its internal corner cases (couldn't allocate a packet)?  Does your I/O path block on a serialized handle (this is a common mistake with user-to-kernel ioctls; you must use async I/O and overlapped handles in usermode).  I've seen the PCI bus bandwidth turn up as the actual bottleneck, although that only tends to be an issue at higher throughputs, around 7Gbps.

    The last step depends on where you found the bottleneck.  E.g., if you find that one CPU core is pegged, while others are idle, then you should think about how to better distribute the workload.  Although 2Gbps should be fine on a single core, unless you need to do heavy application-level processing per I/O.  Use WPT (aka xperf) to find where the CPU is spending most of its time, and see if there are obvious optimization opportunities.  Are you doing anything single-packet-at-a-time?  Batching is essential for scalability.  Are you doing any full buffer copies of packet payloads?  This will make memcpy jump to the top of the WPT charts; don't let it get there.  Are you spending a lot of time spinning trying to acquire spinlocks?  You should try and avoid spinlocks in the datapath.  If a lock is necessary, use a lock that is designed to scale better, like NdisRWLocks.

    Good luck.  Perf is rarely simple; often it takes lots of work to find the real problem.

    • Marked as answer by MengXP Wednesday, August 28, 2013 10:27 AM
    Wednesday, August 28, 2013 7:12 AM
  • Thanks for you very very detailed answer!

    I have found the bottleneck is "NIC Driver"

    I uninstalled the Intel latest PROset driver and change to use the windows 8 default NIC driver e1i63x64.sys(2011)

    Tx performance achieve to 1,200,000 pps (80% of wirespeed). It is 3 times performance improvement

    My experiment is :

    A system thread in a while(1) loop call NdisAllocateNetBufferAndNetBufferList and NdisFSendNetBufferLists, packet size is 42 bytes. and counts the number of call NdisFSendNetBufferLists

    in the FilterSendNetBufferListsComplete. counts the number of NBL success

    so the bottleneck must not be my driver and NDIS, it is in NIC Driver..

    Wednesday, August 28, 2013 10:27 AM