Prokash Sinha's Notes on
High-Performance NDIS Miniport/NIC Design
 

About Prokash's Performance Notes

I've invited Prokash Sinha ("pro") to contribute a few notes about driver performance for NDIS.com. These are 'getting started" notes intended to help NDIS driver developers to think systematically about these topics.

Prokash is a Senior I/O Developer at Hewlett-Packard. He is currently working on Network I/O design for next generation of fabrics to be used on enterprise class servers and is deeply involved in the development of state-of-the-art NDIS miniports. In his line of work performance is a big issue.

So the context of Pro's notes is the design of a hypothetical high-performance NDIS miniport and NIC.

Prokash will add more notes on these topics as time permits. In addition, Prokash's blog covers a lot of ground including insights into performance measurement and NDIS debugging. A recommended read...

Thanks, pro!

Thomas F. Divine
Editor-In-Chief, NDIS.com

The contents of this page are Copyright © 2010 Prokash Sinha and are published here with permission of the author.

Index

bulletPerf-1 - Getting Started - August 18, 2010
bulletPerf-2 - First Order Performance Limitations - September 1, 2010
bulletPerf-3 - Hardware and Algorithm Choices and Performance - September 18, 2010
bulletPerf-4 - Basic Approaches for Receive Side Performance - October 23, 2010
bullet??? - More to come...

Perf-1 - Getting Started

First consider just one driver - NDIS miniport, and rest of the system is out of the box. - How do I know if there is a bottleneck or several of them?

We know that we have run some tests first and get the metrics first. So our metric would have to be in place. For example we are working on a 10 Gbps MAC. Some simple metric would be:

bulletPayload: Normal, Jumbo packets.
bulletApplications: Constant streaming, bursty.

Then even before we start we have to make sure we have the right systems. For example PCIE is a must, CPU(s) should have 2 GHz or more processor speed and memory bandwidth has to be around 2 GHz etc. The reason behind it is to think this way - In any pipeline, rate of the flow is limited by the smallest rate of the segment of a pipeline....

Objective: Can we get line rate, or close to it?

Now ndistest, NetPerf and some other test suites have different types of stress tests that we should look to run.

If in most case, we achieve the desired line rate or close to it. We have nothing to worry. In this case the line rate is the bottleneck.

Principle: In any finite system there is one or more bottleneck ( the pipe with the smallest diameter ).

So we can conclude we achieve the inherent bottleneck, and driver is not at fault when it comes to performance.

Note that due to packet formats and header overhead it is good to have say 8.5 to 9.0 Gbps as the objective line rate (not 10 Gbps raw rate). In most cases, we will not achieve the line rate or close to it. More like 6 to 7 Gbps. Then the question would be:

  1. Is the tests feeding enough data at the rate of 10 Gbps or more? If it does not, then there is no way we can achieve our objective.
  2. Can we be sure that the memory bandwidth, CPU Hz, and PCI Hz are adequate - this you know how to compute to bit rate, so in the above example with PCIE etc... we should be fine.

Once we know that we have test cases that's feeding data at a rate of > 10 Gbps. Next would be to look at if CPU is running at above 90%, and memory usage. If for example, memory usage is very high ( paged, non paged), then we know that to support near-line-rate, either we need to bump up the memory or look at the driver for better memory mgmt. If the CPU is above 90%, then same alternative decision(s) to make as for memory...

So assume that you want to support the same amount of memory you have on your system and you want to support the same CPU Hz - otherwise customers might not buy the product !!

So we now need to analyze the NDIS miniport code to see if there is any bottleneck: memory or CPU usage wise. How do we solve this?

We first need to run some code coverage tools to measure which part of the code is/are being executed most. We know in this experiment, we are looking first at the send rate ( we can do the same for receive side too ).

I use BullseyeCoverage tools, it gives function level breakdown of coverage, including the frequencies. After analyzing the output we will have a scoped-out area that would be under radar - or we can broadcast to the world that we have found the hotspot of our driver code...

Once that is found, alternative design ensue. At this particular point questions like (1) using Queues (2) some preallocated pool of memory to be used when bursty traffic comes (3) etc. would come to mind to tackle the issue.

Also there are other statistics that would also help determine where could we concentrate. For example, packet discarded rate and H/W flow-control is one area where you can tell that if something is going wrong....

More later .... ( I will explain more on what kinds of other things we look at to decide what is going on ).

In the next part I would explain why I thought 2 GHz CPU, 1 GHz to 2 GHz of memory speed are needed to achieve the near line rate of LAN throughput. We would also consider alternative CPU and memory speed and how they might come as bottleneck to our system at hand.

Finally after may be Perf- 4 ( 4th or 5th round of notes), I will explain how to analyze alternative design of the code segments - C/C++ routines to backup our analysis and alternative solutions...

Best, -pro
August 18, 2010

 

Perf-2 - First Order Performance Limitations

In this note I will discuss some gross estimation of CPU, memory, and NIC processing power and how they affect hardware and miniport driver performance.

Mostly the idea here is to apply the empirical observations to assess different hardware asset requirement for a hypothetical 10Gbps NIC. This should help to think about the pipeline paradigm I emphasized in part one.

Using DMA, we get a fair amount of concurrent processing at the HW level in the sense that in the time domain, NIC and Host works simultaneously.

10Gbps NIC is approximately 1GBps. With dual or quad port, it amounts to 2GBs to 4GBs. One port is one interface in simple configuration (e.g., no teaming, no multi home).

So in a single port NIC, if the HW provides 256 MB of RAM, it will have one-forth of a second of processing time at the most before a buffer overrun.

Actually it is more like one-tenth of a second of processing before buffer overrun. In other words, it is about 100 milliseconds of processing time. The reason is that firmware will have some inherent delays due several synch points where firmware and NIC need to synch up. As an example, if you issue successive commands to the firmware within short epoch, HW may not respond in time or may lock up. I’ve seen it in a number of different NIC.

Now if the NIC buffer is divided into transmit and receive buffer evenly, we are talking about roughly 50 milliseconds of processing in each path before buffer overrun.

On the receive path, it amounts to:

bulletReceive the data
bulletDo any error checking
bulletDMA it out to host buffer
bulletAnd interrupt to the host OS to signal that data arrived for processing.

OS will schedule the interrupt service routine of your miniport driver, the routine will do minimal processing, then eventually will have to indicate the NIC that the data is consumed. If data is not consumed in time, NIC can observe buffer over run, and it will indicate the sender side using link layer flow control that I can not eat any more, I’m full. Note if this happens frequently, we will never achieve line rate, that is 10Gbps in this case. As you can see, it becomes very tight when we have dual port or quad port NIC.

Though it is not directly related to this miniport driver performance issue, it is clear that having a good PHY and MAC layer is not all there is to design a good NIC for server performance. And this is one of the reason you will encounter NIC(s) with multiple specialized processors, VLIW instructions, and optimized & tunable firmware need to be coupled with server class host machine.

For the CPU speed at about 2 GHz, we have 2 * 1024 MHz. If we simplify and assume that on an average 4 cycles to execute an instruction on the host, it amount to 512 M of assembler instructions per second.

Note that due to micro architecture advancement, multiple pipelines, and good cache organization and policies, sub cycle instructions are possible. But that is way more difficult to estimate for us being driver writers. So using a conservative approximation is the key.

Also several areas where unintentional delays like:

bulletMemory bandwidth
bulletOther processor payloads due to different tasks not related to our packet send and receives
bulletProtocol processing are part of the CPU’s processing payloads

Then there are intentional delays like:

bulletForced delays in the driver since firmware and hardware need a little time to synch up
bulletHigher priority tasks needs to be served
bulletOther tasks are not friendly in terms of their memory or CPU usage while they are with higher priority advantages.

These are reasons why a lot of NIC performances are still observed using DOS type setup, where many of the overheads are eliminated.

So unless your team is quite large with performance analysts and source code at hand, a detail analysis of path length for transmit and send are difficult, to say the least.

My suggestion is to craft code in your driver that can observe the rate of flow you are getting on both send and receive side. They would be of real value when you think you need to analyze your code for throughput requirements.

For a more detailed architectural issue I would take the liberty to refer a good manual: Intel® 64 and IA-32 Architectures Optimization Reference Manual. The link for it is:

http://download.intel.com/design/processor/manuals/248966.pdf

As an example, when you indicate up packets for NDIS to hand off to protocol edge, you measure the traffic rate, so you know what rate you are achieving, and let the NDIS and protocol processing as black box. Similarly as soon as NDIS dispatches your send routine you measure the rate of flow.

As a concrete example we are a facing a MTU-related problem in receive throughput. Using the same miniport driver we achieve 9.6Gbps using 9000 byte MTU (Jumbo) packets, but only 6.2Gbps using 1500 bytes MTU (Standard) packets.

As it turned out, for 1500 bytes MTU, the PCIE bus analyzer is showing an initial burst then a constant rate of 30% at most . For 9000 bytes MTU, PCIE bus showing gradual increase in traffic beyond 50%. So the BUS was under utilized in the 1500 bytes MTU tests. We have 64 bit interrupt counters in the statistics, and it shows that using 1500 bytes MTU, the NIC generates about 1/4th the receive interrupts we have using 9000 bytes MTU Jumbo frames. Interrupt moderation is enabled in both cases.

When using 1500 bytes MTU the CPU is running at paltry 20% or less. When using 9000 bytes MTU Jumbo frames the CPU was at 30 to 40%. Clearly CPU is under utilized. In a server class system, we expect other applications like Database, Web server etc., would be running. They can certainly consume another 50 to 70% of CPU. So 2GHz processor seems like a good choice and isn't the problem.

This is probably not a memory BW problem because near line rate can be reached using Jumbo frames.

We have yet to test another port of the NIC along with the existing one to see what rate we get on both lines.

Our primary suspect is that we are probably dropping two many packets at link layer, hence TCP sliding windows is taking effect due to lack of ACK, finally the client app Netperf at burst mode is being blocked at the TCP level. This hypothesis is partly supported by the PCI analyzer telling that not much traffic is going out the door from the client.

Other thing we found empirically is that if we turn off hyper-threading we get a little boost in traffic rate, and this confirms the assertion (below) that multithreading interferes with the caching.

Memory bandwidth is formally expressed in bytes/second, but engineers often use the term Hertz (Hz). For memory bandwidth historically it is almost always an order of magnitude slower than processor speed. So for a GHz processor, 100 MHz to 400MHz memory bandwidth seems to be fine. But in our previous pipeline theory, how does it make sense to have such a low memory bandwidth and yet we think memory might do fine without becoming a bottleneck: the pipe segment with smallest diameter?

Here comes the caching!

Several layers of caching with different configurations for data cache and instruction cache have been exploited already. Curious reader should read the book named Memory Systems – Cache, DRAM, Disk. Unique name for a book, so no need to know ISBN or Authors Names!

But wait! How does the cache works?

Two basic principles:

  1. In Time domain, it is temporal locality of the data, meaning the programs tend to use sections of data over and over again before switching to a different section. We seem to be reading programmers mind here, but fortunately this is the way we think and program, so a valid assumption.
     
  2. In the Space domain, the program usually does not branch willy-nilly. So in memory, instructions get executed in sequence most of the time, and fetching and caching seem to work most of the time.

Multithreading and multiple CPUs spoil the party a bit. Too many thread switching spoil the cache performance. So it is good to have a fairly high Hz for memory too! And design to use multi CPUs efficiently.

So if you think you need an example of how to circumvent this threading and cache thrashing issues in miniport, you should read the receive side scaling of modern NIC and Microsoft implementation detail. The idea is to tie traffic flow (usually session based) to NIC HW queues, and that ties to specific CPUs based on hash result. Dual core, hyper threaded processors are very common these days for even laptops, so when it comes to server grade NIC performance we should not even think about single processor host systems.

Following is a link an Intel® article to get an idea of how threads could be bottleneck:

http://software.intel.com/en-us/articles/detecting-memory-bandwidth-saturation-in-threaded-applications/

The link for memory bandwidth:

http://en.wikipedia.org/wiki/Memory_bandwidth

In the next part I would cover the recent trends for high performance NIC from several points of views like:

bulletData structure
bulletDMA
bulletScatter gather
bulletBounce buffer
bulletInterrupt moderation
bulletMSI interrupts and hardware queue support to distribute interrupts to different CPU
bulletInterrupt migration
bulletProtocol processing offloads
bulletEtc.

I will also consider the benefits and impact of using these and what to look for when performance is to be considered.

Best, -pro
September 1, 2010

Perf-3 – Hardware and Algorithm Choices and Performance

Here we continue discussing design of a theoretical high-performance NIC and companion miniport driver. The focus is on choosing key hardware and algorithm approaches that will support the desired performance goals mentioned before.

Hardware Implementation Choices

We all know what a device driver does! It drives the device. In our discussion for miniport NIC performance, we will mostly concentrate on hardware and hardware-specific data structures.

Depending on the device design and its purpose, it is possible to have drivers that drive devices in one or more of the following ways –

bulletPoll the device from the CPUs once in, say 100 milliseconds for hardware events.
bulletDevice interrupts when an event of interest needs to be notified to CPUs.

For NICs, it has been interrupt driven. So NIC drivers need to have interrupt service routine to handle the events mainly associated with send and receive. NICs also interrupt CPU for other events, but those are not in the send and receive path, hence not performance sensitive.

When we talk about NIC performance, we really mean about two aspects of it:

bulletPacket Delivery Latency
bulletBandwidth

In my previous note I talked about bandwidth estimation and system requirement to achieve near line rate bandwidth.

Packet delivery latency is the time it takes to deliver a packet (of some standard fixed size) to the destination. Usually the destination for performance measurement is the partner machine hooked directly to the system under performance measurement. So for our discussion, we don’t try to do performance measurement and analysis of a machine in North America hooked up to a machine in Amazon forest over the Internet.

Just having interrupt support in the device, and interrupt handlers in the driver is not enough to support the high performance NIC. If the device interrupt very frequently then the OS would not have much time to process the data, hence backpressure would build, and in some extreme case data loss could occur.

How exactly NIC can achieve a good throughput?

There are few alternatives, but the important ones are:

bulletDMA
bulletShared Memory
bulletPacket Aggregations

There are two top-level choices for DMA on Windows platforms:

bulletSystem DMA
bulletBus-Master DMA

At this point using the system DMA controller is not a viable choice. Using the system DMA controller creates a choke point in system performance due to sharing, and extra processing for bus allocation to devices sharing the system DMA controller.

So the DMA choice for a high-performance NIC must be bus-mastered DMA and there are two flavors: Scatter gather or just plain DMA using physically contiguous buffer.

Shared Memory is another alternative, and some NICs can have both options to suit particular traffic demands.

And packet aggregation at the NIC level for ingress traffic is yet another option when protocol offloading is supported.

We will discuss DMA and shared memory related data structures in this note.

The Right Data and the Right Data Structures

Right Data structures are essential to sophisticated programs.

Hardware implementation and algorithm design needs the support from data structures for correctness and performance. For our discussion data structures can broadly be classified into two categories:

bulletAlgorithmic
bulletHardware-Specific

Both DMA and shared memory dictates some protocol and data structures that are specific to NIC. These data structures are termed as Hardware data structure for our discussion of NIC performance.

For example if I allocate a pool of DMA able buffers in my driver, and let the NIC know the addresses and lengths. The design decisions I would have to face are:

bulletHow many buffers?
bulletWhat are the sizes?
bulletHow do I let the NIC know the buffers availability?
bulletHow the NIC tells me that a buffer has been filled up with packet(s) for processing as receive packet?
bulletHow do I tell the NIC that some packets are ready to be sent?

Similarly for shared memory. Hardware usually dictates how much memory can be used as shared memory. The idea here is that as soon as some part of the shared memory changed by either device driver or the NIC it is visible to the other part. We know that any shared memory access can cause race condition and we know that shared resources are accessed with different locking mechanisms. But we are trying to avoid bottleneck so taking lock, accessing shared resources, releasing lock could be very expensive from the performance point of view. Here producer consumer paradigm is one of the perfect design alternatives to consider. Since on the send path driver is the producer and NIC is the consumer and on the receive side NIC is the producer and driver is the consumer, shared memory is usually divided into two parts.

For high bandwidth NIC, a combination of shared memory and DMA buffer with scatter gather seems to work fine. We know that buffer copying between layers of protocols is one of the problems for network data processing. So for the send side having several packets to DMA out using scatter gather approach helps the performance for example. NDIS 6.0 has a better support for SGDMA (scatter gather DMA).

When considering hardware and algorithms for performance you must keep other constraints in mind. For example, the NIC must send packets in the order intended by the originating protocol.

We have discussed that producer consumer model and DMA with shared memory as our first choice to handle high data bandwidth. Now we will have a simple example about how to design this. The design should illustrate the basic design with performance in mind. We hope to enhance this as we go along with more notes about NIC performance.

Lets assume that the device supports 2 pages of PAGE_SIZE for shared memory. We will define, allocate, and make them shared between the device and host memory. First we will have Transmit and Receive rings of buffer descriptor. A buffer descriptor usually describes a packet of data, and it will contain at a minimum three fields. A basic buffer descriptor will be defined (at our first attempt of pseudo code) would be as follows.

typedef struct basic_buf_desc
{

PHYSICAL_ADDRESS phy_addr_of_buff;
BUFFER_LENGTH buf_length;
BUFFER_FLAGS buf_flags;
#ifdef DEBUG_NIC_BUFF_DESC
    VIRTUAL_ADDRESS virt_addr_of_buff;
#endif

} BUFF_DESC, *PBUFF_DESC;

We will then allocate two PAGE_SIZE of buffer descriptors, one for the transmit ring, another for receive ring, and make them shareable between device and the host. One way to do it is to use the Common Buffer DMA under windows. If you are from Linux, you should think about streaming DMA feature of it.

We need two more data structures or types to make each one work as ring under producer consumer paradigm. Basically a pair of indices: producer index, and consumer index are needed to have the descriptor rings act as producer and consumer.

At driver initialization time, we should do all the above steps to make sure the descriptors are allocated, DMA mapped for shared access, and the respective indices are also plumbed so that driver and the device have consistent view.

How the processing of packets will go with this data structures and producer consumer model for both Transmit and Receive rings of buffer descriptors?

In our send routine, we will have a packet descriptor to send, we will first map the packet buffer to a DMA able buffer, take the PHYSICAL_ADDRESS of the buffer and populate the transmit descriptor pointing by the producer index of the transmit descriptor ring. We will also populate the other field of the descriptors: length, flags etc. We will increment the producer index.

Then depending on several design choices, and NIC processing stage we will indicate the NIC device that packet has arrived on transmit side. NIC might still be processing some of the old packets to actually send them over the wire, or it is waiting for new transmit packets. In each cases NIC have an updated producer index, so it can consume all the packets from the consumer index to the producer index of the transmit ring.

Few questions to ask at this point are –

bulletIs it possible that the transmit descriptor ring has no available entries to describe new packet buffer?
bulletHow to detect there is no more free entries in the transmit ring?
bulletWhat to do if there is indeed no more free descriptors in the transmit ring?
bulletIs there a way to calculate and estimate how frequently this ring fullness happens? And if any way, we can avoid this.
bulletIs it possible that the transmit ring is empty most of the time?
bulletEtc.

It is possible that the driver can see, at times, that the ring is full with unconsumed packet buffer descriptions. If it happens frequently, we know NIC is not keeping up with the outbound bandwidth from the protocol layer, and it is a good thing for us. We simply proved that in the send path, we are not a bottleneck if any. But in-frequent packet burst needs to be addressed using staging area or queue before being placed on the transmit ring.

On the other hand, if we see that the transmit ring is empty most of the time, we have a little to worry. It is possible that we don’t have enough traffic from the protocol or it is possible that we are not being fed at a reasonable pace at our send dispatch routine. Hence we have room for improvements!

For detection, one should follow any good book or article that explains the producer consumer paradigm. For drivers that see lack of available transmit descriptor, should have staging area, usually a queue to hold the packet buffer temporarily. If you provide a queue, the send path should fetch the packet data from the queue first and describe, before considering the current packet buffer.

On the NIC side, the firmware will process from the consumer index to the producer index of the transmit rings. It will take one consumable descriptor at a time, DMA the buffer in using the descriptor and hand it off to PHY for transmitting. Once transmitted, it will move to the next consumable descriptor. Depending on the design and policies, NIC might interrupt the host after processing one or more of the descriptors. But NIC will also update the consumer index to let the driver (i.e. the host) know how far in the descriptor ring has been processed, so that host can be ready to use those descriptors after freeing up the resources to the protocol. Remember that the packet buffers are really protocol resources, and we simply indicate the protocol that those buffers are dispatched over the wire, so the driver does not need those buffers for any use whatsoever.

In this discussion, I intentionally did not introduce few things related to this approach of ring buffer descriptors and its processing mechanics. For example:

bulletCompletion queue descriptors
bulletPosting or signaling updates of producer and consumer indices
bulletDMA burst size configuration using PCI interface
bulletDMA coherency
bulletCache flushing
bulletEtc.

These are essential for high performance, but it will clutter the discussion at hand. We also avoid the receive side for now since it requires another article to cover systematically. Finally we avoid the windows and NDIS specific DMA setup and device programming steps to start the actual transfer.

The idea is to capture one or more programming patterns specific to achieve high throughput on modern NIC.

As an example when I said that descriptors defining DMA-able buffer, I meant to say that map registers and other stuff necessary for bus-master DMA is already set. At the end of all the notes related to packet DMA, scatter-gather DMA, ring buffers, bounce buffers I might consider putting some skeleton code to emphasize the design decisions we consider here.

Summary

In summary:

  1. We know that without DMA bus mastering NIC device we cannot achieve high throughput on the transmit side.
     
  2. We also know that producer consumer paradigm achieve good parallelism without using locks. Even though it is possible to have lock free implementation when multiple producer and multiple consumers accessing a ring concurrently, here we are thinking about single producer and single consumer. I will explain the multi-processor synchronization with fairly light-weight locks for multiple producers on send path, and alternative approach of multiple transmit queues similar in concept to multi-queue in receive path.
     
  3. We also consider the ring data structure to support the producer consumer paradigm.
     
  4. We also mentioned the staging area for holding off packets to be placed on transmit ring. This is to make sure ring fullness does not cause problems due to temporal burst of egress traffic.

At this point I will mention that I’ve seen drivers that do check for ring indices after increasing them and set to zero as soon as it increased beyond the total number of descriptors in the ring. Producer and consumer indices should have a natural ring pattern embedded into the types so that the increment beyond the upper boundary should naturally wrap around the ring without explicit checks and programmatically setting to zero. But that is one design specific decision. Briefly, if you think about using bit field structure to slice off a size that will naturally wrap around, make sure you get the assembler instructions generated in front of you for verification. For example, please look at my blog: http://prokash.squarespace.com/ for a reference.

Similarly, DMA burst size, and DMA buffer alignments are also important thing to consider. Usually a NIC design dictate alignment requirement for DMA-able buffers. And NIC configuration space may allow a driver to consider the DMA burst size as a tunable parameter. Basically larger the burst size, longer the device will occupy the BUS on average. So suggestion is to use a PCI analyzer to figure out what should be an optimal range of burst size.

In the next installment, I would like to cover the receive path. Then hope to cover the scatter-gather DMA. After that I will follow the path I intend to cover and mentioned in the previous article.

Best, -pro
September 18, 2010

 

Perf-4 – Basic Approaches for Receive Side Performance

We continue discussing receive side design of a theoretical high-performance NIC and companion miniport driver.

As mentioned before, in our example, receive side will have a ring of receive descriptors. The descriptors are shared map between host and device.

Receive Side Processing in General

  1. NIC device receives bits on its port(s)
  2. NIC process the bits to form packets
  3. NIC finds one or more available receive descriptors
  4. NIC finds the buffer addresses to DMA in the packets
  5. NIC updates the descriptors fields
  6. NIC interrupts the host for processing
  7. Host driver gets the notification of interrupt in its ISR routine
  8. Host driver services the interrupt to process the packets
  9. Host driver indicates the packet up to protocol layer

The first steps in receive processing are implemented in the NIC hardware and the last in the host miniport driver.

Of course there are various levels of complexity and sophistication that can be provided by both the NIC hardware and the host operating system.

For example, a modern NIC usually contains firmware(s) that could be of multiple flavors. For example NIC boot codes, yes multiple boot code regions, general packet processing engines etc. are example of some such firmware. Current generations of NIC can have couple mega bytes of downloadable firmware on top of another couple kilo-bytes of burned in (or flashed) codes mostly related to different flavors of boot processing of the NIC engines. And from purely hardware point, modern NIC has its domain of discourse that ranges from having multiple processors, multiple hardware queues, packet aggregation and other advance technology specific to virtualization.

Similarly the host OS may offer facilities that support more sophisticated receive side performance enhancements. For example, some Windows platforms support Receive Side Scaling (RSS), TCP offload and other features.

However this note will discuss only the basic approaches for receive performance that can be implemented on most current Windows platforms. Leave the more advanced topics for later.

It is assumed that receive processing implementation will employ the basic producer/consumer paradigm and the ring buffer mechanism described in earlier notes.

But how we use this approach for best receive performance? Let’s talk about how to do this in the NIC and then in the miniport driver.

Receive Processing in the NIC

In receive side processing the NIC is the producer in the producer/consumer paradigm.

One metric to remember is that the host hardware and its OS have a limited capability to process interrupts. For high performance networks the rate that the NIC receives packets may be higher than the rate that the host can process interrupts (and have reserves needed to service other hardware). This means, quite simply, that a high-performance NIC cannot generate an interrupt for each packet received by the hardware.

Another metric to keep in mind is that throughput on the host is limited more by the packet rate than the size of packets being received. Packet aggregation is a key method of reducing the packet rate as seen on the host.

These two metrics point to these areas where the choice of NIC algorithms can potentially improve performance:

bulletInterrupt Moderation - If interrupt moderation is supported and active then the device will not interrupt immediately as soon as it receives some bits and parsed into a packet. Instead it might wait a brief interval and attempt to interrupt when multiple packets are available. The goal here is to reduce the number of interrupts required to transfer packets to the host.
 
bulletPacket Aggregation (Protocol Specific) – If packet aggregation is implemented by the NIC, it can aggregate (partially reassemble) multiple received packets into a single larger packet – this is just the inverse of packet segmentation for large packets usually done by TCP segmentation feature on the send side.

Both of these algorithms must be implemented carefully. If the NIC has received at least one packet it cannot wait too long before interrupting the host or the packet may appear to be lost. Likewise the NIC cannot delay too long in anticipation of a possible next fragment to aggregate before finally making an indication to the host.

Receive Processing in the Miniport Driver

In receive side processing the miniport driver is the consumer of packets produced by the NIC. But it is also a producer of packets that are transferred up the host network stack to the top-level transport drivers using OS-specific resources and APIs.

OS Resource Allocation

Since any network aware application and the protocol layer does not know ahead of time how much data, in general, it would receive from the NIC, it is the duty of the driver to allocate receive buffers, make them DMA-able, filled the receive descriptors and post it to indicate the NIC that host has memory where receive packets can be received. This is how the NIC finds available receive descriptors, uses them as packet arrives, and finally interrupts.

Make sure you use packet pools to avoid situations where you ask the protocol to release the buffer to you immediately upon indicating to the protocol layer. This can force the protocol to copy the packet buffer into its own memory, and release the ownership of your receive packet buffers to your driver. This is expensive.

If you have fixed size buffer, which is usually the case for miniport buffer allocations, you should use look aside list. Also cache coherency is supported in most Intel platforms, so think about using cached memory whenever possible. Since received packets are DMA-ed in, cache coherency will flush any stale cache. Remember that your NIC should be able to handle soft real time media processing like: streaming video and audio. For this it is essential to use cached memory.

Interrupt Service Routine

The driver in its interrupt service routine (ISR) will process the descriptor(s) available to be processed. It goes from the consumer index of the receive ring up to the producer index of the same ring. And as it processes the descriptors, it has opportunity to hand the packets off to the driver’s DPC and eventually indicate them to higher level filter and protocol drivers.

Since the miniport will have a MiniportInterrupt routine that gets called by NDIS for interrupt processing, the execution path follows from this routine, including the DPC routine should have the necessary synchronization to access any resources shared by multiple concurrent paths of driver execution. In our case, processing receive descriptors and incrementing the consumer index (the host driver is a consumer in the receive path) should be guarded with appropriate lock. The reason is that these routines can execute concurrently in multiple processors. Once the processing logic is tested to be correct, one should think about optimizing this path. As an example, I would recommend to look at the basic idea of analyzing concurrent execution using my blog link: http://prokash.squarespace.com/.

Deferred Procedure Call (DPC)

Irrespective of packet aggregation, the deferred procedure call (DPC) processing should try to process as many packets as possible within DPC processing time limit. Note that if the driver’s DPC processing takes too long , it will make the system unresponsive for other applications. At worst, you might see a bug check!! Once again for receive buffer use cached memory allocation.

If interrupt moderation or packet aggregation or both supported in the NIC, then be very careful to send the packets to the protocol layer as soon as possible or at least spend some time analyzing the path. Interrupt moderation, and packet aggregation may cause lack of timely ACK causing the sender to retransmit again. TCP layer on the sending side usually keep the UN-ACKED packets in retransmit queue attached with timers. If the timer expires, it simply retransmits the corresponding packet.

Packet Aggregation in the Miniport

If interrupt moderation or packet aggregation are not supported in the NIC then it is possible to aggregate multiple packets at the miniport level before indicating up to the protocol layer.

This is an area where one should measure the time between receive interrupt arrival from NIC. Usually if the time between interrupts arrivals are less than 15 to 25 microseconds, some amount of packet aggregation is possible, but I would recommend to observe the performance with and without packet aggregation in the driver’s receive path. Moreover, packet reassembly or packet aggregation requires knowledge of TCP processing. So for example, aggregation based on TCP flow, time duration of packet reassembly before indicating to protocol, and packet types are some of the important aspects to consider.

General

By all means one should try to use the reader writer locks to guard and process the receive descriptor processing. Note that acquiring and releasing spin locks takes a hit of about 120 to 160 cycles, but if the read contention is high using reader writer lock has a clear advantage over plain spin lock.

Also this type of lock behaves much like spinlock, except that it gives better concurrency when the lock is not acquired for write. At DPC level, try to use NdisDprAcquireSpinLock and NdisReleaseSpinLock if you cannot use the reader-writer locks.

Finally, as an independent vendor one might not have a ready access to the source code of the operating system’s network stack so a detail measurement and analysis of different use case scenarios are vital to have a better understanding to both send and receive processing.

As an example, unless you don’t have access to source code and or deep experience in the networking area you would not know what could happen by turning on the Jumbo frame feature of your NIC and expect higher throughput over internet sessions that spans couple switches and routers, the backbone, then to the destination. The reason is that IP fragmentation and TCP segmentations may kill the performance beyond expectations. Fragmentation occurs, as a measure to avoid transmit failure under IPv4, when the smallest MTU in the path the packet traverses is less than the IP packet size at the sender side. So reassembly at the receive side by IP layer before indicating up to protocol layer (TCP, UDP, ICMP etc) could be quite expensive. This is one of the reason Jumbo frame is very useful for cluster type setup.

For now I will defer any discussion on this since this and other measurements like: TLB misses, cache performance, branch mis-prediction etc. are beyond the scope at this stage of these notes. Summary

In summary:

  1. It is important to allocate and manage receive buffers for NIC to DMA in - using cached buffer.
     
  2. It is not a good idea to indicate packets to the NDIS and ask the protocol to release them immediately. This will incur a fairly heavy overhead in protocol processing layer due to buffer copying. So it is better to have a pool of buffers mainly from non-paged system pool. We want to avoid probing and locking buffers while processing receive packets.
     
  3. It is also a good idea to estimate buffer requirements for occasional traffic burst. So measuring the average number of times when your NIC does not get back the ownership of receive buffer in time from protocol layers is vital to provide a good estimate of reserved buffer for burst traffics.
     
  4. We know that we need to use locks to guard the processing of receive descriptors. So reducing the amount of code under lock (critical section) is very important. Also using the most appropriate locks (i.e. Reader writer locks) is useful for performance enhancements.
     
  5. If the NIC does not support interrupt moderation and/or packet reassembly driver writer should be cautious about implementing this purely on software basis. This may cause packet retransmit from the senders on time outs.
     
  6. Receive side scaling should be implemented if the NIC supports multiple hardware queues and MSI-X type interrupt execution. More on this in the next installment.
     
  7. Consider implementation of adaptive tunable properties (parameters) like: interrupt moderations, buffer pool allocation to handle occasional burst, small buffer aggregation, number of packets to indicate per DPC interrupt invocation etc. They all depend on several system resource utilization metrics, hence dynamic.

In the next installment, I would like to cover the scatter-gather DMA and list out the areas to look for send and receive path performance optimization.

Best, -pro
October 23, 2010

 

 

Hit Counter9/18/2010

Topic Status

October 23, 2010 Perf-4 posting.
September 18, 2010 Perf-3 posting.
September 1, 2010 Perf-2 posting.
August 18, 2010 Perf-1 initial posting.

PCAUSA Home · Privacy Statement · Products · Ordering · Support · Utilities · Resources
Mailing Lists  · PCAUSA Newsletter · PCAUSA Discussion List
Rawether for Windows, Rawether .NET, WinDis 32 and NDIS Press are trademarks of Printing Communications Assoc., Inc. (PCAUSA)
Microsoft, MS, Windows, Windows Vista, Windows 95, Windows 98, Windows Millennium, Windows 2000, and Win32 are registered trademarks and Visual C++ and Windows NT are trademarks of the Microsoft Corporation.
Copyright © 1996-2012 Printing Communications Assoc., Inc. (PCAUSA)
Last modified: January 01, 2012