none
Mmcached and mmnoncached attribute RRS feed

  • Question

  • Hi,

    are there any Microsoft recommendations on when to use MmCached/MmNonCached (for instance with MmAllocateContiguousMemorySpecifyCache)?

    I have been playing around with this parameter in a few projects and my results would indicate the following:

    1) Scenario A, Getting best PCI Express Performance from low-end or Atom chip-sets

    Here, it would seem that there are advantages in assigning the non-cached attribute to memory in circular buffers. after looking at the Intel architecture manuals this seems logical. Typically writes to system memory are snooped by the processor/chip-set. Obviously these cache coherency checks cost time, resulting in a delay before the chip-set re-advertises PCI Express credits. I have measured ca 5% higher throughput if the target system memory is non-cached

    2) Scenario B, Copying DMA target memory to IRP Buffer

    If this is the primary issue,  it would seem it is better to assign the cached attribute to the DMA target buffer. Again, this seems logical. If the DMA target buffer in system memory is non-cached, functions like RtlCopyMemory require reading from slower non-cached memory. I notice significant throughput fall off if rtlCopyMemory reads from non-cached buffers.

    Would anybody care to confirm or dispute these observations?

    Thanks,

    Charles

    Tuesday, February 19, 2013 9:47 PM

Answers

  • MmNonCached should be used if you are allocating memory that will be shared with a controller; otherwise, MmCached should be used. Non-cached memory will be slower than cached.

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting.

    Wednesday, February 20, 2013 6:05 AM
    Moderator

All replies

  • MmNonCached should be used if you are allocating memory that will be shared with a controller; otherwise, MmCached should be used. Non-cached memory will be slower than cached.

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting.

    Wednesday, February 20, 2013 6:05 AM
    Moderator
  • Hi Brian,

    thanks for your reply.

    I am aware of the general statement "MmNonCached should be used if you are allocating memory that will be shared with a controller" but unless "memory that will be shared with a controller" has a different meaning to my typical application, this is not what I have been measuring in systems where an external data source (PCI, PCI Express or USB) first streams to a DMA target buffer in memory  (such as a circular buffer) and the contents of this DMA target buffer are later copied to an IRP request buffer using RtlCopyMemory().

    Does this DMA target buffer fall into your understanding of "memory that will be shared with a controller"?

    Intel have done a lot of work in the last few years on keeping caches in multiple processor systems coherent. PCI Express data streamed to system memory is also usually validated against cached destination addresses (not so if the no-snoop attribute is set, but that's usually not the case). I have tested this with constrained random data (i.e. randomly generated but known reproducible content) and the data sent to the system through PCI Express or USB is definitely correct when read back later from the DMA target buffer even if this is cached. Just for clarity, my "DMA target buffer" is typically either lookaside buffers (WdfMemoryCreateFromLookaside etc.)  in the non-paged region or a large buffer requested from user space using VirtualAlloc and locked down by pending the request owning the buffer.

    Even yesterday I had a customer where his DSP application writes data blocks to system memory over PCI. He was complaining that his RtlCopyMemory loop was consuming about 10% cpu resources, I guess most of this was the cpu blocked and waiting for the results of non-cached reads. The DMA target buffer areas were setup during driver initialisation using

      for(i=0; i < number_of_dma_blocks; i++)
      {    addr = (unsigned long)MmAllocateContiguousMemorySpecifyCache( dma_block_size,  
                                                                      nullPA,
                                                                      highestAcceptableAddress,
                                                                      nullPA,
                                                                      MmNonCached ); 

    Just by changing the memory attribute to MmCached, we got the cpu usage down to 1% - 2%. All of his tests with test-data still ran successfully.

    I'm just trying to get the WDK documentation and my observations in sync.

    Charles

    Wednesday, February 20, 2013 1:03 PM
  • Do you have to use common buffer DMA? It would be a lot faster if you used direct I/O and didn't do the extra copy.

    As for when to use Cached vs. Non-cached memory, the concern is cache coherency (that the controller will write to the buffer and the data won't get flushed through the CPU's caches by the time the host tries to access it, thus reading stale/incorrect data). The "MmNonCached should be used if you are allocating memory that will be shared with a controller" mantra is to ensure that your driver will work on all CPU architectures supported by Windows - which includes CPU's that offer limited cache coherency (ARM, MIPS, i860). Intel and AMD x86/x64 chips provide full cache coherency, so using MmCached will work fine, as you've noticed. If you are going to write your driver this way, then you should have a switch statement that determines the CPU architecture and uses MmCached for x86/x64, and MmNonCached for everything else - don't assume your driver will never run on another CPU architecture.

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting.

    Thursday, February 21, 2013 12:32 AM
    Moderator
  • Hi Brian,

    the systems I work on are all embedded either with an embedded operating system or the commercial version but encapsulated (i.e. not modified by the end customer). The reason for the data model is that the external PCI/PCIe/USB hardware typically runs even if the application doesn't so the buffering in the look-asides or circular buffer is really just to provide a certain history depth when the application is restarted. i.e. there is always a lag of a milliseconds or tens thereof between the data from the peripheral and the time when it is actively requested by the application. I seldom use common buffer DMA in the strict WDF sense, my buffer requirements are usually too large.

    But thanks for the clarification on the MmCached/MmNonCached and Intel/ARM etc.

    Charles

    Thursday, February 21, 2013 8:25 AM