locked
Large read-only memory chunk locked in RAM RRS feed

  • Question

  • I need to allocate a large memory chunk, say 2GB, in RAM that is both read-only and locked. What is the best way to get it so that (sequential/scan) memory access is fast?

    This is useful, for instance, to perform high-performance streaming analytics on data without incurring the issues of page faults. Apologies if this has been already asked, as my search could not find any answer in the forums.

    I am facing two alternatives with VirtualAlloc but none is fully satisfying.

    (1) Use VirtualProtect with PAGE_READONLY on the chunk allocated by VirtualAlloc. As page size is typically 4KB, this significantly increases the working set size and potentially degrades the access performance of the associated metadata.

    (2) Use large-page VirtualAlloc with MEM_LARGE_PAGES and page size of 2MB. This is good as the number of pages is almost three orders of magnitude smaller than solution (1) and they are not part of the working set. However only PAGE_READWRITE is allowed whereas I want read-only: the official documentation (https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support) says that "The memory is always read/write and nonpageable (always resident in physical memory)".

    I am not mentioning CreateFileMapping + MapViewOfFile as it is giving worse performance than VirtualAlloc, but maybe I employed it in an improper way (followed https://docs.microsoft.com/en-us/windows/win32/memory/creating-a-file-mapping-using-large-pages). It seems to me that the pages are not locked in RAM in this way.

    Thank you for any help

    -Rob





    Thursday, April 30, 2020 3:24 AM

All replies

  • The best solution likely is to let Windows manage memory. If there is enough main memory then there will be no need for memory to be paged out.


    Sam Hobbs
    SimpleSamples.Info

    Thursday, April 30, 2020 4:21 AM
  • Another option for obtaining nonpageable memory is to use Address Windowing Extensions

    But like other options, these functions do not create read-only allocations.

    Thursday, April 30, 2020 10:53 AM
  • I think I will run some vtune measurements to get better sight on the situation. I was wondering if I considered all the options.

    -R

    P.S. I did not know AWE. Thanks again :)


    • Edited by RobHicSunt Thursday, April 30, 2020 12:10 PM
    Thursday, April 30, 2020 12:09 PM
  • Hello RobHicSunt,

    If this issue solved you can share your solution here. It will be helpful for others are searching on this. Thanks.

    Best regards,

    Rita


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Friday, May 1, 2020 7:56 AM
  • Hello Rita,

    not yet found, but I am trying to seek one by trail and error.

    Bests

    -Rob

    Hello RobHicSunt,

    If this issue solved you can share your solution here. It will be helpful for others are searching on this. Thanks.

    Best regards,

    Rita


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.


    Friday, May 1, 2020 8:00 AM
  • What you are doing is called "premature optimization," and it is one of the high crimes of programming.  As of this point, you don't have any clue whether your code needs any special handling.

    Remember, the virtual memory manager in this operating system is absolutely critical to the performance of the system.  It has been critically examined and hyper-optimized continuously over the 30-year lifespan of the kernel.  It is almost always the case that naive attempts to outsmart the kernel simply result in REDUCED overall performance.  The kernel already has code to recognize sequential access and optimize for that case, by prefetching upcoming pages.  It will leave pages in memory if you are repeatedly accessing them, and trim out pages that aren't accessed.

    BY FAR the best plan is to write your code to do your analytics, then run it and figure out where the bottlenecks are.


    Tim Roberts | Driver MVP Emeritus | Providenza & Boekelheide, Inc.

    Friday, May 1, 2020 6:35 PM
  • Noted. I cannot agree more, the OS's goal is highly optimized for fair usage of resource and overall performance...

    ... however there are situations in which one has a high-performance special-purpose application in mind, and typically there is a preliminary comparison with the performance obtained with the virtual memory manager. One can predict much better than OS by design in some cases, having enough expertise on cache efficient algorithms. No fair usage in this case.

    Data analytics on GBs of readonly data is the case. Modern analytics is in-memory (and fast PCIe SSD are also useful). What would you suggest in that case as allocation policy? Solution (1) is quite good with suitable  prefetching, but it pollutes the TLB with the many pages.

    Bests

    -Rob

    What you are doing is called "premature optimization," and it is one of the high crimes of programming.  As of this point, you don't have any clue whether your code needs any special handling.

    Remember, the virtual memory manager in this operating system is absolutely critical to the performance of the system.  It has been critically examined and hyper-optimized continuously over the 30-year lifespan of the kernel.  It is almost always the case that naive attempts to outsmart the kernel simply result in REDUCED overall performance.  The kernel already has code to recognize sequential access and optimize for that case, by prefetching upcoming pages.  It will leave pages in memory if you are repeatedly accessing them, and trim out pages that aren't accessed.

    BY FAR the best plan is to write your code to do your analytics, then run it and figure out where the bottlenecks are.


    Tim Roberts | Driver MVP Emeritus | Providenza & Boekelheide, Inc.


    Monday, May 4, 2020 6:36 AM
  • virtual memory manager

    Virtual memory is not relevant when there is sufficient main memory.

    The suggestion has been made to first determine what the performance is like without any of your custom optimizations. You will get much better help if you can be specific about an actual performance problem instead of a theoretical one. Otherwise you give the impression you are closed-minded to the possibility that you might not need any custom optimizations.



    Sam Hobbs
    SimpleSamples.Info

    Monday, May 4, 2020 7:14 AM
  • It makes sense to me, indeed I am using VTune with PMU hardware events to measure the impact. I forgot to say that this is not an app development, it's for research and education :)

    Still, it would be nice if Windows could allow for read-only large-page support in the future. Thanks.

    Monday, May 4, 2020 8:32 AM
  • Suppose you get 2GB read-only. Now what? Would you like to put some data there?

    -- pa

      
    Wednesday, May 6, 2020 9:35 PM
  • I would use processor's counters to see if this gives any benefit in terms of cache performance (after filling with the input and calling VirtualProtect to make the chunk readonly). Maybe there is no difference or maybe there is. Measuring will tell.

    Suppose you get 2GB read-only. Now what? Would you like to put some data there?

    -- pa

      


    Thursday, May 7, 2020 8:42 AM
  • I performed some experiments on a commodity PC with 16GB RAM and i7 processor, with around 1.5Gb of real data. I computed average running times for sequential scans made with (1) memcpy, (2) non-temporal copy and (3) non-temporal copy that exploits also sw prefetching. Each operation copies 512 bytes at a time. Each average time is computed over 15 runs (5 runs x 3 independent round-robin batches). Although the difference is not astonishing, some comments are in order. 

    - For memcpy, RO 4KB PAGES is slightly better than (RW) 4KB PAGES, and both 10% better than LARGE PAGES. Maybe hw prefetching helps here.

    - For non-temporal copy, it does not make L2 cache dirty and helps to have sw prefetching, which also gives quite stable and predictable performance. However, LARGE PAGES is slightly better for plain non-temporal copy (maybe due to much smaller TLB size).

    My understanding is that using careful hw or sw prefetching can level things. Non-temporal helps if some persistent data structure kept in L2 should not be over-flooded by the streaming data.




    LARGE PAGES average

    memcpy 0.110272

    non-temporal 0.144127

    non-t. prefetch
    0.132939

    4KB PAGES

    memcpy 0.109565

    non-temporal 0.149712

    non-t. prefetch
    0.133316

    READ-ONLY 4KB PAGES

    memcpy 0.108648

    non-temporal 0.149763

    non-t. prefetch
    0.134725






    • Edited by RobHicSunt Saturday, June 13, 2020 10:46 AM
    Saturday, June 13, 2020 10:30 AM