none
SQL Server 2016 memory pressure response flushing plan cache

    Question

  • I posted this in the connect feedback yesterday, but it might be useful here as well. There is a recent similar post, but my thoughts are in a different direction.

    The specific situation details are as follows. SQL Server 2016 EE RTM+cu2 build 2164. On Windows Server 2012 R2 Standard Edition. Hardware is HP ProLiant DL380 Gen9 2-socket 18-cores per socket Xeon E5-2699 v3, 2.3GHz, HT disabled (36 physical core and 36 logical processors total), no VM. System memory is 512GB, SQL Server Max memory is 494GB (506,000MB), 20GB page file. LPIM is set. Storage is All-Flash, local PCI-E SSDs. Environment does use availability groups with 2 local nodes and 1 DR node.

    Typical operating conditions are SQL Server at or near target memory (489 of 494GB), OS has 9GB available physical memory, 25GB available page file (implying 4GB of page file in use?). There is very low disk IO to the data files even during full transaction load periods. The plan cache is 14GB, well under the 75% of first 4GB, 10% of 4-64GB and 5% of memory over 64GB limit, will works out to just over 30GB. The plan cache split is about 3.5GB Adhoc, 7.6GB prepared (probably from Entity-Framework parameterized SQL) and 3.2GB procedures. Most of the Adhoc SQL plans (90%) are single use, more than half of prepared plan (57%) are single use. Only a tiny fraction of procedure plans are single use.

    Virtual address space reserved is 2.4TB after SQL Server has been running 60-70 days, but growth appears to be in spurts, not steady. VAS committed is never observed at more than 496GB. There are no home growth CLRs, but there is infrequent light use of spatial geography STDistance function. Most EF calls specific a network packet size of 4096 or 8000 bytes, but there might be infrequent older .Net Framework clients that do not specify network packet size, hence defaulting to 8192 bytes. Does anything else make direct OS VAS allocations instead of using the buffer pool?

    Whatever is the source of the VAS reserved (over permanent allocated to the buffer pool) it is probably short lived, hence the VAS committed has not been observed to be above 496GB.

    On Microsoft web site, it is stated that when operations need to make memory allocation via the OS VAS while SQL Server is at the target memory limit, it will allow this, temporary exceeding the specified memory limit, then gradually release memory elsewhere.

    The problem experienced is intermittent occasions when the plan cache is forced down for unknown reasons, getting as lows as 100MB. This results in a majority of queries requiring a compile, greatly increasing system overall CPU (from 20-30% to near 100%?) as many procedures involve complex SQL. An even worse occurrence is compiles being block. Perhaps whatever the memory pressure that caused this results in SQL Server taking a lock on the entire plan cache to force out plans?

    It is not certain, but suspected that SQL Server did not flush out data pages from the buffer cache. The 14GB of plan cache flushed by the memory pressure event had severe negative consequences. It is thought the flushing out 14GB of data pages probably would only resulted in a minor increase in disk IO, and even that would be far below what the Flash storage could have supported. Also, just flushing the single use Adhoc and prepared plans would have reduced plan cache by 50% (7GB) with probably minor impact.

    The question is: what is the SQL Server internal strategy for responding to memory pressure. Does it automatically flush plan cache? Which would not be a good choice. Does it also consider whether to flush data pages? In this particular case with disk IO normally being low, and storage on all-flash, it would be my preferred course of action. 


    jchang



    • Edited by jchang6 Monday, November 06, 2017 7:24 PM
    Monday, November 06, 2017 4:57 PM

Answers

  • update: being investigated now, 

    there were several procedures that use the old XML API,  sp_xml_preparedocument  followed by OPENXML,

    but only a few actually do sp_xml_removedocument afterwards. This might be the cause of the bloated Compiled Plan - Prepared in the plan cache. So the followed up call is being deployed, hopefully this week. Long term is to use the new xml data type and its associated API.


    jchang

    • Marked as answer by jchang6 Sunday, July 08, 2018 2:41 PM
    Tuesday, December 26, 2017 7:26 PM
  • It would seem that closing the xml handles via sp_xml_removedocument in various stored procedures has worked, though the issue reoccurred recently. I suspect there might be a script that uses sp_xml_prepare_document, but is not in a procedure.

    just came across this DMV/DMF: sys.dm_exec_xml_handles , which "Returns information about active handles that have been opened by sp_xml_preparedocument" . there is also sys.dm_exec_cursors for API server cursors.

    So I will ask the client to monitor, for this, I made a system procedure (link: http://www.qdpma.com/SQL/sp_vas.html )

    • Proposed as answer by jchang61 Saturday, June 30, 2018 5:07 PM
    • Marked as answer by jchang6 Sunday, July 08, 2018 2:41 PM
    Saturday, June 30, 2018 5:06 PM

All replies

  • The first thing you need to do is update to Service Pack 1 and retest.

    There have been many changes since the version you are running.  It is very likely if there was a bug in that version it is already fixed.

    https://support.microsoft.com/en-us/help/3177312

    Monday, November 06, 2017 5:22 PM

  • The question is: what is the SQL Server internal strategy for responding to memory pressure. Does it automatically flush plan cache? Which would not be a good choice. Does it also consider whether to flush data pages? In this particular case with disk IO normally being low, and storage on all-flash, it would be my preferred course of action. 

    I would second what Tom is saying please apply SQL Server 2016 Sp1 and CU 4 as well. My question is are you using query store, if yes disable it the unexpected flush may be because of that. 

    Now to your question as to how SQL Server reacts to memory pressure is Wiindows OS never directly asks SQL Server to trim down its memory consuption when OS is seeing memory pressure. Instead it raises flags(MEMPHYSICAL_HIGH and MEMPHYSICAL_LOW). These flags are monitored by SQLOS dedicated threads called as resource monitors ,that in turn sets a internal HighMemoryresource flag that tells the various caches they can allocate additional memory, or Low Memory resource flag that tells the various caches to try and reduce their memory allocations.. This effect is applied for 5 seconds and then paused for 1 minute. This is to avoid any hungry/faulty application to consume all the memory making SQL Server to scale its usage continuously. If low memory conditions still exist after 1 minute pause, the effect is applied again. So if there is physical memory pressure even after SQL Server scaling its usage for 5 seconds windows will still page out SQL Server’s working set. 

    Now since you have LPIM enabled SQL Server will trim down max to Min server memory set but please note it will take its own CPU cycles to do it. In this scenario if OS faces grave memory pressure and LPIM being there the OS process would be paged to disk and physical memory would be granted to process requesting excessive memory.

    YES if memory pressure continues the plan cache can be flushed as a fact to cater to memory requests other caches can also be flushed. When flushing plan cache it uses some algorithm to make sure the old and unused plans are flushed first, similar technique is used when flushing data pages.

    Let me know if you need further info. I will keep looking at this thread and would like to hear from you how things go after applying SP1 and Sp1 CU4



    Cheers,

    Shashank

    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it

    My TechNet Wiki Articles

    MVP

    Tuesday, November 07, 2017 8:04 AM
  • waiting for the client to apply SP1 + cu5, it may still take a few weeks. this issue was first encountered Jul 31, and reoccurred Oct 31. System was good on Aug 31 and Sep 30. Hopefully SP1+cuX will be applied before Nov 30.

    Query store is not used. My complaint is that SQL Server flushed plan cache from a normal of 12-16GB down to 100MB. I saw no indication of buffer cache flushing. Dumping 15GB of data pages from the buffer cache probably would not have been even noticeable, but instead SQL Server dumped plan cache, including the critical procedures portion.. Even dumping of Adhoc and prepared might not have been a big deal. The prepared plans was half of the plan cache, all Entity Framework code. If the EF portion of the application sucked, that is expected.

    I would like an explanation of why plan cache was flushed before buffer cache data pages. 


    jchang

    Tuesday, November 07, 2017 9:31 PM

  • I would like an explanation of why plan cache was flushed before buffer cache data pages. 


    jchang

    Unfortunately I do not have detailed explanation but it seems like memory manager is "somehow" seeing memory pressure which ofcourse is not correct and flushing the cache out.

    Can you please see This Similar thread. See what Shreyas Verma from MS has to say. It would be good if you can post the query output he asked for


    Cheers,

    Shashank

    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it

    My TechNet Wiki Articles

    MVP

    Wednesday, November 08, 2017 7:11 AM
  • I believe that there is memory pressure, because SQL Server is operating near or at the target memory, and there are operations that make direct OS memory allocations, i.e. a request for address space outside of the SQL Server managed buffer pool of standardized 8KB pages. Per MS documentation, SQL Server will allow this, briefly exceeding the Target memory limit, then gradually pushing out stuff from its buffer pool to get back to the Target memory. All of this is the correct behavior. However, in this case, the more correct approach should have been to flush out cached data pages as disk IO is very low, and not push out of the plan cache beyond the single use plans.

    I did see the other post, there were differences. They are on SP1 and see the memory flush problem, but were initially using LPIM, where as here LPIM was used. (there was a situation in which SQL Server was restarted, and on restart, it could not get the large pages.) I will comment further later (the local SQL User group meeting is starting soon, actually, just the pizza is arriving)

    Wednesday, November 08, 2017 10:59 PM
  • You are assuming the plan cache flush was caused by memory pressure.  That may not be true.  There are other reasons for the plan cache to remove plans.

    SQL Server always uses all the RAM is it is allowed to use. That does not mean there is memory pressure.

    I suggest you read:

    https://technet.microsoft.com/en-us/library/ee343986(v=sql.100).aspx

    Thursday, November 09, 2017 11:53 AM
  • Ok, the system was running fine with 14GB plan cache, 3.5GB adhoc, 7GB prepared most of which was single use, a the critical 3.2GB procedure cache. Flushing the single use plans probably would not have had much effect, but flushing nearly the entire cache down to 100MB had horrible effect.

    Does SQL Server 2016 Enterprise Edition have a special hari-kari feature?

    and oh yeah, attempts to compile the evicted plans not incurs the compile cost, but might also encounter a block on getting a write to the plan cache. 


    jchang

    Thursday, November 09, 2017 5:01 PM
  • You need to wait until you get SP1 installed you pursue this further.  There have been many changes since the version you are running and MS is horrible about documenting changes publicly.  It is possible and likely you are chasing something that is already fixed.

    If you have 512GBs of RAM, SQL Server has had many patches to fix problems with RAM > 256GBs.  It is likely you are being affected by something in the "high memory" area and it is already fixed.

    Thursday, November 09, 2017 6:19 PM
  • Yes, I have trying to get the client to apply SP1 + cu5 (or latest), but they are very slow on this.

    in the similar post cited by shashank

    https://social.msdn.microsoft.com/Forums/sqlserver/en-US/6b9118f3-b6c1-4f39-ae80-d149392656f9/sql-server-2016-memory-pressure-leads-to-the-plan-cache-clearing?forum=sqldatabaseengine

    the poster was on 2016 SP1 + cu3 and saw a similar issue. There it seems using LPIM may have resolved the issue.

    Here, LPIM was always in use, one difference being the plan cache flush seems to occur once every 2-3 months, but most critically it occurs at the time of month-end postings.

    many people have asserted that there were very many bugs in 2016RTM that have been fixed in SP1, I will press for having SP1 + cu5, then resume this thread


    jchang

    Thursday, November 09, 2017 8:24 PM
  • There are 2 nodes in the availability group, identical HW, identical config (512GB, 524288MB mem), SQL Server max mem (494GB = 506000MB). LPIM in effect.

    The cache flush problem is more pronounced on one than the other. System 1 shows 8573MB available memory, but the system_low_memory_signal_state is True. System 2 typically shows 9,818MB free, sys low mem ss False.

    Given that there is almost no disk IO for data reads, I recommended reducing SQL Server Max 490,000MB and the plan cache immediately recovered (from a situation where the plan cache tries to grow to several GB, then gets flushed to below 1GB intermittently).

    So apparently several GB of OS free memory is not a reliable indicator, use dm_os_sys_memory value of system_low_memory_signal_state instead. Roughly, SQL Server Max memory at 96% of system mem is too high, but 93.5% seems to be ok. It is suspected that something is running on the OS periodically consuming large memory (possibly even a file transfer that get cached by the OS, more so on one than the other. I think this might be it because system_cache is just a hair higher than avail phy mem.)

    In this particular case, flushing the plan cache in the sys mem low state was the wrong thing to do because there is very little IO for data pages, and storage is on all-flash. SQL Server + Windows should have a way of knowing this, hence feel free to ditch data buffer pages, not the plan cache, unless it is the single use parameterized SQL (from Entity Frameworks). Mostly certainly the OS should know not to use too much free memory for file caching, and never trigger the sys low mem state.

    Friday, December 01, 2017 5:50 PM
  • this is my best guess.

    SQL Server backups locally. This does not use the OS/system cache because SQL Server backups use unbuffered IO.

    there is a robocopy job that later copies the backups to a network location. This copy would go to the system cache because the default is not unbuffered IO, and the Windows Server 2016 OS is not smart enough to know that for a multi-GB file transfer, using buffered IO is stupid, more so when SQL Server is running and using 494GB on a system with 512GB memory. So Windows fills the remaining memory with system cache for this file copy that will not be touched afterwards (the only "benefit of caching"), then triggers the system low memory signal state.

    Then to compound this, SQL Server sees the low memory signal, and decides to flush the plan cache, not just the very many single use parameterized plan (generated unnecessarily from Entity Frameworks), but also the critical frequently used procedure plans. This causes a significant jump in CPU. In some cases, there also appears to a lock on the plan cache (could this be taken for a large dump of the plan cache) which then causes any new calls that require a compile to be blocked?

    What is sad here is that even if SQL Server decided to cooperate with the OS in releasing memory, it could have dumped data buffers instead. Much of the data pages were infrequently used. And the storage is on flash. It could 100,000's of IOPS if needed. But no, there may have been some dump of data buffers, PLE drops from 90,000 to 8,500 sec? Even then, IOPS was noise compare to what the storage could have handled. Hey dude, what's up?

    The plan is to try the Robocopy with the -j flag, which is supposed to be unbuffered IO, hopefully this will stop using the system cache. Also, I see that there is a call: 

    %windir%\system32\rundll32.exe advapi32.dll, ProcessIdleTasks

    that is suppose to flush the system cache. Will try both to see if the OS free memory is truly free and not system cache (of unnecessary stuff). 


    • Edited by jchang61 Sunday, December 03, 2017 5:20 PM
    Sunday, December 03, 2017 5:17 PM
  • update: being investigated now, 

    there were several procedures that use the old XML API,  sp_xml_preparedocument  followed by OPENXML,

    but only a few actually do sp_xml_removedocument afterwards. This might be the cause of the bloated Compiled Plan - Prepared in the plan cache. So the followed up call is being deployed, hopefully this week. Long term is to use the new xml data type and its associated API.


    jchang

    • Marked as answer by jchang6 Sunday, July 08, 2018 2:41 PM
    Tuesday, December 26, 2017 7:26 PM
  • we have quite similar behaviour. I've posted today an update to our case. What I noticed today for this specific occurrence is that plan cache was quite huge ~4GB while it stored only 1400 objects and did not store any addhoc & prepared plans.

    After SQL server restart - which so far is for us the only solution, the cache plan dropped to 2,1GB while the number of objects in the case increased to 13000 and included over 11000 adhoc and prepared plans.

    We posted our case Help-with-SQL-memory-pressure-investigation at www(dot)sqlservercentral(dot)com but can not paste a link here until my account is verified

    Wednesday, December 27, 2017 1:16 PM
  • It would seem that closing the xml handles via sp_xml_removedocument in various stored procedures has worked, though the issue reoccurred recently. I suspect there might be a script that uses sp_xml_prepare_document, but is not in a procedure.

    just came across this DMV/DMF: sys.dm_exec_xml_handles , which "Returns information about active handles that have been opened by sp_xml_preparedocument" . there is also sys.dm_exec_cursors for API server cursors.

    So I will ask the client to monitor, for this, I made a system procedure (link: http://www.qdpma.com/SQL/sp_vas.html )

    • Proposed as answer by jchang61 Saturday, June 30, 2018 5:07 PM
    • Marked as answer by jchang6 Sunday, July 08, 2018 2:41 PM
    Saturday, June 30, 2018 5:06 PM