none
SQL Server 2016 memory pressure leads to the plan cache clearing RRS feed

  • Question

  • Two weeks ago we migrated into SQL Server 2016 SP1 CU3 (previously our app was using SQL Server 2014 SP2). The application hasn't been upgraded or changed nor the workload has changed.

    Since the migration, every few minutes up to more or less one hour, the proc cache is being flushed (not entirely, but still the majority of plans go away). If I run:

    1. SELECT count (*) FROM sys.dm_exec_cached_plans
    2.  

    ...just after clearing happens, the number of plans drops to 1 up to 3 hundred and then gradually increase to more or less 2,000, then usually clearing happens again and so on. It worth to mention that Buffer Pool seems to stay intact and only caches are affected.

    The server runs on VMWare, it has 128 GB of RAM (SQL Server max server memory is set to 102 GB, min. server memory is set to 72 GB). Based on the output from SentryOne, I can see that buffer pool consumes ~61 GB.

    SQL Server memory usage

    My SQL Server version is as follows:

    Microsoft SQL Server 2016 (SP1-CU3) (KB4019916) - 13.0.4435.0 (X64)

    It's a Standard Edition.

    Finally, I came across an article by Jonathan Kehayias and I decided to check ring buffers and boom!, it turns out that I have notifications from resource monitors saying: 'low physical memory'. Occurrences of this notification fit perfectly to the proc cache clearing. Now the case is how to interpret these results and how to find the responsible process. As you can find in the query result:

    1. SELECT
    2. EventTime,
    3. record.value('(/Record/ResourceMonitor/Notification)[1]', 'varchar(max)') as [Type],
    4. record.value('(/Record/ResourceMonitor/IndicatorsProcess)[1]', 'int') as [IndicatorsProcess],
    5. record.value('(/Record/ResourceMonitor/IndicatorsSystem)[1]', 'int') as [IndicatorsSystem],
    6. record.value('(/Record/MemoryRecord/AvailablePhysicalMemory)[1]', 'bigint') AS [Avail Phys Mem, Kb],
    7. record.value('(/Record/MemoryRecord/AvailableVirtualAddressSpace)[1]', 'bigint') AS [Avail VAS, Kb] FROM (
    8. SELECT
    9. DATEADD (ss, (-1 * ((cpu_ticks / CONVERT (float, ( cpu_ticks / ms_ticks ))) - [timestamp])/1000), GETDATE()) AS EventTime,
    10. CONVERT (xml, record) AS record
    11. FROM sys.dm_os_ring_buffers
    12. CROSS JOIN sys.dm_os_sys_info
    13. WHERE ring_buffer_type = 'RING_BUFFER_RESOURCE_MONITOR') AS tab ORDER BY EventTime DESC;
    14.  

    Ring buffer query results

    we can observe 'low physical memory' flag although the value of Available Physical memory stays at the same level. Moreover, if I am not mistaken above results are indicating internal memory pressure (IndicatorsProcess = 2, which is weird to me as Sentry all the time shows that SQL Server doesn't fully utilize the allocated memory. This is memory usage captured by Sentry for a sample taken at 8 AM:

    SQL Server memory usage

    All lines are pretty flat. What is also weird to me is that group of events:

    • RESOURCE_MEMPHYSICAL_LOW

    • RESOURCE_MEM_STEADY

    • RESOURCE_MEMPHYSICAL_HIGH

    happens at the same time. So this pressure takes milliseconds or less (perhaps this is also the reason why Sentry doesn't capture anything as it collects data a way less frequently).

    I tried to find the reason behind this internal pressure and I checked top 10 memory clerks (in terms of memory consumption) to see if there are any heavy consumers there:

    Memory clerks query results

    but to be fair I don't see anything suspicious there.

    Another thing worth to say is I haven't tried LPIM yet, as it requires SQL Server to be restarted, but even if it's the solution I would really like to understand why this issue happens. Moreover, please correct me if I am wrong, but as buffer pool seems not to be affected by trimming I don't really think LPIM is a solution here.

    Now I am completely lost and I don't really know what else should I check in order to find the root cause of the issue. I would really appreciate if some can help me to solve this puzzle.


    • Edited by Grzegorz Opara Wednesday, August 9, 2017 12:28 PM I removed 'internal' from the title as during investigation it turned out it might be external memory pressure either.
    Thursday, August 3, 2017 3:19 PM

Answers

  • Actually, I would say that there is a fair chance that LPIM resolves/masks the problem. I observed this issue myself on a VM, but only when LPIM was disabled. I should immediately add that my setup was below most standards, a 2 GB VM running Win7.

    One thing I observed in your post, but I did not comment on is the fairly high setting of "min server memory". I'm not saying that it is part of the equation, but it could be worth lowering it to try the effect.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    • Marked as answer by Grzegorz Opara Wednesday, August 9, 2017 12:25 PM
    Thursday, August 3, 2017 6:22 PM
  • Actually, I would say that there is a fair chance that LPIM resolves/masks the problem. I observed this issue myself on a VM, but only when LPIM was disabled. I should immediately add that my setup was below most standards, a 2 GB VM running Win7.

    One thing I observed in your post, but I did not comment on is the fairly high setting of "min server memory". I'm not saying that it is part of the equation, but it could be worth lowering it to try the effect.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Ok Erland, just thinking out loud here, and I also believe LPIM should do the tricks but let us assume this is bug and happening because memory manager has some issues and wrongly gives low memory forcing clearing of cache would LPIM help whcih ONLY prevents paging and we must node that even though LPIM is their still memory can be froced to "trim" till min server memory after which windows will be paged. 

    Cheers,

    Shashank

    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it

    My TechNet Wiki Articles

    MVP

    • Marked as answer by Grzegorz Opara Wednesday, August 9, 2017 12:25 PM
    Friday, August 4, 2017 5:17 AM
    Moderator

All replies

  • I would absolutely recommend that you open a case with Microsoft about this issue.

    I have spent some time to analyse this problem, and so have some other MVPs, but we have never nailed it. My suspicion is that there are multiple possible root causes, possibly in combination in some less good algorithm.

    Question: do you have Query Store enabled? I know that there was a memory leak related to Query Store that might have triggered this problem. I'm quite sure that this issue was fixed in CU3, but maybe there are more to it. Or there some other component you are using which also has a leak.

    One more thing you can look at, is the DMV sys.dm_os_memory_cache_clock_hands. Most likely, you will see that these hands or moving with an increasing number of rounds_count, be that the internal or external hand. And you will also see the values for removed_all_rounds_count increasing.

    And there is nothing else running on the machine? And the VM host is not over-committed on memory (which could cause it to activate the baloon driver)?

    Thursday, August 3, 2017 4:43 PM
  • I would join Erland here, I know you got some good response from Brent Ozar from This Link so did you ran the sp_blitz script with given parameters ? I am interested in the sp_blitz result.

    If you are really facing a bug from MS or some special behavior LPIM is not going to help much but you MUST try that. How about query store is that enabled ?



    Cheers,

    Shashank

    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it

    My TechNet Wiki Articles

    MVP

    Thursday, August 3, 2017 5:18 PM
    Moderator
  • Actually, I would say that there is a fair chance that LPIM resolves/masks the problem. I observed this issue myself on a VM, but only when LPIM was disabled. I should immediately add that my setup was below most standards, a 2 GB VM running Win7.

    One thing I observed in your post, but I did not comment on is the fairly high setting of "min server memory". I'm not saying that it is part of the equation, but it could be worth lowering it to try the effect.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    • Marked as answer by Grzegorz Opara Wednesday, August 9, 2017 12:25 PM
    Thursday, August 3, 2017 6:22 PM
  • Actually, I would say that there is a fair chance that LPIM resolves/masks the problem. I observed this issue myself on a VM, but only when LPIM was disabled. I should immediately add that my setup was below most standards, a 2 GB VM running Win7.

    One thing I observed in your post, but I did not comment on is the fairly high setting of "min server memory". I'm not saying that it is part of the equation, but it could be worth lowering it to try the effect.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Ok Erland, just thinking out loud here, and I also believe LPIM should do the tricks but let us assume this is bug and happening because memory manager has some issues and wrongly gives low memory forcing clearing of cache would LPIM help whcih ONLY prevents paging and we must node that even though LPIM is their still memory can be froced to "trim" till min server memory after which windows will be paged. 

    Cheers,

    Shashank

    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it

    My TechNet Wiki Articles

    MVP

    • Marked as answer by Grzegorz Opara Wednesday, August 9, 2017 12:25 PM
    Friday, August 4, 2017 5:17 AM
    Moderator
  • Guys, 

    First of all I would like to thank you so much for your efforts! I really appreciate it! I went through your comments and I tried answer all your questions and provide some new information.

    1. Query Store is disabled across the board.

    2. I set up the min server memory back to 0. It's not a problem, this box is purely dedicated for SQL Server purposes.

    3. As I am always suspicious in terms of VM stuff I ran perfmon to capture VM related metrics when the issue happened. This is what I collected, IMHO looks fine.

    I checked in vCenter and there is no memory overcommttement on the host. Perhaps I am wrong here but I think that if it would be vm / hypervisor related issue it would rather result in external memory pressure and trimming buffer pool either.

    4. Although I said this is SQL Server dedicated VM I double checked some obvious consumers outside SQL Server like antivirus, etc. No findings. 

    5. I also noticed that issue happens more frequently when SQL runs havier workload, but I guess it's not a surprise, I am just putting it here for the record.

    6. I scheduled LPIM implementation on Tuesday night, so on Wednesday I can give you some update whether it solved the issue or not.

    7. As per @Erland's suggestion I started collecting data from sys.dm_os_memory_cache_clock_hands to see how it changes over the time. I will share the results as soon as I collect some information.

    8. @Shashank was asking about sp_Blitz. You can find the results in the orginal post I created on dba.stackexchange.com I cannot paste it here due to text length limitation. The only difference now is sp_Blitz has a record about min server memory set to 72 GB, which is no longer the case.

    9. Last but not least is question about raisng this with Microsoft. I work for a small company, we don't have premier support. Is it possible to open the case without having mentioned permier support option?
    Friday, August 4, 2017 6:43 AM

  • 9. Last but not least is question about raisng this with Microsoft. I work for a small company, we don't have premier support. Is it possible to open the case without having mentioned permier support option?
    Thanks for revert and yes you can open case with Microsoft even if you dont have premier support, only thing is it would be bit costly. If they find out it is due to bug in the product which is still not fixed you probably would be refunded. If they have the fix they would provide you the fix and refund

    Cheers,

    Shashank

    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it

    My TechNet Wiki Articles

    MVP

    Friday, August 4, 2017 7:59 AM
    Moderator
  • 2. I set up the min server memory back to 0. It's not a problem, this box is purely dedicated for SQL Server purposes.

    I see in your Stack Overflow post that when you had min server memory set to 16 GB, the trimmings were worse. So what happened now when you set it to 0?

    5. I also noticed that issue happens more frequently when SQL runs havier workload, but I guess it's not a surprise, I am just putting it here for the record.

    It is not entirely uninteresting, since it indicates that the perceived memory pressure is internal, which also agrees with the findings from the ring-buffer query. This means that there is less reason to chase ghosts outside SQL Server.

    6. I scheduled LPIM implementation on Tuesday night, so on Wednesday I can give you some update whether it solved the issue or not.

    That will be interesting. As I said, in the situation I observed, it helped. But in that case, the memory pressure was perceived as external. (Which still was weird, because the commit charge in the OS was well below 50%.)

    8. @Shashank was asking about sp_Blitz. You can find the results in the orginal post  <https://dba.stackexchange.com/questions/182365/sql-server-2016-sp1-cu3-clears-the-plan-cache>I created on dba.stackexchange.com

    The only thing that stood out to me is the thing about 100 plans for a single query. This is possibly worth investigating. If you have many of these, it may be the plan cache itself that is exploding. But at the same time you mentioned 2000 cache entries which is not any alarming number.

    Last but not least is question about raisng this with Microsoft. I work for a small company, we don't have premier support. Is it possible to open the case without having mentioned permier support option?

    As Shanky says, you can open a case, but it can certainly cost you an arm and a leg. If Microsoft agrees that there is a bug with a Microsoft product you will be refunded. But I will definitely not promise that this is the case. As I said, I've seen this reported by forum people and also by MVPs, but it has never been ironed out. I suspect there is a mix of things. That is, there may be something inappropriate inside SQL Server which causes this wobbling, but it may be triggered by something in your workload. And the support staff may stop if they find something like that, and never dig deeper into the code.

    And just because it worked fine on SQL 2014, does not mean that it has to be a bug in SQL 2016. You might have been on the brink in SQL 2014, but smaller adjustments in internal algorithms took you across that brink now.

    Also, I guess your main objective is to bring back your server to sanity, so if LPIM fixes it, you may not be interested in spending a lot of time of tracking down what might be an SQL Server bug.

    It occurs to me that there is one more thing to try, if your VM host and your time permits: Increase the memory of the server considerably, and of course adapt max server memory accordingly. If this does not change matters, I'd saying something is absolutely fishy.

    Friday, August 4, 2017 9:33 AM
  • I would just add, with VMWare I always recommend setting the VM to have a reasonable minimum CPU/memory reservation to prevent the host from stealing all your resources and also to prevent VM from moving your VM to a host which does not have enough resources on auto failover.

    Friday, August 4, 2017 2:01 PM
    Moderator
  • I see in your Stack Overflow post that when you had min server memory set to 16 GB, the trimmings were worse. So what happened now when you set it to 0?

    Originally when I posted this case on dba.stackoverflow I included this observation, but to be fair the change was made when the server was quite busy, therefore, eventually I was not sure whether the observed, increased cache purging was a result of min memory change or not. This is the reason why when I posted this here on msdn I decided to skip this information. It seems that it was a good decision because when I set min server memory to 0, literally nothing has changed

    It is not entirely uninteresting, since it indicates that the perceived memory pressure is internal, which also agrees with the findings from the ring-buffer query. This means that there is less reason to chase ghosts outside SQL Server.

    Absolutely. There is even more indicators saying the problem is internal. Yesterday there was the maintenance process running (DBCC CHECKDB, Indexes Rebuild / Reorganize) on the server. Both activities triggered cache trimming and some new entries appeared in ring-buffer indicating internal memory pressure. At that time application activity was minimal. There is only one thing which doesn't fit to this picture (or I am just misinterpreting the results). As you suggested I started collecting data from sys.dm_os_memory_cache_clock_hands DMV (limited to the object plans, the procedures plans and the bound trees caches).   

    As you can see we can find some numbers for the clock_hand (HAND_EXTERNAL). Does it mean SQL Server is facing external memory pressure? If that's the case it's completely not in line with what we have found so far.

    The only thing that stood out to me is the thing about 100 plans for a single query. This is possibly worth investigating. If you have many of these, it may be the plan cache itself that is exploding. But at the same time you mentioned 2000 cache entries which is not any alarming number.

    I will try to figure out what queries have multiple plans, but as I monitor overall number of plans in the cache it doesn't look like trimming is a result of blowing up the cache plan by some particular query. 

    And just because it worked fine on SQL 2014, does not mean that it has to be a bug in SQL 2016. You might have been on the brink in SQL 2014, but smaller adjustments in internal algorithms took you across that brink now.

    This is very interesting point, I was thinking whether there is any way to prove that although it’s plainly visible that trimming the cache happens more frequently when server is under stress at the same time it’s not related to the flaws in the application code, disclosed during the migration onto 2016. 

    Apparently the maintenance process gave some answers here. It caused the same trimming, therefore I believe we can say the issue is completely not related to the application code itself. 

    Also, I guess your main objective is to bring back your server to sanity, so if LPIM fixes it, you may not be interested in spending a lot of time of tracking down what might be an SQL Server bug.

    Yes, this is true. The main goal is to fix the issue, if LPIM fixes this, that's fine. It would be really great to understand the root cause, but again this is not a main objective here. That's a pity I cannot open the case with Microsoft for free, provide all details and then let them decide whether it's worth to investigate this issue further or not. At the end of the day I guess they should be interested in finding bugs in the application they produce. I think that evidence collected so far indicates it really might be a bug. Well perhaps I am bit naive anyway :)


    Let's see on Wednesday if LPIM brings any relief. If not, I would consider taking step towards opening the case with Microsoft, but considering refund is not guaranteed here, it is not going to be easy decission.


    Sunday, August 6, 2017 8:24 AM
  • I would just add, with VMWare I always recommend setting the VM to have a reasonable minimum CPU/memory reservation to prevent the host from stealing all your resources and also to prevent VM from moving your VM to a host which does not have enough resources on auto failover.

    Well, that's fair point Tom. As far as I know our VM host can handle all VMs even though they run max assigned memory (no memory overcommitment)
    Sunday, August 6, 2017 8:35 AM
  • First of all, the maintenance jobs can cause some plans to fall out of the cache, simply because the plans are invalidated due to updated statistics, so it may be a false lead.

    Yes, it seems that we need to change our attention. This appears to be external memory pressure. I have a machine where I can easily reproduce a scenario like this. It is a 2GB VM running Win7 and SQL 2014. On this machine the plan cache goes poof quite quickly (there are never more than a plans in it, as I am not running any real workload on it), and the external hands goes like crazy. When I run Jonathan Keyhaias's query, the result is the same as yours. That is, indicator_process is 2 for RESOURCE_MEMPHYSICAL_HIGH.

    On my VM I am quite sure that the issue is external memory pressure, even if there no trace of it in task manager. But if I increase the memory to 3GB I don't see the issue. Or for that matter if I activate LPIM.

    I had some discussions about this scenario at Microsoft, but we agreed on that since the OS is quite old, it was not worth trying to nail it down.

    Which brings us to the question which OS you have?

    What is interesting in your case is that it seems that the wobbling stops once the load on the server goes down. Many who have reported this, have said that once the behaviour started to set it, the only way to stop was to restart SQL Server, and this is also my observation in the cases I have looked at. (Of which none have been production boxes.)

    As for the difference between SQL 2014 and SQL 2016 it may be as trivial as different execution plans, and there are plans that how require a higher memory grant.

    In my previous post, I suggested that you should attempt to increase the memory for the VM and adjust "max server memory" for SQL Server accordingly. Now that it seems that it is external issue, I would rather suggest that you could try lowering "max server memory", to, say 90 GB. That may give the OS more space to breathe. As I understand from your posts, you are not really using the full buffer cache anyway. (But then you could ask why the OS would care about a max you don't use.)

    Sunday, August 6, 2017 10:30 AM
  • First of all, the maintenance jobs can cause some plans to fall out of the cache, simply because the plans are invalidated due to updated statistics, so it may be a false lead.

    I should have been more precise here. Although you are absolutely right that indexes rebuild can make plans in a cache outdated due to updated statistics, in this particular case I am more concerned about DBCC. On our server we use Ola Hallengren's scripts for the maintenance. Once a week on Saturday, first we run DBCC on system databases at 10:05 PM, then we trigger DBCC for application databases at 10.15 PM and finally we execute indexes rebuild / reorganize at 4.15 AM on Sun.

    Now let's combine this with ring-buffer output:


    This is very intriguing why DBCC executed for small subset of databases (system only) triggers RESOURCE_MEMPHYSICAL_LOW flag to come up.

    I had some discussions about this scenario at Microsoft, but we agreed on that since the OS is quite old, it was not worth trying to nail it down.

    Which brings us to the question which OS you have?

    SQL Server runs on Windows Server 2012 R2 (Microsoft Windows [Version 6.3.9600])

    What is interesting in your case is that it seems that the wobbling stops once the load on the server goes down. Many who have reported this, have said that once the behaviour started to set it, the only way to stop was to restart SQL Server, and this is also my observation in the cases I have looked at. (Of which none have been production boxes.)

    Yes, that's a good conclusion. There is definitely strong correlation between workload and number of cache trimming. Although it doesn't completely disappear when server is relatively idle but it's definitely lower than when server is under stress.

    For example yesterday, the last trimming happened during, mentioned earlier indexes maintenance (at ~04.15 AM) and after that for entire day did not happen even once! But since people have started working this morning I have observed it to happen several times.

    In my previous post, I suggested that you should attempt to increase the memory for the VM and adjust "max server memory" for SQL Server accordingly. Now that it seems that it is external issue, I would rather suggest that you could try lowering "max server memory", to, say 90 GB. That may give the OS more space to breathe. As I understand from your posts, you are not really using the full buffer cache anyway. (But then you could ask why the OS would care about a max you don't use.)

    This is something worth to be checked. I will schedule this change to be implemented later this week. Let's see what's going to happen tomorrow when we turn on LPIM and then we can adjust max server memory.

    Monday, August 7, 2017 9:07 AM
  • This is something worth to be checked. I will schedule this change to be implemented later this week. Let's see what's going to happen tomorrow when we turn on LPIM and then we can adjust max server memory.

    Please report back the effects of setting LPIM!

    Monday, August 7, 2017 9:57 PM
  • This is something worth to be checked. I will schedule this change to be implemented later this week. Let's see what's going to happen tomorrow when we turn on LPIM and then we can adjust max server memory.

    Please report back the effects of setting LPIM!

    LPIM was implemented successfully. So far the results are promising. Since restart (~10 hours ago) I haven't observed even single occurance of the cache trimming. I don't want to speak too soon anyway. SQL Server consumes now roughly ~50 GB of memory and it goes slowly up. I would be able to make some final conclusions when it reaches its maximums (96 GB). It should happen this evening or tomorrow. I will keep you posted!
    Tuesday, August 8, 2017 7:52 AM
  • This is something worth to be checked. I will schedule this change to be implemented later this week. Let's see what's going to happen tomorrow when we turn on LPIM and then we can adjust max server memory.

    Please report back the effects of setting LPIM!

    LPIM was implemented successfully. So far the results are promising. Since restart (~10 hours ago) I haven't observed even single occurance of the cache trimming. I don't want to speak too soon anyway. SQL Server consumes now roughly ~50 GB of memory and it goes slowly up. I would be able to make some final conclusions when it reaches its maximums (96 GB). It should happen this evening or tomorrow. I will keep you posted!
    That is great that LPIM is giving signs of issue not repeating itself so one can assume that since LPIM does the job what was forcing cache trimming was low memory to OS. OS faced low memory and in turn forced SQL Server to trim its cache resulting cache clearing. You may also think of reducing max server memory and giving a bit more to OS.

    Cheers,

    Shashank

    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it

    My TechNet Wiki Articles

    MVP

    Wednesday, August 9, 2017 6:15 AM
    Moderator
  • I would ask what maxdop are you running at, and whether SQL can see any NUMA nodes running in your VM environment, and if so how many.

    This relates to the basically undocumented behavior that SQL has in relation to NUMA nodes, it could run within one node and complain about just that node being short of memory.

    But that should occur a lot more while your memory is small, one would think, and all other things being equal, and LPIM might not help it at all.

    Josh

    Wednesday, August 9, 2017 6:30 AM
  • "... so one can assume that since LPIM does the job what was forcing cache trimming was low memory to OS..."

    I wouldn't count on it. :-) When I teach various SQL Server classes, I run demos involving plan caching. Suddenly, the plans just weren't there anymore. I figured out that a reboot solved the problem. Of course I have heard about this issue. The problem happened every time. Finally I got around (remembered) to turn on LPIM. And since then, the problem never happened. This has been super-consistent for me. And I'm not low on OS memory, at all. Granted I'm on a client OS, and on a laptop. But the point is that there probably is more to the story than what meets the eye. 


    Tibor Karaszi, SQL Server MVP (Web Blog)

    Wednesday, August 9, 2017 8:30 AM
    Moderator
  • I would ask what maxdop are you running at, and whether SQL can see any NUMA nodes running in your VM environment, and if so how many.

    This relates to the basically undocumented behavior that SQL has in relation to NUMA nodes, it could run within one node and complain about just that node being short of memory.

    But that should occur a lot more while your memory is small, one would think, and all other things being equal, and LPIM might not help it at all.

    Josh

    @Josh, to give you some insight of CPU and memory configuration please find the results of following query:

    SELECT cpu_count AS [Logical CPU Count], scheduler_count, hyperthread_ratio AS [Hyperthread Ratio],
    cpu_count/hyperthread_ratio AS [Physical CPU Count], 
    physical_memory_kb/1024 AS [Physical Memory (MB)], 
    committed_target_kb/1024 AS [Committed Target Memory (MB)],
    max_workers_count AS [Max Workers Count], affinity_type_desc AS [Affinity Type], 
    softnuma_configuration_desc AS [Soft NUMA Configuration],
    sql_memory_model_desc -- New in SQL Server 2016 SP1
    FROM sys.dm_os_sys_info WITH (NOLOCK) OPTION (RECOMPILE);

    I can see single NUMA node:

    Node configuration: node 0: CPU mask: 0x00000000000000ff:0 Active CPU mask: 0x00000000000000ff:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.

    And the MAXDOP is set to 8.



    Wednesday, August 9, 2017 8:35 AM
  • This is something worth to be checked. I will schedule this change to be implemented later this week. Let's see what's going to happen tomorrow when we turn on LPIM and then we can adjust max server memory.

    Please report back the effects of setting LPIM!

    LPIM was implemented successfully. So far the results are promising. Since restart (~10 hours ago) I haven't observed even single occurance of the cache trimming. I don't want to speak too soon anyway. SQL Server consumes now roughly ~50 GB of memory and it goes slowly up. I would be able to make some final conclusions when it reaches its maximums (96 GB). It should happen this evening or tomorrow. I will keep you posted!

    That is great that LPIM is giving signs of issue not repeating itself so one can assume that since LPIM does the job what was forcing cache trimming was low memory to OS. OS faced low memory and in turn forced SQL Server to trim its cache resulting cache clearing. You may also think of reducing max server memory and giving a bit more to OS.

    Cheers,

    Shashank

    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it

    My TechNet Wiki Articles

    MVP

    It is almost 36 hours since LPIM has been enabled. No more signs of the memory pressure, even single entry in ring-buffer. The plan cache is growing along with the buffer pool, but still it hasn't reached the max server memory. I pretty sure the issue is gone or I should rather say it has been masked by LPIM. Considering this and fact that most probably it was external pressure (although it was definitely related to the workload either) now I am keen put my attention back to VM configuration. 

    In terms of max server memory, I am not a big fan of reducing it further. This is really SQL Server dedicated box. Even when I was facing the issue and SQL Server was consuming no more than 96 GB which is 3/4 of the entire server memory (I would say very conservative approach), there was always lot of memory available for the OS and other processes.

    Wednesday, August 9, 2017 8:36 AM
  • "... so one can assume that since LPIM does the job what was forcing cache trimming was low memory to OS..."

    I wouldn't count on it. :-) When I teach various SQL Server classes, I run demos involving plan caching. Suddenly, the plans just weren't there anymore. I figured out that a reboot solved the problem. Of course I have heard about this issue. The problem happened every time. Finally I got around (remembered) to turn on LPIM. And since then, the problem never happened. This has been super-consistent for me. And I'm not low on OS memory, at all. Granted I'm on a client OS, and on a laptop. But the point is that there probably is more to the story than what meets the eye. 


    Tibor Karaszi, SQL Server MVP (Web Blog)

    @Tibor, I agree. In this case what is really weird is why the OS 'thinks' there is low memory when in fact there is lot of memory available. Another odd thing is the correlation to SQL Server workload, if it's entirely an external pressure it should be completely unrelated to what's running inside SQL Server.   
    Wednesday, August 9, 2017 8:44 AM
  • "... so one can assume that since LPIM does the job what was forcing cache trimming was low memory to OS..."

    I wouldn't count on it. :-) When I teach various SQL Server classes, I run demos involving plan caching. Suddenly, the plans just weren't there anymore. I figured out that a reboot solved the problem. Of course I have heard about this issue. The problem happened every time. Finally I got around (remembered) to turn on LPIM. And since then, the problem never happened. This has been super-consistent for me. And I'm not low on OS memory, at all. Granted I'm on a client OS, and on a laptop. But the point is that there probably is more to the story than what meets the eye. 


    Tibor Karaszi, SQL Server MVP (Web Blog)

    Thanks for adding another point Tibor I was trying to base my assumption on what actually LPIM does(prevents paging) which mostly happens when OS faces memory crunch or as I said in my very first reply *somehow* memory manager is thinking there is low memory condition and trimming the caches which is prevented by LPIM in this case. I would definitely say this is NOT something ordinary and would really should really be matter of research for MS team. AFAIK this is not an expected behavior

    Cheers,

    Shashank

    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it

    My TechNet Wiki Articles

    MVP

    Wednesday, August 9, 2017 9:20 AM
    Moderator
  • In terms of max server memory, I am not a big fan of reducing it further. This is really SQL Server dedicated box. Even when I was facing the issue and SQL Server was consuming no more than 96 GB which is 3/4 of the entire server memory (I would say very conservative approach), there was always lot of memory available for the OS and other processes.

    Obviously, if LPIM was sufficient to save the show, there is no reason to fiddle around further.

    As for how come the increased workload in SQL Server could cause the OS to think that there is memory pressure, I can only speculate. Possibly the workload includes some special component that causes SQL Server to allocate memory outside the buffer cache (although most memory is taken from there since SQL 2012, I believe).

    But since you don't have the incentive to open a case, we will get the answers to these questions. Neither this time.

    Wednesday, August 9, 2017 11:04 AM
  • Well, I am very happy the issue is solved, but on the flip side I regret we couldn't investigate it any further. As there is no guarantee it's a bug I am not able to convince my managers to spend money and more time on opening the case with Microsoft. Moreover as I was not able to reproduce / see this issue on any other SQL Server 2016 we have on board (perhaps because these servers run some small workloads or maybe there are some other reasons) Considering LPIM masked / solved the issue I don't think there are any options to collect more data about the issue itself. Probably we would need to turn off LPIM in order to observe the issue again (in case of any additional questions from Microsoft support). It's a production environment so it wouldn't be a best idea I guess:)

    I think that topic can be closed now. I wish to thank all of you guys, especially @Erland for all efforts and time spent on helping me solving this issue. I really appreciate it! I hope to see you again @Erland on SQLDay next year!
    Wednesday, August 9, 2017 12:25 PM
  • We experience the same problem after the upgrade from 2008 to 2016. It is a shared and the main patient record system for 8 hospitals with many thousand simulatenous users. We have for many years monitored the system by running DBCC FLUSHPROCINDB at 23:00 and producing a report based on sys.dm_exec_query_stats the next evening. It has been common that almost all queryplans produced after the flush survives to the next flush, hence it is possible to make a report of accumulated resource usage tied to individual queries.

    A few weeks after the upgrade we observed a gradually increased amount of deletion of queryplans, after a month this happend at least every hour. Only the most frequently used plans survived. By "accident" we discovered that the system was "repared" by performing a failover of the cluster running the database, but after a few weeks it gradually was back to the faulty behaviour. That is a strong indication of a software bug causing memory leakage. We have reported the problem to Microsoft [REG:117101016468804], they runned pssd and reported

    “Regarding the SQL Server plan cache, we detected a drop on the plan cache number of objects. This could be explained by memory pressure and with that SQL Server had to reclaim part of the plan cache for other use, but looking to other performance counters related to memory pressure, it didn’t seem to fit the memory pressure scenario.”

    The only fix suggested was to try a downgrade:

    "As this was an upgrade from a 2008 version, you can try to revert to the Legacy CE by enabling trace flag 9481 or setting database compatibility level below 120 to see if performance improves. As temporary relief, you may use this approach.  But we do not recommend you disable new CE permanently unless advised by Microsoft support."

    We did not do that.

    It is mentioned somewhere above that there is a memory leak problem with Query Store. Could that be releated to sys.dm_exec_query_stats? We read that daily and I quess Query Store also uses it.

    Monday, December 4, 2017 2:10 PM
  • We experience the same problem after the upgrade from 2008 to 2016. It is a shared and the main patient record system for 8 hospitals with many thousand simulatenous users. We have for many years monitored the system by running DBCC FLUSHPROCINDB at 23:00 and producing a report based on sys.dm_exec_query_stats the next evening. It has been common that almost all queryplans produced after the flush survives to the next flush, hence it is possible to make a report of accumulated resource usage tied to individual queries.

    A few weeks after the upgrade we observed a gradually increased amount of deletion of queryplans, after a month this happend at least every hour. Only the most frequently used plans survived. By "accident" we discovered that the system was "repared" by performing a failover of the cluster running the database, but after a few weeks it gradually was back to the faulty behaviour. That is a strong indication of a software bug causing memory leakage. We have reported the problem to Microsoft [REG:117101016468804], they runned pssd and reported

    “Regarding the SQL Server plan cache, we detected a drop on the plan cache number of objects. This could be explained by memory pressure and with that SQL Server had to reclaim part of the plan cache for other use, but looking to other performance counters related to memory pressure, it didn’t seem to fit the memory pressure scenario.”

    The only fix suggested was to try a downgrade:

    "As this was an upgrade from a 2008 version, you can try to revert to the Legacy CE by enabling trace flag 9481 or setting database compatibility level below 120 to see if performance improves. As temporary relief, you may use this approach.  But we do not recommend you disable new CE permanently unless advised by Microsoft support."

    We did not do that.

    It is mentioned somewhere above that there is a memory leak problem with Query Store. Could that be releated to sys.dm_exec_query_stats? We read that daily and I quess Query Store also uses it.

    @Jo We didn't use Query Store at all so at least for my case it wasn't the root cause of the issue. I didn't raise a ticket to MS as we would need to pay for this with no guarantee MS confirms it's a bug and we get the money back. Another difficulty was we could observe some symptoms of both internal and external memory pressure, which was very confusing. Anyway Lock Pages In Memory did a trick for me so I suggest to try it (if you haven't done it already).
    Monday, December 4, 2017 3:22 PM
  • I can't say that I am entirely impressed by the response from Microsoft support.

    But I'm curious: did you upgrade from SQL 2008 to SQL 2016 on the same hardware and OS? What OS and how much memory do you have?

    If you have the same OS and memory as you had with SQL 2008, maybe it is not surprising that you see memory pressure. After all, newer software tends to grab more memory, simply because they assume that more memory is available.

    I assume that you have applied the latest CU for SQL 2016 SP1?


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Monday, December 4, 2017 10:33 PM
  • We used the same hardware as we had on SQL 2008. Did
    not upgrade, but reinstalled the servers with Windows 2016. New Clean install.
    The Server has 512GB of memory. We have installed SQL server 2016 with SP1 and
    CU 4. Max memory is set to 500000 (488GB) leaving 24GB for the OS.<o:p></o:p>

    We have been running with LPIM enabled, on 2008 and
    now on 2016. Observations that LPIM mask the problem could be just what we
    observed: Setting LPIM requires a restart, and we have twice observed that a
    restart temporarily helps, but the problem reappears gradually during the coming
    weeks.


    Thursday, December 7, 2017 11:33 AM
  • So there is not really lack of memory. But maybe the OS has a weeny bit to little headroom. To grab a number out of thin air, I would suggest that you leave 50 GB to the OS.

    An interesting DMV to observe is sys.dm_os_memory_cache_clock_hands. You may see rounds_count and removed_all_rounds_count incrementing, and this may occur before you actually notice any problems. The interesting part is that there are two hands, and which of the hands that are moving indicates where the memory pressure is perceived.

    Thursday, December 7, 2017 12:35 PM
  • Restarting or failing over "fixes" the issue because it clears the RAM and starts over.

    I would suggest installing the latest CU and see if that has any effect.  Although the release notes do not list anything which looks related, MS is horrible at publicly documenting changes and it is possible it is fixed in a later CU.

    If that does not help, I would suggest reducing the max server memory and see if that reduces the number of flushes the OS triggers.  Ultimately, you want to set max server memory so everything coexists without the OS triggering a low memory message.

    Thursday, December 7, 2017 12:40 PM
    Moderator
  • I have a new idea. The cache for the operation : sort, hash ... is very high because the queries granted a significant amount of memory :

    But, how the engine determine the memory. Ressource governor allows you to master these allocations.

    I test with the 2 cost estimator, 1 request is allocated with the old 100GB and with the new 2GB.

    I will test the decrease of memory when the production will allow it


    • Proposed as answer by Jo Piene Tuesday, August 7, 2018 9:08 AM
    • Unproposed as answer by Jo Piene Tuesday, August 7, 2018 9:09 AM
    Tuesday, June 5, 2018 2:03 PM
  • We increased memory for OS from 34 GB to 50 GB. That solved the problem.

    The Server has 512GB of memory. We have installed SQL server 2016 with SP1 and CU 4.

    Tuesday, August 7, 2018 9:12 AM