locked
Cold Cache after Failover? RRS feed

  • Question

  • I'm puzzled about failover and cache on SQL Server. I've run a pair of servers for a few weeks without restarting either. I need to do maintenance on the current master, so I failed it over. The mirror server instance was using almost all of the RAM on the server, presumably for cache. But when it started processing queries, its cache hit ratio went down substantially and queries run very slowly.

    Why is the mirror using so much memory if it isn't caching data pages? Is there a way to improve cache content before fail over? If mirroring is meant to provide high availability, but cache population of the is so poor that fail over ends up affecting application performance severely, isn't this a bug?

    Tuesday, May 11, 2010 12:20 AM

All replies

  • Hi,

    Based on your description, it seems the mirror server is suffering insufficient memory. You need to check the memory size of the mirror and SQL Server memory usage by using the DBCC MEMORYSTATUS command.

    Generally, if you are experiencing undesirable performance when you fail over, you could consider a number of capacities on the principal and mirror servers, such as CPU, memory, storage, and network capacity, and SQL Server configuration, such as memory settings. In the event of a failover, best practices recommendations are to use:
    1. Identical partner servers (in terms of CPU, memory, storage, network capacity).
    2. The same service pack and patch levels for the operating system and SQL Server on both partners. When performing rolling upgrades, the service pack and patch levels can be temporarily different on the principal and the mirror. However, for steady state operations, they should be identical.
    3. The same edition of SQL Server on both partners.
    4. An identical directory structure for the SQL Server install and database files on both partners.
    5. The same SQL Server configuration (trace flags, startup options, memory settings, etc.) for the principal and mirror instances.
     
    If there are any more problems, please let me know.
    Thanks.


    ***Xiao Min Tan***Microsoft Online Community***
    Wednesday, May 12, 2010 4:55 AM
  • Hello, Xiao Min.

     

    If I look at the DBCC MEMORYSTATUS output for the two servers, it's nearly identical. I guess the SQLBUFFERPOOL part is the most germain to my complaint. On the master, it looks like this:

     

    MEMORYCLERK_SQLBUFFERPOOL (node 0)    KB
    ---------------------------------------- -----------
    VM Reserved               134365184
    VM Committed               106653184
    Locked Pages Allocated          0
    SM Reserved               0
    SM Committed               0
    SinglePage Allocator           0
    MultiPage Allocator           464

    and on the mirror, it looks like this:

    MEMORYCLERK_SQLBUFFERPOOL (node 0)    KB
    ---------------------------------------- -----------
    VM Reserved               134365184
    VM Committed               106653184
    Locked Pages Allocated          0
    SM Reserved               0
    SM Committed               0
    SinglePage Allocator           0
    MultiPage Allocator           464

    They're the same!

    To answer the rest of your questions:

    1) Yes, the servers are identical hardware in CPU, memory, storage, and network connection.

    2) The servers are running the same patches and SP for both SQL Server and the OS. Your point about upgrades is what brings me ot my question, by the way. If I fail over in order to demote the master to a mirror so I can patch it, response time is slow until the cache warms up. Note that the cache isn't growing; it's populated, but populated with garbage and has to spend time evicting pages that it doesn't need after doing physical I/O.

    3) Both machines are running standard edition.

    4) The setup of the directories is entirely identical. I'm curious, though: How would this possibly effect what SQL Server decides to cache?

    5) We don't use any trace flags or special startup options. The memory limits are configured identically across the two machines.

    I'm not sure if I'm ready to let you know about any other problems, as this one isn't yet solved!

     

     

    Wednesday, May 12, 2010 3:50 PM
  • Hi,

    I’d like to clarify that the data cache on the mirror server will be 'hot', because the mirror server is continuously replaying data modification transactions that it receives from the. In other words, the data cache will be populated with data and index pages based on the same kinds of changes made on the principal. To make the mirror cache even more like the principal's cache, database mirroring also passes SELECT hints to the mirror so that the cache used for querying data also is reproduced on the mirror server. This will aid in making the mirror more like the principal and will reduce the remaining redo time in the case of a failover. Obviously, any additional activity on the mirror server, including queries against database snapshots, will affect the state of the cache and could increase the duration of time to finish the redo phase in the event of a failover.

    Did you use an automatic failover, manual failover or forced service? Since your purpose is to upgrade the SQL Server, I suppose you use a manual failover. When a manual failover causes current users to be disconnected and rolls back any unfinished transactions from the old principal database. It will recover the mirror database by finishing all completed transactions in the redo queue, and rolling back (in the undo phase) unfinished transaction. The recovered new principal database can become available for use as soon as it finishes the redo phase.  I hope all of these may help you understand the high memory usage and catch poputed in the event of a failover.

    If there are any more problems, please let me know.
    Thanks.


    ***Xiao Min Tan***Microsoft Online Community***
    Thursday, May 13, 2010 3:35 AM
  • Hello Xiao Min.

    Indeed, I expected the cache on the mirror server to be hot for the the reasons that you describe. But in practice we found that it certainly wasn't.

    I stopped my application, did a manual failover, waited for the send queue to deplete, waited for the redo queue to finish, and then brought my application back up against the new master server. Before the failover, the SQL Server cache hit counter was greater than 99.9%. After the failover, the cache hit counter was less than 98% and grew as SQL Server replenished its cache.

    At startup, our application isn't accessing data that's much different than what it uses at steady state. There are some queries with large result sets that the application does at startup, but the size of the pages touched to run these queries is very small compared to the cache memory available (and in use) by SQL Server at the time of the failover.

    I still can't figure out why the cache on the Mirror server isn't properly populated and ready for the failover event.

    .B ekiM

    Thursday, May 13, 2010 5:38 PM
  • Seems like SQL is alolcating virtual mem (as opposed to physical mem) - could it possibly be that on the mirror some of the buffer cache has been paged out?
    Bodo Michael Danitz - MCT, MCITP - free consultant - performance specialist - www.sql-server.de
    Friday, May 14, 2010 10:08 AM
  • Hello, Bodo.

     

    SQL Server on the mirror is not paging. The memory allocated is physical, and I can observe this with Process Explorer.

    Friday, May 14, 2010 2:13 PM
  • Hello Mike,

    Have you somehow solved this issue?

    Tuesday, June 12, 2012 8:44 AM
  • No. It seems that the issue is better in 2008 SP2, but it's still true that after a fail over the newly promoted server ends up having very low cache page hits.
    Tuesday, June 12, 2012 12:26 PM
  • What version of SQL are you on? I just read through this thread and I am puzzled a little... Normally the cache on a mirror server in DBM only is hot for the update part of your workload. I have (before SQL2012 and AlwaysOn) never heard of select hints, at least not of any that would have worked in my environments, so from my experience what you see is a quite normal behavior if your selects and updates don't match very well...

    Xiao-Min, if you know more than me, please point me to the KB articles... I would love to learn something new in that space.

    Thursday, June 14, 2012 7:31 PM
  • Right, I do not think we have any prefetching of pages for reads on the secondary, just for updates. There can be some readahead that kicks in which might read in adjoining pages, but nothing intelligent for queries failing over.

    The other thing to check on is if the mechanism you are using to failover is triggering a restart of the database. That would discard any cached pages. Errorlog would show the "Starting up database 'X'." message if we restarted.


    Peter Byrne, Microsoft SQL Server Group

    Tuesday, June 19, 2012 6:56 PM