none
Possibly found a bug in SQL Server 2012 Availability Groups

    Question

  • Around three weeks ago we've migrated from our SQL Server 2005 with mirroring setup to the brand spanking new (and way cooler) SQL Server 2012 Availability groups. Everything worked out wonderfully untill I discovered my LOG backups were not beeing created a few days ago.

    I contacted Ola Hallengren (which some of the other users here are likely to know as the man who supplies us with the SQL Server Maintenance Solution; http://ola.hallengren.com/) in order to shed some light on the issue.

    First let me pain you the picture, we've got two servers running SQL Server 2012 Enterprise with an Availability Group between them. The backup perferences are set up the the default; 'On secondary replicas' (http://technet.microsoft.com/en-us/library/hh245119.aspx).

    This means i'd expect the LOG backups to be done on which ever of the two servers is running as a Secondary at the time of running. Now it took me longer then it should to realize but we didn't have any LOG backups beeing created at all since the migration.

    Trying to pinpoint why we've come to the conclusion that sys.fn_hadr_backup_is_preferred_replica (http://technet.microsoft.com/en-us/library/hh213235.aspx) is returning a 0 on both the servers. I've since then found the following two references;

    http://social.msdn.microsoft.com/Forums/sqlserver/en-US/7cb08c5c-30b7-40d1-97e7-3302675a9d0e/sysfnhadrbackupispreferredreplica-returns-0-on-both-nodes-of-cluster

    http://connect.microsoft.com/SQLServer/feedback/details/765317/sys-fn-hadr-backup-is-preferred-replica-always-return-zero-for-availabilty-database

    Which happens to be pretty similar to our situation, here is how our migration took place and the "current" sitation;

    - Installed two new servers with Windows 2012 and SQL Server 2012 in a testing lab.

    - Both servers turned out stable and great so we're renamed to their future final names (this bit is important)

    - Server were moved to the datacentre, the cluster was created & the SQL Server Availability Group was created

    - Databses were migrated.

    This is where we're at today, now I can tell you aswell that after reading the two references I checked our @@SERVERNAME property which holds the OLD server names (SQL-1 and SQL-2 for simplicity sake). I've since then renamed both servers to their actual name. I've yet to restart the services though, which I simply haven't had the balls to do during work hours. I'm expecting this to solve the problem from what i've read in the two references but it might be something that is reproducable in a Lab.

    Surely beeing unable to make transaction backups on either server is wrong. I'm also supprised the @@SERVERNAME is used and no the values found in; sp_helpserver

    I have no Idea if these forums are the right place, i've also mailed 'premier@microsoft.com' to ask how to proceed but I thought also posting on the forums might be the right way to go.

    Mind you, i'm EXPECTING a restart of both services to fix the issue but I cannot at this time confirm this (hopefully after the weekend)

    Thursday, August 15, 2013 9:58 AM

All replies

  • Hello,

    I've attempted to reproduce this, but when I do I can get a value of 1 returned on the secondary replica. I've failed the AG over to another node and tested again retuning a value of 1. I've also read a bunch of other posts on here speaking to the same issue. I'm thinking it might have to do with the actual availability group and how adding databases in after the fact could cause the setting to not take?

    Could you check the automated_backup_preference_desc from sys.availability_groups ? What does it report? 

    I'm assuming you have the same jobs setup on each node with Ola's scripts?

    I'll try to dig further into this, but afraid I'll be out for a bit.


    Sean Gallardy | Blog | Twitter

    Friday, August 16, 2013 2:18 AM
  • I have too tried to Repro this issue for a while now, as it is a big concern for us. But so far all efforts to see the Problem have failed. So at least with the current patch Level of SQL I can't seem to get it done, no matter how I do the Setups, AGs, DBs, etc. If anyone has a clear step-by-step Repro I would love to see it.
    Friday, August 16, 2013 6:20 PM
  • I have been doing some testing.

    One way to reproduce it is to change the local servername in sysservers using sp_dropserver and sp_addserver. After a restart @@SERVERNAME will display the new name.

    As I understand it, there is a problem if the @@SERVERNAME doesn't match the server names in the availability group catalog views that are used in sys.fn_hadr_backup_is_preferred_replica.

    Ola Hallengren
    http://ola.hallengren.com

    Friday, August 16, 2013 8:05 PM
  • There also seems to be some issues if you have created the availability groups using scripts with lower case server names, and you have a case sensitive instance.

    http://connect.microsoft.com/SQLServer/feedback/details/765317/sys-fn-hadr-backup-is-preferred-replica-always-return-zero-for-availabilty-database

    Ola Hallengren
    http://ola.hallengren.com



    Friday, August 16, 2013 8:27 PM
  • Firstly, restarting the service which only changed the @@SERVERNAME for us (after doing the remove/add server thing Ola posted. Which i had done a while ago but didn't have the heart to reboot the servers outside the weekend) fixed this issue.

    As Ola stated it seems to very much have to do with the @@SERVERNAME not matching the machine name, I haven't got lab to reproduce but here is the exact steps I took before finding the issue;

    - Install windows 2012 calling the server: WRONGNAME1 and WRONGNAME2
    - Install SQL 2012 on both servers setup AG
    - Did some testing with stability e.d.
    - Removed the AG & Clustering (did not remove sql)
    - Renamed the servers to RIGHTNAME1 and RIGHTNAME2
    - Installed AG with a dummy DB
    - Removed the dummy DB from the AG
    - Added the migrated databases in recovery
    - Restored Transaction logs to minimise down time
    - Added the DB's to the AG
    - Ran sp_Blitz from Brent Ozar to find out @@SERVERNAME isn't showing the right names
    - Changed the @@SERVERNAME using the stored procedures posting only to find out the service then needs a restart

    This is pretty much where I posted here, not having restarted the services and pretty sure that the issue revolves around the @@SERVERNAME.

    Another question asked, yes, Ola´s scripts are all setup the same for each server.

    I hope these steps allow someone to reproduce this issue.



    Sunday, August 18, 2013 6:23 AM
  • Hi

    Thanks for your question.  

    I am trying to involve someone more familiar with this topic for a further look at this issue. Sometime delay might be expected from the job transferring. Your patience is greatly appreciated.  

    Thank you for your understanding and support

    Thanks,
    Candy Zhou

    Friday, August 23, 2013 12:46 AM
  • Not a problem, as said the issue is resolved for us after rebooting the servers this issue for us is resolved.

    Should any more information be required let me know and I'll do my best to facilitate.

    Friday, August 23, 2013 6:58 AM
  • All,

    Just as an FYI:

    I took at a look at the code for "sys.fn_hadr_backup_is_preferred_replica" and the reason it fails is because of the following lines:

    select @local_server_name = @@servername;

    and then later

    if (@primary_server_name = @local_server_name)
            begin
                return 1;
            end

            return 0;

     As you can see, it's going to check @@ServerName rather that use SERVERPROPERTY which should return the correct server name after an actual rename has occurred.

    I wouldn't call it a "bug" because it's working as intended but I would say there is a better way of doing it.

    -Sean


    Sean Gallardy | Blog | Twitter

    Friday, August 23, 2013 12:40 PM
  • Agreed the code does what it should do, however bad/wrongful behavior what would that be called then if not a bug?

    Friday, August 23, 2013 12:50 PM
  • The behavior, IMHO, is expected. Remember @@SERVERNAME does not reflect changes in network name.

    Although the @@SERVERNAME function and the SERVERNAME property of SERVERPROPERTY function may return strings with similar formats, the information can be different. The SERVERNAME property automatically reports changes in the network name of the computer. In contrast, @@SERVERNAME does not report such changes. @@SERVERNAME reports changes made to the local server name using the sp_addserver or sp_dropserver stored procedure.

    http://technet.microsoft.com/en-us/library/ms187944.aspx

    Bug - Meh?

    Needs refactoring or should have used SERVERPROPERTY - Yes

    You could 'argue' that when the network name change occurred drop and add server + service restart should have happened immediately. Had that happened, this would work as intended. So, depending on point of view, it's the "other's" fault for not following good practices.

    -Sean


    Sean Gallardy | Blog | Twitter

    Friday, August 23, 2013 1:12 PM
  • Ow I agree that the behavior is as expected when it comes to the @@SERVERNAME.

    I wouldn't agrue the "bug" would be within the @@SERVERNAME, i'd expect the availability function to have used the SERVERPROPERTY and not the @@SERVERNAME.

    Eitherway it's something that can happen in a server scenario (as I clearly proved by not naming my servers right the first time around hehe) and it causes "sys.fn_hadr_backup_is_preferred_replica" to return the wrong value in my opinion.

    The code you linked is indeed causing this behavior but the behavior itself I wouldn't call "Expected" or "Intended" but in honesty that's just symantics and what not. Just like I don't really mind/care if this is a bug or not (because the code does what it's designed to do).

    I'd say the final behavior (sys.fn_hadr_backup_is_preferred_replica returning a 0 because of the @@SERVERNAME discrepancy) is not that which should be desired (then again this could all be by design due to some issues I can in no way oversee).

    So i'll give you this; Bug, no. Design flaw, possibly but I have no way of knowing if there are scenario's where this is indeed desired behavior.

    Friday, August 23, 2013 1:19 PM
  • So I think that sys.fn_hadr_backup_is_preferred_replica has issues in two cases:

    * The @@SERVERNAME is not correct.

    * The availability group configuration has been made using lower case on a case sensitive instance.

    The best would of course be if there could be made improvements so that these cases cannot happen.

    Another thing could be to do an initial check in sys.fn_hadr_backup_is_preferred_replica that it can determine the preferred backup replica safely.

    Something like this:

    IF @@SERVERNAME NOT IN (SELECT replica_server_name FROM sys.availability_replicas) RETURN NULL

    Ola Hallengren
    http://ola.hallengren.com

    Friday, August 23, 2013 4:55 PM