Wednesday, January 16, 2013 9:13 PM
This issue started last week on Wednesday, January 9, 2013, at 2 PM my hourly backups started failing and I was unable to determine what the issue was. I assumed it was an authentication issue but I waited until the evening and decided to fail-over my cluster thinking that it would my fix my issue. However before I could fail-over my backups started working again, about 8 PM, (weird). So I decided to fail-over anyways since I had installed some windows patches, and my backups continued to work so I thought this might be a fluke.
Then starting today, Wednesday, January 16, 2013, at 2 PM again my backups started to fail. So clearly something must be wrong.
I am running a 2 node SQL cluster (SQL 2008 R2 SP1) and the SQL Engine uses an AD service account I have a test cluster that is configured exactly the same, however it uses a different AD account. At first I thought it might be my backup device, so checked its logs, ran an update, and rebooted it, and that did not fix it. So then I tried backing up a database from the test cluster to my backup device and that worked. We have another backup device that is identical to the one we use for SQL, it is actually where we replicate the backups to so it is setup the same way, and I tried backing up both the SQL clusters to the second backup device and both work.
Has anyone experienced this before? It is a very odd issue and has me stumped.
Wednesday, January 16, 2013 9:39 PM
What do you see in the sql server error log? it should why your backups failed. did you check them??
Hope it Helps!!
Wednesday, January 16, 2013 9:41 PM
Sorry, forgot to post that:
BackupDiskFile::CreateMedia: Backup device '\\<device-name>\ssis.bak' failed to create. Operating system error 1326(Logon failure: unknown user name or bad password.).
However this error message does not really help. The backups work, except during this bizarre time period, they also work on other devices.
Wednesday, January 16, 2013 9:52 PM
Since you are backing up to a network share, I have seen issues like this when the datetime of the servers are significantly out of sync with the AD Domain Controller.
Wednesday, January 16, 2013 10:00 PMAnswerer
What is interesting is that you say it is already around that time. I would talk to your windows admins and see if anythign is hammering the domain controllers around there or if your network becomes unresponsive.
In addition to what others ahve said around kerberos and time issues, try creating the bakup to a drive that isn't a network share and see if the issues persist. That could indicate an issues with the server running the network share, etc. I would aks your networking team to setup a network capture for that server around that time so you cna capture the authentication request, then look in the windows security event viewer to see if the logon failed for that account. You're windows team should be able to help with that as well.
Wednesday, January 16, 2013 10:07 PMI thought that because I got burned on that a few years ago and the time seems correct. Both the SQL server and the backup device are getting there time from the DC.
Wednesday, January 16, 2013 10:08 PMLet me ask our AD team. However, both the backup devices, the primary and the secondary, are using the same DC for authentication so I am not sure that is the issue.
Thursday, January 17, 2013 9:13 AMModerator
This is definitely an Windows/AD issue. Windows returns the error code 1326 to SQL Server (see the SQL Server error). You can even ask Windows what this means:
NET HELPMSG 1326
"The USername or password is incorrect"
Perhaps you are hitting different DCs at different times? For instance...
Thursday, January 17, 2013 2:22 PMDo you know how I can track what domain controller the account is trying to authenticate to during the backup?
Thursday, January 17, 2013 2:26 PMAnswerer
Unfortunately all I know is to check %logonserver% but that's not to say it won't go to a different one. The network trace should show the authentication attempts. Also don't forget to check the windows security event log as that'll actually log the failed login and give you at least some information to go on if you can't get the traces right away.
Thursday, January 17, 2013 8:14 PM
Technically, due to the error you are recieving this is a Windows AD issue, not a SQL Server issue.
I would check the Windows Event Security log on the target server and see what the error message says.
I have also seen this when AD replication is not working correctly.
Monday, January 21, 2013 4:02 PMThe target server is a QNAP NAS using the SMB protocol, so I will have to check if it has that kind of logging. The issue cleared up after about 3 hours. This issue is just baffling to me. Will update after I have checked the logs.