locked
Mirroring failing with error 1479 for all databases at once RRS feed

  • Question

  • I'm running SQL Server 2008 Enterprise on Windows 2008 Standard with a Hyper-V virtualised mirror on Windows 2008 R2 Datacentre and a witness running SQL Express on Windows 2008 Standard. All are at 10.0.5512 and 64bit. I am using High Safety with automatic failover (synchronous) mirroring on 37 databases. All servers have 1GB network links and are 1ms away from each other.

    Since early January, approximately once to twice a day I see all the databases suddenly throw error 1479:

    Error: 1479, Severity: 16, State: 1.

    Database mirroring is inactive for database 'xxx'. This is an information message only. No user action is required.

    The mirroring connection to "TCP://MIRROR.DOMAIN.com:5022" has timed out for database "xxx" after 10 seconds without a response.  Check the service and network connections.

    Within a few seconds, usually less than 10 I see another entry:

    Starting up database 'xxx'.

    Then a bunch of recovery messages, usually 0 transactions rolled back. Then a CHECKDB which finishes without errors then:

    Database mirroring is active with database 'xxx' as the principal copy. This is an informational message only. No user action is required.

    This happens for all databases that are mirrored and usually takes less than 20 seconds for the entire process.  There are no other error codes shown, except some 18456 where users have tried to access during the time the databases are unavailable.

    I have a lock detector program running against the principal that shows queries stuck waiting on DBMIRROR_DBM_EVENT waits, and sometimes on DBMIRROR_WORKER_QUEUE.

    I have tried moving the mirror to different switches and am fairly sure the network is fine as no other machines are interrupted in the same way, but I can't rule it out.

    Looking at performance monitor on the mirror, during these incidents I see a spike (20 or 30 times higher than usual) in Processor Queue Length, % Disk Read Time, Page Faults/sec, and big increases (5-10 times higher than usual) in Current Disk Queue Length and Processor time. Available memory does not seem to increase or decrease at all.

    Can anyone suggest what might be happening? Or what I should do to diagnose in more detail? Or what changes in those counters could mean?

    Thanks!


    • Edited by Peter Curd Friday, February 15, 2013 6:11 PM Adding clarification over which machine had the performance monitor values mentioned
    Friday, February 15, 2013 4:28 PM

Answers

  • Long time since my last message but I'm pleased to say this issue was resolved!

    After working with Microsoft support they identified that the CPU of the mirror was entering a low power state and all CPU cycles were stopping for minutes at a time which caused the principal to time out transactions, which caused the witness to move the mirroring to the mirror.

    The fix was:

    1.       Make sure that the power plan for the mirrored server is set to high performance
    2.       Make sure the host machine for the  mirrored VM is also set on high performance
    3.       Make sure the host machine for the mirrored VM at the BIOS level has no power saving options enabled.

    Thanks all for your help!

    • Marked as answer by Peter Curd Thursday, October 10, 2013 11:25 AM
    Thursday, October 10, 2013 11:25 AM

All replies

  • What do you see on the mirror side during this incident? I know of two Scenarios that can lead to similar Patterns, but both only happen in high load situations. (At least with me so far...)

    1) The more common issue along this line is when the data disks on the mirror can't Keep up with writing the new data pages down. This is normally detectable by looking at the waits on the mirror. They should Show high numbers of PAGEIOLATCH waits and most likely Long current disk Queues on the data drives.

    2) The other one I have seen is around receive side scaling on the Network Card. (On either principal or mirror.) This is a Little harder to detect though. The only real hint I have seen so far is that at the beginning of the Problem period you will see single CPU cores spiking to 100% with most of the time being Kernel-mode time.

    Friday, February 15, 2013 6:56 PM
  • Thanks for the reply!

    The mirror sees spikes in the % Disk Read Time, Log Compressed Bytes Rcvd/sec, Log Bytes Received/sec, Log Send Queue KB, Redo Queue KB, Redo Bytes/sec, Send/Receive Ack Time, and Log Harden Time (ms) counters. For example, Log Compressed Bytes Rcvd/sec goes from the usual of around 200 to 8,445,088. Log Harden Time (ms) from the usual of around 0.1 to 77,911.

    CPU rises to around 50% just before. Memory usage doesn't change.

    I have put a monitor on the collected PAGEIOLATCH waits and see 647,000 on the primary since last reboot, 888,743 on the mirror (they reboot within minutes of eachother). I don't see a significant increase in wait_time_ms or waiting_tasks_count during the time of these incidents on either the primary or the mirror.

    Does the fact that the mirror is virtualised using a VHD for the storage make it more likely that IO is the problem? The PAGEIOLATCH figures don't seem to suggest it but the spikes in Log Harden Time do? Any recommendations for other areas I can investigate or things I can try would be appreciated.

    Thanks,

    Monday, February 18, 2013 4:09 PM
  • You may want to look at the perfmon counters during the issue time on both the principal and mirror to check for any Memory, IO, Network related issues contributing towards the mirroring failover.

    http://technet.microsoft.com/en-us/library/cc966540.aspx

    As an FYI, you can also increase the timeout value from default 10 seconds to higher value to avoid failover due to this issue

    http://www.mssqltips.com/sqlservertip/1603/adjusting-the-automatic-failover-time-for-sql-server-database-mirroring/

    Tuesday, February 19, 2013 12:27 AM
  • I agree with V. , you Need to look at this issue live when it is Happening. And at that time I would look specifically for the Waits in Progress (SQL Wait Statistics) on both sides. And please make sure that you have TraceFlag 1234 enabled on both machines, otherwise the waits in Progress don't get populated correctly. (You can Switch that on interactively with "DBCC TRACEON (1234,-1)")

    Monday, February 25, 2013 4:54 AM
  • I only know about the issue when I get L16 emails, when my monitoring software (Red Gate SQL Monitor) detects a SQL Mirroring Latency increase, or when an application or user reports a time out exception. By the time I've even heard the email ding the mirroring is already working again.

    It takes a matter of seconds to happen - just long enough for real time processes to fail or unlucky users to get a timeout.

    Looking at the perfmon counters I have been able to capture (by leaving them on all the time) I see log harden time increase after, compliations and scans drop before, current disk queue length goes to 3 (which seems to be the maximum) during, and disk time (mostly on the C Drive) rise after.

    Memory seems to be static, and processor queue increases but cpu usage does not. All sounds like disk IO to me but I don't know how to tell what the problems are?

    Monday, February 25, 2013 12:48 PM
  • OK, let's try it like that then: Enable TF 1234 (as I said above) and run a perf Counter collection (5 second interval) with the SQL Wait Statistics in it on both nodes. That should let us see the Spikes in the waits hopefully...
    Monday, February 25, 2013 9:11 PM
  • Right,

    Live shows:

    Log write waits started per second from 0-1 to 140

    Network IO waits usually goes 0-2,000 every 30 seconds or so - during the event it stays solid at 2,000 for 20 seconds. Waits started drops to 0 during this time too.

    Mirror shows:

    Log Write Waits cumulative time goes to 1ms per second, waits in progress to 1 from 0

    Wait for the worker goes to 1 from 0

    Non-Page latch waits goes to 554 from 0

    Network IO waits go from 4-6 started per second to 0

    Page latch waits started per second go from 8-30 to 0

    Thursday, February 28, 2013 3:00 PM
  • Long time since my last message but I'm pleased to say this issue was resolved!

    After working with Microsoft support they identified that the CPU of the mirror was entering a low power state and all CPU cycles were stopping for minutes at a time which caused the principal to time out transactions, which caused the witness to move the mirroring to the mirror.

    The fix was:

    1.       Make sure that the power plan for the mirrored server is set to high performance
    2.       Make sure the host machine for the  mirrored VM is also set on high performance
    3.       Make sure the host machine for the mirrored VM at the BIOS level has no power saving options enabled.

    Thanks all for your help!

    • Marked as answer by Peter Curd Thursday, October 10, 2013 11:25 AM
    Thursday, October 10, 2013 11:25 AM
  • to whom it may help:

    ALTER DATABASE [databasename]
        SET PARTNER TIMEOUT 30;
    GO


    Doria

    Wednesday, June 29, 2016 7:35 PM