locked
Frequent Database Failover RRS feed

  • Question

  • Hi MSDN ppl,

    I seek your expertise yet one other time.

     

    Scenario:

    We have 7 databases mirrored on two servers which are mirroring partners. 3 of the 7 databases are live on server1 and mirrored on server2; and the remaining 4 databases are live on server2 and mirrored on server1. The data is exposed through .NET Widows Application.

     

    The configurations of the servers are as follows.

     

    System: Microsoft Windows Server 2003 R2

    Standard x64 Edition

    Service Pack 2

     

    Computer: Intel(R) Xeon (R) CPU

    5130 @ 2.00 GHz

    2.00 GHz, 32.0 GB of RAM

     

    SQL Version: Microsoft SQL Server 2005 - 9.00.3175.00 (X64)   Jun 14 2007 11:45:39  

    Copyright (c) 1988-2005 Microsoft Corporation Enterprise Evaluation Edition (64-bit)

    on Windows NT 5.2 (Build 3790: Service Pack 2)  

     

    Problem:

    The databases for no apparent reason keep randomly failing over to one server quite frequently. At least twice a day. There is no pattern associated for me to make out as to why this is happening.

     

    My Questions:

    1. Is it a good practice to divide the databases on each server, the way it is now? Or should all the databases be kept on one server and mirrored on other all the time?

     

    2. From the above mentioned scenario, do you find the reason for database to 'failing over' so frequently? Could the Win Application which is used to expose the data be responsible for the failovers?

     

    3. What steps can be taken to check for the reason which is causing the databases to failover? Alternatively and most importantly, how can I this problem of 'Databases Failing over randomly' be solved?

     

    Thank you,

    Little_Birdie

     

     

    Monday, April 28, 2008 4:46 PM

All replies

  • You will probably get more answers in the database mirroring forum, but since I hang out there also, I will give it a shot.

     

    Do you have a witness server?  You must since that is the only way you can get automatic fail-over, but you did not mention it. Are either of the database servers having any hardware issues where they are going down? Are the principal and the mirror in the same location, or are they far away (many miles) from each other.

     

    Have you looked in the SQL Error log and in the Windows Application Event Log to see if there are any relevant errors?

     

    If the host DB servers are stable, unexpected fail-overs are usually caused by connectivity problems. For example, if the mirror and the witness are having problems talking to the principal server, you could get an unexpected fail-over.

     

    Monday, April 28, 2008 7:57 PM
  • Glenn,

    Thanks for the input.

     

    "You will probably get more answers in the database mirroring forum, but since I hang out there also, I will give it a shot."

    --Pls use your discretion in moving the question to 'database mirroring' forum if you consider the qn more appropriate for that forum.

     

    "Do you have a witness server?  You must since that is the only way you can get automatic fail-over, but you did not mention it. Are either of the database servers having any hardware issues where they are going down? Are the principal and the mirror in the same location, or are they far away (many miles) from each other."

    --Yes, we have a witness server and that makes automatic failovers possible. There is no sign of any hardware failure on the database servers at the time when mirroring failure has been noticed. The principal and mirror are in the same location, in fact, they reside on the same rack, just inches away from each other.

     

    "Have you looked in the SQL Error log and in the Windows Application Event Log to see if there are any relevant errors?"

    --I have checked the windows log. Nothing significant recorded there which can give a possible explaination of the failure. The following are the few messages that were recorded.

     

    Database mirroring is inactive for database 'TC_IKN'. This is an informational message only. No user action is required.

     

    2008-04-28 19:24:22.88 spid33s     The mirroring connection to "TCP://LV-SQL4.com:5022" has timed out for database "TC_CTI" after 10 seconds without a response.  Check the service and network connections.

    2008-04-28 19:24:26.59 spid14s     Error: 1479, Severity: 16, State: 2.

     

    "If the host DB servers are stable, unexpected fail-overs are usually caused by connectivity problems. For example, if the mirror and the witness are having problems talking to the principal server, you could get an unexpected fail-over."

    --Is there a way to determine that this is indeed a connectivity problem, a simple test which would prove it? And if it is, how can this be resolved?

     

    Little_Birdie

     

     

     

     

     

     

     

     

    Tuesday, April 29, 2008 4:24 PM
  • There is a Microsoft KB947462 that seems pretty relevant to this issue.

    http://support.microsoft.com/kb/947462

     

    I believe that you are running Build 3175 (which is CU2), you should think about upgrading to CU7 (which is the most current CU).

     

    Wednesday, April 30, 2008 7:41 PM
  • Hi Little_Birdie,

    By default the timeout period is 10 seconds and hence for every 10 seconds you principal will ping the mirror server and if its failed to connect you will receive the error mentioned above. May be you can increase the timeout period to 60 seconds and see how it goes.

    Code Snippet

    ALTER DATABASE dbname SET PARTNER TIMEOUT 60


    Moved to Database mirroring forum so that you get better response..

    - Deepak
    Thursday, May 1, 2008 1:26 AM
  •  

    Please paste the error log information from witness server.

     

    Regards

    Sufian

    Thursday, May 1, 2008 12:58 PM
  •  Deepak,

     

    I already set it to 15 secs. My thought is, why is it taking more than 10 - 15 secs to connect in the first place, unless there is some other problem that needs to be delt with. On a different note, does increasing the connection time have any cons ?

     

    Thank you,

    Little_Birdie. 

     

    Thursday, May 1, 2008 3:29 PM
  • If you set the Partner timeout value to a higher value, it will very likely help with your frequent unexpected fail-over problem. The downside is that it will take longer for SQL Server to notice and react if you actually have a failure that should generate a real fail-over.

     

    I agree that you should be looking at network issues (especially since these servers are all in the same rack), to find the root cause of the problem.

     

    Thursday, May 1, 2008 3:37 PM
  • Glenn,

    Thanks again for a wonderful suggestion. The symptoms stated in the article are close to what we are experiencing.

     

    Few qns in this regard.

    1. How can I cross check if we are running the 'CU2'  build?

     

    2. Does applying Service pack 2 automatically upgrade to CU7? The reason I am asking this is, we already have service pack 2 installed but we still experiencing the 'unexplained failover'  problem.

     

    Awaiting your reply..

    Little_Birdie.

     

    Thursday, May 1, 2008 3:49 PM
  • Sufian,

    These are following errors recorded by the Witness Server.

     

    2008-04-27 07:34:27.25 spid15s     The mirroring connection to "TCP://LV-SQL5.com:5022" has timed out for database "TC_CTI" after 10 seconds without a response.  Check the service and network connections.

    2008-04-27 07:34:51.14 Server      Error: 1474, Severity: 16, State: 1.

    2008-04-27 07:34:51.14 Server      Database mirroring connection error 4 '64(The specified network name is no longer available.)' for 'TCP://LV-SQL4.com:5022'.

    2008-04-27 07:35:12.89 spid21s     Error: 1474, Severity: 16, State: 1.

    2008-04-27 07:35:12.89 spid21s     Database mirroring connection error 2 'Connection attempt failed with error: '10060(A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)'.' for 'TCP://LV-SQL4.com:5022'.

    2008-04-27 08:43:03.21 spid15s     Error: 1479, Severity: 16, State: 1.

    2008-04-27 08:43:03.21 spid15s     The mirroring connection to "TCP://LV-SQL5.com:5022" has timed out for database "TC_CTI" after 10 seconds without a response.  Check the service and network connections.

    2008-04-27 13:34:08.78 spid15s     Error: 1479, Severity: 16, State: 1.

     

    Thank you,

    Little_Birdie.

     

     

     

    Thursday, May 1, 2008 3:52 PM
  • This Microsoft KB explains all the post SP2 builds. Build 3175 (CU2)  is pretty old (April 2007), but it was post SP2.

     

    http://support.microsoft.com/default.aspx/kb/937137/LN/

     

    You must have SP2 installed, then just apply CU7 (which includes all the earlier CU builds).
    Thursday, May 1, 2008 3:54 PM
  • Glenn,

    Thanks for the additional info. Will do what you suggested and will get back.

     

    Little_Birdie.

     

    Thursday, May 1, 2008 6:11 PM
  • Hi,

     

    The error log state that witness server don't see the principal server for a certain period of time and failsover.

    This is purely a network issue.

     

    You can ping the principal server from the witness server and see if there are  RTO' overnetwork.If so need to get in touch with the network team.

    You can aslo see wether mirroing port is not getting blocked or scaned by any ntvirus. (disable AntiVirus)

    You need to check the network speed on the switch also. (speed on switch between principal server and Witness server should be same (better to be AUTO)).

     

    You can also alter the timeout period on the principal server.

     

    In my case the problem is with the NIC Card and Switch speed.

     

    Regards

    Sufian

     

     

     

    Friday, May 2, 2008 11:44 AM
  • Thank you Sufian. I'll check out the things you suggested, and will get back to you with findings.

     

    - Little_Birdie.

     

     

    Monday, May 5, 2008 10:05 PM