locked
MSSQL 2008 standard, Principal failed without going to Mirror. Extra diagnostics possible ? RRS feed

  • Question

  • I have had only one occurrence of the following problem and would like to enable further diagnostics within MSSQL 2008 to diagnose it.... 

    Environment:

    2 * (Dell Poweredge R610's running Server 2003 R2. 6 disks, 1 pair of 3 RAID sets), each running MSSQL 2008 Standard. Running as principal and mirror. A third PC (same spec as above) running SQLExpress as the Witness or Arbitrator.

    Symptom:

    When applications posted a query, no connection could be established. Management SQL console hung when run. However, the currently "active" database  appeared to be running normally, backing up and mirroring as expected and not reporting any errors. Tthe "mirror" database did not take over under this situation. Logs show a transaction being committed but failing to return !

    When a transaction is committed on MSSQL but fails to return, we don't know if MSSQL committed the transaction ok or failed.  This in itself isn't a problem - it's allowed for - but it does show how sudden the failure was, actually during the commit of a transaction.  In brief, the database failed suddenly, and then started timing out when the client attempted completely innocuous things.

     

    14/11/2010 07:42:17.515 ERROR [Multicast clock receiver] Transaction (null) - Exception committing transaction

    System.Data.SqlClient.SqlException: Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.

    ...

       at System.Data.SqlClient.SqlTransaction.Commit()

       at Rescale.Mssql.Transaction.Dispose() in d:\buildAgent1\work\d465656922109049\Rescale\RescaleModelManager\Mssql\Transaction.cs:line 33

    14/11/2010 07:42:17.531 WARN  [Multicast clock receiver] ModelManager (null) - Error committing database transaction

    Rescale.Mssql.UnknownCommitSuccessException: The success of a commit statement is unknown ---> Rescale.Mssql.UnknownCommitSuccessException: The success of a commit statement is unknown ---> System.Data.SqlClient.SqlException: Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.

    A forced database failover got our system working again (shutting down the currently "active" database PC), when the previous mirroring database took over.

    Logs from the two machines indicate both databases have lost communication with the Witness PC, but shouldn't the main database carry on under that situation?

    SVR1:
    2010-11-14 07:56:37.75 spid26s     Error: 1479, Severity: 16, State: 1.
    2010-11-14 07:56:37.75 spid26s     The mirroring connection to "TCP://MPH-SD1-WIT.Automation.local:5022" has timed out for database "RESCALE" after 10 seconds without a response.  Check the service and network connections.
    2010-11-14 07:56:58.44 Server      Error: 1474, Severity: 16, State: 1.
    2010-11-14 07:56:58.44 Server      Database mirroring connection error 4 '64(failed to retrieve text for this error. Reason: 1815)' for 'TCP://MPH-SD1-WIT.Automation.local:5022'.
    2010-11-14 07:57:00.95 Backup      Log was backed up. Database: Rescale, creation date(time): 2010/08/25(08:02:27), first LSN: 3558702:1092:1, last LSN: 3558702:1098:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'nul'}). This is an informational message only. No user action is required.
    2010-11-14 07:57:22.38 spid20s     Error: 1474, Severity: 16, State: 1.
    2010-11-14 07:57:22.38 spid20s     Database mirroring connection error 2 'Connection attempt failed with error: '10060(failed to retrieve text for this error. Reason: 1815)'.' for 'TCP://MPH-SD1-WIT.Automation.local:5022'.

     

    SVR2:
    2010-11-14 07:56:39.69 spid27s     Error: 1479, Severity: 16, State: 1.
    2010-11-14 07:56:39.69 spid27s     The mirroring connection to "TCP://MPH-SD1-WIT.Automation.local:5022" has timed out for database "RESCALE" after 10 seconds without a response.  Check the service and network connections.
    2010-11-14 07:57:13.86 Server      Error: 1474, Severity: 16, State: 1.
    2010-11-14 07:57:13.86 Server      Database mirroring connection error 4 '64(failed to retrieve text for this error. Reason: 1815)' for 'TCP://MPH-SD1-WIT.Automation.local:5022'.
    2010-11-14 07:57:36.69 spid31s     Error: 1474, Severity: 16, State: 1.
    2010-11-14 07:57:36.69 spid31s     Database mirroring connection error 2 'Connection attempt failed with error: '10060(failed to retrieve text for this error. Reason: 1815)'.' for 'TCP://MPH-SD1-WIT.Automation.local:5022'.
    2010-09-27 20:41:00.50 Backup      Error: 3041, Severity: 16, State: 1.
    2010-09-27 20:41:00.50 Backup      BACKUP failed to complete the command BACKUP LOG Rescale. Check the backup application log for detailed messages
    2010-09-12 10:17:30.10 Logon       Error: 18456, Severity: 14, State: 38.
    2010-09-12 10:17:30.10 Logon       Login failed for user 'sa'. Reason: Failed to open the explicitly specified database. [CLIENT: 10.202.7.55]

     If anyone is aware of extra diagnostic logging that could pinpoint the cause of this failure, it would be a great help.

     

    Thanks very much,

    snell123

    Wednesday, December 8, 2010 11:14 AM

Answers

  • First, I would suggest you to apply service pack 2 for SQL 2008 as there have been few fixes in Mirroring area.
    Coming to the problem faced by you.. there are multiple problem I see..

    1.
    System.Data.SqlClient.SqlException: Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding
    This error happened at "07:42:17.515"

    2.
    Database mirroring connection error 4 '64
    Error 64 => The specified network name is no longer available.
    This error happened at "07:56:37.75"

    3.'Connection attempt failed with error: '10060
    Error 10060 => A connection attempt failed because the connected party did not properly respond a period of time, or established connection failed because connected host has failed to respond.

    There is a time difference in both so i don't think there are co-related. 
    Issue # 1 can be easily explained in below image.

    There might be blocking in the database causing your query to wait more than 30 sec (default value). Did you capture any information when issue happened?

    I would also suggest you to disable TCPChimney since there could be intermittent issue due to this.  
    http://support.microsoft.com/kb/942861 (Error message when an application connects to SQL Server on a server that is running Windows Server 2003: "General Network error," "Communication link failure," or "A transport-level error")


    Balmukund Lakhani | Please mark solved if I've answered your question, vote for it as helpful to help other user's find a solution quicker
    --------------------------------------------------------------------------------
    This posting is provided "AS IS" with no warranties, and confers no rights.
    --------------------------------------------------------------------------------
    My Blog: http://blogs.msdn.com/blakhani
    Team Blog: http://blogs.msdn.com/sqlserverfaq
    • Marked as answer by Alex Feng (SQL) Wednesday, December 15, 2010 12:03 PM
    Wednesday, December 8, 2010 1:31 PM

All replies

  • Please post Select @@version from all servers involved.


    Balmukund Lakhani | Please mark solved if I've answered your question, vote for it as helpful to help other user's find a solution quicker
    --------------------------------------------------------------------------------
    This posting is provided "AS IS" with no warranties, and confers no rights.
    --------------------------------------------------------------------------------
    My Blog: http://blogs.msdn.com/blakhani
    Team Blog: http://blogs.msdn.com/sqlserverfaq
    Wednesday, December 8, 2010 11:23 AM
  • 10.0.2531.0 (X64) for the two primary database engines:

    MPH-HD-SVR-1 = Microsoft SQL Server 2008 (SP1) - 10.0.2531.0 (X64)   Mar 29 2009 10:11:52   Copyright (c) 1988-2008 Microsoft Corporation  Standard Edition (64-bit) on Windows NT 5.2 <X64> (Build 3790: Service Pack 2)

    and

    MPH-WIT   = Microsoft SQL Server 2008 (SP1) - 10.0.2531.0 (X64)   Mar 29 2009 10:11:52   Copyright (c) 1988-2008 Microsoft Corporation  Express Edition with Advanced Services (64-bit) on Windows NT 5.2 <X64> (Build 3790: Service Pack 2) 

    on the Witness (Arbitrator)

    Wednesday, December 8, 2010 12:46 PM
  • First, I would suggest you to apply service pack 2 for SQL 2008 as there have been few fixes in Mirroring area.
    Coming to the problem faced by you.. there are multiple problem I see..

    1.
    System.Data.SqlClient.SqlException: Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding
    This error happened at "07:42:17.515"

    2.
    Database mirroring connection error 4 '64
    Error 64 => The specified network name is no longer available.
    This error happened at "07:56:37.75"

    3.'Connection attempt failed with error: '10060
    Error 10060 => A connection attempt failed because the connected party did not properly respond a period of time, or established connection failed because connected host has failed to respond.

    There is a time difference in both so i don't think there are co-related. 
    Issue # 1 can be easily explained in below image.

    There might be blocking in the database causing your query to wait more than 30 sec (default value). Did you capture any information when issue happened?

    I would also suggest you to disable TCPChimney since there could be intermittent issue due to this.  
    http://support.microsoft.com/kb/942861 (Error message when an application connects to SQL Server on a server that is running Windows Server 2003: "General Network error," "Communication link failure," or "A transport-level error")


    Balmukund Lakhani | Please mark solved if I've answered your question, vote for it as helpful to help other user's find a solution quicker
    --------------------------------------------------------------------------------
    This posting is provided "AS IS" with no warranties, and confers no rights.
    --------------------------------------------------------------------------------
    My Blog: http://blogs.msdn.com/blakhani
    Team Blog: http://blogs.msdn.com/sqlserverfaq
    • Marked as answer by Alex Feng (SQL) Wednesday, December 15, 2010 12:03 PM
    Wednesday, December 8, 2010 1:31 PM