locked
SQL Server 2014 Mirroring: Principal fails and no switching to mirror RRS feed

  • Question

  • 1 Environment:

    Principal: Microsoft SQL Server 2014 (SP2) (KB3171021) - 12.0.5000.0 (X64)

    Mirror:  Microsoft SQL Server 2014 (SP2-GDR) (KB3194714) - 12.0.5203.0 (X64)

    Witness: Microsoft SQL Server 2014 (SP2-GDR) (KB4019093) - 12.0.5207.0 (X64)

    We use a domain controller and service accounts are on the domain.

    We also have log shipping to another database:

    Log shipping database: Microsoft SQL Server 2014 (SP2) (KB3171021) - 12.0.5000.0 (X64)

    During all the issue all servers were up and running.

    2 Problem:

    At 09:45 the principal server was up but not answering any queries. The role didn’t switch between the principal and the mirror.

    At 10:00 a log backup was performed for log shipping.

    At 10:01:50 we have the first error in the logs:

    Login failed for user 'X'. Reason: Failed to send an environment change notification to a log shipping partner node while revalidating the login.

    And

    The client was unable to reuse a session with SPID X, which had been reset for connection pooling. The failure ID is 51. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.

    This error repeated every 2 3 minutes.

    At 10:25 We restarted the server hosting the principal. It switched to the mirror that responded correctly.

    When the restart was over, we manually switched back the role to principal and it worked with no issues.

    When I tried to restore the log backup done at 10AM during the issue, I have this error:

    Could not redo log record (XXXX:XXXX:X), for transaction ID (X:-XXXXXX), on page (X:XXXXX), allocation unit XXXX, database 'XDB' (database ID X). Page: LSN = (XXXX: XXXX:X), allocation unit = 72057594061914112, type = 1. Log: OpCode = 2, context 2, PrevPageLSN: (XXXX: XXXX:X). Restore from a backup of the database, or repair the database.

    3 Questions:

    Do you have an idea why the principal instance didn’t switch to the mirror when the principal started not answering?

    Tuesday, January 29, 2019 10:16 AM

All replies

  • Clients not connecting to the principal is not a cause for failover. 

    The mirror servers use a "heartbeat" to detect if the other server is working or not to cause a failover.  

    See:

    https://docs.microsoft.com/en-us/sql/database-engine/database-mirroring/possible-failures-during-database-mirroring?view=sql-server-2017

    As for "The client was unable to reuse a session with SPID", that is caused by .Net timeout errors.  That is not an issue on the SQL Server.

    Tuesday, January 29, 2019 1:38 PM
  • Thank you very much for your answer.

    So you think the sql server was working correctly on the principal but something else caused the connections not to answer ?

    Sorry to ask for clarification, I prefer to be sure.

    Thansk in advance,

    Emmanuel OREN

    Tuesday, January 29, 2019 2:12 PM
  • Hi Emmanuel OREN,

     

    >>Login failed for user 'X'. Reason: Failed to send an environment change notification to a log shipping partner node while revalidating the login.

     

    Please refer to the similar thread: https://dba.stackexchange.com/questions/81077/sql-server-2008-r2-failed-to-send-an-environment-change-notification-to-a-log

     

    It seems that the error may be related to the failover partner attribute in the connection string.

     

    Best Regards,

    Emily


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Wednesday, January 30, 2019 8:23 AM
  • Thank you for your help, we are investigating on that side but except a network glitch I don't know how it could happen.

    The thing that bothers me is that for this exact period, I cannot restore the log backup on the replicate with the message above. It seems an SQL issue and it seems a pretty big coincidence that there is another unrelated issue at the same time.

    I performed a complete DBCC CheckDB and there seem to be no corruption.

    Thanks in advance,
    Emmanuel

    • Proposed as answer by Bob_FT Friday, February 1, 2019 4:51 PM
    Thursday, January 31, 2019 1:14 PM
  • Is your network good?
    Friday, February 1, 2019 4:45 PM
  • Hello, 

    The same issue happened again. The main became unreachable and no swap happened.

    A network glitch is much less likely and it becomes a very serious issue for us.

    Here are the available logs :

    03/16/2019 08:07:19,spid838,Unknown,The client was unable to reuse a session with SPID 838 which had been reset for connection pooling. The failure ID is 51. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.

    03/16/2019 08:07:19,spid838,Unknown,Error: 18056 Severity: 20 State: 51.

    03/16/2019 08:07:19,Logon,Unknown,Login failed for user 'dotnet'. Reason: Failed to send an environment change notification to a log shipping partner node while revalidating the login. [CLIENT: 192.168.106.73]

    03/16/2019 08:07:19,Logon,Unknown,Error: 18456 Severity: 14 State: 51.

    03/16/2019 08:07:19,spid173,Unknown,The client was unable to reuse a session with SPID 173 which had been reset for connection pooling. The failure ID is 51. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.

    03/16/2019 08:07:19,spid173,Unknown,Error: 18056 Severity: 20 State: 51.

    03/16/2019 08:07:19,Logon,Unknown,Login failed for user 'dotnet'. Reason: Failed to send an environment change notification to a log shipping partner node while revalidating the login. [CLIENT: 192.168.107.72]

    03/16/2019 08:07:19,Logon,Unknown,Error: 18456 Severity: 14 State: 51.

    03/16/2019 06:00:48,Backup,Unknown,Log was backed up. Database: TransDB creation date(time): 2018/03/25(19:31:16) first LSN: 1673078:1639744:1 last LSN: 1673110:7099560:1 number of dump devices: 1 device information: (FILE=1 TYPE=DISK: {'h:\TransDBSharedLogs\TransDB_20190316050000.trn'}). This is an informational message only. No user action is required.

    03/16/2019 04:01:00,Backup,Unknown,Log was backed up. Database: TransDB creation date(time): 2018/03/25(19:31:16) first LSN: 1673064:787904:1 last LSN: 1673078:1639744:1 number of dump devices: 1 device information: (FILE=1 TYPE=DISK: {'h:\TransDBSharedLogs\TransDB_20190316030000.trn'}). This is an informational message only. No user action is required.

    The last logs repeats until 8:13:56

    03/16/2019 08:13:56,spid938,Unknown,The client was unable to reuse a session with SPID 938 which had been reset for connection pooling. The failure ID is 51. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.

    03/16/2019 08:13:56,spid938,Unknown,Error: 18056 Severity: 20 State: 51.

    03/16/2019 08:13:56,Logon,Unknown,Login failed for user 'dotnet'. Reason: Failed to send an environment change notification to a log shipping partner node while revalidating the login. [CLIENT: 192.168.106.73]

    03/16/2019 08:13:56,Logon,Unknown,Error: 18456 Severity: 14 State: 51.

    And then nothing until 8:39:53 and then the stack was dumped (possible to send) :

    03/16/2019 09:18:08,Server,Unknown,SQL Server is terminating because of a system shutdown. This is an informational message only. No user action is required.

    03/16/2019 09:15:05,Server,Unknown,New queries assigned to process on Node 1 have not been picked up by a worker thread in the last 2400 seconds. Blocking or long-running queries can contribute to this condition and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads or optimize current running queries. SQL Process Utilization: 0%. System Idle: 99%.

    03/16/2019 09:10:04,Server,Unknown,New queries assigned to process on Node 1 have not been picked up by a worker thread in the last 2100 seconds. Blocking or long-running queries can contribute to this condition and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads or optimize current running queries. SQL Process Utilization: 0%. System Idle: 99%.

    03/16/2019 09:05:04,Server,Unknown,New queries assigned to process on Node 1 have not been picked up by a worker thread in the last 1800 seconds. Blocking or long-running queries can contribute to this condition and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads or optimize current running queries. SQL Process Utilization: 0%. System Idle: 99%.

    03/16/2019 09:00:03,Server,Unknown,New queries assigned to process on Node 1 have not been picked up by a worker thread in the last 1500 seconds. Blocking or long-running queries can contribute to this condition and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads or optimize current running queries. SQL Process Utilization: 0%. System Idle: 99%.

    03/16/2019 08:55:03,Server,Unknown,New queries assigned to process on Node 1 have not been picked up by a worker thread in the last 1200 seconds. Blocking or long-running queries can contribute to this condition and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads or optimize current running queries. SQL Process Utilization: 0%. System Idle: 99%.

    03/16/2019 08:50:02,Server,Unknown,New queries assigned to process on Node 1 have not been picked up by a worker thread in the last 900 seconds. Blocking or long-running queries can contribute to this condition and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads or optimize current running queries. SQL Process Utilization: 0%. System Idle: 99%.

    03/16/2019 08:45:02,Server,Unknown,New queries assigned to process on Node 1 have not been picked up by a worker thread in the last 600 seconds. Blocking or long-running queries can contribute to this condition and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads or optimize current running queries. SQL Process Utilization: 0%. System Idle: 99%.

    03/16/2019 08:40:01,Server,Unknown,New queries assigned to process on Node 1 have not been picked up by a worker thread in the last 300 seconds. Blocking or long-running queries can contribute to this condition and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads or optimize current running queries. SQL Process Utilization: 0%. System Idle: 99%.

    03/16/2019 08:40:01,Server,Unknown,External dump process return code 0x20000001.External dump process returned no errors.

    03/16/2019 08:39:53,Server,Unknown,Stack Signature for the dump is 0x00000000000000E9

    03/16/2019 08:39:53,Server,Unknown,* Short Stack Dump

    03/16/2019 08:39:53,Server,Unknown,* -------------------------------------------------------------------------------

    03/16/2019 08:39:53,Server,Unknown,* *******************************************************************************

    03/16/2019 08:39:53,Server,Unknown,*

    03/16/2019 08:39:53,Server,Unknown,* Deadlocked Schedulers

    03/16/2019 08:39:53,Server,Unknown,*

    03/16/2019 08:39:53,Server,Unknown,* 03/16/19 08:39:53 spid 2188

    03/16/2019 08:39:53,Server,Unknown,* BEGIN STACK DUMP:

    03/16/2019 08:39:53,Server,Unknown,*

    03/16/2019 08:39:53,Server,Unknown,* *******************************************************************************

    03/16/2019 08:39:53,Server,Unknown,***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\LOG\SQLDump0020.txt

    03/16/2019 08:39:53,Server,Unknown,**Dump thread - spid = 0 EC = 0x0000000000000000

    03/16/2019 08:39:53,Server,Unknown,Using 'dbghelp.dll' version '4.0.5'*

    At 09:18 we restarted FREQXSQL8 witch was the main at the time and it swapped to FREQXSQL9. When rebooted we restarted FREQXSQL9 and it switched back to FREQXSQL8 and is working since then with no issues.

    Log backup strangeness : Additionally I saw something strange so maybe worth noticing, the log backup job that was suppose to start at 8:00 Started and never finished correctly, when I connected and checked around 9:30 and the job still appeared running in the job logs, I tried to stop the job, had an error because it was not running. I waited for 10:00 to see how it would react to the next log backup and it worked perfectly but all logs concerning the 8:00 log backup disappeared.

    SQL Server Mirroring Monitor logs :

    On the one that was main, no logs for the time of the issue.

    SQL Server mirroring monitor for main

    The mirror didn't notice an issue.

    SQL Server mirroring monitor for mirror

    Thank you very much for you time.

    Tuesday, March 19, 2019 7:47 AM
  • Hello Emily,

    So it happened again and it becomes problematic to have a mirrored database that never switches when we have an issue.

    I would really appriciate any suggestions,

    Thanks in advance,

    Emmanuel

    Tuesday, March 19, 2019 7:49 AM