Network timeout to Mirror causes Primary to deny connections

Unanswered Network timeout to Mirror causes Primary to deny connections

  • Wednesday, November 01, 2006 2:29 PM
     
     

    We've had two instances now where when there is a network connection timeout to the mirror from the primary, the primary db server goes to 100% utilization and refuses all connections.

    The first time we had to reboot the primary, the 2nd time mirroring picked up again 10 minutes later.

    There are two dbs being mirrored, one is 15gig, the other 4gig.  Both boxes are running SQL2005 64 bit and Win 2003 64bit. 

    This happened at 6am and typically there shouldn't be a lot of traffic at that time but here are the error messages in the SQL log below.

    We are going to try and move db communications to a separate network and network card - but this looks like either a bug in mirroring or a configuration problem on our end - though it works just fine other times.

    Any thoughts/suggestions would be greatly appreciated.

    Thanks!

    Mark

    SQL Error Log:

    11/01/2006 06:12:12,Logon,Unknown,The server was unable to load the SSL provider library needed to log in; the connection has been closed. SSL is used to encrypt either the login sequence or all communications<c/> depending on how the administrator has configured the server. See Books Online for information on this error message:  0x2746. [CLIENT: 10.16.7.7]
    11/01/2006 06:12:12,Logon,Unknown,Error: 17194<c/> Severity: 16<c/> State: 1.
    11/01/2006 06:12:12,Logon,Unknown,The server was unable to load the SSL provider library needed to log in; the connection has been closed. SSL is used to encrypt either the login sequence or all communications<c/> depending on how the administrator has configured the server. See Books Online for information on this error message:  0x2746. [CLIENT: 10.16.7.7]
    11/01/2006 06:12:12,Logon,Unknown,Error: 17194<c/> Severity: 16<c/> State: 1.
    11/01/2006 06:12:12,Logon,Unknown,The server was unable to load the SSL provider library needed to log in; the connection has been closed. SSL is used to encrypt either the login sequence or all communications<c/> depending on how the administrator has configured the server. See Books Online for information on this error message:  0x2746. [CLIENT: 10.16.7.2]
    11/01/2006 06:12:12,Logon,Unknown,Error: 17194<c/> Severity: 16<c/> State: 1.
    11/01/2006 06:12:11,spid22s,Unknown,Database mirroring connection error 4 '10054(An existing connection was forcibly closed by the remote host.)' for 'TCP://PYTHAGORAS.test.com:7024'.
    11/01/2006 06:12:11,spid22s,Unknown,Error: 1474<c/> Severity: 16<c/> State: 1.
    11/01/2006 06:04:53,spid26s,Unknown,Database mirroring is inactive for database 'NewScribe'. This is an informational message only. No user action is required.
    11/01/2006 06:04:53,spid26s,Unknown,The mirroring connection to "TCP://PYTHAGORAS.test.com:7024" has timed out for database "NewScribe" after 10 seconds without a response.  Check the service and network connections.
    11/01/2006 06:04:53,spid26s,Unknown,Error: 1479<c/> Severity: 16<c/> State: 1.
    11/01/2006 06:04:53,spid24s,Unknown,Database mirroring is inactive for database 'HL7Transfer'. This is an informational message only. No user action is required.
    11/01/2006 06:04:53,spid24s,Unknown,The mirroring connection to "TCP://PYTHAGORAS.test.com:7024" has timed out for database "HL7Transfer" after 10 seconds without a response.  Check the service and network connections.
    11/01/2006 06:04:53,spid24s,Unknown,Error: 1479<c/> Severity: 16<c/> State: 1.

All Replies

  • Friday, November 03, 2006 11:04 AM
     
     
    would be useful to try to determine what comes first, the high cpu utilisation or the mirror timeout... it could be that the timeout is caused because the security context can't be loaded due to the excessive cpu utilisation (thread can't get cpu time quickly enough).
  • Friday, November 03, 2006 11:51 AM
     
     

    Before going deaper in details, I would try to identify what is going on at 6 am.

    Backup?

    Incorrectly configured backup might simply lock the files from the library (or certificates) thus preventing them to be loaded and turning it to a 'bouncing' mode.

  • Friday, November 03, 2006 1:49 PM
     
     

    Thanks for the thoughts.  The sequence of events at 6am is as follows:

    1. Backup of Transaction Logs Occurs each hour.  At 6am this backup completed at 6:00.47 - there was very little to back up.

    2. At 6:04.53 the system reports that there has been a timeout to the mirror for more than 10 seconds

    3. From there you see the other error messages and then the CPU goes to 100% utilization and denies other connections.

    4. The security context error is just a party to it being at 100% - that's our web server trying to get to the db server.

    We would expect that at step 2 the mirroring would be paused or broken - but it does not.  To us that appears to be a bug.  It's as if it continues to wait and never breaks or pauses.  By the way - the mirror event log reports the same timeout error.

    There is nothing else going on on the db server at 6am.  We believe that there probably was some network traffic between the db server and the mirror that would have caused the timeout - we plan to move them to their own network so that part wouldn'tn happen.  What concerns us though is the mirroring behavior - again - if the network was unavailable it should have just paused the mirroring session and or broken them - right?

    Another thought, this box has 4gig of memory in it and uses just about all of it.  It keeps roughly 50 to 100k available.  I've seen other posts where there is some bug with 2005 that will be fixed in sp2 that strange things happen with low physical memory - is this a possible related bug?

    Anyway - we'd appreciate any expert opinions as mirroring is great - but having a production server lockup is not.

    Thanks!

    Mark

  • Friday, November 03, 2006 4:14 PM
     
     

    Mark,

    I was not talking about SQL backup. I was talking about something like Backup.exe agent backing up files on the level of operating system.

    Even so, I do not understand logic behind backing up a log file every hour in high availability mode, can you please go in details of: How do you backup logs? Is it a job? What backup schema are you using? Simple or full? Do you have same jobs running every hour or they are different? In other words, WHAT IS IT that makes backup at 6 am so special and different from all other backups? And another wild thought: may be it worth to turn off a backup at 6 am in order to check if the issue is related?

    I would consider following logic of events:

    File locked on principle-> Principle can not establish connectivity from principle to mirror (CPU load goes to 100%)-> Witness see principle and can ping it and does not send message to mirror to become a principle -> your mirror is broken.

    That what I meant under "bouncing" mode.

    P.S. Antivirus run or WUS session can do it as well.

  • Friday, November 03, 2006 5:31 PM
     
     

    Hi Glen -

    Thanks for the comments.  Unfortunately, we don't have backup.exe or any other backup system running - just sql's normal transaction log backup.  I should add that the sql backup job runs hourly just backing up transaction logs and at midnight it does a full db backup.  So the file wouldn't ever have been locked except by sql server itself.

    We know that our network does have a lot of traffic on it, but it shouldn't and didn't look like there was so much that would have caused a time out. Also  it could happen 6am or noon - which would be really bad.   The only other time it happened was at 6:10pm - again not a high traffic time for us - and again when backup would not be running.  

    We're changing the network configuration - but we can't ever guarantee there might not be blips of 10 seconds on that either.  So what is scary to us though is that mirroring would do what it did.  Also, just fyi this is not configured with a witness - nor active failover - we have to manually make the mirror the principle.

    The memory "bug" that was mentioned in another thread also has us wondering if this is somehow related.  It could have been that low memory caused the beginning of the lockup, which caused the network timeout, which caused the lock-up etc.  We're just not sure.

    So if anyone working on the SP2 issues knows whether this is a potential similar problem that would be great.  In the meantime, we'll adjust the memory usage and put it on a different network.

    Thanks!

    Mark

  • Thursday, November 09, 2006 2:49 PM
     
     

    I think you already alluded to this...but it may help to isolate your mirror network traffic to a mirroring only network. That way it does not have to compete with your application traffic. This can be done cheaply by simply using a null cable to connect two nics in your database servers. If you introduce a witness, you can then add a switch allowing all three computers to use this separate network.

    This ought to be something you can simulate. It appears to be consistent that when your mirror can not be found by the primary that the primary hangs. So, take the mirror offline when you are there while running diagnosis software on the primary. That way you can see what is going on. If it doesn't happen when you force the failure, then you know it has to be something that is specific to those times of day.

    Have you tried increasing the time for mirror connection failure? It looks like you have it set to 10 Seconds. Would it hurt you to increase this? I am guessing not, since you are not using immediate failover, and do not have a witness. So, give the servers a little more time to establish that the Mirror is really unavailable. Set it to 15+ seconds, maybe.

    One thing you may consider is your mirror authentication method.

    Finally, why the transaction dumps every hour? You have a hot failover. If you are backing up to send off site, would not a daily full database backup just before sending off site be sufficient? I'm not saying your tans dumps hourly are not useful...they could be used with the previous offsite database complete in case you needed to revert to a point in time due to user data errors. But, that would be the scenario you are trying to protect; not disaster recovery.

    Just my 2 cents.

    Ben

  • Tuesday, November 14, 2006 4:41 PM
     
     

    We experience the same problem - but without mirroring:

    SQL Server 2005 64 Bit
    Windows Server 2003 64 Bit
    Failover Cluster
    16 GB RAM, 4 CPU

    1. Transaction Log Backups at 17:30:00 (small, hourly)
    2. CPU grows to 100%, Clients receive timeouts at 17:32:52
    3. Unable to load SSL provider library messages in SQL Server log

    It seems that this problem is not related to mirroring but a general problem of SQL Server 2005 64 Bit.

    Have you found any more information about this?

    Thanks
    -Benedikt

  • Thursday, March 22, 2007 9:21 AM
     
     
    Same here, no mirroring, just a 64 bit failover cluster.

    SQL Server 2005 Standard 64-bit
    Windows 2003 Server 64 bit
    Failover cluster
    64 GB RAM, 2 CPUs with 2 cores each.

    Don't think its related but there are transaction log backups every 5 minutes (mostly few hundred megabytes up to a few gigabytes depending on activity)

    This happens when we launch the application server that uses the database. 200 processes on around 95 computers start connecting at once, CPU goes to 100% and I start getting those SSL errors. Even with timeout set to 180 seconds in the connection string, many of the application servers still time-out.

    Maybe this is just a load related issue?

  • Tuesday, May 22, 2007 3:08 PM
     
     

    I just saw this issue myself on one of our production servers...

     

    SQL 05 64bit SP1

    Windows Server 2003 64bit R2 SP1

    4x2.8ghz CPUs

    6Gb RAM

     

    I'm doing 15 minute transaction log backups on 10 dbs, soon to be about 15-17dbs. I was able to access the server via TermServices, and it wasn't pegged at 100% CPU utilization. I tried stopping the SQL service because I didn't want to reboot, but it wouldn't stop. SQL is the only service that wasn't responding; was able to stop the SQL Server Agent. Finally gave up and rebooted and server seems fine... for now.

     

  • Friday, August 21, 2009 9:09 AM
     
     
    Does anyone have more information on this matter? We're having the same problem even after upgrading to SQL Server 2005 SP3.

    We are running database mirroring between 2 servers with Windows 2003 x64 and SQL Server 2005 64-bit.

    Thanks!

    Lars
  • Sunday, August 23, 2009 6:41 PM
     
     
    Hi Lars,

    Could you check if the TCP Chimney, TCPA, and RSS are desable on each server  ?


    Last time, I got this issue, I following this article KB : 942861  Error message when an application connects to SQL Server on a server that is running Windows Server 2003: "General Network error," "Communication link failure," or "A transport-level error"
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;942861

    Cheers,
    Michel Degremont. http://blogs.technet.com/mdegre
  • Thursday, March 17, 2011 8:02 PM
     
     
    I know it's too late for this answer, but I solved this problem by checking the status of Sql Server Service Account, who lost the Log On as Service permission. So Next time you have this behavior with Sql Server Mirroing check your accounts.