Help diagnosing a SQL 2008 R2 freeze
-
Tuesday, March 06, 2012 9:06 PM
Earlier today I had a SQL 2008 R2 SP1 instance running on a vmWare VM freeze up on me. It was not totally frozen and for a while I could do some diagnosis before it completely froze. It should be noted that the VM itself never froze. Only SQL behaved abnormally. Here is what I observed.
- Numerous sessions were waiting on locks and blocked by other sessions.
- Tracing the "lock chains" (for lack of a better term) to their end always ended at a session waiting on a write to a transaction log.
- Review of error log showed that 11 hours before several database transaction logs were full and could not be expanded due to lack of disk space.
- A a few minutes later a maintenance plan ran that backed up all transaction logs and shrunk them.
- There were no more errors in the error log after the log backups completed.
- The drive containing the logs was dynamically expanded (we've done this many times before on a running VM) 8 hours after the errors appeared in he error log and while the server seemed to be behaving normally.
- There was a snapshot on the VM that had to be removed before expanding the drive.
- 3 more hours transpired before things started freezing.
- Trying to restart SQL Server or the Agent with services.msc resulted in an error message that the service could not be controlled.
We rebooted the server and it has worked correctly since. What could have caused this?
Chuck
All Replies
-
Wednesday, March 07, 2012 12:56 AM
Since your databases are presumably in full recovery mode do you have regular log backups setup?, the maintenance plan you mention is it scheduled and if so how often? Also when you saw the problem do you have any large processes running.
Thanks
/Neil Moorthy - Senior SQL Server DBA/Developer (MCITP (2005/2008), MCAD, ITILv3, OCA 11g) Please click the Mark as Answer button if a post solves your problem
-
Wednesday, March 07, 2012 3:55 AMModerator
Earlier today I had a SQL 2008 R2 SP1 instance running on a vmWare VM freeze up on me. It was not totally frozen and for a while I could do some diagnosis before it completely froze. It should be noted that the VM itself never froze. Only SQL behaved abnormally. Here is what I observed.
- Numerous sessions were waiting on locks and blocked by other sessions.
- Tracing the "lock chains" (for lack of a better term) to their end always ended at a session waiting on a write to a transaction log.
- Review of error log showed that 11 hours before several database transaction logs were full and could not be expanded due to lack of disk space.
- A a few minutes later a maintenance plan ran that backed up all transaction logs and shrunk them.
- There were no more errors in the error log after the log backups completed.
- The drive containing the logs was dynamically expanded (we've done this many times before on a running VM) 8 hours after the errors appeared in he error log and while the server seemed to be behaving normally.
- There was a snapshot on the VM that had to be removed before expanding the drive.
- 3 more hours transpired before things started freezing.
- Trying to restart SQL Server or the Agent with services.msc resulted in an error message that the service could not be controlled.
We rebooted the server and it has worked correctly since. What could have caused this?
Chuck
Being able to diagnose the root cause after rebooting the VM is going to be next to impossible. I would suspect that it was a combination of problems that just resulted in instability overall, but it was likely rooted in the snapshot of the VM, which is not actually supported by Microsoft as a side note.
You have a number of problemattic things in what you listed above, starting with having a maintenance plan that shrinks log files:
http://www.karaszi.com/SQLServer/info_dont_shrink.asp
http://www.straightpathsql.com/blog/2009/1/6/dont-touch-that-shrink-button.html
http://www.straightpathsql.com/blog/2009/1/31/shrinking-is-a-popular-topic.html
http://www.sqlskills.com/BLOGS/KIMBERLY/category/Transaction-Log.aspx
http://technet.microsoft.com/en-us/magazine/2009.02.logging.aspx
You need to plan your log backups appropriately to mantain the size of the log file. Having free space in the log is not a bad thing, but shrinking the log file routinely is, since it has to be zero initialized every time it grows, a very expensive operation. Set your logs to auto grow by a fixed size, not a percent, then resize them so that they never actually have to grow under normal workloads, and make sure you minimize the VLF count for performance when you resize the log:
http://www.sqlskills.com/BLOGS/KIMBERLY/post/Transaction-Log-VLFs-too-many-or-too-few.aspx
Auto grow should be on, but it should not be necessary for the log. If you are going to snapshot the VM, you need to keep in mind the I/O impacts the snapshot might have on the underlying subsystem and you need to make sure that your snapshot quieces activity across all your VMDK/VHD's or attempts to rollback could result in database corruption just like incorrectly taken SAN snapshots of large physical environments will. Dynamically expanding volumes isn't a problem typically.
It could be that your problem was entirely separate from any of this, and something else going on within the host hypervisor itself, for example ballooning the VM, or worse host level paging of the VM that was causing it to appear frozen. It is really impossible to tell without having access to all the information from that point in time.
Jonathan Kehayias | Principal Consultant, SQLSkills.com
SQL Server MVP | Microsoft Certified Master: SQL Server 2008
Author of Troubleshooting SQL Server: A Guide for Accidental DBAs
Feel free to contact me through My Blog or Twitter. Become a SQLskills Insider!
Please click the Mark as Answer button if a post solves your problem!- Proposed As Answer by Peja TaoModerator Thursday, March 08, 2012 2:51 AM
-
Wednesday, March 07, 2012 1:04 PM
Since your databases are presumably in full recovery mode do you have regular log backups setup?, the maintenance plan you mention is it scheduled and if so how often? Also when you saw the problem do you have any large processes running.
Thanks
/Neil Moorthy - Senior SQL Server DBA/Developer (MCITP (2005/2008), MCAD, ITILv3, OCA 11g) Please click the Mark as Answer button if a post solves your problem
Yes. The maintenance plan that I mentioned runs every 30 minutes. After it completed I saw no more errors in the error log.
There were no large processes running when I saw the problem. I first noticed the problem when I myself tried to save a stored procedure in the master database that I was working on. The "alter procedure" hung.
Other than backing up the logs (which happened automatically) is there anything else I would need to do to correct the initial event (full transaction logs)?
Chuck
-
Wednesday, March 07, 2012 1:14 PM
I think if there was host level paging I would have noticed performance problems with Windows but I didn't see anything like that. The only time the Windows became unresponsive was 3 hours before SQL started freezing up, while the snapshot was being removed. That lasted maybe a minute and then performance returned to normal.
FYI we use Veeam to backup the server and that was the purpose of the snapshot. Veeam takes a snapshot, backs it up, then deletes it. This happens every day on dozens of VMs without incident.
Chuck
-
Wednesday, March 07, 2012 4:32 PM
I was mistaken about the maintenance plan shrinking the logs. It only backs them up. It does not shrink them.
I did notice a few messages in the error log regarding the master.mdf file
2012-03-06 00:36:24.01 spid107 Error: 17053, Severity: 16, State: 1. 2012-03-06 00:36:24.01 spid107 D:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\master.mdf:MSSQL_DBCC50: Operating system error 112(There is not enough space on the disk.) encountered. 2012-03-06 00:36:24.07 spid107 DBCC CHECKDB (master) WITH no_infomsgs executed by BIODOMA05\biosqlsvc found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 2 seconds. Internal database snapshot has split point LSN = 0000011d:000001a3:0001 and first LSN = 0000011d:000001a2:0001.
This was about 12 hours before sessions began hanging. Only one of the hung sessions (my own) was in the master db.
There was also a message about msdb's log not being able to grow. Shouldn't this error have cleared though once space was freed up? MSDB and master both use simple recovery model.
Chuck
- Edited by chuckh1958 Wednesday, March 07, 2012 4:34 PM

