locked
dbcc checkdb errors come and go RRS feed

  • Question

  • Really odd issue on our Sqlserver 2008 R2 Sp3 (10.50.6529.0) running with Server 2008R2 SP1 which runs as a VM on our Server 2012 HyperV Server.  This is a small SQL system, nothing fancy (no clusters, simple volumes, etc.) with about 15 users and 5 databases around 5GB each.

    Before our log backup, we perform a basic dbcc checkdb (no options) to ensure the databases are good.  Generally this works but occasionally we see an issue where the dbcc checkdb fails with consistency errors.  We immediately take steps to check it out but find the results on the checkdb seem to vary each run, here's an example:

    Msg 8909, Level 16, State 1, Line 1
    Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID -396284812572229632 (type Unknown), page ID (1:496621) contains an incorrect page ID in its page header. The PageId in the page header = (28801:487589056).
    CHECKDB found 0 allocation errors and 1 consistency errors not associated with any single object.
    Msg 8928, Level 16, State 1, Line 1
    Object ID 832722019, index ID 1, partition ID 72057604743364608, alloc unit ID 72057604748541952 (type In-row data): Page (1:496621) could not be processed.  See other errors for details.
    Msg 8976, Level 16, State 1, Line 1
    Table error: Object ID 832722019, index ID 1, partition ID 72057604743364608, alloc unit ID 72057604748541952 (type In-row data). Page (1:496621) was not seen in the scan although its parent (1:487329) and previous (1:496620) refer to it. Check any previous errors.
    Msg 8978, Level 16, State 1, Line 1
    Table error: Object ID 832722019, index ID 1, partition ID 72057604743364608, alloc unit ID 72057604748541952 (type In-row data). Page (1:496622) is missing a reference from previous page (1:496621). Possible chain linkage problem.
    CHECKDB found 0 allocation errors and 3 consistency errors in table 'OPERATION' (object ID 832722019).
    CHECKDB found 0 allocation errors and 4 consistency errors in database 'VMFG'.
    repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (VMFG).

    Msg 8909, Level 16, State 1, Line 1
    Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID -396284812572229632 (type Unknown), page ID (1:67073) contains an incorrect page ID in its page header. The PageId in the page header = (28801:487589056).
    CHECKDB found 0 allocation errors and 1 consistency errors not associated with any single object.
    Msg 8928, Level 16, State 1, Line 1
    Object ID 1396200024, index ID 1, partition ID 72057604739563520, alloc unit ID 72057604744740864 (type In-row data): Page (1:67073) could not be processed.  See other errors for details.
    Msg 8976, Level 16, State 1, Line 1
    Table error: Object ID 1396200024, index ID 1, partition ID 72057604739563520, alloc unit ID 72057604744740864 (type In-row data). Page (1:67073) was not seen in the scan although its parent (1:480322) and previous (1:67072) refer to it. Check any previous errors.
    Msg 8978, Level 16, State 1, Line 1
    Table error: Object ID 1396200024, index ID 1, partition ID 72057604739563520, alloc unit ID 72057604744740864 (type In-row data). Page (1:67074) is missing a reference from previous page (1:67073). Possible chain linkage problem.
    CHECKDB found 0 allocation errors and 3 consistency errors in table 'INVENTORY_TRANS' (object ID 1396200024).
    CHECKDB found 0 allocation errors and 4 consistency errors in database 'VMFG'.
    repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (VMFG).

    --------

    Same number of errors but different tables, it CHANGES each run.  Another odd thing, when these errors show up ALL databases on this instance report errors, even the static year-end databases which no one is accessing:

    Msg 8909, Level 16, State 1, Line 1
    Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID -396284812572229632 (type Unknown), page ID (1:221998) contains an incorrect page ID in its page header. The PageId in the page header = (28801:487589056).
    CHECKDB found 0 allocation errors and 1 consistency errors not associated with any single object.
    Msg 8928, Level 16, State 1, Line 1
    Object ID 1396200024, index ID 1, partition ID 72057602564816896, alloc unit ID 72057602569863168 (type In-row data): Page (1:221998) could not be processed.  See other errors for details.
    Msg 8976, Level 16, State 1, Line 1
    Table error: Object ID 1396200024, index ID 1, partition ID 72057602564816896, alloc unit ID 72057602569863168 (type In-row data). Page (1:221998) was not seen in the scan although its parent (1:141065) and previous (1:221997) refer to it. Check any previous errors.
    Msg 8978, Level 16, State 1, Line 1
    Table error: Object ID 1396200024, index ID 1, partition ID 72057602564816896, alloc unit ID 72057602569863168 (type In-row data). Page (1:221999) is missing a reference from previous page (1:221998). Possible chain linkage problem.
    CHECKDB found 0 allocation errors and 3 consistency errors in table 'INVENTORY_TRANS' (object ID 1396200024).
    CHECKDB found 0 allocation errors and 4 consistency errors in database 'YE2016'.
    repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (YE2016).

    -------------

    Even the system msdb databases show errors when this is occurring.  We try and follow up on some of the errors to verify the tables but they seem OK when we check, the fact that the errors keep changing each run makes it hard to track down.  We actually can't find any corruption at all, the databases restore properly.  

    Then at some point, after running checkdb a few times (for testing) it just "fixes" itself.  All databases magically check clean with no errors - can't figure out why, we really don't do anything other than keep checking.  Our event logs are full of SQL dumps (I assume each checkdb failure does this?), its just overwhelming how much data this issue generates.

    I have tried the following:

    1) Verify "disks" on the SQLserver.  Because it is a VM the "drives" are all VHDX files on the hyperV box (on our SAS RAID6 and SSD RAID1 arrays), not sure if this is useful but they came up clean.

    2) Looked for updated drivers on the SQLserver but as it is a VM all drivers are the latest from Microsoft.

    3) Did RAM tests on the VM (not sure how useful that is?).

    4) Did full disk/RAID/memory checks on the hyperV server.  This server runs several other VM's including our domain/file and exchange servers and they have not experienced any issues.  The only thing unique about the SQLserver is it is the only one that uses the SSD RAID1 array on this box (for the databases) - I will change that this weekend to see if that is the cause.  I did do SQLIO tests on the SSD array before I put it into production and it worked flawlessly.

    5) In my hyperV backup program I "cloned" the entire sqlserver while it was in this state and then booted that up and ran checkdb's and it came up clean for all databases.  Ran checks for a couple of days and still clean.

    6) Verified we don't have any extraneous programs on the SQLserver, we don't run antivirus on it - its basically just Windows 2088R2+SQLserver and is fully patched.

    Any other ideas what to check?  I'm at a loss here, it seems like its a bug or a memory/corruption error but why are all databases affected?  Thanks.


    -- Al

    Friday, September 29, 2017 5:08 PM

All replies

  • 5) In my hyperV backup program I "cloned" the entire sqlserver while it was in this state and then booted that up and ran checkdb's and it came up clean for all databases.  Ran checks for a couple of days and still clean.

    By "clone", do you mean export/import? Did you run it on the same Hyper-V host? That suggests to me there might be a problem with the Hyper-V image and the solution is to use the cloned version, which seems to have repaired the problem during the process. I'm no Hyper-V expert but that's what I would try.


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    Friday, September 29, 2017 6:05 PM
  • The clone is a restore of a "point in time" backup of the SQLServer VM using Windows VSS/images so it should be identical to the VM.  That's why it is so weird that it doesn't show the issue (although it could be it will and I haven't seen it yet, the real SQL server ran for many months before the issue showed up again).

    At this point the issue has gone away on the actual SQLServer as well, again I did nothing...???  Not sure what to do now.


    -- Al

    Friday, September 29, 2017 6:54 PM
  • Hi AMBW,

    The above message shows that your cluster index (as index id = 1) page is corrupted, you may need to use dbcc checkdb (YE2016, 'repair_allow_data_loss') command to repair that db. However, be note that, once you take this approach, it will bring you data loss.


    We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
    Click HERE to participate the survey.

    Saturday, September 30, 2017 9:43 AM
  • The above message shows that your cluster index (as index id = 1) page is corrupted, you may need to use dbcc checkdb (YE2016, 'repair_allow_data_loss') command to repair that db. However, be note that, once you take this approach, it will bring you data loss.

    If you work for Microsoft you should know better than to suggest people to run DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS. If there really is permanent corruption, it is oftentimes better to restore from a clean backup.

    In Al's case, it appears that there are some transient problems with the I/O subsystem. As he does not seem to get those errors when he is actually using the database, the problems only seems to be occurring with the hidden snapshot that DBCC works from.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Saturday, September 30, 2017 10:16 AM
  • Agreed, I would never use the repair options if I could avoid it.

    The fact that the "errors" come and go is perplexing (currently they are ALL gone).  Your comment regarding corruption in the hidden snapshot is interesting and actually may be a good theory.  Do you know where this snapshot is "stored" (ie: in memory or on disk) and how I could go about finding the issue?  I gather a checkdb with the TABLOCK option would not use this type of snapshot, perhaps I should switch to that type of check (these checks are done early AM before backup when no one is supposed to be accessing the databases).


    -- Al

    Sunday, October 1, 2017 8:46 PM
  • The hidden snapshot is on disk, and it is a sparse file. That is, as long as a page is unchanged, the access is to the page in the original database file. Only if the page is changed, the original page is copied to the sparse file.

    When you create a regular database snapshot, this file is visible in the file system. But these hidden snapshots are created as an alternate stream. (Or were until recently; I think I saw something about a change.)

    I can only speculate, but it could be that something is wonky in your environment so that this does not work properly. When googling a little bit in order to find a good reference for you, I find an old blog post from Paul Randall. I don't think those particular issues apply to use, but it gives an idea of how things could go wrong.

    Yeah, you could try the TABLOCK version for a while and see if these keeps the errors away entirely.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Sunday, October 1, 2017 9:57 PM
  • Was this snapshot produced so that SQL Server were aware of the snapshot? I.e., did you see "I/O Frozen" messages in the SQL Server errorlog file? That is the correct way to produce a snapshot for SQL server, making sure that the host communicates that it is about to do the snapshot (option might be called something like "application aware or similar))using the SQL Server VSS Writer service.

    Tibor Karaszi, SQL Server MVP (Web Blog)

    Monday, October 2, 2017 1:22 PM
  • I assume the snapshots are created in the same folder as the databases?  Nothing to do with tempdb?

    Since I'm using hyperV I guess I can assume the issue lies in the storage controller driver in the hyperV server where the virtual hard disk resides? 

    I'll also check the version of the microsoft integration services driver used by the VM but I'm fairly sure it is up to date. 

    Thanks.


    -- Al

    Monday, October 2, 2017 3:34 PM
  • "I assume the snapshots are created in the same folder as the databases?  "

    Correct. Notice than we have two types of snapshots going on here.

    You have the snapshot you created by the VM. Did you verify that you have "I/O frozen messages? If not, then you likely have the reason for the corruption.

    Then you have the *SQL Server* snapshot, produced by *SQL Server* when doing DBCC CHECKDB. This is done all by SQL Server, utilizing sparse files in the file system (one for each data file, in the same folder as the data file - as of 2014 these are visible while CHECKDB is executing, in earlier version it was a hidden alternate stream file).


    Tibor Karaszi, SQL Server MVP (Web Blog)

    Monday, October 2, 2017 6:36 PM
  • All I could think of was to update the RAID driver on the hyperV server for the disks that the databases resided on.  Everything on the VM is Microsoft virtual drivers that are up to date.  I would think that this issue has to be related to SQLServer or the filesystem so I'm not sure this should help.  So far so good but its ran clean before, no idea what triggers it.  Not sure if there is any additional logging I could turn on to get more info if it happens again?  Thanks.

    -- Al

    Wednesday, October 4, 2017 4:12 PM