Corruption she (the DBCC) wrote….

I received recently an email from a friend asking for my help as he faced a problem. He runs a small business in the healthcare industry and his needs in terms of IT should and are minimalist, if you want. And that’s expected.

Of course, I asked what was the problem about and he said: “We had our systems down for quite a while and starting 2 days ago, we cannot access our database anymore.”. Long story short, it looked like a database corruption of some sort. I paid him a visit and had a look thru the logs. Wow… Loads of mini-dumps all over the SQL error log, similar to this one:

The affected database seemed ok, was online and from I was able to tell, readable. I was able to read metadata, I was able to read few user tables (spot checks). But looking in the Windows logs, I found an entry that usually sends cold chills up to any DBA’s spine and make them sweat instantly: 824.

I started asking for a bit of a history, I found out that lately, the computer started running slower and slower and 2 days before they had an incident with the computer when it froze and decided to give a go with a CHKDSK. Now, to have the full picture: the computer was not a server hardware, rather an all-in-one desktop computer playing in the big league…

Let me take a break here and explain why I wouldn’t recommend having a desktop computer act as both desktop and server. Main reasons, there’s a human using it day in day out, for all kind of stuff: Excel spreadsheet, navigating the Internet, Word documents, Solitaire, you name it, it’s there. Plus the antivirus. Plus the occasional freeze when humans do what they overhear in buses, on the street: power it down and then power it up. Or reset it, if you’re lucky enough to own a computer which still has the reset button.

For really small offices, such in this case, a reasonably spec’d desktop, could do the job, provided the human factor is out of the equation and the desktop is fairly reliable.

Back to the case, I started wondering how could a CHKDSK cause such havoc, because it shouldn’t. And here’s the gotcha. The drive (yes, the only drive in the system) has been removed, set up as a slave in another computer and CHKDSK was able to finish its job. Hurray, job done. Well, sort of kind of. Because after the drive was installed back in, the second SQL started up, bang! Problems started.

When I learned all these last details, everything started to make sense. It didn’t really matter anymore why the computer was running slower and slower (eventually, I figured that out, I guess – read on), the only thing that matter now was that the database *was* corrupted. And as it turned out later, it was corrupted big time.

Ran a DBCC CHECKDB and got these beauties back, after only 8 minutes:

DBCC results for ‘DBNAME’.
Warning: NO_INDEX option of checkdb being used. Checks on non-system indexes will be skipped.
Msg 8921, Level 16, State 1, Line 10
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.
Msg 8998, Level 16, State 2, Line 10
Page errors on the GAM, SGAM, or PFS pages prevent allocation integrity checks in database ID 5 pages from (1:695568) to (1:703655). See other errors for cause.
Msg 8998, Level 16, State 2, Line 10
Page errors on the GAM, SGAM, or PFS pages prevent allocation integrity checks in database ID 5 pages from (1:703656) to (1:711743). See other errors for cause.
Msg 8939, Level 16, State 5, Line 10
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_headerVersion == HEADER_7_0) failed. Values are 104 and 1.
Msg 8939, Level 16, State 7, Line 10
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_freeData >= PageHeaderOverhead () && m_freeData <= (UINT)PAGESIZE – m_slotCnt * sizeof (Slot)) failed. Values are 0 and 8192.
Msg 8939, Level 16, State 11, Line 10
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_ghostRecCnt <= m_slotCnt) failed. Values are 29811 and 0. Msg 8939, Level 16, State 5, Line 10 Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_headerVersion == HEADER_7_0) failed. Values are 104 and 1. Msg 8939, Level 16, State 7, Line 10 Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_freeData >= PageHeaderOverhead () && m_freeData <= (UINT)PAGESIZE – m_slotCnt * sizeof (Slot)) failed. Values are 0 and 8192.
Msg 8939, Level 16, State 11, Line 10
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_ghostRecCnt <= m_slotCnt) failed. Values are 29811 and 0. Msg 8939, Level 16, State 5, Line 10 Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 362258295026614272 (type Unknown), page (1:703656). Test (m_headerVersion == HEADER_7_0) failed. Values are 104 and 1. Msg 8939, Level 16, State 7, Line 10 Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 362258295026614272 (type Unknown), page (1:703656). Test (m_freeData >= PageHeaderOverhead () && m_freeData <= (UINT)PAGESIZE – m_slotCnt * sizeof (Slot)) failed. Values are 0 and 8192.
CHECKDB found 7 allocation errors and 3 consistency errors not associated with any single object.
CHECKDB found 7 allocation errors and 3 consistency errors in database ‘DBNAME’.

I came across corruption, I think, twice in my career and I was lucky enough to have it fixed in almost no time. Some indexes got corrupted and in other case it was some 3rd level table that we were able to rebuild it completely from scratch. But this one, for me, was a first. “Master, I have a bad feeling about this”, I said to myself. Did a quick search on the errors and almost if not all results pointed to a non-repairable case, where the only salvation would be a reliable backup.

Yes, you guessed it, there was none. Like in never, since the database was created.

So, I created a backup of the database and the tail-log (well, in the reverse order) and was time to get my hands dirty.

I restored the backup on a virtual machine and started working. Two things that I noticed were that VM was running slower and slower and that the drive where I had the SQL binaries was filling up quite quickly. Then I remembered about those mini-dumps. Checking the default location for those logs, big surprise. That’s what was causing the drive to fill up. Thousands and thousands of files. Because there’s no way of stopping those, I added a new drive to the VM, changed the settings for the mini-dumps to point to that drive and let that fill up (SQL will not stop anyway when that’s filled up, will just roll-over the files). That would explain the symptom that my friend noticed in the first place. I guess the corruption lurked around from some time. To stop these from being generated, I switched the corrupted DB to read-only. But there’s a catch switching it to read-only, at least that’s what I noticed. Switching it to read-only, seems to have limited my ability to read the records that were still salvageable from the corrupted tables. As soon as I switched it back to R/W, I was able to read whatever records were still accessible. I played a bit around with this and seems to have a direct correlation. Once that was cleared out of the way, I moved on to starting recovering the data, or what was still recoverable.

I started by scripting out the database. Once that was done, I moved on and put together a script to export everything I can still read. I managed, this way, to isolate a number of 9 tables that were seriously impacted. The big problem was that 2 of those were at the core of the whole application and 2 table hold a lot of scanned documents, which are really hard to rescan them. So yeah, no pressure.

The impacted table were still readable up to a certain point, so I took advantage of that and exported whatever records were still readable and that was pretty much everything I could salvage. Now the reconstruction work can begin.

I used the script I created when I scripted out the database and created an empty on. Once all the salvageable data was exported, I switched the damaged database to R/O. Exporting the unaffected tables, was a breeze. I created a dedicated schema (call it [export]) and the I simply used SELECT * … INTO [export].[table] FROM [table] and that was it. For the affected ones, I created another schema (let’s call it [staging]) and then I created the required empty tables. Couple of notes here:

  • I created the tables with no constraints whatsoever
  • I created the tables with generic column datatypes (VARCHAR(MAX)) to avoid any incompatibilities

Once the staging tables have been populated, I can start reconstructing whatever lost data was possible. And here, all depends on what the data model is, what the relations / foreign keys are in place, a lot of factors are at play here. In some cases, that’s not even possible.

The last step of this will be to import the salvaged data into the final tables. Here the data model comes into play, to load the tables in the right order, to have all the FKs check out, so on and so forth. It’s going to be a tedious process, but it has to be done. At least, tried :).

The final chapter of this is yet to be written: I’m at the stage where the staging tables are all populated and I have to try and reconstruct whatever data is possible. And that is a lot of manual work: studying the database diagrams, all the relationships between the tables, etc.

I also issued a list of recommendations to my friend and he was more than happy to follow and we’re now in the process of fixing things: buying a new hardware for the server role, changing how things are done. It’s good to see that people understands the risks, but it’s sad to see this just after a disaster struck. The bad part in this story, is that only 4%-5% of the impacted data was salvageable and there’s no money in the world that can make up for that. Green area shows how many records I was able to recover from the affected tables out of how many. The biggest issues are with table in rows #1, #3, #4 and #8. All of them were important, but these ones were really all that mattered.

RecoverPercentages

Added bonus to the whole incident ? There’s another instance on that computer which was not affected by this incident, fortunately. YET. It runs on a SQLExpress edition, has some backup in place, but… it’s on the same hard drive. In case something goes wrong with the whole computer, both databases are doomed, no matter that the second one has some sort of backup.

Lessons learned ?

No matter how small your database is, disaster *will* strike. It’s not IF, it’s WHEN. Just make sure you’re prepared:

  • Have proper backup in place
  • Have proper backup testing process in place (because having backup is only half of the protection)
  • Have the backup files stored off of that box on another computer, at least, if not in some cloud or some NAS
  • Check your logs, SQL and Windows
  • Run periodic DBCC CHECKDBs (how frequent, another topic)
  • Make sure you’re looking for early signs of hardware failure (have hard drives with SMART capabilities or similar technologies)
  • Make sure the computer you’re using as server is as isolated in terms of access and workload as possible
  • no other activities should take place on this server other than SQL (or whatever other processes you may have, but definitely no user activities)
  • Invest in some sort of data redundancy technology (software RAID1 at least)
  • Monitor your hardware (HDD space, resource usage, etc)

All the best.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.