Skip to content

Corruption she (the DBCC) wrote….

I received recently an email from a friend asking for my help as he faced a problem. He runs a small business in the healthcare industry and his needs in terms of IT should and are minimalist, if you want. And that’s expected.

Of course, I asked what was the problem about and he said: “We had our systems down for quite a while and starting 2 days ago, we cannot access our database anymore.”. Long story short, it looked like a database corruption of some sort. I paid him a visit and had a look thru the logs. Wow… Loads of mini-dumps all over the SQL error log, similar to this one:

The affected database seemed ok, was online and from I was able to tell, readable. I was able to read metadata, I was able to read few user tables (spot checks). But looking in the Windows logs, I found an entry that usually sends cold chills up to any DBA’s spine and make them sweat instantly: 824.

I started asking for a bit of a history, I found out that lately, the computer started running slower and slower and 2 days before they had an incident with the computer when it froze and decided to give a go with a CHKDSK. Now, to have the full picture: the computer was not a server hardware, rather an all-in-one desktop computer playing in the big league…

Let me take a break here and explain why I wouldn’t recommend having a desktop computer act as both desktop and server. Main reasons, there’s a human using it day in day out, for all kind of stuff: Excel spreadsheet, navigating the Internet, Word documents, Solitaire, you name it, it’s there. Plus the antivirus. Plus the occasional freeze when humans do what they overhear in buses, on the street: power it down and then power it up. Or reset it, if you’re lucky enough to own a computer which still has the reset button.

For really small offices, such in this case, a reasonably spec’d desktop, could do the job, provided the human factor is out of the equation and the desktop is fairly reliable.

Back to the case, I started wondering how could a CHKDSK cause such havoc, because it shouldn’t. And here’s the gotcha. The drive (yes, the only drive in the system) has been removed, set up as a slave in another computer and CHKDSK was able to finish its job. Hurray, job done. Well, sort of kind of. Because after the drive was installed back in, the second SQL started up, bang! Problems started.

When I learned all these last details, everything started to make sense. It didn’t really matter anymore why the computer was running slower and slower (eventually, I figured that out, I guess – read on), the only thing that matter now was that the database *was* corrupted. And as it turned out later, it was corrupted big time.

Ran a DBCC CHECKDB and got these beauties back, after only 8 minutes:

DBCC results for ‘DBNAME’.
Warning: NO_INDEX option of checkdb being used. Checks on non-system indexes will be skipped.
Msg 8921, Level 16, State 1, Line 10
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.
Msg 8998, Level 16, State 2, Line 10
Page errors on the GAM, SGAM, or PFS pages prevent allocation integrity checks in database ID 5 pages from (1:695568) to (1:703655). See other errors for cause.
Msg 8998, Level 16, State 2, Line 10
Page errors on the GAM, SGAM, or PFS pages prevent allocation integrity checks in database ID 5 pages from (1:703656) to (1:711743). See other errors for cause.
Msg 8939, Level 16, State 5, Line 10
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_headerVersion == HEADER_7_0) failed. Values are 104 and 1.
Msg 8939, Level 16, State 7, Line 10
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_freeData >= PageHeaderOverhead () && m_freeData <= (UINT)PAGESIZE – m_slotCnt * sizeof (Slot)) failed. Values are 0 and 8192.
Msg 8939, Level 16, State 11, Line 10
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_ghostRecCnt <= m_slotCnt) failed. Values are 29811 and 0. Msg 8939, Level 16, State 5, Line 10 Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_headerVersion == HEADER_7_0) failed. Values are 104 and 1. Msg 8939, Level 16, State 7, Line 10 Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_freeData >= PageHeaderOverhead () && m_freeData <= (UINT)PAGESIZE – m_slotCnt * sizeof (Slot)) failed. Values are 0 and 8192.
Msg 8939, Level 16, State 11, Line 10
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 77687093572141056 (type Unknown), page (1:695568). Test (m_ghostRecCnt <= m_slotCnt) failed. Values are 29811 and 0. Msg 8939, Level 16, State 5, Line 10 Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 362258295026614272 (type Unknown), page (1:703656). Test (m_headerVersion == HEADER_7_0) failed. Values are 104 and 1. Msg 8939, Level 16, State 7, Line 10 Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 362258295026614272 (type Unknown), page (1:703656). Test (m_freeData >= PageHeaderOverhead () && m_freeData <= (UINT)PAGESIZE – m_slotCnt * sizeof (Slot)) failed. Values are 0 and 8192.
CHECKDB found 7 allocation errors and 3 consistency errors not associated with any single object.
CHECKDB found 7 allocation errors and 3 consistency errors in database ‘DBNAME’.

I came across corruption, I think, twice in my career and I was lucky enough to have it fixed in almost no time. Some indexes got corrupted and in other case it was some 3rd level table that we were able to rebuild it completely from scratch. But this one, for me, was a first. “Master, I have a bad feeling about this”, I said to myself. Did a quick search on the errors and almost if not all results pointed to a non-repairable case, where the only salvation would be a reliable backup.

Yes, you guessed it, there was none. Like in never, since the database was created.

So, I created a backup of the database and the tail-log (well, in the reverse order) and was time to get my hands dirty.

I restored the backup on a virtual machine and started working. Two things that I noticed were that VM was running slower and slower and that the drive where I had the SQL binaries was filling up quite quickly. Then I remembered about those mini-dumps. Checking the default location for those logs, big surprise. That’s what was causing the drive to fill up. Thousands and thousands of files. Because there’s no way of stopping those, I added a new drive to the VM, changed the settings for the mini-dumps to point to that drive and let that fill up (SQL will not stop anyway when that’s filled up, will just roll-over the files). That would explain the symptom that my friend noticed in the first place. I guess the corruption lurked around from some time. To stop these from being generated, I switched the corrupted DB to read-only. But there’s a catch switching it to read-only, at least that’s what I noticed. Switching it to read-only, seems to have limited my ability to read the records that were still salvageable from the corrupted tables. As soon as I switched it back to R/W, I was able to read whatever records were still accessible. I played a bit around with this and seems to have a direct correlation. Once that was cleared out of the way, I moved on to starting recovering the data, or what was still recoverable.

I started by scripting out the database. Once that was done, I moved on and put together a script to export everything I can still read. I managed, this way, to isolate a number of 9 tables that were seriously impacted. The big problem was that 2 of those were at the core of the whole application and 2 table hold a lot of scanned documents, which are really hard to rescan them. So yeah, no pressure.

The impacted table were still readable up to a certain point, so I took advantage of that and exported whatever records were still readable and that was pretty much everything I could salvage. Now the reconstruction work can begin.

I used the script I created when I scripted out the database and created an empty on. Once all the salvageable data was exported, I switched the damaged database to R/O. Exporting the unaffected tables, was a breeze. I created a dedicated schema (call it [export]) and the I simply used SELECT * … INTO [export].[table] FROM [table] and that was it. For the affected ones, I created another schema (let’s call it [staging]) and then I created the required empty tables. Couple of notes here:

  • I created the tables with no constraints whatsoever
  • I created the tables with generic column datatypes (VARCHAR(MAX)) to avoid any incompatibilities

Once the staging tables have been populated, I can start reconstructing whatever lost data was possible. And here, all depends on what the data model is, what the relations / foreign keys are in place, a lot of factors are at play here. In some cases, that’s not even possible.

The last step of this will be to import the salvaged data into the final tables. Here the data model comes into play, to load the tables in the right order, to have all the FKs check out, so on and so forth. It’s going to be a tedious process, but it has to be done. At least, tried :).

The final chapter of this is yet to be written: I’m at the stage where the staging tables are all populated and I have to try and reconstruct whatever data is possible. And that is a lot of manual work: studying the database diagrams, all the relationships between the tables, etc.

I also issued a list of recommendations to my friend and he was more than happy to follow and we’re now in the process of fixing things: buying a new hardware for the server role, changing how things are done. It’s good to see that people understands the risks, but it’s sad to see this just after a disaster struck. The bad part in this story, is that only 4%-5% of the impacted data was salvageable and there’s no money in the world that can make up for that. Green area shows how many records I was able to recover from the affected tables out of how many. The biggest issues are with table in rows #1, #3, #4 and #8. All of them were important, but these ones were really all that mattered.

RecoverPercentages

Added bonus to the whole incident ? There’s another instance on that computer which was not affected by this incident, fortunately. YET. It runs on a SQLExpress edition, has some backup in place, but… it’s on the same hard drive. In case something goes wrong with the whole computer, both databases are doomed, no matter that the second one has some sort of backup.

Lessons learned ?

No matter how small your database is, disaster *will* strike. It’s not IF, it’s WHEN. Just make sure you’re prepared:

  • Have proper backup in place
  • Have proper backup testing process in place (because having backup is only half of the protection)
  • Have the backup files stored off of that box on another computer, at least, if not in some cloud or some NAS
  • Check your logs, SQL and Windows
  • Run periodic DBCC CHECKDBs (how frequent, another topic)
  • Make sure you’re looking for early signs of hardware failure (have hard drives with SMART capabilities or similar technologies)
  • Make sure the computer you’re using as server is as isolated in terms of access and workload as possible
  • no other activities should take place on this server other than SQL (or whatever other processes you may have, but definitely no user activities)
  • Invest in some sort of data redundancy technology (software RAID1 at least)
  • Monitor your hardware (HDD space, resource usage, etc)

All the best.

DATETIME and its surprises

Well, it’s about time I write another blog post (with a promise that I’ll do it more than once a year).

I’m going to start a series of “bad habits to kick” as for me it is important that we all have a common ground on SQL Programming.

1.            Consider the following scenario:

SET

LANGUAGE British;

SELECT
CAST(’02/12/2007′
AS
DATETIME);

SET
LANGUAGE us_english;

SELECT
CAST(’02/12/2007′
AS
DATETIME);

SET
LANGUAGE British;

SELECT
CAST(‘2007-02-12’
AS
DATETIME);

SET
LANGUAGE us_english;

SELECT
CAST(‘2007-02-12’
AS
DATETIME);

SET
LANGUAGE British;

SELECT
CAST(‘20070212’
AS
DATETIME);

SET
LANGUAGE us_english;

SELECT
CAST(‘20070212’
AS
DATETIME);

And let’s see the results:

Changed language setting to British.

———————–

2007-12-02 00:00:00.000

(1 row(s) affected)

Changed language setting to us_english.

———————–

2007-02-12 00:00:00.000

(1 row(s) affected)

Changed language setting to British.

———————–

2007-12-02 00:00:00.000

(1 row(s) affected)

Changed language setting to us_english.

———————–

2007-02-12 00:00:00.000

(1 row(s) affected)

Changed language setting to British.

———————–

2007-02-12 00:00:00.000

(1 row(s) affected)

Changed language setting to us_english.

———————–

2007-02-12 00:00:00.000

(1 row(s) affected)

So the first two results are different, while the last query returns the same result! What’s going on?

It has to do with the way Microsoft implemented the date and time data types, starting with SQL Server 2008.

We have two similar data types, DATETIME and DATETIME2. What people fail to realize is that the two data types behave completely differently (Use the timedatedatetime2 and datetimeoffset data types for new work. These types align with the SQL Standard. They are more portable. timedatetime2 and datetimeoffset provide more seconds precision. datetimeoffset provides time zone support for globally deployed applications – see here.)

DATETIME preserves the old, non-standard ANSI  settings, in order to be backwards compatible, while DATETIME2 is pretty much standard compatible.

2.           How can we address this issue?

The answer lies in the code itself (repeated here for clarity):

SET
LANGUAGE British;

SELECT
CAST(‘20070212’
AS
DATETIME);

SET
LANGUAGE us_english;

SELECT
CAST(‘20070212’
AS
DATETIME);

This will return the same date and time; and the reason lies (again) in Books Online, for CAST and CONVERT:

For date and time styles, we have:

12

112

ISO

12 = yymmdd

112 = yyyymmdd

In other words, even if DATETIME is not ISO standard, the way we input data for CAST can be aligned with the ISO standard, thus having the correct results each time.

I hope this helps.

Create certificate from backup fails on SQL Server 2008

This is so going to bite yer ass if you don’t pay attention…

As usual, a very quick post: I tried to create (ok, restore) a certificate from the production server to our DR server, so that the automated restores would work. When I ran the script (yes, CREATE CERTIFICATE … FROM FILE […]), I got this:

Msg 15208, Level 16, State 1, Line 1
The certificate, asymmetric key, or private key file does not exist or has invalid format.

Hm… apart from the usual stuff you would find on the web, I’ll add another one: just make sure that the machine name is the same as the SQL Server (default instance) name. In my case, the computer name displayed: DR20 server, while the SQL Server default instance name was: DR20DB server. Once the SQL Server default instance name was DR20, everything went fine.

I hope this helps some poor soul out there.

Best regards,
io

Say What? – “Management Studio: must be set in single thread apartment (STA) mode before OLE calls can be made”

Yes, I have another couple of minutes to document this weird message.

I was trying to access the database properties for one of our databases. Guess what: I couldn’t as there was no owner! Well, the easiest thing to do is to alter the authorization on the database; as you may have guessed it already, I didn’t pay any kind of attention to the error message, and quickly jumped to the magic world of… creating a diagram (actually the mouse pointer was in the wrong location, I just clicked, so it was not my fault, right?). Well, this time I got to do a print screen:

OK, now let’s copy that stupid message without reading it, select some random text from it, and do a blind search in hope for sheer luck. Click the Copy sign… and there you go, another weird stuff:

Now this one finally got my attention.

It turns out that it used to be a bug (check here), but it was supposed to be fixed. Well, apparently not all the cases were covered.

My wild guess is that:

If you have a database that does not have an owner (how about someone who left the company and we removed the user account – I need to check this) and you want to create a diagram, you will see the first pop-up error message. Trying to copy the message itself (via the Copy feature – yeap, those little sheets of paper in the lower-left corner) will yield the second error.

Of course, the solution is the most difficult one:

ALTER AUTHORIZATION ON DATABASE::MyDB TO sa;

I hope it helps.

Best regards,
io

Login failed. The login is from an untrusted domain and cannot be used with Windows authentication. Error source: Microsoft OLE DB Provider for SQL Server – revisited

This blog post is waiting for me to have some time for the past couple of months by now. It did bite me back then, and so it did yesterday.

“Login failed. The login is from an untrusted domain and cannot be used with Windows authentication. Error source: Microsoft OLE DB Provider for SQL Server” – now, I must emphasize that I’m not going to reproduce the plethora of info that is present out there already, but instead I might be among the few selected ones that indeed had issues with AD domains. If you’re curious about that, read on; if you’re looking for how to enter some lines of code in some connection string, you better go have a coffee or something.

Background: the company that I work for, migrated all their SQL Server 2005 servers (with DAS – Direct Attached Storage) to SQL Server 2008 clustering (on SAN). This obviously means a lot (I mean A LOT) of fun, such as deprecated code, and so on. Besides this, one of the required steps is to swap the IP addresses between the SQL Server 2005 machine and the SQL Server 2008 resource on the cluster (of course, anybody knows that SQL Server 2008 is just a resource on a 2008 cluster, right? – I learned this the hard way).

And now for the fun:

“ONLY jobs that need to run to C:\Program Files (x86)\Microsoft SQL Server\80\Tools\Binn\DTSRun are failing.
The ones that are running C:\Program Files (x86)\Microsoft SQL Server\90\DTS\Binn\dtexec are running fine.
Secondly, the ones running under 80\Tools are the ones using the server IP (10.0.0.31). when you ping the IP address, it still returns SQL03DB as the server.”

Rough translation of the above: all the jobs that running DTSX packages are running fine, while all the jobs that are running DTS (SQL Server 2000 – yes, I know, that old) are failing in SQL Server 2008. Besides that, when I run ping –a 10.0.0.31, I still get the name of the SQL Server 2005 machine, instead of the SQL Server 2008 cluster resource.

This is something that happened early May, and I completely forgot about it. And it happened again yesterday.

Sorry for all that blur stuff, but this is a prod machine. Anyway, as you can see, the IP address of my SQL Server cluster resource is 10.0.0.10.

Now, if I ping –a 10.0.0.10, I get the following:

So instead of getting something in the line of “28DBCLU”, I still get the old SQL Server machine, which used to be “01db”.

Again, oddly enough, this manifested itself only for the DTS packages, which are running DTSRun, found in the \80 folder.

Solution? A very simple one, just ask the sys admins to make the necessary changes in the DNS (yes, that DNS, related to AD, DHCP, all that jazz). Also, it would be very helpful not to have hard-coded values in your hosts file.

And for the grand finale: I posted a question on Twitter, under #sqlhelp, asking whether I should have my system databases on SAN (which mean automatic failover in case something goes wrong with the active node), or if they should be on the default installation path (yes, that Program Files thing). While there was a little variation of the answers, my personal choice is to have them automatically fail over. Just imagine what would happen if you have a failover and the msdb
and master databases are not the same.

Best regards,
io

My mail to SQL Sentry that will win me (yeah, right) a free seat to SQL Skills training!

The amazing guys @SQL Sentry (www.sqlsentry.net)  are giving away a chance to win a free seat to SQL Skills Internals and Performance training (check out the details: http://www.sqlsentry.net/sqlskillsimmersion.asp).

Of course, I decided to give it a try, so I wrote to the guys; you can check it below:

“Hello there.

Now, last night at around 2:30 AM I stepped down of bed, as the little ones (we have two kids) kept kicking me in the head for far too long. Since I’m a hopeless DBA, the very first thing to do is to open the laptop and check on the emails. I almost fell down from chair when I saw Paul’s email regarding your offer. And I said to myself ok, I’m still asleep and dreaming, let’s read that again in the morning. Which I did. And there it was, the offer.

I so MUST HAVE to be there! Let’s see why:

I live in Romania, which, by all standards, had a strict comunist regime until 1989. While it was strict in regards with computers (we didn’t even have a word for computer), there was a plethora of underground illegal stuff, such as cable TV. This way we got introduced to good movies with hot chicks. Movies that immediately after the Romanian revolution (December 1989 the revolution, early ’90ies for the rest of the story) turned into even more good looking chicks. Of course, the entire gang fell in love with such angels. Even back then we knew that we had no chance with them, but at least we could try to get something similar in Romania.

By all standards, none of us were quite hot-shots, so we decided that the best way to pick up gorgeous girls is to have money – a truth that happens to be the same worldwide. It also happened that unfortunately all of us were raised to be honest, not to steal, all that jazz, so we realized that we would have to make a career that would pay enough money to stand out from the crowd. There you go, we got into IT.

This was a big mistake for several reasons: as everywhere else in the world, the true hot shots never go to work, ever; as they say, if you work, you don’t have time to make money. But the worse was to happen: we all fell in love with this new… thing, which was computers. Not to mention that back then the Internet changed it all (yeap, I’m that old, I keep telling that Internet and internet are not the same). So before any of us could realize what’s happening, we pretty much forgot about the movies and the hot chicks and got our hands dirty into all sorts of things, such as networking (myself included) and programming (myself excluded).

Fast forward a few years later, around the beginning of the millennia, most of us were working in the IT field (by the way, it is still growing exponentially in Romania). Now, there is an old saying: he who can do, will do; he who cannot, will teach. Of course, most of the gang members got to work, while I got to teach (as a technical trainer in various capacities).

Quick jump into 2008: I got a job offer as a database administrator in SQL Server. Does it have icons? Can I work with a mouse? Yes. OK, I’m in. It was back then when I fell in love with the damn thing. Then I got to read great blogs from great people, such as Brent Ozar and Paul Randal, just to name a few. Back then, I would have said that thanks to them, I kept my love for SQL Server growing.

Another quick jump into 2011: I got another job as a SQL Server DBA, but this time the mouse was the least of my worries. Just to give you an example: guys, do any of you know anything about clustering? None of us did. OK, you have three days to pick it up, we’re migrating all of our databases to SQL Server 2008 clusters. We did pick it up.

This year (2012) I got to TechEd Orlando, only to make a fool out of myself in front of Brent Ozar (if you check his presentation, he makes fun of me for asking for a card to show his card to my wife).

To conclude: while in 2008 I would have said “thanks to these guys…”, this year I’m saying “because of these guys…”. Because of these guys I still love the damn thing (SQL Server, that is). Because of these guys I’m still wasting nights, trying to pick up things on the fly. You cannot imagine the sheer joy and happiness of my wife when I have to work to replace and consolidate the database servers almost every week-end.

So, if it’s all because of “these guys“, why not let them handle me? At least in part? They made me forget my initial goal, of being a hot shot, and turned me into some sort of a weird kookoo, trying to do his best. Worse part yet; because of them I got again into training, blogging, and trying to help other people.

I absolutely need this course (and not only): it will massively shorten the path to great knowledge and less work. I’m planning to that ASAP; right now, the kids are still calling me daddy, but with the crazy hours that I have to put, I’m not sure it will be the case in the near future.”

Move cluster group from one node to another

First off, apologies to the 3 readers I have (I don’t read my blog post, I just write), since all of them are far better than I am when it comes to clustering.

To cut it short: how can you move ANY resource from one node to another? Namely, the disk witness in the quorum cannot be moved using the Failover Cluster Manager GUI:

So what to do, what to do? When, after a 1 second Google search, I found couple of resources. Not to keep you too much (if you need to read more than 5 minutes, I’m in deep trouble):

Cluster.exe <<servername>> group – this one (of course, after you replace <<servername>> with your actual server name) will list all of the available resources:

Group                    Node    Status
—————————————-
Cluster Group            Est21DB Online
Available Storage        Est21DB Online
EST20_21Dtc              Est21DB Online
SQL Server (MSSQLSERVER) EST20DB Online

OK, so now I have my SQL Server on one node, and the rest of the resources on the other node.

For the brilliance, I decided to move the “cluster group” resource, since this is going to move everything, right?

cluster.exe est20_21 group “cluster group” /moveto:est20db – and then let’s run again to check that everything is moved to whatever node I choose (in my case, est20db):

Group                    Node    Status
—————————————-
Cluster Group            Est20DB Online
Available Storage        Est21DB Online
EST20_21Dtc              Est21DB Online
SQL Server (MSSQLSERVER) EST20DB Online

So I was obviously wrong – moving the “cluster group” did not move anything but the “cluster group”. Aaargh, the shared nothing principle (thank you, Alan Hirt, for your great book).

So, if I really want to move the quorum drive, I need to move the “Storage” group like this:

cluster.exe est20_21 group “available storage” /moveto:est20db

Moral of the story: you cannot move all the resources at once – I personally find “cluster group” to be a bit misleading, but again, I’m less than a novice when it comes to clustering.

I hope this helps someone out there.

Best regards,
io