I recently had the pleasure of giving a talk at SQLSaturday #71 in St. Louis on the topic of data corruption. One part of this involved discussing some not-so-well-known root causes of data corruption (e.g., “It’s your fault”), but the other part involved showing how to use Extended Events to detect and address these root causes (e.g., “It’s not always your fault”). I’ve posted my slides from that talk here.
This blog article discusses what follows in terms of extended events functionality; therefore, you’ll want to refer back to those slides as needed throughout this content if you need more details about any particular event or state.
Root Causes of SQL Server Data Corruption
In my experience, many of the people who attend my talks on data corruption are there because they’ve been told that their application developers did a bad job and therefore “caused” data corruption in their databases. This leads to some interesting questions for me about why these same folks have not also blamed their network guys or storage guys. In fact, I had one person give me feedback after my talk at one of these conferences saying something along these lines: “I can’t wait until you get to the root cause section so I can tell the DBA in our shop that it’s his fault.”
Unfortunately, if this is how your organization handles things then you’re only going to perpetuate the problem by blaming the DBA for everything. I’ve worked in environments where the DBA was blamed for all sorts of application and network issues, and it’s about as good a situation as one could ever hope to be in (i.e., not very).
A more effective approach is to look at root causes that you can address without assigning blame.
The following lists some potential root causes of data corruption based on my own experience:
SQL Server bug (see e.g., BUG#14049, BUG#12085, BUG#44531) Poor application design using constructs which are known to cause SQL Server bugs (e.g., outer joins, cross-database queries, unquoted identifiers) ing issue (see e.g., KB #95530, KB #331671, BUG#3415) Poor application design using improperly configured (or nonexistent) collation settings which cause data corruption when leveraging constructs such as case-insensitive sorts and joins.
Application bug in a client-side library that is being used by one or more applications to connect to SQL Server. Hardware problem related to either the server itself or storage attached to the server (e.g., failing disk drive, memory errors in RAM, misconfigured RAID controller). Collation mismatch between the database and the client application environment, otherwise known as an “application collation” issue.
Root Cause Example: Buggy Code
Let’s take a closer look at example #2 above: Poor application design using constructs that are known to cause SQL Server bugs (e.g., outer joins, cross-database queries, unquoted identifiers). Suppose you have a requirement at your company that any code changes needed for an application must be verified as correct by the developer who originally wrote the code and those changes may not be implemented until they complete this verification process. If you run into a bug in this application though, you’ve now got two problems because instead of just fixing the problem you now must fix it and then verify that fix before moving on to other items on your already overburdened development and test schedule.
Another approach would be to write tests around all of these of known problematic SQL statements along with a cross-database query and a case-insensitive sort to catch application bugs. Then, you could run these tests as part of your automated testing process (e.g., use SQL Server Agent jobs) to automatically identify the root because whether it’s an application bug or not.
Root Cause Example: Unsupported Collations
Let’s take another look at example #4 above: Poor application design using improperly configured (or nonexistent) collation settings which cause data corruption when leveraging constructs such as case-insensitive sorts and joins. The example I like to give regarding this issue is the following scenario: Suppose you have an application deployed in both Europe and Asia that uses smartphones with local language characters installed on them for users to interact with the application.
If an application was designed in the U.S. then it will assume the database collation is the same as the client computer collation (which, in fact, is not true for any language other than U.S. English) and thus every single input character will be converted to uppercase before being stored in the database because that’s how U.S. English locale works (i.e., all characters are uppercased).
So, when you try to do a case-sensitive sort on this data in the database it’s actually sorting the uppercase versions of these characters which is an entirely different thing than what your application assumed would happen. We’ve seen this scenario time and again despite documentation that clearly states:
When comparing two expressions where one or both expressions are NULL, use COLLATIONPROPERTYEX to determine the collation differences between them because UNICODE comparisons are relied upon. As described above, if either expression contains non-Unicode data (such as data that has been converted to uppercase) then comparisons will not be equivalent.