Blog post #5 in support of Tim Ford’s (b|t) #iwanttohelp, #entrylevel. Read about it here.
Saying that you should use the correct data type seems like something that should be very straight forward. Unfortunately it’s very easy for things to get confusing. Let’s take a simple example from AdventureWorks. If I run this query:
SELECT a.ModifiedDate FROM Person.Address AS a WHERE a.AddressID = 42;
The output looks like this:
2009-01-20 00:00:00.000
Normal right? You see the year, the month and the day followed by the time in hours, minutes, and seconds as a decimal. Ah, but there is an issue. This query is supposed to be for the reporting system, and the business only cares about the date that the values in the Person.Address table have been modified, so they don’t want to see the time. Also, the person in charge is kind of picky. They really don’t like seeing the date formatted that way. They prefer to see “January 20, 2009.”
Far too often then, the easy answer, just change the column to varchar. You can trim the time and output in exactly the format needed by the business. Problem solved and it was easy…
Well, until someone inputs “Janry 20, 2009” slightly mangling the spelling and suddenly your report looks all messed up. Or, they ask you to start filtering just the last two weeks, regardless of when the report was run and you now can’t easily do date math on the column. Even after you get over that problem with a little formatting using CAST (along with ISDATE to try to catch all those other typos that are in the system now)  you notice that the performance is really slow so you go to put an index on the column and now you have an index key that is 50 bytes wide instead of the 3 bytes that the DATE data type would have been, making the index less efficient (not to mention, sorting the data is going to put February ahead of January, more formatting).
The list goes on and on. You’re going to hit issue after issue and all you tried to do was a little formatting, make the data look pretty by using the VARCHAR data type in place of DATETIME or DATE. This kind of thing happens all the time, especially to people just getting started. The rules should be very simple, use the correct data type for the data you’re collecting. However, it’s easy to get distracted. It’s easy to get pushed. The business says “put it in this format” and the only way you can do that is by changing the data type so…
Don’t do it. Take the time to understand the implications of mangling the data types. It’s not just date and time that cause issues either. Lots and lots of stuff can be easily stuffed into a VARCHAR field that more properly should be an INT or a DECIMAL or some other data type. Be able to communicate with the business so that they understand that you’re not simply being difficult, you’re trying to protect the functionality and integrity of the data. Show how formatting can be done using client tools. Explain all the downstream issues that this decision will have. Then, use the correct data type for the information being stored.
Excellent point and well made. It’s easy to see how someone might go down the varchar route, but I think always best to store the best description of the actual data in the database and process it, into whatever the client wants to see, in the application.
[…] Grant Fritchey implores you to use the correct data type: […]
doesn’t it also depend on where data source for your reporting? in Data warehouse experience, you don’t store data to suit reporting but instead you want to store as much of its original form as you can. In this case, you can always reference date dimensions for displaying date in the report in the correct format
Matt, absolute agreement.
Sifiso, yes & no. I agree that data needs to be as original as possible, but if the original is incorrect, I don’t see how perpetuating a mistake is helpful.
I have seen this sort of thing done quite a lot, especially by users who want to create data for a report. They forget that formatting can be done in two places, at the database level or at the report level, and don’t realise that the report level is much better at this than SQL, and will give them more flexibility later on. At least the YYYYMMDD format means that dates are sortable, but reformatting it for display, or to do maths with, is a pain.
Alex,
Not to mention that you have to build a custom enforcement mechanism or you can arrive at inconsistent data.
” Be able to communicate with the business so that they understand that you’re not simply being difficult, you’re trying to protect the functionality and integrity of the data. ”
I once worked with a very literal group of developers. They did not have a Char/Varchar column in the database because the Business Stake Holder said the data elements were “text”.
Ray
[face … palm]
[…] Use The Correct Data Type Silent Killers Lurking in Your Schema by (yes, a good friend) Mickey Steuwe is absolutely going to cover data types, among other things. […]