I read all the time how SELECT * hurts performance. I even see where people have said that you just have to supply a column list instead of SELECT * to get a performance improvement. Let’s test it, because I think this is bunkum.
The Test
I have here two queries:
SELECT * FROM Warehouse.StockItemTransactions AS sit; --and SELECT sit.StockItemTransactionID, sit.StockItemID, sit.TransactionTypeID, sit.CustomerID, sit.InvoiceID, sit.SupplierID, sit.PurchaseOrderID, sit.TransactionOccurredWhen, sit.Quantity, sit.LastEditedBy, sit.LastEditedWhen FROM Warehouse.StockItemTransactions AS sit;
I’m basically going to run this a few hundred times each from PowerShell. I’ll capture the executions using Extended Events and we’ll aggregate the results.
The Results
I ran the test multiple times because, funny enough, I kept seeing some disparity in the results. One test would show a clear bias for one method, another test would show the opposite. However, averaging the averages we see that things broke down as follows:
* | 167.247ms |
Column List | 165.500ms |
That’s after about 2000 separate executions of each query. There’s a 2ms bias towards the Column List query as opposed to the *. That’s an improvement, if you want to call it that, of 1%. It’s hardly worth the bother, assuming that with more testing this continued to hold true. In multiple tests, the SELECT * ran faster. I just feel honor bound to put up the full results. They show an improvement, but not one I’d get excited about. Oh, and the reads, the execution plan, everything else… identical.
SELECT * Conclusion
Don’t get me wrong, there are lots of reasons to not use SELECT *. Yes, performance is one of the reasons to not use SELECT *. However, when most people suggest that maybe using SELECT * is a bad idea for performance reasons, what they’re saying is you ought to only move the columns you need and the data you are actually using, not everything. I’m not aware of anyone with experience and knowledge suggesting that using the complete column list instead of SELECT * is faster. As we can see in the tests above, it isn’t (or is by so small a margin, who cares).
I love talking performance tuning. In fact, I’ll be doing an all day seminar on execution plans and query tuning before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.
Few questions:
Is this taking into account the time it takes to transfer the query to an application?(transmitting the bytes to a different computer)
How many columns did the source table have? If there are 50+ columns, it probably will take longer to send the query back to the application (just because its sending more).
Have you tried this same test with a stored procedures?
Hey Jordan,
The amount of data doesn’t matter because if you list all the columns (5 or 50) or you use *, you’re returning the same amount of data. No, * versus fewer columns, of course fewer columns wins, no question. I’m not saying you should use *. I’m just saying if you’re going to return all the columns anyway, * doesn’t hurt additionally (yeah, 1%, I still think that’s a bit of an anomaly).
I haven’t tried it with stored procedures, but I see no reason why it would be event a tiny bit different. I can test it. Why do you think it could change with procedures?
Yeah, it’s not clear to me if you’re recording the query time or the network time here. (or both).
And if it’s the same number of columns in both cases. (I don’t have that db loaded on my particular machine).
I’d also be very interested to see the difference if you had a WHERE clause and the impact of columns included. I suspect if you only return the columns in a non-clustered index (that is based on the parameters of the where clause) vs * in the same case (so you have go to back to the clustered index to get the remaining columns) you’d see a big difference.
Greg,
It’s query time. The numbers are based on extended events capturing query time. The columns are exactly the same, otherwise it wouldn’t be a good test.
Yes, anything that changes the number of columns will change behavior. However, even with a filter in place, ALL columns by listing them out or using *, will result in the same performance, clustered, nonclustered, or heap. It won’t matter. Again, for the test to be fair, it has to be all columns vs. *.
I’m not suggesting that moving more columns has the same performance as moving less. I’m saying that listing all columns or just using * result in the same performance.
Good post, and a good example of one of those myths that is “right for the wrong reasons”. Certainly you could have a situation where SELECT * avoids an index that could have covered a narrower query, if you didn’t need all the columns…my bigger beef with it is how easy it is for things to break when people shove it into a stored procedure. All it takes is the next chap to come along and add a column, and boom, your code breaks. Even just from a time perspective, when writing stored procedures, you should not be taking typing shortcuts.
That said I use it all the time while running adhoc queries!
Did you try this with a flush cache before each select ?
Depends on if one of those fields is full XML field, or a 7000byte varchar.
I won’t use select * in production. if i need 3 fields, i query those three fields.
Add columns with geography or other obscure data type? I’ve definitely seen higher cpu just by selecting geography columns when not needed. Guess I’m testing tomorrow 🙂
SELECT * FROM Table vs. SELECT is trivially different in terms of compilation time – not worth bothering.
However, there is a potentially very large difference in SELECT * vs. SELECT cols when you have a subset of the columns selected since it can significantly change the number of bytes processed and it can impact plan choice when there are covering indexes on a subset of the columns that can help you.
More broadly, it is likely not good to use SELECT * for any long-lived apps where you might change/add columns over the lifetime of an app. This can cause your app to break when you deploy things.
Conor
Thanks Connor. I suspect that ~1% that I saw was the difference in compile time. That makes sense considering all the other measures.
Total agreement on all your other comments too. Appreciate the feedback. Personally, I stay away from SELECT * in the vast majority of all cases. It was being told, by an individual online, that listing all the columns was inherently faster than SELECT * that lead to this experiment.
Appreciate you stopping by and commenting.
Nic,
Total agreement. Again, limiting the columns selected is the right, preferred mechanism. I just knew that once I tested it, and it held up, that wasn’t the case that SELECT * vs. all columns resulted in superior performance (that 1% just isn’t worth the trouble).
Micheal,
I did, but it skewed the results. Not because it showed either one winning, but that loading stuff from disk over and over caused wild disparity between runs, too much conflict on a small system.
Alex,
Exactly. I’m not suggesting otherwise. Limiting the column list is a very good thing.
We’re talking not limiting the column list vs. SELECT *, different story than a limited column list.
Hey Jared,
Without a doubt that would cause additional headaches, but it will cause them for both scenarios. Remember, we’re not limiting the selected columns. It’s all columns, listed out, vs. SELECT *. Any pain from a special data type is going to be there in both. That’s part of my point.
Good post Grant. One scenario were I will advise people specify all column list instead of * is to ensure the code doesn’t break after someone adds or removes a column. For ex,
Insert into tblB select * from tblA
The above query will break if column is added to tblA but not to tblB
But yes, this is not for performance reasons and agreeing with you on no/little performance difference while specifying all column list against select *
Thanks Grant! Just the premise had me thinking about it wrong. All columns is all columns! Duh! (Jared smacks forehead)
I don’t prefer to use ‘SELECT *’ as it would cause performance delay when you are executing for a larger set. For example, you want only 4-5 fields where the table has 10+ fields having a huge data which you don’t need, then what’s the need to get all these extra columns/data?
Kunal,
Yes, limiting the column list is a good thing and does help performance. That’s not what we’re talking about here. The claim made was listing the columns, all the columns, performed faster than SELECT *. That’s not true.
In older versions of SQL Server, I suspect SELECT * incurred more of a performance hit. Hence the advice to *not* use SELECT * in the past.
I advocate listing the columns, and share that advice when I deliver SSIS, ETL, and Biml training. Why? “It communicates intention,” I tell my students.
:{>
Don’t understand – why do so many responders harp on the OBVIOUS advantage of retrieving fewer columns? This article is about two different ways of retrieving ALL columns, and says so many times. Do you even bother reading the article before posting your objections?
But on the subject of the article, I wonder if the slight speed advantage of listing columns could be that internally, the parser translates the * into a column list, and only then does the exact same work as if you supply the list yourself? Generating the column list could take a little time, although an average of 2mn seems rather a long time to me, for such a simple operation.
Hello Pete,
I’m assuming that the difference seen is what Connor refers to above.
Hello, i guess the question should be:
if you need all the columns from a table, does it matter to use select * from than naming the columns? here you are saying not much, but the select * as a habit just to return all columns where you actually need just very few is a bad practice, that’s the really bad thing about using select * then
thanks
Bassam,
Exactly. If you don’t need the columns, don’t return them. That should absolutely be the standard.
I think it should be more of “don’t get in the habit of being lazy and writing select *” when it comes to building something that will be permanently installed in production. The example given of the select into a table, which will break when a column is added is a perfect reason.
Another example I’ve seen far too many times is select * from Table A in a view with fields from other tables listed behind it. When you add a new field to Table A, your view will still return the same data, but now it will be aligned to different columns.
Break the habit, and someone won’t have to check a database for all references of a table when adding a new field to it.
Hi Grant!
I think the title of your post is very polemic. Because if someone doesn’t read the full article, can keep a bad idea about a well stablished performance standard (just bring the columns you need).
I’ve had problems with some irresponsible developers, who made ORM that by default brings every columns available, and they just leave it like that, even with lots of joins. On the dev environment, it’s OK, but when concurrency comes at play, it just crashes the system.
There are cases, like some ODS database organizations, where the problem of applications moving unecessary data around have a huge impact, and I believe this shoud be on the mind of every developer.
Thanks for the articles, you bring a lot of informations and insights to us!
Grant,
From the title of the article, it gives a very different impression of what you’re discussing as it implies that there’s no performance hit to using SELECT * at all.
If you identify all columns in your select statement, then the only overhead you’re saving is the lookup SQL Server had to do to resolve the * to the column names.
But, having worked with tables with lots of columns (not my design), there is a definite performance hit in retrieving data from unneeded columns.
Aaron
For the most part your correct. But, what about the X lock on the system schema tables when you do select *. The system has to X lock the schema tables to get the field list. It does this because it needs to know what the field list is at that time.
The title of this post implies there’s no performance hit to “select *” when you don’t need all the columns. NO! You are not saying that, but it’s not immediately clear from the title. So, thank you for saying in your conclusion “Don’t get me wrong, there are lots of reasons to not use SELECT *. Yes, performance is one of the reasons to not use SELECT *.” I agree. It might be interesting to delve into some of those reasons in a future contrasting post.
Wise Old Man pointed out a pitfall to using it in a view. Another pitfall to using it in a view is that it’s a common beginner mistake to expect that newly added columns will automatically “ripple through” the view. (sp_refreshview??? What’s that?) A new developer unfamiliar with recent change history might have considerable difficulty figuring out why the result set from a “select *” view is missing one or more columns. If the original author had taken the time to list all original columns, it would be obvious from looking at the view definition that one or more newly added columns were missing. Yes, if you want to then add the new columns to the view it is a tiny bit more work to alter the view definition versus simply running sp_refreshview, but it aligns with the best-practice of only returning needed columns (What if 20 other un-needed columns were also added?).
Finally, there’s really no excuse not to enumerate columns from a table or view because the SSMS object explorer makes it so easy. eg. For a vertical list, right click the table or view and “Script it as”->”Select to” etc. For a horizontal list, simply drag the un-expanded “Columns” folder into a query pane.
Thanks for the thought provoking post. I’m not crazy about the title, but it got me to read it!
I don’t know of anyone who has ever claimed that “SELECT *” performs any worse than “SELECT {list of *every* column in the table}”. Of course, the two are essentially equivalent, and there is no reason to even suspect there would be any appreciable difference in performance.
Anyone (that I know of) who was pointing out a performance difference was specifically saying that “SELECT *” is definitely worse than “SELECT {*only* the columns you really need}” when your application needs fewer columns than whatever happens to be in the table – – **especially** in cases where some of those fields may be char(), varchar(), or TEXT having large amounts of data that your app doesn’t need in that specific query.
Further if the SQL statement includes joins to other tables, then “SELECT *” returns *all* field data from *every* table involved, making things even worse.
Separate from the issue of performance is the issue of code-stability. If your views, stored procedures, hard-coded app-generated SQL statements include something like “INSERT INTO {Table_B} SELECT * from {Table_A}”, this code will break if you ever change the schema for *either* one of those tables (without remembering to make analogous changes in the other table). Your code will be more stable if you explicitly name the source and destination fields.
Guys, if you do a little searching on the web, you will find a performance checklist making the rounds that explicitly states that listing the columns is better than SELECT *.
As I said in the conclusion, of course moving all the columns is going to be a performance hit. However, moving all the columns is going to be a performance hit. It will be that if you use SELECT * or you move all the columns.
As you stated, SELECT * is barely worth mentioning for performance reasons, assuming you actually need all the columns. I think it’s just a maintenance nightmare, personally, and suggest not using it except for personal, non-persisted use; i.e troubleshooting, testing, etc.
[…] than you actually need come to mind immediately (although if you need every column, SELECT * is no worse for performance than specifying them by […]