The things you read on the internet, for example, “don’t use a sub-query because that hurts performance.”
Truly?
Where do people get these things?
Let’s Test It
I’ve written before about the concept of cargo cult data professionals. They see one issue, one time, and consequently extrapolate that to all issues, all the time. It’s the best explanation I have for why someone would suggest that a sub-query is flat out wrong and will hurt performance.
Let me put a caveat up front (which I will reiterate in the conclusion, just so we’re clear), there’s nothing magically good about sub-queries just like there is nothing magically evil about sub-queries. You can absolutely write a sub-query that performs horribly, does horrible things, runs badly, and therefore absolutely screws up your system. Just as you can with any kind of query. I am addressing the bad advice that a sub-query is to be avoided because they will inherently lead to poor performance.
Let’s start with a simple test, just to validate the concept of how a sub-query performs within SQL Server:
SELECT sd.OrderQty, pr.Name FROM (SELECT * FROM Sales.SalesOrderDetail AS sod ) AS sd JOIN (SELECT * FROM Production.Product AS p ) AS pr ON pr.ProductID = sd.ProductID WHERE sd.SalesOrderID = 52777; SELECT sod.OrderQty, p.Name FROM Sales.SalesOrderDetail AS sod JOIN Production.Product AS p ON p.ProductID = sod.ProductID WHERE sod.SalesOrderID = 52777;
If there is something inherently wrong with a sub-query, then there is something twice as wrong with two sub-queries. Here are the resulting execution plans:
Huh, look sort of, I don’t know, almost identical. Let’s compare the plans using the new SSMS plan comparison utility:
Well, darn. Displayed in pink are the common sets of operations between the two plans. In other words, for these plans, everything except the properties of the SELECT operator are exactly the same. Let’s take a look at those properties:
OK. Now we have some interesting differences, and especially, some interesting similarities. Let’s start with the similarities. First of all, we have exactly the same QueryPlanHash value in both plans. In addition, we also have identical estimated rows and costs. In short, the optimizer created two identical execution plans. Now, this is where things get a little bit interesting. See, the optimizer actually worked a little harder to create the first plan than the second. It took an extra tic on the CPU and just a little more CompileMemory and CompileTime. Interesting.
What about execution times? With a few runs on average, the execution times were identical at about 149mc with 11 reads. However, running a query once or twice isn’t testing. Let’s get a few thousand runs of both queries. The average results from the Extended Events sql_batch_completed event were 75.9 microseconds for both queries.
However, what about that extra little bit of compile time in the query that used sub-queries? Let’s add in a statement to free the procedure cache on each run and retry the queries. There is a measurable difference now:
Query | duration |
Sub-query | AVG: 5790.20864172835 |
Query | AVG: 4539.49289857972 |
More work is done by the optimizer on the sub-query to compile the same execution plan. We’re adding work to the optimizer, requiring it to unpack the, admittedly, silly query written above. When we refer only to the compile time and not the execution time, there is a performance hit. Once the query is compiled, the performance is identical. Whether or not you get a performance hit from a sub-query then, in part, depends on the degree to which you’re experiencing compiles or recompiles. Without the recompile, there is no performance hit. At least in this example.
Let’s Test It Again, Harder
I firmly believe in the old adage; if you ain’t cheatin’, you ain’t fightin’. It’s time to put the boot in.
Let’s go with much more interesting queries that are more likely to be written than the silly example above. Let’s assume some versioned data like in this article on Simple-Talk. We could express a query to bring back a single version of one of the documents in one of three ways from the article. We’re just going to mess with two of them. One that uses a sub-query, and one that does not:
--no sub-query SELECT TOP 1 d.DocumentName, d.DocumentID, v.VersionDescription, v.VersionID, ROW_NUMBER() OVER (ORDER BY v.VersionID DESC) AS RowNum FROM dbo.Document d JOIN dbo.Version v ON d.DocumentID = v.DocumentID WHERE d.DocumentID = 9729; --sub-query SELECT d.[DocumentName],        d.[DocumentId],        v.[VersionDescription],        v.[VersionId] FROM   dbo.[Document] d        CROSS APPLY (SELECT TOP (1)                            v2.VersionId,                            v2.VersionDescription                     FROM  dbo.[Version] v2                     WHERE v2.DocumentId = d.DocumentId                     ORDER BY v2.DocumentId,                            v2.VersionId DESC                    ) v WHERE  d.[DocumentId] = 9729;
As per usual, we can run these once and compare results, but that’s not really meaningful. We’ll run them thousands of times. Also, to be sure we’re comparing apples to apples, we’ll force a recompile on every run, just like in the first set of tests. The results this time:
Query | duration |
Sub-query | AVG: 1852.14114114114 |
Query | AVG: 2022.62162162162 |
You’ll note that, even with the compile on each execution, the query using a sub-query actually out-performed the query that was not using a sub-query. The results are even more dramatic when we take away the compile time:
Query | duration |
Sub-query | AVG: 50.8368368368368 |
Query | AVG: 63.3103103103103 |
We can also look to the execution plans to get an understanding of how these queries are being resolved:
The plan on top is the sub-query plan, and the plan on the bottom is the plan for just the plain query. You can see that the regular query is doing a lot more work to arrive at an identical set of data. The differences are visible in the average execution time, about a 20% improvement.
You could argue that we’re comparing two completely different queries, but that’s not true. Both queries return exactly the same result set. It just so happens that the query using the sub-query performs better overall in this instance. In short, there’s no reason to be scared of using a sub-query.
Sub-Query Conclusion
Is it possible for you to write horrid code inside of a sub-query that seriously negatively impacts performance? Yes. Absolutely. I’m not arguing that you can’t screw up your system with poor coding practices. You absolutely can. The query optimization process within SQL Server deals well with common coding practices. Therefore, the queries you write can be fairly sophisticated before, by nature of that sophistication, you begin to get serious performance degradation.
You need to have a method of validation for some of what you read on the internet. People should provide both the queries they are testing with and the numbers that their tests showed. If you’re just seeing completely unsupported, wildly egregious statements, they’re probably not true.
In conclusion, it’s safe to use sub-queries. Just be careful with them.
If you’re finding any of this useful and you’d like to dig down a little more, you can, because I’ll be putting on an all day seminar on execution plans and query tuning. The event takes place before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.
But some subqueries are demonstrably bad and people do word association tricks without understanding the subquery antipatterns.
A subquery using NOT IN *tends* to be poor performing vs, say, a functionally-equivalent LEFT JOIN…WHERE col IS NULL
Another example is correlated subqueries. In some cases these can be rewritten as LEFT JOINs, in other cases the query optimizer figures it out for you.
I think your point is…nothing is inherently bad about subqueries, but your post title leads me to think there is nothing EVER wrong with subqueries. And we all know that there are cases where *it depends* is a valid answer.
Have to agree with Dave. You cheated by using an APPLY function, which I’ve found most people tend not to do. The subqueries I usually label as “potentially troublesome” are the correlated sub-queries, especially the ones put in the “SELECT” and not in the “FROM”. I do find that using an APPLY operator greatly affects the performance and I don’t tend to have a problem with derived table sub-queries or those used in an APPLY because they’re often quite acceptable and usually better than what was there before.
However, I get your point. Subqueries by themselves are not necessarily bad. How we abuse them could be. 🙂
[…] Grant Fritchey busts a myth: […]
Hi Grant,
When I’m looking at your query structure, it seem as if you’re writing a derived table than a sub-query.
A sub-query typically look like
SELECT (select top 1 xx from xx), column FROM table
OR
SELECT column FROM table WHERE column = (SELECT xx FROM xx)
And yes, the first subquery example is terrible from performance point of view.
I was curious, as from my general understanding, there should not be any difference in execution time between sub-query and join, provided the meaning of both queries is the same and provided the optimizer does what it is supposed to do and produces identical query plans.
in “let’s try it harder” the necessary pre-condition of identical meaning is not fulfilled. One query contains a ranking function (Row_number()) ; the other does not. So you cannot compare them.
Hi Simon,
I double (and triple) checked the definition of what a sub-query is. Yes, correlated sub-queries are one type, but the derived table is also a type of a sub-query. Microsoft has a definition and examples of what a sub-query is right in the MSDN documentation.
It’s worth noting that correlated sub-queries are frequently more problematic than other types of sub-queries. However, even they are not automatically problematic. They too are subject to the ability of the optimizer to logically deal with them. I’ve seen situations where they perform just fine. In fact, Gail Shaw has a blog post or two where she addresses the idea that they automatically lead to a cursor type of situation (which isn’t true).
As with most objects in T-SQL, you can write them horribly, or you can write them well. It’s down to your code and your structure, not simply a single method within the code or structure.
Herbert,
They return identical data sets, so they can be compared. If you go and read the article, depending on how the indexes are structured and the amount of data in play, either approach can perform better. I used a single example to illustrate the point here. More examples can be made to make the point in the other direction. The main take-away is that a sub-query is not inherently a problem.
The first example, silly but illustrative, shows that there actually is a performance difference as more time is spent compiling the plan. As with so many things in SQL Server and T-SQL, it depends. This is why I’ve been writing all these blog posts against the goofy, single-statement, performance check-lists I’m seeing spring up all over the place. The information in them is both wrong and horribly incomplete.
Dave & Peter,
I’m positive that I said, twice, in the post, that there is nothing inherently positive, just as there is nothing inherently negative around sub-queries. They are another construct that can be used or abused.
Yep. I just double-checked. Second full paragraph said, “…there’s nothing magically good…” and “… “write a sub-query that performs horribly…”. Twice in the conclusion I also say that you can screw these things up, and you can. Also, in the tests, it pretty clearly shows that there is a performance cost during optimization. I truly don’t know what else I could have said that would let you know that “it depends” is in absolute operation here. However, you both seem to agree with me, and disagree with the goofy original premise, a sub-query, in and of itself, is not problematic. That’s the primary point.
Reiterating lest anyone else think I’m unclear. Are there situations where a sub-query, of any type, can lead to poor performance? Yes. 100%. Yes. Are there situations where a sub-query, of any type, performs perfectly fine, possibly even better than some other construct within SQL Server? Yes.
So, in conclusion then, again, there is nothing inherently problematic about a sub-query, but rather, how it is used. In short, it depends.
I have used sub-queries similar to the examples above and I have found NO issues with using them on projects that have tens of thousands of rows throughout several tables, with at least 50 columns and some with up to 600 columns (not my design!). As a *human* I don’t see much of a difference in performance between using a sub-query and an [X] JOIN. I prefer the sub-queries because they’re far more readable — and manageable — to me than JOINS.
As with any language, even simple commonly used functions can become problematic if poorly coded — or poorly planned.
This was a great article. Thanks!
Thanks. This rather simple point seems to have ruffled feathers because, yes, there are exceptions. I still prefer JOINs, but I use a lot of sub-queries, including within JOINs.