In the previous post, I showed how you can get full listings of your execution plan costs. Knowing what the values you’re dealing with for the estimated costs on your execution plans can help you determine what the Cost Threshold on your system should be. However, we don’t want to just take the average and use that. You need to understand the data you’re looking at. Let’s explore this just a little using R.
Mean, Median, Range and Standard Deviation
I’ve used the queries in the previous blog post to generate a full listing of costs for my plans. With that, I can start to query the information. Here’s how I could use R to begin to explore the data:
library("RODBC", lib.loc="~/R/win-library/3.2") query <- "SELECT * FROM dbo.QueryCost;" dbhandle <- odbcDriverConnect( 'driver={SQL Server};server=WIN-3SRG45GBF97\\dojo;database=AdventureWorks2014;trusted_connection=true' ) data <- sqlQuery(dbhandle,query) ##data mean(data$EstimatedCost) median(sort(data$EstimatedCost)) maxcost = max(data$EstimatedCost) mincost = min(data$EstimatedCost) costrange = maxcost - mincost costrange sd(data$EstimatedCost)
The mean function is going to get me my average value, which, in this case, is 0.8755985. If I just accept the average as a starting point for determining my Cost Threshold for Parallelism, I guess I can just leave it at the default value of 5 and feel quite comfortable. This is further supported by the median value of .0544886 from my data. However, let’s check out the costrange value. Knowing an average, a mean, or even a median (literally, the middle number of the set), doesn’t give you an indication of just how distributed the data is. My costrange, the max minus the min, comes out to 165.567. In other words, there is a pretty wide variation on costs and suddenly, I’m less convinced that I know what my Cost Threshold should be.
The next value that matters is the Standard Deviation. This gives you an idea of how distributed your data is. I’m not going to get into explaining the math behind it. My standard deviation value is 8.301819. With this, I know that a pretty healthy chunk of all my values are less than a cost estimated value of 8, since a single standard deviation would be 8.301819 on top of my average value of .8755985.
With this knowledge, I can start to make informed choices. I’m not relying simply on an average. I can begin to think through the process using statistics.
Just to help out with the thought process, let’s plot the values too.
Histogram
My first thought for any kind of data is statistics, so let’s see what a histogram would look like. This is really easy to do using R:
hist(data$EstimatedCost)
The output looks like this:
Clearly, this doesn’t give me enough to work on. Most of my data, nearly 1500 distinct values, is at one end of the distribution, and all the rest is elsewhere. I can’t use this to judge any kind of decision around my Cost Threshold.
Scatter Plot
The histogram isn’t telling me enough, so let’s try throwing the data into a Scatter plot. Again, this is silly easy in R:
plot(data$EstimatedCost)
The output is a lot more useful:
Now I can begin to visually see what the standard deviation value was telling me. The vast majority of my costs are well below two standard deviations, or approximately 16. However, let’s clean up the data just a little bit and make this as clear as we can.
Density Plus Values
Instead of just plotting the values, let’s get the density, or more precisely, a kernel density estimation, basically a smooth graph of the distribution of the data, and plot that:
plot(density(data$EstimatedCost)) rug(data$EstimatedCost,col='red')
I went ahead and added the values down below so that you can see how the distribution goes as well as showing the smooth curve:
That one pretty much tells the tale. The vast majority of the values are clumped up at one end, along with a scattering of cost estimates above the value of 5, but not by huge margins.
Conclusion
With the Standard Deviation in hand, and a quick rule of thumb that says 68% of all values are going to be within two standard deviations of the data set, I can determine that a value of 16 on my Cost Threshold for Parallelism is going to cover most cases, and will ensure that only a small percentage of queries go parallel on my system, but that those which do go parallel are actually costly queries, not some that just fall outside the default value of 5.
I’ve made a couple of assumptions that are not completely held up by the data. Using the two, or even three, standard deviations to cover just enough of the data isn’t actually supported in this case because I don’t have a normal distribution of data. In fact, the distribution here is quite heavily skewed to one end of the chart. There’s also no data on the frequency of these calls. You may want to add that into your plans for setting your Cost Threshold.
However, using the math that I can quickly take advantage of, and the ability to plot out the data, I can, with a much higher degree of confidence, make choices on how I want my Cost Threshold for Parallelism to be set on my servers. From there, I measure the affects of my choices and adjust as necessary.
Nice. Wonder if this can be done in powershell? I’m a little reluctant to add yet another programming language. Might be like those instrument flight rule ( IFR ) pilots who only put in the minimum time per year flying on instruments — most likely unsafe when they really need to use the skill.
Following for when you decide to do this and publish it. 🙂
Yeah, it might be. I’m trying to explore and understand R, so I went there for my math & graphs. However, it’s not the only way to skin this cat, I’m sure.
Skinning cats is bad 😮 But great post – thank you Grant! By the way – there’s more than one way to cook an egg!! A much less gross way to say the same thing 🙂
Thanks! I’ll try to avoid grossing people out too much (too often).
OK, fine. But I don’t know why I should raise my Cost Threshold at all. Nearly all of my statements have higher costs when I run them with the query hint “Option(maxdop 1)”. Beside that they are slower with that hint as well. I assume that setting this hint has the same effect on my query as setting the Cost Thresold up.
Yeah, they probably will have higher costs because setting that hint will completely change the plan that gets generated. You can’t compare the two plans costs at that point.
The primary difference is, some plans will benefit from parallelism. By setting the MaxDOP to 1 (hints or through server settings, and I’d suggest server settings if you’re really going to do that, easier to undo later), you’re not adjusting the cost threshold to allow some queries to run parallel and others not to based on their estimated costs, you’re just forcing serial plans on everything. The reason you would raise your Cost Threshold is precisely where you find yourself, the plans are performing poorly with parallel execution. Raise the Cost Threshold so that those plans get serial plans naturally instead of through hints. Higher cost plans, the ones most likely to benefit, can then still get a parallel plan without you having to muddy your code with hints. You adjust one setting, in one place, that can be easily change later, rather than having to go to X number of queries and add a hint. It’s a much more efficient mechanism for controlling your servers.
Thanks for the explanation.
Not a problem. And it gave me an idea for another blog post. Appreciate the questions. Thank you.
Hello Grant, very good article.
I have a problem with a instalation. The standard desviation is very high (4432.966) because there is a plan with 2385560 of estimated cost (!!!!!!). The curiosity is that its CPUCost = 18.2881 and Estimate Rows = 182881000
Is normal very high values in estimated cost?
The instance is a SQL SERVER 2008 R2 SP3.
Thank you very much.
Right, so in your case the distribution isn’t in any way normal. You’d be best served by tossing that value out and see how the rest behave. From there you can make a determination on a good choice for cost threshold. Regardless, you sure would want that query to go parallel.
And no, that’s an extremely high estimated cost. It sounds like you’re moving lots of data with either no filtering at all or your filters aren’t using the indexes you have (if you have them). Nothing but guesses as to the cause.
Hello Grant, thanks for your answer.
A question.
Is it possible that a plan with Parallel=0 when executed, the plan used more than one CPU?
The number the plans with parallel=1 is much more litlle than plans with parallel=0, 138 for parallel=1 and 296000 with parallel=0
It is the result to collect plans executed in 12 hours of work along 5 months. This information is collected out of hours’s work. and it sent to my computer.
The CXPACKET is very high and the “cost threshold for parallelism” makes that only plans with estimated cost > “cost threshold for parallelism” have parallel=1
Thanks for all.
Hello again Manuel,
No. If a plan didn’t go parallel, it’s just using a single thread. Parallel plans use multiple threads. You are correct, the cost threshold for parallelism ensures that only plans with an estimated cost greater than that value will go parallel.
Don’t worry about cxpackets in isolation. You have look also at your other wait statistics. cxpackets just means that you have parallelism occurring. It doesn’t mean it’s a problem. You have to see if you also have issues with CPU waits that would indicate excessive CPU use. Here’s more info: https://www.sqlskills.com/help/waits/cxpacket/
Hello Grant, a last question, please.
Is normal that more of 9000 query plans with estimated cost > “cost threshold for parallelism†have parallel = 0 ?
The 66% of those query plans are prepared or adhoc (it not was a decision mine…)
Thank you.
It’s hard to define what “normal” looks like. Normal for your system is abnormal for mine. However, the question is can something exceed the cost threshold and still not go parallel. The answer is yes.
When a query exceeds the cost threshold it doesn’t automatically go parallel. Instead, the optimizer considers parallel plans. Will a parallel plan result in a lower cost than the serial plan? If so, yeah, you’ll get parallel execution. If not, you won’t see parallel execution even though the cost threshold was exceeded.
Now, I’m not looking at your system, so I can’t say what “normal” looks like there. Based on the information I have, your plans are exceeding the threshold but going parallel won’t make them run faster in the optimizer’s opinion, so you still have a single threaded execution plan.
NOTE: If you have a lot of ad hoc queries, you could be seeing stuff that’s all over the map for queries that are essentially very similar or even largely the same (there have to some differences or you wouldn’t have more than one plan). This is where you also have to concern yourself with how many times each of those queries is being executed so that you have some understanding of what is causing load on your system.
Great post, Grant! Thanks for sharing! \m/
[…] relationship to your execution plans, how to determine your execution plan cost, and even how to decide what value to set your Cost Threshold to. What I haven’t explicitly addressed in extremely clear terms […]
HI all,
A standard deviation function should not be used to calculate ‘Cost Threshold for Parallelism’ setting value, because it will give imprecise result. Instead, you need to use a weighted standard deviation function and use execution count value as a weight. Here is a standard deviation explanation video: https://www.youtube.com/watch?v=duKaOx4_TmU
Heh… or set it to 20 and call it a day because no setting will be perfect. 😀