What is CPU(Consumed) Wait Category?
-
Saturday, May 26, 2012 12:05 PM
We have a new server with 2 E7 10 core CPU (totally 40 logical cpus including HT) in 2 NUMA nodes. The server is equipped with 128GB RAM, and we are running SQL 2008 R2 SP1 Enterprise x64. The CPU affinity mask is the default value; and MAXDOP is set to 8.
There isn't too much workload on the server at present, but people complian that the queries they run are slower than should be. I have used Data Collection to collect performance data for a couple of days, and what's causing my attention is that nearly 50% of the total server waits (through the Server Activity History report) is caused by CPU all the time. But the top left hand side "%CPU" is always under 10% as a whole, which means the CPU is not under a pressure at all. If I click into the Server Waits subreport, CPU is on the top of the list in the bottom table, order by Wait Time. And if I expand the CPU node, the subcategory CPU(Consumed) is almost 100% contributing to the total CPU wait. The CPU(Signal Wait) and SOS_SCHEDULER_YEILD are almost Nil.
I also used the following query to check cpu pressure, and the %singal cpu wait is 12.89.
---- Total waits are wait_time_ms
Select signal_wait_time_ms=sum(signal_wait_time_ms)
,'%signal (cpu) waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))
,resource_wait_time_ms=sum(wait_time_ms - signal_wait_time_ms)
,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))
From sys.dm_os_wait_statsSo my question is: what is the CPU(Consumed) wait on earth? Is it normal or indicating some configuration problems?
Thank you.
All Replies
-
Saturday, May 26, 2012 12:52 PMModerator
Whatever reports you are looking at are probably grouping CXPACKET waits into the CPU rollup that is being displayed, making it skew incorrectly. Have you tried running a trace to see what those reports are collecting and how?
Jonathan Kehayias | Principal Consultant, SQLSkills.com
SQL Server MVP | Microsoft Certified Master: SQL Server 2008
Author of Troubleshooting SQL Server: A Guide for Accidental DBAs
Feel free to contact me through My Blog or Twitter. Become a SQLskills Insider!
Please click the Mark as Answer button if a post solves your problem!- Edited by Jonathan KehayiasMVP, Moderator Saturday, May 26, 2012 12:57 PM
- Marked As Answer by WII Sunday, May 27, 2012 12:09 AM
-
Saturday, May 26, 2012 1:14 PM
Hi Jonathan, thanks very much for your quick reply. And yes, you are right, I just checked wait dmv by the following query, and it returned me that CXPACKET is the top one! It's got triple times larger wait time than the second one - LAZYWRITER_SLEEP. So does it mean MAXDOP should be reduced to 4 to have a look? Now the value is 8. Thanks.
select *
from sys.dm_os_wait_stats
order by wait_time_ms desc; -
Saturday, May 26, 2012 2:01 PMModerator
Hi Jonathan, thanks very much for your quick reply. And yes, you are right, I just checked wait dmv by the following query, and it returned me that CXPACKET is the top one! It's got triple times larger wait time than the second one - LAZYWRITER_SLEEP. So does it mean MAXDOP should be reduced to 4 to have a look? Now the value is 8. Thanks.
select *
from sys.dm_os_wait_stats
order by wait_time_ms desc;No, that means you have queries using Parallelism and nothing else. CXPACKET is never a problem it is just a sign that you have parallelism in use, you'd need to look further at what else is happening to find the problem.
http://www.sqlsoldier.com/wp/sqlserver/thebarkingdoganalogy
Jonathan Kehayias | Principal Consultant, SQLSkills.com
SQL Server MVP | Microsoft Certified Master: SQL Server 2008
Author of Troubleshooting SQL Server: A Guide for Accidental DBAs
Feel free to contact me through My Blog or Twitter. Become a SQLskills Insider!
Please click the Mark as Answer button if a post solves your problem! -
Saturday, May 26, 2012 3:13 PMCXPACKET is never a problem? Can you explain? I've seen plenty of cases where queries were going parrallel with the threads all CXPACKET waiting and causing the query to take orders of magnitude longer than they ran with a MAXDOP 1 or if the cost threshold was switched from 5 to 10
Chuck Pedretti | Magenic – North Region | magenic.com
- Edited by Chuck Pedretti Saturday, May 26, 2012 3:23 PM
-
Saturday, May 26, 2012 3:35 PM
Indeed, CXPacket is not the problem, it's a symptom, that's all.
CXPacket waits show that queries are running in parallel and some threads are having to wait for others. What you need to do is find out why some threads are lagging behind. Could be parallel skew (where some threads process more rows than others, often due to poor estimates). Could be some threads are waiting for something else. See what threads of that parallel query are not waiting on CXPacket, see what they really are waiting for. Then try to resolve that wait.
For example, if a query parallels over 8 threads, 7 of those threads are suspended with CXPacket and one has a latch wait, the root problem is that the one thread is stuck waiting for a latch, has not managed to do the work the other threads have done and now those other threads are required to sit and wait for that latch request to be granted and for that one thread to finish it's work.
Gail Shaw
-
Saturday, May 26, 2012 3:44 PM
Hi Jonathan, thanks very much for your quick reply. And yes, you are right, I just checked wait dmv by the following query, and it returned me that CXPACKET is the top one! It's got triple times larger wait time than the second one - LAZYWRITER_SLEEP. So does it mean MAXDOP should be reduced to 4 to have a look? Now the value is 8. Thanks.
Well, it's easy enough to try a new maxdop value and see.
How many active SPIDs do you normally see?
You probably want to do some more conventional performance analysis, what are the top waits, and what queries are taking the most reads or CPU. Get some real numbers instead of (in addition to!) subjective reports.
Actually, if your second wait is lazywriter_sleep ... well, I've used a query that filters out all the system waits, so actually I can't hardly say, but at least it's nothing worse.
Josh
select top 10 * from sys.dm_os_wait_stats where wait_type not in -- remove system waits ( 'KSOURCE_WAKEUP', 'SLEEP_BPOOL_FLUSH', 'BROKER_TASK_STOP', 'XE_TIMER_EVENT', 'XE_DISPATCHER_WAIT', 'FT_IFTS_SCHEDULER_IDLE_WAIT', 'SQLTRACE_BUFFER_FLUSH', 'CLR_AUTO_EVENT', 'BROKER_EVENTHANDLER', 'LAZYWRITER_SLEEP', 'BAD_PAGE_PROCESS', 'BROKER_TRANSMITTER', 'CHECKPOINT_QUEUE', 'DBMIRROR_EVENTS_QUEUE', 'LAZYWRITER_SLEEP', 'ONDEMAND_TASK_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'LOGMGR_QUEUE', 'SLEEP_TASK', 'SQLTRACE_BUFFER_FLUSH', 'CLR_MANUAL_EVENT', 'BROKER_RECEIVE_WAITFOR', 'PREEMPTIVE_OS_GETPROCADDRESS', 'PREEMPTIVE_OS_AUTHENTICATIONOPS', 'BROKER_TO_FLUSH' ) order by wait_time_ms desc
-
Saturday, May 26, 2012 3:47 PM
OK, I understand what you are saying. I just think of what constitutes a problem differently - if I see CXPACKET waits hanging around thats a problem. In the same way that if I saw water running down the wall - thats a problem. Might be a busted pipe, someone left the tub on or the rain is coming in.
Chuck Pedretti | Magenic – North Region | magenic.com
- Edited by Chuck Pedretti Saturday, May 26, 2012 3:49 PM
-
Saturday, May 26, 2012 4:09 PM
It's a symptom that there may be a problem. Water running down a wall is fine if it's an outside wall :-) Similarly CXPacket is fine if the waits are short.
The thing is, there's so much bad info out of the form "CXPacket waits are bad, lower maxdop!", which is not the case at all.
Gail Shaw
-
Saturday, May 26, 2012 4:11 PM
Hi Jonathan, thanks very much for your quick reply. And yes, you are right, I just checked wait dmv by the following query, and it returned me that CXPACKET is the top one! It's got triple times larger wait time than the second one - LAZYWRITER_SLEEP. So does it mean MAXDOP should be reduced to 4 to have a look? Now the value is 8.
I wouldn't jump and reduce maxdop. Check the link Jonathan provided. Check chapter 3 of this http://www.simple-talk.com/books/sql-books/troubleshooting-sql-server-a-guide-for-the-accidental-dba/ Look for the root problem.
Gail Shaw
-
Saturday, May 26, 2012 4:39 PM
OK, I understand what you are saying. I just think of what constitutes a problem differently - if I see CXPACKET waits hanging around thats a problem. In the same way that if I saw water running down the wall - thats a problem. Might be a busted pipe, someone left the tub on or the rain is coming in.
Don't get carried away by a (wrong) metaphor!
If you see a lot of sawdust coming out of something, just means work is getting done not that there's anything wrong. Cxpackets are just a cost of doing business, if you have a lot of SPIDS and a lot of parallelism you're going to have a lot of Cxpackets. Just so the average cost is modest, or even if it's not, it may be the server is performing optimally. Could be some much smaller number of something else entirely, which is your actual problem.
Just on the matter of maxdop, I've seen pathological problems when reducing maxdop to 1, a query that takes one second with maxdop=2 took ten seconds with maxdop=1. Big problem with execution plan, apparently. Plus, if you've PAID for all those cores, you might as well use them, even if that means allocating them and letting them wait now and then. SQL Server optimization tends to be for resources, not time. Sometimes you can coerce SQL Server to "waste" some cores and get much faster throughput. If I set my query to maxdop=8 and it takes 0.3 seconds, maybe I'd prefer that, depending on other load factors, even if it makes my Cxpacket numbers look much worse.
Of course I do recall you seem to have a reported problem of slow execution and CPU contention, so some of what I'm reciting here may not apply. But I think you need some more application-centric diagnostics, just the Cxpacket stat simply doesn't tell you enough.
Josh
-
Saturday, May 26, 2012 8:42 PMModerator
That is so incredibly wrong to think because of how CXPACKET waits accumulate. Gail already explained it above, but if you have 8 threads in a parallel task execution and 1 is waiting on something, you have 7 parallel threads accumulating CXPACKET wait time, plus the original task control thread, which is why CXPACKET is the top wait type on most SMP or NUMA systems. All it means is that you have parallelism being used during query execution, it is inevitable that it will be a top wait type if this is the case. If you suppress parallelism by lowering DOP, you don't fix the underlying wait you just serialize all processing. Only true OLTP workloads (and sadly SharePoint which isn't a true OLTP) typically benefit from having parallelism severely restricted or disable, and my definition of OLTP in this sense is not what most databases today are, it is 30K+ transactions/sec with average transaction times of 200-300ms or less.OK, I understand what you are saying. I just think of what constitutes a problem differently - if I see CXPACKET waits hanging around thats a problem. In the same way that if I saw water running down the wall - thats a problem. Might be a busted pipe, someone left the tub on or the rain is coming in.
Chuck Pedretti | Magenic – North Region | magenic.com
Jonathan Kehayias | Principal Consultant, SQLSkills.com
SQL Server MVP | Microsoft Certified Master: SQL Server 2008
Author of Troubleshooting SQL Server: A Guide for Accidental DBAs
Feel free to contact me through My Blog or Twitter. Become a SQLskills Insider!
Please click the Mark as Answer button if a post solves your problem! -
Saturday, May 26, 2012 9:47 PMModerator
>>Well, it's easy enough to try a new maxdop value and see.
Which in itself lies one of the single largest issues with the horrid information out there and MAXDOP setting. It's easy to reset the max memory on the instance and flush the cache but do you do that just because something is using memory? I hope not.
Changing MAXDOP simply to see if it helps one thing is borderline nuts. Best way I could put it. Either that or it is the only transaction hitting the instance and you have only that to worry about. Jonathan and Gail completely answered the questions and concerns, I only felt a need to point out, if someone reads that line, please don't take that as an excuse to run out and just change it because it's easy.
Ted Krueger Blog on lessthandot.com @onpnt on twitter
Troubleshooting SQL Server: A Guide for the Accidental DBA
Please click the Mark as Answer button if a post solves your problem! -
Saturday, May 26, 2012 10:35 PM
FWIW what I am talking about is CXPACKET duration on an individual statement being run not any sort of accumulation of counts or total elapsed time accross all processes.That is so incredibly wrong to think because of how CXPACKET waits accumulate. Gail already explained it above, but if you have 8 threads in a parallel task execution and 1 is waiting on something, you have 7 parallel threads accumulating CXPACKET wait time, plus the original task control thread, which is why CXPACKET is the top wait type on most SMP or NUMA systems. All it means is that you have parallelism being used during query execution, it is inevitable that it will be a top wait type if this is the case. If you suppress parallelism by lowering DOP, you don't fix the underlying wait you just serialize all processing. Only true OLTP workloads (and sadly SharePoint which isn't a true OLTP) typically benefit from having parallelism severely restricted or disable, and my definition of OLTP in this sense is not what most databases today are, it is 30K+ transactions/sec with average transaction times of 200-300ms or less.
Jonathan Kehayias | Principal Consultant, SQLSkills.com
SQL Server MVP | Microsoft Certified Master: SQL Server 2008
Author of Troubleshooting SQL Server: A Guide for Accidental DBAs
Feel free to contact me through My Blog or Twitter. Become a SQLskills Insider!
Please click the Mark as Answer button if a post solves your problem!Chuck Pedretti | Magenic – North Region | magenic.com
-
Saturday, May 26, 2012 11:16 PM
>>Well, it's easy enough to try a new maxdop value and see.
Which in itself lies one of the single largest issues with the horrid information out there and MAXDOP setting. It's easy to reset the max memory on the instance and flush the cache but do you do that just because something is using memory? I hope not.
Changing MAXDOP simply to see if it helps one thing is borderline nuts.
I do it all day long - on dev environments.
I would hope you would too, rather than argue opinions all day.
Of course one wants to be far more careful doing *anything* in a production environment, but even there, in the environments I've seen over the last five years or so, changing global maxdop settings in realtime, flushing the cache, etc have not been killers. That is, after I've evaluated a theory in dev, and the time comes to deploy it to production, I do so and - typically small impact, vanishing after maybe thirty seconds and converging (we hope) on the new expected behaviors.
Better yet, you could do it selectively, locate just those queries that use a ton of resources in parallel and change their maxdop settings INDIVIDUALLY - of course, taking even that minor change through a standard process of configuration control, testing, and deployment.
Josh
-
Saturday, May 26, 2012 11:56 PM
Changing MAXDOP simply to see if it helps one thing is borderline nuts.
Pretty sure that most of us are talking about trying out the query hint OPTION (MAXDOP n) not fooling with the database level setting
Chuck Pedretti | Magenic – North Region | magenic.com
- Edited by Chuck Pedretti Saturday, May 26, 2012 11:57 PM
-
Sunday, May 27, 2012 12:09 AM
OK guys, I think I get what you mean - CXPACKET is just a sympton, not the cause. Lower DOP might help on the surface, but doesn't fix the root cause. And it may harm the overall server performance. I totally agree now.
To answer JRStern's question, the second wait type is OLEDB. Which I think is becuase of massive distributed queries through Linked Server. This might be the real cause of excessive CXPACKET waits.
Thank you very much for all of your inputs. I learnt a lot from this thread.
-
Sunday, May 27, 2012 12:30 AMModerator
OK guys, I think I get what you mean - CXPACKET is just a sympton, not the cause. Lower DOP might help on the surface, but doesn't fix the root cause. And it may harm the overall server performance. I totally agree now.
To answer JRStern's question, the second wait type is OLEDB. Which I think is becuase of massive distributed queries through Linked Server. This might be the real cause of excessive CXPACKET waits.
Thank you very much for all of your inputs. I learnt a lot from this thread.
How do you use the distributed queries exactly? Have you made certain that you've provided the appropriate permissions for the account(s) making the linked server authentication so that they can leverage remote statistics correctly? (This requires db_ddladmin at a minimum on the remote database)
http://www.benjaminnevarez.com/2011/05/optimizer-statistics-on-linked-servers/
Incorrect costing based on bad estimates for the distributed queries could easily cause parallelism to be used where it shouldn't have, again the symptom being CXPACKET waits, but the root cause being incorrect permissions potentially in the remote database. If you can find a query where using a lower degree of parallelism actually improves the performance look at what is happening in the plan and you'll likely find a issue that lead to a query being costed incorrectly during optimization, or that lead to a skew that was problemattic during execution. Fixing that problem will fix the overall problem and you don't have to monkey with query hints or server level settings to fix it.
Jonathan Kehayias | Principal Consultant, SQLSkills.com
SQL Server MVP | Microsoft Certified Master: SQL Server 2008
Author of Troubleshooting SQL Server: A Guide for Accidental DBAs
Feel free to contact me through My Blog or Twitter. Become a SQLskills Insider!
Please click the Mark as Answer button if a post solves your problem! -
Sunday, May 27, 2012 12:44 AM
To answer JRStern's question, the second wait type is OLEDB. Which I think is becuase of massive distributed queries through Linked Server. This might be the real cause of excessive CXPACKET waits.
Very plausible!
One more on the maxdop - the big mistake is typically leaving it at zero on big new servers with 32++ cores, but once you have it set down to your NUMA number or some factor thereof, as you already do, any further reduction should be on the order of fine-tuning, not likely to be the cause of any big performance factors.
Josh
-
Monday, May 28, 2012 3:01 AM
Hi Jonathan,
Though we don't have the permission issue that you mentioned (but still great to know it), you point is still invaluable! According to what you said, I found that even those problematic queries are run locally on the remote server, they are slow as well. If I run the actual execution plan, there is a huge difference between Estimated and Actual Row Counts. I tried to update statistics, rewrite query... finally I figured it out that it's a parameter sniffing issue. With the query hint "RECOMPILE", the query exec plan is optimal now. I will wait and see how it goes.
Thanks again.
-
Monday, May 28, 2012 3:02 AM
To answer JRStern's question, the second wait type is OLEDB. Which I think is becuase of massive distributed queries through Linked Server. This might be the real cause of excessive CXPACKET waits.
Very plausible!
One more on the maxdop - the big mistake is typically leaving it at zero on big new servers with 32++ cores, but once you have it set down to your NUMA number or some factor thereof, as you already do, any further reduction should be on the order of fine-tuning, not likely to be the cause of any big performance factors.
Josh
Thanks Josh. I agree with you 100%. -
Tuesday, August 07, 2012 3:12 PMI knew it did not feel right just automatically trying to bring down CXPACKET waits just because it is at the top. I've read other folks think it is a problem if CXPACKET is over 5% of the total waits, period. That is the wrong way to look at it. You highlight a key point GilaMonster, "CXPACKET is fine if the waits are short". For instance, a server I am looking at has 1,558,554 wait seconds and 578,672,134 wait counts for CXPACKET, that's .002 second per wait, so to me that doesn't seem to be a problem. I believe that applies to any of the waits. I think attention should be focused on the average wait in seconds (dividing total seconds waiting by total wait counts) to zero in on which waits need attention.

