locked
Detecting SMT/Hyperthreads vs physical cores RRS feed

  • Question

  • At runtime I need to configure a concurrent work queue for the user's machine, and in order to do that I need to query the core situation. Win32 GetNativeSystemInfo is available but it reports all logical processors. Win32 GetLogicalProcessorInformation is not available to WinRT/Metro and I didn't see __cpuid in there either, so I can't distinguish between physical cores and hyperthreads. Is there a way to do this under WinRT?
    • Edited by Scott Bruno Thursday, April 19, 2012 9:35 PM
    Thursday, April 19, 2012 9:34 PM

Answers

All replies

  • Scott,

    Thanks for the question.  I will look into this for you. 

    Best Wishes - Eric

    Thursday, April 19, 2012 9:52 PM
    Moderator
  • Thank you, Eric. We're building a multicore-friendly DirectX 11 game engine so this is a critical point for the technical design.
    Thursday, April 19, 2012 10:02 PM
  • Scott,

    Are you mostly concerned about dual core hyper threaded systems showing up as quad core.

    Best Wishes - Eric

    Saturday, April 21, 2012 12:41 AM
    Moderator
  • Hello Eric,

    My concern is tomorrow as much as today. The code is designed to be strongly parallel and to scale well with the CPU, without code changes. It will put as many cores as you give it to good use.

    It's not really locking myself into oversubscription I'm worried about because that won't happen. Performance is monitored per thread, and the active pool size is self-regulating. If a thread is struggling then I'll spin it for a while and try it again later. If it keeps struggling I'll kick it out of the pool for the rest of the session.

    So yes, technically I could just grab a thread for every reported logical processor and let the system find its own level over time. But that's hardly ideal because it means the game session could very well begin its life in a sub-optimal configuration.

    The ideal solution is (foreseeable) future-proof, which is exactly what the information returned from a function like GetLogicalProcessorInformation would provide in this case. If I could obtain the count of physical cores I could begin with a configuration that's bound to be close to ideal, and it would Just Work -- today, and five years from now. Conversely, relying on something like a table of known family/stepping/features is not future-proof, because in the future it would have to guess.

    Another concern is that there are parts of the code that would likely benefit from Hyperthreading, so in an environment where it's available it might be advantageous to switch in an existing "extra" thread for that portion of the update.

    May I ask why the existing function is forbidden? I can understand wanting to avoid adding anything to WinRT that would be meaningless on some devices. But if XBLive wants us to be Metro, and the store wants us to be Metro, then they have to enable the "heavy duty" PC games as well as the "click the monkey!" tablet games. The ability to reliably assess the system so that we can scale with it is definitely on the list of enablers.

    Saturday, April 21, 2012 3:27 AM
  • Thank You Scott!  I am looking into this.

    Best Wishes - Eric

    Tuesday, April 24, 2012 5:42 PM
    Moderator
  • Thanks again, Eric.
    Tuesday, April 24, 2012 6:12 PM
  • For some reason I can no longer access the account I opened this question with, but I'm still alive and hoping.
    Thursday, April 26, 2012 11:51 PM
  • Thanks Scott,

    I have searched a lot of different ways and so far am unable to find a way to do this from a Metro style app.  I will continue to look.

    Best Wishes - Eric

    Friday, April 27, 2012 12:47 AM
    Moderator
  • I do appreciate your help.

    I've looked too and I just don't think anything was added to WinRT to make up for the fact that GetLogical(...) was made off-limits. That's going to make it difficult to do any sort of general work queue, game or otherwise, because you simply don't know how many threads to run. You can't treat a HT like a real core in that case because you'll destroy the cache and do more harm than good. The DirectX SDK Core Detection sample states the case nicely:

    "...More significantly, SMT or HT Technology threads share the L1 instruction and data caches. If their memory access patterns are incompatible, they can end up fighting over the cache and causing many cache misses. In the worst case, the total performance for the CPU core can actually decrease when a second thread is run.

    ...On Windows, the situation is more complicated. The number of threads and their configuration will vary from computer to computer, and determining the configuration is complicated. The function GetLogicalProcessorInformation gives information about the relationship between different hardware threads, and this function is available on Windows Vista, Windows 7, and Windows XP SP3.

    ...The safest assumption is to have no more than one CPU-intensive thread per CPU core."

    I don't mean to preach; I know that you know this already. I'm just trying to demonstrate to TPTB that this seems like an actual oversight in the API rather than the whimsical musings of some random programmer. With all the work that was done to weave parallel programming into the fabric of modern Windows development, we would appear to have everything we need -- except the means to determine how many threads we should be using.



    • Edited by ScottBruno Friday, April 27, 2012 6:16 AM
    Friday, April 27, 2012 6:11 AM
  • Scott,

    Thank you for the additional details.  I appreciate the information.  It all helps.  Stay on the thread and ping me every couple of weeks.  I will update this if I find something.

    Best Wishes - Eric

    Friday, April 27, 2012 7:35 AM
    Moderator
  • I have added this as a request on UserVoice. I cannot imagine that I am the only person who is hamstrung by this, nor can I imagine how to support Metro without this information. If you are a developer who needs to know how many threads you should create for your WinRT app, please add your vote here:

    http://visualstudio.uservoice.com/forums/121579-visual-studio/suggestions/2836624-add-the-means-to-detect-physical-cpu-cores-vs-hyp.

    And thanks again to Eric for looking into this.

    Wednesday, May 9, 2012 12:04 AM
  • Thanks Scott,

    I added my votes to it.

    Best Wishes - Eric

    Wednesday, May 9, 2012 6:02 AM
    Moderator
  • Scott,

    There is no way to do this from a Metro style app. 

    Best Wishes - Eric

    P.S.

    Great conference on C++ apps that does not answer your question but does have a lot of information. 

    http://channel9.msdn.com/Events/Windows-Camp/Developing-Windows-8-Metro-style-apps-in-Cpp

    Friday, May 25, 2012 3:47 AM
    Moderator
  • I was in attendance at the conference and I did ask your question there just in case someone knew of a way to do this.  They confirmed that it is not possible.

    Best Wishes - Eric


    Friday, May 25, 2012 5:03 AM
    Moderator
  • Never let it be said that Eric Hanson doesn't take developer support seriously!

    Ah, well. Abandoning WinRT as a viable platform just means less work for me. The game still runs fine under Win8 as regular native Windows/DirectX software, so, the only thing we lose is the app store. Full Steam ahead!

    Thanks again for your help.

    --

    Scott

    Tuesday, May 29, 2012 11:44 PM
  • Hello Scott,

    The recommended approach for data parallel optimization on Win8 is to use the WinRT ThreadPool::RunAsync API as opposed to trying to build your own parallel work queue implementation from scratch.  ThreadPool is a core part of the OS and is aggressively optimized (we use it ourselves in some of the most important and performance sensitive parts of the system).  It is absolutely possible (in fact much easier than with the legacy Win23 threading APIs) to achieve awesome data parallel performance this way.

    Couple of important things to bear in mind about getting good performance from ThreadPool:

    • Use it as-designed as a work pool primitive; don't try to layer some other work mangement system on top of it.  You shouldn't be trying to manually schedule work across cores: that's the job of the OS.
    • Be sensible about how many buckets you split your workload into.  You typically want to be looking at double digit threadpool tasks per frame.  Too few, and the OS doesnt have enough opportunities for meaningful parallelism.  Too many, and you'll find yourself drowning in work submission and coordination overheads.

    Note that there are good reasons for the OS wanting to control work scheduling as opposed to letting every app roll their own.  Raw perf is of course important (and ThreadPool can deliver this), but in these days of laptops and tablet devices, power usage and scalability across different hardware are also crucial parts of a good user experience.

    A problem we see a lot with legacy games that manually divide their work over all available cores and set explicit thread affinities is, what happens when this content is run on the latest and greatest hardware that has 8 cores and better performance than the developer ever expected?  Ideally this newer machine should be able to run the game on a single core without even breaking a sweat, but because the developer told the OS not to do that, instead every core must be woken up every frame.  They might only run for a couple of milliseconds and then sleep the remaining 14, but this prevents any core from ever going into full power collapse, which destroys battery life.

    Using ThreadPool, you just submit a bunch of work and let the OS decide where to run it.  If all cores are neccessary to achieve your performance goals, that's how it runs.  If just a couple of cores are fast enough to keep up, the others can be powered down.  If the user decides to dock a video chat app alongside your game (remember all Metro apps can be docked next to others, so you can't count on always being in exclusive fullscreen mode!) the OS might need to run your game logic on N-1 cores to free some cycles for that other app.

    A static processor query API cannot help to solve this sort of dynamic scheduling problem. This is something only the OS knows enough to do right, which is why Metro ThreadPool was designed the way it is.

    Hope that explanation makes more sense of what you are seeing, and will help you understand how best to move your threaded code over to this new platform.

    Thursday, May 31, 2012 3:20 AM
  • I appreciate the response.

    In the end it’s just kind of silly to code two paths for the same OS. I can’t find a compelling reason to do that.

     

    • Proposed as answer by ScottBruno Friday, June 1, 2012 2:52 AM
    Friday, June 1, 2012 2:52 AM
  • No need for two different paths:  ThreadPool has been part of the Windows kernel ever since Vista.


    XNA Framework Developer

    Friday, June 1, 2012 3:37 AM
  • I think everyone here is letting whatever the Microsoft Thread Pool du jour do the work for us, but we need a metric between the physical and logical cores to set the number from.  You can get at the logical cores, so why not implement the call to get the physical core count as well (GetLogicalProcessInfo...).  Especially when you have SIMD intensive code, then there are only so many physical SIMD units to run on.    On a slightly different note, there doesn't appear to be a way to get at the physical memory of the device.  That's another important (and missing) metric in the API.

    Wednesday, August 22, 2012 9:15 PM
  • I know this is old, but may be someone find it useful...

    Compute something with amount of logical processors you get from GetNativeSystemInfo and then do the same with 2 times less threads. Hyperthreading won't give more then 20% speed up but real core will double.




    • Edited by dubik1 Wednesday, November 28, 2012 8:13 AM
    Wednesday, November 28, 2012 8:07 AM
  • I think Shawn gave a good answer to this question: You cannot know for sure the current workload requirements of the processors. There are just too many variables to account for (like other apps and background services running, low power requirements, etc).

    That's why he recommends using the ThreadPool and letting the OS do a great job at scheduling all the work. Even if you can do a better job on your own PC, it is unlikely that is will perform as well on the variety of PCs out there.



    Pierre Henri K
    Developer of PressPlay Video, an advanced video player for Windows 8 (with experimental support for MKV and FLV videos)

    Friday, November 30, 2012 8:11 AM
  • I disagree that you cannot know the workload that your app has.  There's not much you can do about the other processes on the system.  The bigger issue is hyperthreading.  If we didn't have it, then there would be no difference between logical and physical core count.  With it there are 2x more logical than physical cores.   We don't see a speedup on SIMD tasks from allocating tasks to the logical core count, only the physical core count.  

    I do know that I have a several SIMD intensive tasks, and I know that I have the same number of SIMD units as physical cores.  When firing off tasks to a pool, there's no way to flag the tasks to say that it is a SIMD task and should be assigned by the pool to a different physical core.    This is a limitation of the GCD model, and Microsofts ThreadPool.    If we assigned 2 tasks to a 4 logical core HT machine (2x 2HT), then they may both be assigned to the same HT core.  I'm trying to avoid assigning 4 tasks that each compete for the same physical SIMD unit and cache, regardless of other processes that get scheduled in there.  


    Wednesday, January 16, 2013 6:56 PM