locked
Test hardware?

    Question

  • Is there going to be testing hardware available for purchase any time before GA of Windows 8/RT? I've at least one project, that could get pretty CPU heavy, if its WP7 sibling is of any indication. So actual hardware would be preferable, because at some point, to evaluate low level optimizations, I would have to deal with actual realities around the lowest common denominator, that being an ARM CPU.

    Point is this, said project was relatively successful on WP7, considering that it's a niche app, on a niche platform. The last thing I want is to get caught offguard by bad reviews around launch, because everyone skimped out and went for ARM tablets, which I didn't have a chance to actually test, optimize and adapt features and update rates for.

    The W8 version is rewritten in C++/CX to get some headroom already, so using WP7 as testbed for new code is out.

    Monday, June 4, 2012 12:59 PM

Answers

  • Tom,

    Here is the information available on ARM:

    http://blogs.msdn.com/b/b8/archive/2012/02/09/building-windows-for-the-arm-processor-architecture.aspx

    We do have a one day conference on C++ Metro style apps recorded here. (However this is not ARM specific)  :

    http://channel9.msdn.com/Events/Windows-Camp/Developing-Windows-8-Metro-style-apps-in-Cpp?sort=rating&direction=asc#tab_sortBy_rating

    I would recommend the Herb Sutter talk (and the others as well):

    http://channel9.msdn.com/Events/Windows-Camp/Developing-Windows-8-Metro-style-apps-in-Cpp/Cpp-for-the-Windows-Runtime

    Best Wishes - Eric

    Monday, June 4, 2012 5:01 PM
    Moderator
  • Tom,

    Here is a post from Shawn Hargreaves - MSFT  it has additional performance information.  The information posted above is the currently available information on ARM.

    The recommended approach for data parallel optimization on Win8 is to use the WinRT ThreadPool::RunAsync API as opposed to trying to build your own parallel work queue implementation from scratch.  ThreadPool is a core part of the OS and is aggressively optimized (we use it ourselves in some of the most important and performance sensitive parts of the system).  It is absolutely possible (in fact much easier than with the legacy Win23 threading APIs) to achieve awesome data parallel performance this way.

    Couple of important things to bear in mind about getting good performance from ThreadPool:

    • Use it as-designed as a work pool primitive; don't try to layer some other work mangement system on top of it.  You shouldn't be trying to manually schedule work across cores: that's the job of the OS.
    • Be sensible about how many buckets you split your workload into.  You typically want to be looking at double digit threadpool tasks per frame.  Too few, and the OS doesnt have enough opportunities for meaningful parallelism.  Too many, and you'll find yourself drowning in work submission and coordination overheads.

    Note that there are good reasons for the OS wanting to control work scheduling as opposed to letting every app roll their own.  Raw perf is of course important (and ThreadPool can deliver this), but in these days of laptops and tablet devices, power usage and scalability across different hardware are also crucial parts of a good user experience.

    A problem we see a lot with legacy games that manually divide their work over all available cores and set explicit thread affinities is, what happens when this content is run on the latest and greatest hardware that has 8 cores and better performance than the developer ever expected?  Ideally this newer machine should be able to run the game on a single core without even breaking a sweat, but because the developer told the OS not to do that, instead every core must be woken up every frame.  They might only run for a couple of milliseconds and then sleep the remaining 14, but this prevents any core from ever going into full power collapse, which destroys battery life.

    Using ThreadPool, you just submit a bunch of work and let the OS decide where to run it.  If all cores are neccessary to achieve your performance goals, that's how it runs.  If just a couple of cores are fast enough to keep up, the others can be powered down.  If the user decides to dock a video chat app alongside your game (remember all Metro apps can be docked next to others, so you can't count on always being in exclusive fullscreen mode!) the OS might need to run your game logic on N-1 cores to free some cycles for that other app.

    A static processor query API cannot help to solve this sort of dynamic scheduling problem. This is something only the OS knows enough to do right, which is why Metro ThreadPool was designed the way it is.

    Hope that explanation makes more sense of what you are seeing, and will help you understand how best to move your threaded code over to this new platform

    Shawn Hargreaves - MSFT             Thursday, May 31, 2012 3:27 AM


    Monday, June 4, 2012 6:34 PM
    Moderator

All replies

  • Tom,

    Here is the information available on ARM:

    http://blogs.msdn.com/b/b8/archive/2012/02/09/building-windows-for-the-arm-processor-architecture.aspx

    We do have a one day conference on C++ Metro style apps recorded here. (However this is not ARM specific)  :

    http://channel9.msdn.com/Events/Windows-Camp/Developing-Windows-8-Metro-style-apps-in-Cpp?sort=rating&direction=asc#tab_sortBy_rating

    I would recommend the Herb Sutter talk (and the others as well):

    http://channel9.msdn.com/Events/Windows-Camp/Developing-Windows-8-Metro-style-apps-in-Cpp/Cpp-for-the-Windows-Runtime

    Best Wishes - Eric

    Monday, June 4, 2012 5:01 PM
    Moderator
  • Thanks, but I don't see how that helps me in any practical performance evaluations and results.
    Monday, June 4, 2012 5:06 PM
  • Tom,

    Here is a post from Shawn Hargreaves - MSFT  it has additional performance information.  The information posted above is the currently available information on ARM.

    The recommended approach for data parallel optimization on Win8 is to use the WinRT ThreadPool::RunAsync API as opposed to trying to build your own parallel work queue implementation from scratch.  ThreadPool is a core part of the OS and is aggressively optimized (we use it ourselves in some of the most important and performance sensitive parts of the system).  It is absolutely possible (in fact much easier than with the legacy Win23 threading APIs) to achieve awesome data parallel performance this way.

    Couple of important things to bear in mind about getting good performance from ThreadPool:

    • Use it as-designed as a work pool primitive; don't try to layer some other work mangement system on top of it.  You shouldn't be trying to manually schedule work across cores: that's the job of the OS.
    • Be sensible about how many buckets you split your workload into.  You typically want to be looking at double digit threadpool tasks per frame.  Too few, and the OS doesnt have enough opportunities for meaningful parallelism.  Too many, and you'll find yourself drowning in work submission and coordination overheads.

    Note that there are good reasons for the OS wanting to control work scheduling as opposed to letting every app roll their own.  Raw perf is of course important (and ThreadPool can deliver this), but in these days of laptops and tablet devices, power usage and scalability across different hardware are also crucial parts of a good user experience.

    A problem we see a lot with legacy games that manually divide their work over all available cores and set explicit thread affinities is, what happens when this content is run on the latest and greatest hardware that has 8 cores and better performance than the developer ever expected?  Ideally this newer machine should be able to run the game on a single core without even breaking a sweat, but because the developer told the OS not to do that, instead every core must be woken up every frame.  They might only run for a couple of milliseconds and then sleep the remaining 14, but this prevents any core from ever going into full power collapse, which destroys battery life.

    Using ThreadPool, you just submit a bunch of work and let the OS decide where to run it.  If all cores are neccessary to achieve your performance goals, that's how it runs.  If just a couple of cores are fast enough to keep up, the others can be powered down.  If the user decides to dock a video chat app alongside your game (remember all Metro apps can be docked next to others, so you can't count on always being in exclusive fullscreen mode!) the OS might need to run your game logic on N-1 cores to free some cycles for that other app.

    A static processor query API cannot help to solve this sort of dynamic scheduling problem. This is something only the OS knows enough to do right, which is why Metro ThreadPool was designed the way it is.

    Hope that explanation makes more sense of what you are seeing, and will help you understand how best to move your threaded code over to this new platform

    Shawn Hargreaves - MSFT             Thursday, May 31, 2012 3:27 AM


    Monday, June 4, 2012 6:34 PM
    Moderator
  • Thanks again, but I still don't know how my app is going to perform. Signal processing and graph rendering are split, so a dual core would be occupied if necessary. Your post still doesn't tell me how my application will perform given a specific set of runtime parameters. I don't know how fast the CPU is practically, or how I can optimize my code on a lower level (trying to cut down memory bandwidth, manual register allocation, etc).

    Is it planned to make devices available at some point before GA, or will there be none?

    Wednesday, June 6, 2012 7:00 PM
  • The information above is what is publically available with regards to ARM.

    Wednesday, June 6, 2012 7:37 PM
    Moderator