locked
VS 2010 Concurrency Runtime vs Others RRS feed

  • Question

  • How does VS2010 Concurrency Runtime compare with Intel TBB, openMP and openCL? I want to know which one should I choose and why.
    Tuesday, November 2, 2010 4:18 AM

Answers

  • r00ky,

    The answer is (of course) "it depends". :-)

    I should note that I consider OpenCL a GPGPU technology, and I am not going to address that here.  PPL, TBB, OpenMP, STL, and Win32 thread pools and threading are what I discuss.

    When I present courses on PPL and TBB together with Intel, I cover this topic in some depth with discussion.  Anything I provide here would only be "a rule-of-thumb" and YMMV in practice.

    First I ask:

    • are you a C or a C++ programmer?
    • are you writing highly concurrent and/or parallel code today?

    The first question directs my advice.  C++ programmers are generally comfortable with templates.  C programmers, in my experience, are less comfortable with templates.  Both PPL and TBB heavily rely on templated code. 

    I am a C++ programmer, so I would prefer reaching for a well-tested and efficient template instead of rolling my own algorithm.  When writing C code, I am comfortable going directly to Win32 (or in my case, kernel routines) to manage my threads.  When writing user-mode C++ code, I prefer the productivity I get from using templates.  I love lambdas found in the new C++0x features implemented in VC++ 2010.  I know that some programmers I have spoken with seem to be uncomfortable with lambdas.  How do you feel about them?

    The second question above reflects my personal policy that boils down to "if it ain't broke, don't fix it..."  If you are successful using your tools today, injecting a new technology has to provide you enough benefit to make the investment in time to master that technology.  If you use Win32 threading APIs today and you like them, keep doing what works.

    I should mention that STL's thread library is being introduced in  the C++0x standard.  It does provide some ease to create and manage my threads but it does not provide any thread pooling or methods for me to really influence the scheduling of work.  Of course, its not in VC++ 2010 so I won't go into its strengths and weaknesses today.

    Except for the simplest application, I usually find that I need a thread pool of some sort.  TBB, PPL, and the Win32 thread pool are the tools I reach for today.  PPL or more correctly, the Concurrency Runtime's schedulers, are simply an easy-to-use thread pool.  In another thread in this forum I demonstrated how to change your Win32 CreateThread api and make it use the scheduler's thread pool which only a few characters. 

    The Win32 APIs are very rich, give you a lot of control, and there is significant power using IO Completion Ports for workloads that can take advantage of those APIs.

    OpenMP promises some level of cross-platform support.  We support the OMP 2.x in VC2010 and there are other compilers supporting the evolution of that standard.  For highly structured parallel workloads that have flat, uniform workloads (aka, simple math problems) within the parallel loop, OMP seems to work well.  OMP has a strength in that its integrated into the compiler, allowing the potential of better optimization of its loops.  Further, OMP implementations appear to have very low overhead that work well with small workloads.  OMP does not compose well with itself and other threading libraries.  Intel's OMP implementation that is layered on top of VC2010's Concurrency Runtime improves its ability to compose but you are still unable to implement a simple parallel sort using recursion with OMP.  I know that a number of my coworkers would recommend against using OMP as a solution but I temper that reaction by advising you to fully understand the weaknesses of OMP and its strengths.  My guess is that any real application will need more than OMP to express both parallel and asynchronous workloads.  Since I cannot create all this within OMP, I will be blending it with another technology -- one that likely will not compose with my OMP loops.

    So that leaves Intel's TBB and PPL/Agents offered by us in VC2010.  TBB and PPL are cross-platform.  PPL gains its cross-platform capabilities through Intel's compatibility work.  If you stick with PPL,  you can recompile and run your application on VC++ or with TBB on other platforms. 

    TBB is now at V3 and is richer and more mature than PPL -- which is both its strength and weakness.  It has grown organically and when speaking with the TBB architect, he laments about some of the features added years ago that seemed like a good idea at the time.  Today, our PPL effort tries to learn from the mistakes and successes of TBB and create a balance of simplicity versus control.  Of course, we haven't had the time to implement as much as the TBB team has created. 

    The Intel and Microsoft runtimes use different schedulers.  Intel recommends against any blocking work done in their tasks; their scheduler does not cooperatively manage blocking work.  When blocked, an Intel task blocks the hardware thread and your application no longer is scaling as optimally as it could.  Using Microsoft's scheduler, in contrast,  supports cooperative blocking such that a task which is blocked (using Concurrency:: synchronization primitives or UMS threads) results in another task scheduled on that hardware thread.  This may mean creating more threads than there are processors.  This is the fundamental difference between the two approaches.

    The VS2010 debug and performance tools know about PPL tasks and threads.  There is integration with the parallel tasks and stacks window when looking at PPL tasks in the debugger and the concurrency viewer provides visual clues for PPL loops and blocking operations.

    Lastly, the only runtime that provides primitives to support dataflow models is found in VC++ 2010 with the asynchronous agents library built on our schedulers.  We hope that the mix of message passing and the actor model supported by the agents library with the structured parallelism supported by PPL provide enough flexibility to cover most workloads.  (If you find that there are patterns you would like see supported better or more completely in our next release, please provide suggestions here, through connect, or send me email directly.)

    So, I cannot give you one clear recommendation.  Your millage *will* vary (as opposed to YMMV :-) ) depending on your workload.  I hope this provide you enough information in order to make the best decision for your application.  I will mark this as an answer, please feel free to unmark it and ask clarifying questions.

    Thank you for your interest,

    Dana Groff, ConcRT PM


    Dana Groff, Senior Program Manager Parallel Computing, Concurrency Runtime
    • Marked as answer by Dana Groff Wednesday, November 3, 2010 6:14 PM
    Wednesday, November 3, 2010 6:14 PM

All replies

  • r00ky,

    The answer is (of course) "it depends". :-)

    I should note that I consider OpenCL a GPGPU technology, and I am not going to address that here.  PPL, TBB, OpenMP, STL, and Win32 thread pools and threading are what I discuss.

    When I present courses on PPL and TBB together with Intel, I cover this topic in some depth with discussion.  Anything I provide here would only be "a rule-of-thumb" and YMMV in practice.

    First I ask:

    • are you a C or a C++ programmer?
    • are you writing highly concurrent and/or parallel code today?

    The first question directs my advice.  C++ programmers are generally comfortable with templates.  C programmers, in my experience, are less comfortable with templates.  Both PPL and TBB heavily rely on templated code. 

    I am a C++ programmer, so I would prefer reaching for a well-tested and efficient template instead of rolling my own algorithm.  When writing C code, I am comfortable going directly to Win32 (or in my case, kernel routines) to manage my threads.  When writing user-mode C++ code, I prefer the productivity I get from using templates.  I love lambdas found in the new C++0x features implemented in VC++ 2010.  I know that some programmers I have spoken with seem to be uncomfortable with lambdas.  How do you feel about them?

    The second question above reflects my personal policy that boils down to "if it ain't broke, don't fix it..."  If you are successful using your tools today, injecting a new technology has to provide you enough benefit to make the investment in time to master that technology.  If you use Win32 threading APIs today and you like them, keep doing what works.

    I should mention that STL's thread library is being introduced in  the C++0x standard.  It does provide some ease to create and manage my threads but it does not provide any thread pooling or methods for me to really influence the scheduling of work.  Of course, its not in VC++ 2010 so I won't go into its strengths and weaknesses today.

    Except for the simplest application, I usually find that I need a thread pool of some sort.  TBB, PPL, and the Win32 thread pool are the tools I reach for today.  PPL or more correctly, the Concurrency Runtime's schedulers, are simply an easy-to-use thread pool.  In another thread in this forum I demonstrated how to change your Win32 CreateThread api and make it use the scheduler's thread pool which only a few characters. 

    The Win32 APIs are very rich, give you a lot of control, and there is significant power using IO Completion Ports for workloads that can take advantage of those APIs.

    OpenMP promises some level of cross-platform support.  We support the OMP 2.x in VC2010 and there are other compilers supporting the evolution of that standard.  For highly structured parallel workloads that have flat, uniform workloads (aka, simple math problems) within the parallel loop, OMP seems to work well.  OMP has a strength in that its integrated into the compiler, allowing the potential of better optimization of its loops.  Further, OMP implementations appear to have very low overhead that work well with small workloads.  OMP does not compose well with itself and other threading libraries.  Intel's OMP implementation that is layered on top of VC2010's Concurrency Runtime improves its ability to compose but you are still unable to implement a simple parallel sort using recursion with OMP.  I know that a number of my coworkers would recommend against using OMP as a solution but I temper that reaction by advising you to fully understand the weaknesses of OMP and its strengths.  My guess is that any real application will need more than OMP to express both parallel and asynchronous workloads.  Since I cannot create all this within OMP, I will be blending it with another technology -- one that likely will not compose with my OMP loops.

    So that leaves Intel's TBB and PPL/Agents offered by us in VC2010.  TBB and PPL are cross-platform.  PPL gains its cross-platform capabilities through Intel's compatibility work.  If you stick with PPL,  you can recompile and run your application on VC++ or with TBB on other platforms. 

    TBB is now at V3 and is richer and more mature than PPL -- which is both its strength and weakness.  It has grown organically and when speaking with the TBB architect, he laments about some of the features added years ago that seemed like a good idea at the time.  Today, our PPL effort tries to learn from the mistakes and successes of TBB and create a balance of simplicity versus control.  Of course, we haven't had the time to implement as much as the TBB team has created. 

    The Intel and Microsoft runtimes use different schedulers.  Intel recommends against any blocking work done in their tasks; their scheduler does not cooperatively manage blocking work.  When blocked, an Intel task blocks the hardware thread and your application no longer is scaling as optimally as it could.  Using Microsoft's scheduler, in contrast,  supports cooperative blocking such that a task which is blocked (using Concurrency:: synchronization primitives or UMS threads) results in another task scheduled on that hardware thread.  This may mean creating more threads than there are processors.  This is the fundamental difference between the two approaches.

    The VS2010 debug and performance tools know about PPL tasks and threads.  There is integration with the parallel tasks and stacks window when looking at PPL tasks in the debugger and the concurrency viewer provides visual clues for PPL loops and blocking operations.

    Lastly, the only runtime that provides primitives to support dataflow models is found in VC++ 2010 with the asynchronous agents library built on our schedulers.  We hope that the mix of message passing and the actor model supported by the agents library with the structured parallelism supported by PPL provide enough flexibility to cover most workloads.  (If you find that there are patterns you would like see supported better or more completely in our next release, please provide suggestions here, through connect, or send me email directly.)

    So, I cannot give you one clear recommendation.  Your millage *will* vary (as opposed to YMMV :-) ) depending on your workload.  I hope this provide you enough information in order to make the best decision for your application.  I will mark this as an answer, please feel free to unmark it and ask clarifying questions.

    Thank you for your interest,

    Dana Groff, ConcRT PM


    Dana Groff, Senior Program Manager Parallel Computing, Concurrency Runtime
    • Marked as answer by Dana Groff Wednesday, November 3, 2010 6:14 PM
    Wednesday, November 3, 2010 6:14 PM
  • Hi Dana Groff, Thank you for the quick reply. I am a C and C++ programmer...and C# too :-) My job is to evaluate various technologies in multicore. I included openCL because the openCL runtime (atleast the ATI stream sdk) is targeted towards x86 multicore also. Nevertheless, you have answered my question.
    Friday, November 12, 2010 9:39 AM
  • Hi Dana Groff,

    I have one more general question. How does CRT compare with .NET Task Parallel Library? I know CRT targets unmanaged while TPL is for managed environment. is PPL for C++ equivalent to TPL for .NET?

     

    Thank you,

    r00ky

    Friday, November 12, 2010 2:17 PM
  •  PPL + ConcRT = TPL ? just a guess ...
    Tuesday, November 23, 2010 1:21 AM