locked
Task Groups and Logical Processors RRS feed

  • Question

  • Hi,

    I have been having a first look at the Concurrency Runtime via the ConcRT samples.

    Based on tests made (see output and code below) I'm concerned that the tasks in the task group do not remain on the logical processor that they were initially started upon.

    I'm guessing that manual setting of their thread affinity at their startup would cure this, though I would ask of you if this would in any way conflict with the task group runtime?

    Would it not be more performant to constrain an individual task to run on the same logical core for the duration of it's lifetime, or at least constrain it to the use of cores that share an on-die cache where possible?  Without such a mechanism I cannot see much benefit in using the taskgroup runtime over managing threads manually.

    This test was run on a Core2Quad, with the 2 separate L2 cache each between 2 cores only under XPx64Pro, using the default full install of VC++ Beta 2010 Express.  Compiled for Win32 as unfortunately I have not yet been able to compile for x64 target. 

    Will the Visual C++ Beta 2010 Express Edition work with the current 7.0 Windows SDK instead of the 7.0A SDK that is supplied with it?  Although to test the ConcRT would require an x64 version of the 7.0A, correct?   Is an x64 version of the 7.0A SDK available anywhere to download?

    Other bug feedback: There were a few errors in the code:
    1. The string arg in instantiation of philosophers needed typecasting to sys:string.
    2. The "done" member in the RT agent code had an additionl enum argument that needed to be removed.
    3. The agent.exe program does not end once all tasks are complete, final "return 0" is reached but prog hangs until ctrl+c, no idea why?


    Here is the test I made of processor use.
    The code is modified from the ConcRT sample and gives the following output:

    [output]
    C:\VS10\PRJ\ConcRTSamplePack\Debug>event
    Cooperative Event

            Setting the event
            Task 8 has received the event on logical processor # 1
            Task 8 ran on logical processors # 1, 3, 1
            Task 5 has received the event on logical processor # 0
            Task 5 ran on logical processors # 0, 1, 0
            Task 3 has received the event on logical processor # 3
            Task 3 ran on logical processors # 2, 2, 3
            Task 2 has received the event on logical processor # 0
            Task 2 ran on logical processors # 3, 0, 0
            Task 1 has received the event on logical processor # 3
            Task 1 ran on logical processors # 0, 3, 3
            Task 4 has received the event on logical processor # 0
            Task 4 ran on logical processors # 1, 1, 0
            Task 6 has received the event on logical processor # 1
            Task 6 ran on logical processors # 3, 3, 1
            Task 7 has received the event on logical processor # 0
            Task 7 ran on logical processors # 3, 2, 0

    WaitEnded tg completed
    Windows Event

            Setting the event
            Task 2 has received the event on logical processor # 1
            Task 2 ran on logical processors # 0, 2, 1
            Task 1 has received the event on logical processor # 2
            Task 1 ran on logical processors # 3, 3, 2
            Task 3 has received the event on logical processor # 0
            Task 3 ran on logical processors # 2, 0, 0
            Task 4 has received the event on logical processor # 1
            Task 4 ran on logical processors # 2, 2, 1
            Task 5 has received the event on logical processor # 3
            Task 5 ran on logical processors # 1, 3, 3
            Task 6 has received the event on logical processor # 2
            Task 6 ran on logical processors # 3, 2, 2
            Task 8 has received the event on logical processor # 3
            Task 8 ran on logical processors # 3, 3, 3
            Task 7 has received the event on logical processor # 2
            Task 7 ran on logical processors # 2, 2, 2

    WaitEnded tg completed
    Events Done
    ^C
    C:\VS10\PRJ\ConcRTSamplePack\Debug>
    [/output]


    [code]
    // event.cpp : Defines the entry point for the console application.
    //
    // compile with: /EHsc
    #include <windows.h>
    #include <concrt.h>
    #include <concrtrm.h>
    #include <ppl.h>

    using namespace Concurrency;
    using namespace std;

    class WindowsEvent
    {
        HANDLE m_event;
    public:
        WindowsEvent()
            :m_event(CreateEvent(NULL,TRUE,FALSE,TEXT("WindowsEvent")))
        {
        }

        ~WindowsEvent()
        {
            CloseHandle(m_event);
        }

        void set()
        {
            SetEvent(m_event);
        }

        void wait(int count = INFINITE)
        {
            WaitForSingleObject(m_event,count);
        }
    };

    template<class EventClass>
    void DemoEvent()
    {
        EventClass e;
        volatile long taskCtr = 0;

        //create a taskgroup and schedule multiple copies of the task
        task_group tg;
        for(int i = 1;i <= 8; ++i)
            tg.run([&e,&taskCtr]{

                //increment our task counter
                long taskId = InterlockedIncrement(&taskCtr);

                DWORD pn[3];
                pn[0]=GetCurrentProcessorNumber();
    //          printf_s("\tTask %d before sleep on logical processor # %d\n", taskId, pn[0]);

                //Simulate some work
                Sleep(100);

                pn[1]=GetCurrentProcessorNumber();
    //          printf_s("\tTask %d waiting for the event on logical processor # %d\n", taskId, pn[1]);

                e.wait();

          pn[2]=GetCurrentProcessorNumber();
          printf_s("\tTask %d has received the event on logical processor # %d\n", taskId, pn[2]);
          printf_s("\tTask %d ran on logical processors # %d, %d, %d\n", taskId, pn[0],pn[1],pn[2]);

        });

        //pause noticably before setting the event
        Sleep(1500);

        printf_s("\n\tSetting the event\n");

        //set the event
        e.set();

        //wait for the tasks
        tg.wait();

    //    e.~EventClass();  //tried to kill it manually here JIC but prog
                           // still hangs after all done...ctrl+c is my friend?

        printf_s("\nWaitEnded tg completed\n");
    }

    int main ()
    {
        //Create a scheduler that uses two and only two threads.
        CurrentScheduler::Create(SchedulerPolicy(2, MinConcurrency, 2, MaxConcurrency, 2));

        //When the cooperative event is used, all tasks will be started
        printf_s("Cooperative Event\n");
        DemoEvent<event>();

        //When a Windows Event is used, unless this is being run on Win7 x64
        //ConcRT isn't aware of the blocking so only the first 2 tasks will be started.
        printf_s("Windows Event\n");
        DemoEvent<WindowsEvent>();

        printf_s("Events Done\n");
        return 0;
    }

    [/code]

    [edit]: I have tried also varying the SchedulerPolicy to use 4 min 4 max, 4 min 8 max, and 8 min 8 max threads.
    this made no difference to the outcome of task vs processor allocation.

    • Edited by ElCroc Tuesday, November 10, 2009 9:18 AM added info at end
    Tuesday, November 10, 2009 8:48 AM

Answers

  •  

    First off, thank you for reporting errors in the sample pack and the hang you are seeing. We are looking into them.

     

    ConcRT does some things to help with locality, but not exactly what you are expecting. The runtime does not affinitize individual ConcRT threads to a single processor each, but instead, affinitize them to a processor package or NUMA node. Since a process does not have exclusive access to cores (the operating system and other processes share cores with your process), you have to weigh the benefit of running a task on the same core after it was switched out and back in, against the opportunity cost of not executing that task on another core that happens to be idle. In addition, node affinitization helps with power management scenarios. On XP, we don't get topology information from the OS, but on Vista and higher, on a machine with 2 processor packages and 2 cores per package, a scheduler with MinConcurrency 2 and MaxConcurrency 2 will only use one package for executing tasks.

     

    Explicitly Affinitizing the threads your tasks are running on will not be effective for long - the runtime will reaffinitize them when it reschedules thread.

    • Proposed as answer by Genevieve M Friday, November 13, 2009 2:28 AM
    • Marked as answer by rickmolloy Friday, November 27, 2009 6:07 PM
    Friday, November 13, 2009 2:28 AM

All replies

  • I have tried also changing the min/max threads in the SchedulerPolicy to 4/4,4/8,8/8 with  no difference in the task vs processor allocation.
    e.g. CurrentScheduler::Create(SchedulerPolicy(2, MinConcurrency, 4, MaxConcurrency, 8)); etc...
    Tuesday, November 10, 2009 9:20 AM
  •  

    First off, thank you for reporting errors in the sample pack and the hang you are seeing. We are looking into them.

     

    ConcRT does some things to help with locality, but not exactly what you are expecting. The runtime does not affinitize individual ConcRT threads to a single processor each, but instead, affinitize them to a processor package or NUMA node. Since a process does not have exclusive access to cores (the operating system and other processes share cores with your process), you have to weigh the benefit of running a task on the same core after it was switched out and back in, against the opportunity cost of not executing that task on another core that happens to be idle. In addition, node affinitization helps with power management scenarios. On XP, we don't get topology information from the OS, but on Vista and higher, on a machine with 2 processor packages and 2 cores per package, a scheduler with MinConcurrency 2 and MaxConcurrency 2 will only use one package for executing tasks.

     

    Explicitly Affinitizing the threads your tasks are running on will not be effective for long - the runtime will reaffinitize them when it reschedules thread.

    • Proposed as answer by Genevieve M Friday, November 13, 2009 2:28 AM
    • Marked as answer by rickmolloy Friday, November 27, 2009 6:07 PM
    Friday, November 13, 2009 2:28 AM
  • Many thanks for your reply. 

    I can now see that the Concurrency runtime will be very useful in more 'generic' scenarios, especially when there are several physical processor packages / nodes in the target system, with worker threads that are not allocated for a long-living single job.  The ConcRT will scale nicely when the common technology available catches up with it.

    For now, when more detailed control is desirable in a known or discoverable system topology for applications such as game development (where the number of threads may be a well known quantity) it would seem that doing things the old way by managing the threads directly might yield better results, though as you point out: other processes might interfere with this rosy picture.  I realise that the Core2Quad with the split L2 cache is a 'special case' but one that is currently quite common, so I believe that the finer-grained control of the thread->core relationship will still pay-off on the recent generations of hardware until larger-scale parallel systems become the mainstream target.

    Re: your statement: "on a machine with 2 processor packages and 2 cores per package, a scheduler with MinConcurrency 2 and MaxConcurrency 2 will only use one package for executing tasks" ...I hopefully assume that the programmer's choice of 'Numa affinity' is not also overridden  by the runtime when it reschedules?

     

    Sunday, November 22, 2009 1:26 PM
  • When you say "I hopefully assume that the programmer's choice of 'Numa affinity' is not also overridden by the runtime when it reschedules"...  I'm so sure I understand specifically what you mean by the "programmer's choice". 

    The runtime's resource manager picks the specific set of cores that a given scheduler will utilize.  As Genevieve mentioned, with multiple schedulers in the system -- it does this with an understanding of the NUMA topology of the system.  Even if a single scheduler spans multiple physical processor packages / NUMA nodes, there is a bias towards keeping threads managed by the scheduler on the node they last ran on.  At present, there is not a way for a user of the runtime to associate given threads/tasks with specific NUMA affinities.

    If by "programmer's choice" here, you are still referring to explicit affinitization, I would strongly caution against this.  The runtime will reaffinitize threads should it decide to move them away from where it thought they were executing.  I'll also mention here that while explicit affinitization may give the desired effect for a short while (until the next affinitization within the runtime), it will not have the expected effect on a UMS scheduler (Windows 7 x64).

    Monday, November 23, 2009 11:21 PM
  • Again thank-you for your reply.

    To explain myself more clearly, I mean by "programmer's choice" the use of SetProcessAffinityMask and/or SetThreadAffinityMask to restrict certain parts of a program or set of co-operative processes, to operate upon specific packages / nodes, combined with the use of the SetThreadIdealProcessor API to indicate a further preference to the OS scheduler to influence the distribution of work in a repeatable way in the pursuit of an optimal distribution of workload for specific tasks.  By "programmer's choice" of Numa affinity I mean the use of the above listed functions to influence the specific core or cores to run upon as mapped to Numa Nodes and discoverable via usage of the GetNumaProcessorNode and/or GetNumaNodeProcessorMask APIs.

    If I understand what I have read in the responses correctly, any and all uses of these API functions to influence this affinitisation will always be overridden by the runtime scheduler?  In addition the new SetUmsThreadInformation parameter UMS_THREAD_INFO_CLASS member UmsThreadAffinity will also not be used/honoured?  In effect rendering the bulk of the new Numa APIs pretty pointless if used in conjunction with the ConcRT?  Now I'm confused.  Am I really understanding your answers correctly?

    re: "while explicit affinitization may give the desired effect for a short while (until the next affinitization within the runtime), it will not have the expected effect on a UMS scheduler (Windows 7 x64)"
    ... can you please explain what other differences are lurking in the UMS scheduler that devs need to be aware of in the context of this discussion?

    edit: Besides the control-freak part of me that likes to decide what is running where, to most devs the key part of this discussion will be the issue of repeatability of performance and test conditions / results, without which the opportunities of comparative testing and tuning will be significantly reduced.

    • Edited by ElCroc Tuesday, November 24, 2009 9:08 AM noted
    Tuesday, November 24, 2009 9:02 AM
  • You are always free to utilize the affinity / NUMA APIs on threads that you own/create in order to set their affinity / ideal processor / etc...  The runtime will not override affinity on such external threads unless you specifically give such control to the resource manager.  Threads that are owned by the runtime and created internally by the resource manager/scheduler to run tasks/agents/etc...  are a completely different story.  The runtime controls where those threads run and the mapping of tasks to those threads.  Explicitly using affinity / priority / NUMA APIs on those threads is a recipe for conflict with the runtime as the runtime will move threads around and will reaffinitize when it deems it appropriate.

    Regarding UMS: there are always two threads involved in running a UMS thread: the primary which is running it and the user thread itself.  Usually, the primary performs a user mode context switch to the user mode portion of a UMS thread.  This means that the user mode portion of a UMS thread is usually running with the priority/affinity of the primary thread.  Explicitly trying to set the affinity or priority of a UMS thread will, at present, not set the affinity/priority of the primary thread -- meaning that you will not get the result you expect by calling these APIs.  The primary threads are under control of the runtime and you have no way to access these from client code running atop the user thread.  Since code written to the runtime can run atop both thread schedulers and UMS schedulers (unless you explicitly ask for only threads), this is another reason why trying to control the runtime's threads isn't a good idea.

    That said, better programmatic control of affinity and better NUMA awareness in the runtime API set are things we are actively investigating.

    Tuesday, November 24, 2009 7:24 PM
  • Many thanks for your clarifications.
    The mixing of external self-made threads with other worker threads under the ConcRT runtime will make for some interesting decisions to be made early on in the design process, though an ability to use the runtime to designate specific threads as pre-affinitized (even if only for testing purposes) might be a cleaner solution if it's possible.  At this point I'm not sure if that would even make sense in the runtime itself?
    Glad I asked about this, and I'm very appreciative of your clear answers. Off to think about possible advantages of designs combining the thread types now.
    Tuesday, November 24, 2009 8:44 PM