How to get the index of virtual processor or the context in PPL?

Proposed How to get the index of virtual processor or the context in PPL?

  • giovedì 26 gennaio 2012 06:53
     
     

    Hello,

     

    We are using PPL in our project to parallelize some image processing.

    After starting parallel threads, how to find the index of virtual processor or the index of the context from within the thread that is executing currently?

    We need a 0-based index not the ID. For example if there are 4 threads can be executed as defined by scheduler policy,

    then the index can be only 0, 1, 2, 3 (no less, no more).

    Is it safe to use Context::GetVirtualProcessorId() for that purpose?

    In MSDN documentation it is mentioned:

    "This value may be stale the moment it is returned and cannot be relied upon. Typically, this method is used for debugging or tracing purposes only."


    Context::GetId() also seems is not useful since it returns the ID not the index.

     

    It would be nice to have a clarification on this issue.

     

    Thank You,

    Armen Anoyan.




Tutte le risposte

  • lunedì 30 gennaio 2012 21:18
     
     Risposta suggerita

    There is no single "index" quantity from the runtime that you can safely use as you describe. Unless you are changing the policy of the scheduler which is created, virtual processor objects can come and go due to dynamic resource management. As well, contexts can come and go as blocking APIs are called.

    Even if you change the policy of the scheduler being created to utilize a fixed concurrency level (MinConcurrency = MaxConcurrency = N), there are several things which might not follow your expectations:

    • The runtime makes no guarantee about the values returned for virtual processor or context IDs excepting that they are unique between different objects. Creating or destroying schedulers, blocking and unblocking, and various other operations can affect them.
    • Someone in your call chain might call Context::Oversubscribe(true). An explicit request to oversubscribe might do so. Can you guarantee no method you call will do this (are you in control of the entire call chain)?

    Certainly if you are in control of the scheduler policy and everything that happens for the duration of the time you need this index, you can derive one from something returned from the runtime.

    Before going that direction, I might, however, ask, "Why do you need such an index?" There are a number of other constructs which might work just as well:

    • Can you use a combinable? You can do many reduction style operations with this.
    • Can your algorithm make use of a parallel_for loop? If you require a different partitioning strategy than normally provided, will a fixed partitioning work (via, for instance, the sample pack implementation)?
  • martedì 14 febbraio 2012 09:13
     
      Contiene codice

    Let me describe our issue in more detail.
    We have parallel_for construct in our project to increase efficiency of processing.
    Inside parallel_for we call a function exported from our image processing library (DLL).
    That function uses internal buffers for image processing: 
    an array of N buffers, where N is the maximal number of concurrent threads.
    From within parallel_for we need to provide a thread index so the information in internal buffers will not be overwritten by concurrent threads.
    Something like this:

    parallel_for(0, nSize, [&](int i)
    {
    	...
    	int nThreadIndex = Context::CurrentContext()->GetVirtualProcessorId() - 1;
    	ImageProc(nThreadIndex, ...);
    	...
    });

    Currently we are using GetVirtualProcessorId function to get the index, but I'm not sure whether it is safe to use it?

    In our case we can guarantee, that no method will call Context::Oversubscribe inside ImageProc or in our parallel_for loop. Moreover the DLL library that exports ImageProc function does not use PPL at all.
    It just capable of parallel image processing for which we should just provide "nThreadIndex".

    > Certainly if you are in control of the scheduler policy and everything that

    > happens for the duration of the time you need this

    > index, you can derive one from something returned from the runtime.


    Could you please give a sample how to derive a thread index, which will identify the thread? 
    In this context "identify the thread" means that ImageProc function will not be called
    at the same time with same nThreadIndex and nThreadIndex will not be more than N-1, where N is the maximal number of concurrent threads.


    Thank you for explanation,
    Armen Anoyan.



    • Modificato Armen Anoyan venerdì 17 febbraio 2012 05:26
    •  
  • martedì 13 marzo 2012 06:50
    Proprietario
     
     

    Hi Armen,

    Unfortunately, the API wasnt designed to give guarantees you are looking for.  If your question is - "is it safe", the answer is no.

    Your code above is taking advantage of some implementation details. I can elaborate further but do note that this may change in the future in subtle ways.

    concurrency::Context::CurrentContext()->GetVirtualProcessorId()

    returns -1, if called on a non-concrt thread e.g. main thread.  Returns a positive number, 1-based for virtual processors if called on a concrt thread. There are times when main thread participates in parallel_for loop - this is due to what we call inlining. So on a 4 core, you may have 5 threads participating (main thread + 4 ConcRT threads).

    So for your buffers, you would have to do a +1 for main thread, and let it be for other threads. There are ways to prevent the main thread from participating as well (for example if you put the entire execution inside CurrentScheduler::ScheduleTask).

    Caveats and assumptions for the explanation above:

    - Behavior can change in the future; No other explicit thread creation; No concrt aware blocking calls (e.g. mutex, .wait(), concrt events) which cause extra virtual processors to be called; not being called by something nesting this i.e. if the vproc count is already up because of other blocking functions called previously to this algorithm; Please do sprinkle your code with plenty of asserts to be sure to catch deviations very quickly.

    As you can see, the caveats are plentiful to get the desired behavior. So we would not recommend using this.

    PS: I do believe that openmp does provide with this ability though and is worth a look. See http://msdn.microsoft.com/en-us/library/k1h4zbed.aspx  omp_get_num_threads and omp_get_thread_num


    Rahul V. Patil

  • martedì 13 marzo 2012 21:32
     
     

    Hi Armen,

    from your description "we need to provide a thread index so the information in internal buffers will not be overwritten by concurrent threads." are you looking for a per-thread storage?

    If that is the case, combinable class may help you have per-thread buffers : http://msdn.microsoft.com/en-us/library/dd492850.aspx. Please be aware that the runtime owns the threads, and as Rahul, and Bill made it cleare you may not assume their life spane, or count.

    Regards,

    Ameen


    mameen

  • mercoledì 14 marzo 2012 07:28
     
      Contiene codice

    How big is nSize?  If it's not too big (say < 1,000 or even 10,000), I would try simplifying my life and just create 1 buffer per loop iteration --- i.e. think Task-local storage, not thread-local storage.  In other words:

    allocate buffer array[nSize];

    parallel_for(0, nSize, [&](int i) { ImageProc(i, ...); });

    Another alternative is to create the tasks yourself and then you can specify the index any way you want:

    let N = # of cores;

    allocate buffer array[N];

    Task tasks[N];

    for (int i = 0; i < N; ++i)

    tasks[i] = new Task([i]() { ImageProc(i, ...); });

    WaitAll(tasks); // I'm mixing my C# and C++ calls, but you get the idea

    Cheers,

      - joe

  • venerdì 16 marzo 2012 09:09
     
     
    Hi Rahul,

    Thank you for detailed explanation.
    In the example above I simplified our actual code, that's why it might not be clear much.

    "So for your buffers, you would have to do a +1 for main thread, and let it be for other threads. There are ways to prevent the main thread from participating as well (for example if you put the entire execution inside CurrentScheduler::ScheduleTask)."

    Previously with default scheduler we used +1 for main thread (as you described).
    Now we are using CurrentScheduler::ScheduleTask with the scheduler that has GetProcessorCount() number of threads.
    And we use GetVirtualProcessorId() - 1 to get the index of the thread. 
    At this moment it seems to be working, but I understand that it is not safe, that's why I raised this thread.

    "PS: I do believe that openmp does provide with this ability though and is worth a look. See http://msdn.microsoft.com/en-us/library/k1h4zbed.aspx  omp_get_num_threads and omp_get_thread_num"

    This is exactly which we had in our previous version of the code.
    We used OpenMP and omp_get_num_threads() does this work perfectly!
    But we switched to PPL, because it is impossible to link statically OpenMP library in Visual Studio. We cannot supply separate library (OpenMP DLL) with our application, it should be statically linked. So what we need is omp_get_num_threads() like function in PPL. That's what I'm looking for!


    Hi Ameen,

    "If that is the case, combinable class may help you have per-thread buffers : http://msdn.microsoft.com/en-
    us/library/dd492850.aspx. Please be aware that the runtime owns the threads, and as Rahul, and Bill made it cleare you may not assume their life spane, or count."

    We already considered to use combinable class, but I think it is not good idea to provide buffers for some independent library (ImageProc is exported from other library) for its internal usage. ImageProc function is quite complicated and it uses several internal buffers per thread. And only library knows the size of the buffers at any time.


    Hi Joe,

    "How big is nSize?  If it's not too big..."

    It is huge (can be more than 140,000,000).

    "Another alternative is to create the tasks yourself and then you can specify the index any way you want:"

    This is possible, but in this case we will need to restructure our code completely.
    Not only we need to create the tasks, but also redistribute the work for nSize iterations among the threads.
    All this is handled by parallel_for in PPL.


    I think that an ability to enumerate the threads is a good functionality that OpenMP provides via omp_get_num_threads() function.
    I understand that concurrency runtime is more flexible and complicated than OpenMP, and might be it uses CPU better than OpenMP. 
    But what about simple cases? 
    The cases when applications don't need an advanced scheduling and the predictability at any time is more important than overloading CPU as much as possible.
    For example, simply having fixed N threads (N is the number of cores) and 
    get the fixed unique index of the thread which always handles its own interval of the iterations (like static scheduling in OpenMP).

    P.S.
    We also thought somehow to derive an index from Context ID.
    The one method is to have some shared structure (like std::set in C++) and 
    locks when adding context IDs into the structure on first call.
    The other is to use InterlockedIncrement sequentially to associate ContextID with ThreadIndex.
    But all these methods are not that good. I'm looking for better solution.


    Thank you all for the answers,
    Armen Anoyan.



  • sabato 17 marzo 2012 07:21
     
      Contiene codice

    Hi Armen --- I'm struck by how easy this should be, and how difficult it's becoming.  All you need is a unique "index" for each thread:

    parallel_for(0, nSize, [&](int i)
    {
    	...
    	int nThreadIndex = Context::CurrentContext()->GetVirtualProcessorId() - 1;
    	ImageProc(nThreadIndex, ...);
    	...
    });

    We could just call down into Windows and get the thread id, and then map that to a unique index 0, 1, 2, ... .  But what if the thread id changes?  There's no guarantee that thread X won't disappear and new thread Y will appear in the thread pool to run the next task.  And if this happens, you won't have enough buffers and the new index will be out of range.

    And then it dawned on me that part of the problem is that your code is designed in terms of threads (which fits well with OpenMP and the omp_get_thread_num() API function), but the PPL thinks in terms of tasks (and hides threads as an execution detail).  So there's a disconnect here between the design of the code and the design of PPL.  Neither is wrong, but there's a mis-match here --- and someone has to give :-)

    You could make a windows API call and get the virtual processor ID the task is running on.  Or how about this:  declare a thread-local variable initialized to -1, and a shared counter initialized to 0.  Then have each task check it's thread-local variable, and if it's -1, grab a unique index from the counter (and increment the counter).  Like this:

    __declspec( thread ) static int tls_i = -1;
    volatile unsigned int index = 0;
    parallel_for(...
      {
        if (tls_i == -1)
          tls_i = InterlockedIncrement(&index) - 1;  // 0-based:
        .
        .
        .
      });

    This code makes me nervous :-)  But again, I think this suffers from the flaw that it's hard to predict how many buffers to create beforehand, because PPL could inject more threads into the pool to handle the # of tasks.

    Interesting problem!  I'd like to see an elegant solution to this...