none
Process Affinity on a System with 128 Processors RRS feed

  • Question

  • I want to set the Affinity Group and Mask for an process on a machine with 128 processors. The process is an older application which is not NUMA aware and for which I do not have source code.  Once the application is up and running, I have a utility application which I wrote that detects the process and changes it's affinity so that the system will run more efficiently.  The utility uses SetProcessAffinityMask() to do this and it works great on systems with 64 processors or less (i.e. only one group).  But if I want to have this work on a system with multiple groups, I need a way to specify the Group Number.  There seems to be no corresponding SetProcessGroupAffinity() function which let's me specify the processor group.

    I've tried enumerating all of the threads associated with the process and then calling SetThreadGroupAffinity() to set the group number, but that does not work.  SetThreadGroupAffinity() returns success, but the threads continue to run on the processors assigned by the system.

    Is there a sample anywhere that demonstrates how to do this?

    Thanks,

    Bill

     

    • Moved by Jesse Jiang Tuesday, January 11, 2011 7:00 AM (From:Visual C++ General)
    Friday, January 7, 2011 1:05 AM

Answers

  • In theory you could create a small driver that uses KeSetSystemGroupAffinityThread to change the affinity.  I say in theory because sine this is a new call and limited documentation it may not work.  Of course once you do it there is the question if the application will work, I assume you have read http://download.microsoft.com/download/a/d/f/adf1347d-08dc-41a4-9084-623b1194d4b2/MoreThan64proc.docx with its warnings about multiple groups and applications that were not written to take advantage of them.

     


    Don Burn (MVP, Windows DKD) Windows Filesystem and Driver Consulting Website: http://www.windrvr.com Blog: http://msmvps.com/blogs/WinDrvr
    Tuesday, January 11, 2011 3:17 PM

All replies

  • Well, first of all, why do you need to do this? The process group that it runs on is created by Windows taking into account things like physical locations etc. So unless you actually want to run more than 64 threads at once and have them scheduled on different logical processors then I think the process running on one group would actually be more efficient for memory access and other purposes.

    For the issue with SetThreadGroupAfinity, have you actually tested a sample application which makes use of this function? It is possible that it just doesn't like the fact that the call is coming from a completely different process. So try writing a sample that makes this call and see how it works.


    Any samples given are not meant to have error checking or show best practices. They are meant to just illustrate a point. I may also give inefficient code or introduce some problems to discourage copy/paste coding. This is because the major point of my posts is to aid in the learning process.
    Visit my (not very good) blog at
    http://ccprogramming.wordpress.com/
    Friday, January 7, 2011 1:27 AM
  • > Well, first of all, why do you need to do this?

    Because it makes 5.5 GBps throughput go to 10 GBps throughput.  Almost double the throughput. 

     

    > So unless you actually want to run more than 64 threads at once ...

    The issue is not how many threads are running, but on what processors those threads are running.  System with many processor (e.g. 128) divide the processors into NUMA nodes and then assign each NUMA node to a group.  A group can only have at most 64 processors.  Each NUMA node generally has memory attached to it.  When a processor in a NUMA node tried to access memory that is attached to it, there is no performance hit.  However, when a processor on one NUMA node tries to access memory on another NUMA node, there is a performance penalty associated with it.  Keeping the application threads and the hardware driver ISR, DPC and kernel-level threads within the same NUMA node (and same memory) provides a significant increase of performance.  When application threads are running on a different NUMA node than the hardware and driver, the penalties can make performance worse than systems with fewer processors (e.g. 16 processors).

    Another issue here is that PCI-E slots are generally physically connected to NUMA nodes.  So when you seat an adapter into a slot, the OS will tell you what group, NUMA node and processor affinity mask has been assigned to it.  The idea is to distribute the workload throughout the system.  So the only way to choose the group, node and affinity for your hardware is to pick the right slot.  However, if you have many adapters (e.g. 16), you really don't have too much choice but to try to distribute them so that each NUMA node will have an equal number of adapters.  We're talking enterprise level system here and load balancing is very important.  So having application threads running on the same group, node and processor affinity mask as the hardware is also very important.

     

    > For the issue with SetThreadGroupAfinity, have you actually tested a sample application which makes use of this function?

    I will give this a try.

     

    > It is possible that it just doesn't like the fact that the call is coming from a completely different process

    Yes, the function was probably intended for the parent thread or the thread itself to modify its own behavior.  However, functions like SetProcessAffinityMask() allow management services like TaskManager to modify the affinity of an external process so that the system will run more efficiently.  There should be a way for TaskManager to change the group number along with the processor affinity mask.

     

    Friday, January 7, 2011 3:25 PM
  •  

    Hi Bill Alexander,

     

    I think your issue should be raised on Windows WDK and Driver Development Forum I believe they will know more information of this issue then us, and I will move this one to that forum.

     

    Thanks for your understanding,

    Jesse


    Jesse Jiang [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

    Tuesday, January 11, 2011 7:00 AM
  • In theory you could create a small driver that uses KeSetSystemGroupAffinityThread to change the affinity.  I say in theory because sine this is a new call and limited documentation it may not work.  Of course once you do it there is the question if the application will work, I assume you have read http://download.microsoft.com/download/a/d/f/adf1347d-08dc-41a4-9084-623b1194d4b2/MoreThan64proc.docx with its warnings about multiple groups and applications that were not written to take advantage of them.

     


    Don Burn (MVP, Windows DKD) Windows Filesystem and Driver Consulting Website: http://www.windrvr.com Blog: http://msmvps.com/blogs/WinDrvr
    Tuesday, January 11, 2011 3:17 PM
  • No, this issue is a application issue, not a driver/kernel-level issue.  The issue is how does an independent application set the Group Affinity of another process which is already running.  I can set the target process'  affinity mask simply by calling SetProcessAffinityMask() -- that works great for systems with 64 or less processors.  In fact the performance boost is phenomenal.  However, when you have a system with more than 64 processors, you have to start dealing with group affinity.  Currently, the only API that I have found that lets you set the group number is SetThreadGroupAffinity().  So what I had to do was enumerate the threads running inside the target process and then set each of their group affinities. 

    I noticed that even though using SetThreadGroupAffinity() does give me better performance, it gives me less performance than SetProcessAffinityMask().  I think what is happening is the process remains rooted in it's original affinity settings and the threads now with a different affinity have to coordinate with the process' structures.  Thus, a penalty is exacted on the performance for going across NUMA nodes.  I guess the Windows scheduler is not smart enough to see that if all of a process' threads are set to a certain affinity, that it should change the affinity settings for the entire process.  And like I said, there is *NO* SetProcessGroupAffinity() API which I can use, so I am stuck.

     

    Bill

    Tuesday, January 11, 2011 3:20 PM
  • You did not understand what I meant.  There are kernel API's that allow a threads group affinity to be set, whether this will do a better job than the user space SetThreadGroupAffinity would have to be seen, but a small driver could be created that tracks threads of the process you are interested in and distributes them among multiple groups.

     


    Don Burn (MVP, Windows DKD) Windows Filesystem and Driver Consulting Website: http://www.windrvr.com Blog: http://msmvps.com/blogs/WinDrvr
    Tuesday, January 11, 2011 3:25 PM
  • I think he most likely would write a driver to replicate SetThreadGroupAffinity and end up in exactly the same place. 
    Mark Roddy Windows Driver and OS consultant www.hollistech.com
    Tuesday, January 11, 2011 4:00 PM
    Moderator
  • It is possible that he will get the same result, but KeSetSystemGroupAffinityThread do not seem to be constrained the same way as SetThreadGroupAffinity which is based on NtSetInformationThread with an undocument ThreadInformationClass

     


    Don Burn (MVP, Windows DKD) Windows Filesystem and Driver Consulting Website: http://www.windrvr.com Blog: http://msmvps.com/blogs/WinDrvr
    Tuesday, January 11, 2011 4:19 PM
  • Donald:

     

    Sorry, I was responding to Jesse's response.  If you notice the time stamp on your and my responses, you will see they are two minutes off.  A classic mutual exclusion example if I have ever seen one ;)

    But, I think that using the DDI KeSetSystemGroupAffinityThread() would probably give me the same as results as the API SetThreadGroupAffinity().  I'll look into it, though.

     

    Thanks,

     

    Bill

    Tuesday, January 11, 2011 6:34 PM
  • I took advantage of still having access to the Windows source for a few more months until the WDK MVP program ends to check the paths for this call.  The only major difference between the KeSetSystemGroupAffinityThread and the SetThreadGroupAffinity call appear to be checking the job object if it supports and is associated with the group specified.

    I have no knowledge at this time of how the job object gets its group data or how to change it, but it may be something for Bill to look at.


    Don Burn (MVP, Windows DKD) Windows Filesystem and Driver Consulting Website: http://www.windrvr.com Blog: http://msmvps.com/blogs/WinDrvr
    Wednesday, January 12, 2011 5:38 PM
  • Pavel -

     

    I am a device driver writer trying to do an app -- does that entitle me to bend the rules a little?  Especially since the real goal is to align application affinity with hardware/driver affinity ;)

     

    Bill

    Thursday, January 20, 2011 1:55 AM
  • I found an interesting API which is essentally what I want.

     

    The goal here was to do the same thing as SetProcessAffinityMask() with groups.  As it turns out, there is such a function, but it is undocumented.

     

    I discovered that NtSetProcessInformation() will set both the group and affinity mask for an already running process.  It's API is something like this:

     

    GROUP_AFFINITY group_affinity;

    group_affinity.group = 1;  // Second group of 64 processors

    group_affinity.mask = 0x0000FFFF00000000;

    NtSetProcessInformation (hProcess, 0x15, &group_affinity, group_affinity_size)

     

    It works very well and is the same API that TaskManager uses to change the affinity group and mask of processes.

     

    Of course, since it is undocumented, it is subject to change, so it is best not to employ this in retail software.  I am using it only to run performance tests in our lab, so no big problem here.

     

    Thanks all for the help.

     

    Bill

    Saturday, January 22, 2011 3:34 PM
  • Hello

    4 years later, it looks like the situation has not improved yet? NtSetProcessInformation() still works on Windows 7. But I still cannot find any official API function to move an entire process to another group etc.

    SetProcessAffinityMask() may only change the affinity mask within the current group, assuming the process is already in a single group, and in the right one.

    And if we move all individual threads manually, the affinity of the process still contains the original group/mask for unknown reason, so you have multiple groups in your process afinity, and the above SetProcessAffinityMask() doesn't work anymore.

    thanks

    Brice

    Wednesday, September 9, 2015 2:35 PM
  • the answer is you don't. If the app is not aware of groups and uses the cpu number as a way of indexing data structures, you could have 2 CPUs with the same index executing threads in the app at the same time, causing corruption.

    d -- This posting is provided "AS IS" with no warranties, and confers no rights.

    Wednesday, September 9, 2015 4:46 PM
  • Sounds like a far fetched argument for not providing proper API. Because some application might misbehave, we should not have access to proper API for applications that are processor group aware? Following that logic, I can easily argue that the current SetProcessAffinityMask() should not exist either (what if the application does index its data structure per CPU in a way that's not robust to affinity mask changes? What if? What if?).

    There definitely should be an official SetProcessAffinityMask()-like call that takes a GROUP_AFFINITY structure so that an application can change both the processor group and the affinity mask of a whole process.

    As pointed by previous posters, changing the processor group and affinity mask of each thread one at a time does not reliably work and leaves deep in the process hidden knowledge of the old group-affinity mask.

    In our case, we want to move processes to the CPU that is directly connected to the GPU assigned by the calc farm scheduler (there are 2 CPU sockets, and 8 GPUs). On a machine with more than 64 logical cores, the only "reliable" way I've found for now is to use the undocumented NtSetProcessInformation().

    So far I'm not impressed by how Microsoft has extended the API to deal with processor groups and affinity mask, let alone the helpless replies from Microsoft. SetProcessAffinityMask() is a simple enough concept and API call, the current processor group should just be an extra argument.




    • Edited by GPSnoopy Thursday, February 15, 2018 12:16 PM
    Thursday, February 15, 2018 11:32 AM
  • Apparently, even NtSetProcessInformation() is still not enough.

    When initialising a 3rd party library (Intel MKL 2018), it still manages to create a thread in the old processor group. Funny thing though is that it has the new affinity mask of the new group though.

    I wonder where it gets knoweldge of the old processor group, as GetProcessGroupAffinity() returns only the new processor group after calling NtSetProcessInformation(). I even tried setting the processor group on all the threads to be sure.

    Edit:

    Apparently, this could be specific to MKL. Although it does managed to allocate threads on the right processor group if I do not change it.

    • Edited by GPSnoopy Thursday, February 15, 2018 5:09 PM
    Thursday, February 15, 2018 2:44 PM