Thursday, May 19, 2011 8:45 AM
We are developing a computational intensive computer geometry application (a domain specific CAD). The data structures manipulated (triangular meshes) are really not very cache friendly and it is difficult to change their layout in the middle of the play. Running 4 tasks on two core hyper threaded CPU thrashes the cache and the resulting code runs as fast as the serial code if I am lucky. Running the same code using a custom thread pool on just two threads speeds up the execution considerably.
It would be great, if the Concurrency runtime offered a way to control the number of virtual processors spawned based on the L2/L3 cache (something like the minimum limit of cache per virtual processor). In our case around 1.5MB of L3 cache per thread is optimal.
Maybe just an API to tell the Concurrency runtime to ignore hyper threading and spawn a single virtual processor thread per core would be sufficient. That is the way I will patch my custom thread pool.
Saturday, May 21, 2011 5:27 PM
Thank you for your suggestion. Please also provide this suggestion through connect.microsoft.com.
Currently, the ConcRT runtime provides affinity control on the granularity of a node (basically L3) and then allows the OS to provide the final scheduling placement of the thread on the CPU. If you have a single hyperthreaded node and reduce the concurrency of the scheduler to 1/2 the current available threads, you should see the speed-ups you are looking for. This workaround *does not work* if you have multiple nodes, in that case you would reduce your throughput because its likely you will get a subset of nodes instead of spreading the load across the nodes.
I hope that helps you in the short term. I welcome your suggestion and would love to know more about your ConcRT use either through this forum or you can send me email directly at dana dot groff at microsoft dot com.
Dana Groff, Senior Program Manager Parallel Computing, Concurrency Runtime