I compared the performance of parallelizing small number of large tasks and parallelizing large number of small tasks. I found that parallelizing small number of large tasks has superior performance, but parallelizing large number of small tasks could perform even worse than simple serial method. So we should not blindly rewrite all loops into parallel_for loops. Better to divvy up into small number of big tasks, rather than just letting parallel_for take care of everything.
The following is some sample codes I tested:
void CallTask(int ttask)
{
for(int j=0; j<ttask; j++)
{
// Some very simple task, e.g.:
int test = 1;
test = test * 4 / 3;
}
}
void main()
{
/***
// The following set of parameters make small number of large tasks
// Parallel computing is good for this setting!
const int ntask = 10; // Number of tasks
const int ttask = 1000000000; // Length of each task
/***/
// The following set of parameters make large number of small tasks
// Parallel computing isn't really good for this setting!
const int ntask = 100000000; // Number of tasks
const int ttask = 1; // Length of each task
/***/
time_t tstart, tend;
cout << "Do in serial : \n" ;
time (&tstart);
for(int i=0; i<ntask; i++)
CallTask(ttask);
time (&tend);
double t1 = difftime(tend, tstart);
cout << "\n Time : " << t1;
cout << "\n\n Do in parallel : \n" ;
time (&tstart);
parallel_for(0, ntask, [&](int i) {
CallTask(ttask);
}) ;
time (&tend);
double t2 = difftime(tend, tstart);
cout << "\n Time : " << t2 << " = " << t2 / t1 * 100 << " % of t1" << "\n\n" ;
}