none
Parallelization of long running processes and performance optimization

    Question

  • I would like to parallelize the application that processes multiple video clips frame by frame. Sequence of each frame per clip is important (obviously).
    I decided to go with TPL Dataflow since I believe this is a good example of dataflow (movie frames being data).

    So I have one process that loads frames from database (lets say in a batch of 500, all bunched up)

        Example sequence:    
        |mid:1 fr:1|mid:1 fr:2|mid:2 fr:1|mid:3 fr:1|mid:1 fr:3|mid:2 fr:2|mid:2 fr:3|mid:1 fr:4|

    and posts them to BufferBlock. To this BufferBlock I have linked ActionBlocks with the filter to have one ActionBlock per MovieID so that I get some kind of data partitioning. Each ActionBlock is sequential, but ideally multiple ActionBlocks for multiple movies can run in parallel.

    I do have the above described network working and it does run in parallel, but from my calculations only eight to ten ActionBlocks are executing simultaneously. I timed each ActionBlock's running time and its around 100-200ms.

    ActionBlock does Read/Write data from/to SQL. And CPU does stay under 50% averaging 25%. I did convert the DB access methods to async and it kinda increased productivity by around 20%. But still think I can do better.

    What steps can I take to at least double concurrency?

    I read a post about TPL Dataflow arguing against using TaskScheduler.Default (threadpool) in action blocks, but I can't remember where I saw that thread. It stated that threadpool is very pessimistic in spawning new threads and offered an alternative, but can't remember what was it (not ConcurrentScheduler)

    Thursday, November 15, 2012 3:13 PM

Answers

  • Hi Dimitri-

    If you think it's an issue with the ThreadPool not injecting fast enough for your particular scenario, you could try forcing it to spawn threads more quickly by using ThreadPool.SetMinThreads and see if that improves throughput.

    (Note that doing "await Task.Run(() => DbAccessCode())" isn't really going to help with ThreadPool utilization, as you're just changing which thread gets blocked doing the I/O rather than reducing the number of blocked threads.)

    Tuesday, November 27, 2012 4:23 AM
    Owner

All replies

  • Hi Dimitri-

    To make it go faster, it'd first be important to understand what the bottlenecks are.

    What kind of hardware are you running on? e.g. you say CPU usage stays under 50%... is this a dual-core, quad-core, etc.?  And do you have hyperthreading enabled?  If, for example, you were on a dual-core hyperthreaded processor, you'd expect CPU utilization to not be significantly more than 50%, as the additional hardware threads provided by hyperthreading aren't true cores.

    Have you done any calculations as to what optimal performance should be? e.g. have you timed the various pieces of a sequential implementation, such as how long it takes to read/write data from/to SQL for each frame, how long it takes to compute each frame, how long the whole processing of a movie takes, etc.?

    Have you done any profiling of the implementation?  You could use the Concurrency Visualizer in Visual Studio, for example, to see what each thread is doing over the course of the processing, whether it's doing compute and what it's computing, whether it's blocked on something and what it's blocked on, etc.  In Visual Studio 2012, you can enable TPL Dataflow events to show up on the Concurrency Visualizer timeline as well.

    It sounds like you might actually be I/O-bound with all of your database access.

    Thursday, November 15, 2012 6:03 PM
    Owner
  • Hey Stephen,

    Thanks for the reply. Actually I have benchmarked per-frame processing and overall process pretty extensively. There are three major DB operations inside the frame processing routine. One selects from TableA (40-50 ms avg), another inserts into TableB (~100ms) and the third updates TableC and TableD (150ms). So each frame on average takes about 300-400ms to process. Processor on my dev machine is Quad Core with HT, but production machine will have two quadcore processors with HT. I did try and use async/await on DB tasks and it did improve performance by 20-25%. It does seem to me that i'm IO bound. Any thoughts about squeezing more juice out of this?

    Thanks

    P.S. Just ran profiler and it shows that 91% of time is spent on Synchronization. After going deeper and seeing what exactly is synchronizing, I found out that every part of the code that I converted to asychronous is synchronizing (every occurence of await TaskEx.Run(()=> { BDAccessCode()}); Now this might be the expected behaviour since the async code has to return value to the main thread, but in another article I have read that synchronization is a sign of bad program design. Any input on this? (other values in profiler are: 2% - Execution, 5% Sleep, 3% Preemption)

    EDIT: I implemented extra level of data partitioning: frames for Movies with Odd IDs are processed on ServerA, frames for movies with Even IDs are processed on ServerB. Both instances of the application hit the same database. If my problem was DB IO, then I would not see any improvement in total frames processed count (or very little, under 20%). But I do see it doubling. So this leads me to conclude that Threadpool is not spawning more threads to do more frames in parallel (both servers are quad-cores and profiler shows about 25-30 threads per application).

    Friday, November 16, 2012 3:27 PM
  • Hi Dimitri-

    If you think it's an issue with the ThreadPool not injecting fast enough for your particular scenario, you could try forcing it to spawn threads more quickly by using ThreadPool.SetMinThreads and see if that improves throughput.

    (Note that doing "await Task.Run(() => DbAccessCode())" isn't really going to help with ThreadPool utilization, as you're just changing which thread gets blocked doing the I/O rather than reducing the number of blocked threads.)

    Tuesday, November 27, 2012 4:23 AM
    Owner