locked
Best Practices on Processing Large Amounts of data RRS feed

  • Question

  • I have about 22 million records that i take from a database create objects for.  I am doing some minipulation to them and was curious to know the best way to process these records the fastest??? Is using a Thread faster than using For each?
    Give youself a round of applause!!
    Tuesday, August 4, 2009 1:36 PM

Answers

  • When i get the data from the database i create objects out of them, I have reduced server calls down to one call which made it a tad faster, but then i create XML documents from those objects and take those object 500 at a time and create a zip file that consist of 500 xml documents which are representations of the objects from the database....

    I was thinking of going with threading but that will not elimate the loop, i have no UI this is a batch process or Console app so keep the front end responsive is not really a concern....just trying to make it go as fast as possible...


    Give youself a round of applause!!





        class Program
        {
            // Create a new Mutex. The creating thread does not own the
            // Mutex.
            private static Mutex mut = new Mutex();
            private const int numIterations = 1;
            private const int numThreads = 3;

            static void Main()
            {
                MutexDemo();  // threads can comtinue to execute after main thread exits
                Console.ReadKey();
                //
                JoinDemo();  // main thread waits for all threads to complete
                Console.ReadKey();
            }

            private static void MutexDemo()
            {
                // Create the threads that will use the protected resource.
                for (int i = 0; i < numThreads; i++)
                {
                    Thread myThread = new Thread(new ThreadStart(MyThreadProc));
                    myThread.Name = String.Format("Thread{0}", i + 1);
                    myThread.Start();
                }

                // The main thread exits, but the application continues to
                // run until all foreground threads have exited.
            }
            private static void MyThreadProc()
            {
                for (int i = 0; i < numIterations; i++)
                {
                    UseResource();
                }
            }

            // This method represents a resource that must be synchronized
            // so that only one thread at a time can enter.
            private static void UseResource()
            {
                // Wait until it is safe to enter.
                mut.WaitOne();

                Console.WriteLine("{0} has entered the protected area",
                    Thread.CurrentThread.Name);

                // Place code to access non-reentrant resources here.

                // Simulate some work.
                Thread.Sleep(500);

                Console.WriteLine("{0} is leaving the protected area\r\n",
                    Thread.CurrentThread.Name);

                // Release the Mutex.
                mut.ReleaseMutex();
            }

            private static void JoinDemo()
            {
                AutoResetEvent autoEvent = new AutoResetEvent(false);

                Thread regularThread =
                    new Thread(new ThreadStart(ThreadMethod));
                regularThread.Start();
                ThreadPool.QueueUserWorkItem(new WaitCallback(WorkMethod),
                    autoEvent);

                // Wait for foreground thread to end.
                regularThread.Join();

                // Wait for background thread to end.
                autoEvent.WaitOne();
            }

            private static void ThreadMethod()
            {
                Console.WriteLine("ThreadOne, executing ThreadMethod, " +
                    "is {0}from the thread pool.",
                    Thread.CurrentThread.IsThreadPoolThread ? "" : "not ");
            }

            private static void WorkMethod(object stateInfo)
            {
                Console.WriteLine("ThreadTwo, executing WorkMethod, " +
                    "is {0}from the thread pool.",
                    Thread.CurrentThread.IsThreadPoolThread ? "" : "not ");

                // Signal that this thread is finished.
                ((AutoResetEvent)stateInfo).Set();
            }

        }


    Happy Coding

    Rudedog  =9^D


    Mark the best replies as answers. "Fooling computers since 1971."
    • Marked as answer by Bin-ze Zhao Friday, August 7, 2009 10:37 AM
    Tuesday, August 4, 2009 7:44 PM

All replies

  • "Is using a thread faster than using ForEach?"

    Yes and No. 
    A trhead will allow your UI, User Interface, to remain responsive. 
    Whether or not you use a thread, or backgroundworker, you will still most likely require some form of looping statements.
    Mark the best replies as answers. "Fooling computers since 1971."
    Tuesday, August 4, 2009 2:07 PM
  • Doing the manipulation in a stored procedure may be much much faster--saves you having to round trip to the database and back.

    Tuesday, August 4, 2009 2:12 PM
  • last i've check, doing the working using RAM is faster than using Disk Drive.  So, if the data will be located locally in ADO, manipulation will tend to be faster.  However, depending on the type of changes you attempting to make, there are ways to on database side or at the client side.

    in general, i would encourage the work be done over the client side (with the ADO) first.

    Tuesday, August 4, 2009 2:29 PM
  • What makes you think stored procedures don't load the data into RAM?

    And to answer the original question, if you have a multicore CPU and the data doesn't need to be processed sequentially (one record doesn't rely on the last), using as many threads as you have cores should speed up the process.
    Tuesday, August 4, 2009 2:46 PM
  • Hi,

    I would also recommend that a stored procedure or query is used.

    Actually though it all depends on what you want to do with the records. What sort of manipulation are you talking about?

    If you can prevent the network hit of downloading the records off the database and into local memory; in other words doing as much processing on the server then your going to save time and resources.

    If you need to do the processing locally then I'd recommend using some techniques of functional programming to process the data. It will make processing the data quicker as you can run things in parallel a lot easier.

    Map -> Reduce.

    www.dsmyth.net | www.dsmyth.net/wiki
    Tuesday, August 4, 2009 2:59 PM
  • When i get the data from the database i create objects out of them, I have reduced server calls down to one call which made it a tad faster, but then i create XML documents from those objects and take those object 500 at a time and create a zip file that consist of 500 xml documents which are representations of the objects from the database....

    I was thinking of going with threading but that will not elimate the loop, i have no UI this is a batch process or Console app so keep the front end responsive is not really a concern....just trying to make it go as fast as possible...


    Give youself a round of applause!!
    Tuesday, August 4, 2009 3:35 PM
  • No UI.  Follow Scottie's advice.

    Mark the best replies as answers. "Fooling computers since 1971."
    Tuesday, August 4, 2009 4:19 PM
  • Using as many threads as you have cores can help, don't use more. But if the bottleneck here is writing the xml files to the harddrive (which it is unless you do a lot of processing first) it's not gonna help. The only help there is to get a faster harddrive, a SSD would do wonders.

    Also, what are you doing this for? If you tell us the reason you're serializing sets of 500 xml files to your harddrive we might be able to figure out a better way to do things.
    Tuesday, August 4, 2009 4:47 PM
  • If you are fond of threads like me, surely they can help you.
    Eg. If you dual core processor, your application can take advantages of 2 processors to process 22m records by using multiple threads. I dont know if you are experience with parallel programming or not. But if you use in this scenario, will be ideal solution.
    22 million records = 22 threads each executing 1 million records at a time.
    It means 11 threads per processors (Approximately), if you have 2 processors.
    You will gain significant gain as compared to 1 thread, 1 process,1 processor usage.
    For each statement will take a ____ of life to process 22mm records. If your application interacts with the user, this will be failed, coz user can't wait for the application to process 22mm records.

    is that helpful?
    TK
    • Marked as answer by Tryin2Bgood Tuesday, August 4, 2009 6:38 PM
    • Unmarked as answer by Tryin2Bgood Tuesday, August 4, 2009 6:38 PM
    Tuesday, August 4, 2009 6:35 PM
  • If you have two processors use 2 threads, any more is just additional complexity and overhead for no reason.
    Tuesday, August 4, 2009 7:04 PM
  • ScottyDoesKnow, I must disagree with you. Threads are made for performace not overhead. In any application, there are many backgroundtheads which are not causing overhead.
    If there was a recommendation on number of thread = number of processors then there would be no use of Multithreading.
    Threads are made to break down complex data into smaller chunks and assign each thread to use that chunk to perform the required process.
    Ofcourse it will be complex, but you will gain a significant performance.
    TK
    Tuesday, August 4, 2009 7:16 PM
  • Using 11 threads on 1 processor is less efficient than 1 thread on 1 processor. Therefore 22 threads on 2 processors is less efficient that 2 threads on 2 processors.

    Having two threads means the threads can be run in parallel if there are two processors, adding another thread doesn't let 3 run in parallel because there are only 2 processors. If you have 11 threads on one processor all it does is keep switching between the threads. Switching between threads adds overhead. Multithreading is used to keep a UI active (mainly for long blocking calls, processing or HW) or for performance if you have multiple processors.

    Tuesday, August 4, 2009 7:30 PM
  • But its still gives you significant performance as compared to 2 threads, isn't it?
    Moreover switiching between 11 threads for a 3.0 Ghz processor is a difficult task?
    TK
    • Edited by Talal Khan Tuesday, August 4, 2009 7:39 PM
    Tuesday, August 4, 2009 7:32 PM
  • When i get the data from the database i create objects out of them, I have reduced server calls down to one call which made it a tad faster, but then i create XML documents from those objects and take those object 500 at a time and create a zip file that consist of 500 xml documents which are representations of the objects from the database....

    I was thinking of going with threading but that will not elimate the loop, i have no UI this is a batch process or Console app so keep the front end responsive is not really a concern....just trying to make it go as fast as possible...


    Give youself a round of applause!!





        class Program
        {
            // Create a new Mutex. The creating thread does not own the
            // Mutex.
            private static Mutex mut = new Mutex();
            private const int numIterations = 1;
            private const int numThreads = 3;

            static void Main()
            {
                MutexDemo();  // threads can comtinue to execute after main thread exits
                Console.ReadKey();
                //
                JoinDemo();  // main thread waits for all threads to complete
                Console.ReadKey();
            }

            private static void MutexDemo()
            {
                // Create the threads that will use the protected resource.
                for (int i = 0; i < numThreads; i++)
                {
                    Thread myThread = new Thread(new ThreadStart(MyThreadProc));
                    myThread.Name = String.Format("Thread{0}", i + 1);
                    myThread.Start();
                }

                // The main thread exits, but the application continues to
                // run until all foreground threads have exited.
            }
            private static void MyThreadProc()
            {
                for (int i = 0; i < numIterations; i++)
                {
                    UseResource();
                }
            }

            // This method represents a resource that must be synchronized
            // so that only one thread at a time can enter.
            private static void UseResource()
            {
                // Wait until it is safe to enter.
                mut.WaitOne();

                Console.WriteLine("{0} has entered the protected area",
                    Thread.CurrentThread.Name);

                // Place code to access non-reentrant resources here.

                // Simulate some work.
                Thread.Sleep(500);

                Console.WriteLine("{0} is leaving the protected area\r\n",
                    Thread.CurrentThread.Name);

                // Release the Mutex.
                mut.ReleaseMutex();
            }

            private static void JoinDemo()
            {
                AutoResetEvent autoEvent = new AutoResetEvent(false);

                Thread regularThread =
                    new Thread(new ThreadStart(ThreadMethod));
                regularThread.Start();
                ThreadPool.QueueUserWorkItem(new WaitCallback(WorkMethod),
                    autoEvent);

                // Wait for foreground thread to end.
                regularThread.Join();

                // Wait for background thread to end.
                autoEvent.WaitOne();
            }

            private static void ThreadMethod()
            {
                Console.WriteLine("ThreadOne, executing ThreadMethod, " +
                    "is {0}from the thread pool.",
                    Thread.CurrentThread.IsThreadPoolThread ? "" : "not ");
            }

            private static void WorkMethod(object stateInfo)
            {
                Console.WriteLine("ThreadTwo, executing WorkMethod, " +
                    "is {0}from the thread pool.",
                    Thread.CurrentThread.IsThreadPoolThread ? "" : "not ");

                // Signal that this thread is finished.
                ((AutoResetEvent)stateInfo).Set();
            }

        }


    Happy Coding

    Rudedog  =9^D


    Mark the best replies as answers. "Fooling computers since 1971."
    • Marked as answer by Bin-ze Zhao Friday, August 7, 2009 10:37 AM
    Tuesday, August 4, 2009 7:44 PM
  • If you have two processors use 2 threads, any more is just additional complexity and overhead for no reason.

    And 2 threads won't give you much of a performance advantage over a single thread.  It makes it easy to keep the UI active without a performance loss.  Context switches from thread to thread are time consuming.
    Tuesday, August 4, 2009 8:14 PM
  • I agree that context switching is time consuming task for a processor with low cycles per sec, not a problem for 3Ghz processors. As memory was a problem 10 years ago when we had 32 Mb of Ram, this is gone coz now we have 6 GB of Ram. Moreover OP wants to use for performance as there is no user interaction in this application. So in my view the processors will be dedicated to only this application. Those days are gone when OS would take a miliseconds for context switching. In these days its a fraction of nanoseconds.
    Also above all these 22 thread will take less time than any other methods to process 22 millions records.

    TK
    • Edited by Talal Khan Tuesday, August 4, 2009 8:22 PM
    Tuesday, August 4, 2009 8:18 PM
  • I agree that context switching is time consuming task for a processor with low cycles per sec, not a problem for 3Ghz processors. As memory was a problem 10 years ago when we had 32 Mb of Ram, this is gone coz now we have 6 GB of Ram. Moreover OP wants to use for performance as there is no user interaction in this application. So in my view the processors will be dedicated to only this application. Those days are gone when OS would take a miliseconds for context switching. In these days its a fraction of nanoseconds.
    Also above all these 22 thread will take less time than any other methods to process 22 millions records.

    TK

    You know not of what you speak.
    Tuesday, August 4, 2009 8:30 PM
  • Excuse me? Can you be more polite. I was just trying to give my viewpoint. We should not offend others by such comments. Please edit it and reply with appropriate comments.
    Thanks

    TK
    Tuesday, August 4, 2009 8:33 PM
  • "fractions of nanoseconds"   :)
    Mark the best replies as answers. "Fooling computers since 1971."
    Tuesday, August 4, 2009 8:34 PM
  • is something wrong with it? I was just giving example how trivial is context switching now a days for OS. If you dont like lets make it "fraction of seconds" :). But atleast don't insult other members.
    TK
    • Edited by Talal Khan Tuesday, August 4, 2009 8:40 PM
    Tuesday, August 4, 2009 8:36 PM
  • Fraction means divide.
    nanoseconds = 1 billionth of second.
    so if nanosecond is x and fraction denominator is y
    fraction of nanoseconds is x/y.
    Simple math. :)
    TK
    • Proposed as answer by ShellShock Wednesday, August 5, 2009 8:24 AM
    Tuesday, August 4, 2009 8:45 PM
  • Well it definately takes more than a nanosecond to switch threads, I would guess it takes milliseconds but I haven't looked it up. But anyways, multiple threads on a single processor can never be more efficient that one. Let's say it takes 1 second per million records.

    a) 1 thread doing 11 million records = 11 seconds
    b) 11 threads doing 11 million records = 11 seconds + the time to switch between threads

    Two threads can never run in parallel because there's only 1 processor, it is faked by constantly switching between threads. A is both faster and simpler than B. There is no reason to do B.

    Tuesday, August 4, 2009 10:31 PM
  • Well it definately takes more than a nanosecond to switch threads, I would guess it takes milliseconds but I haven't looked it up. But anyways, multiple threads on a single processor can never be more efficient that one. Let's say it takes 1 second per million records.

    a) 1 thread doing 11 million records = 11 seconds
    b) 11 threads doing 11 million records = 11 seconds + the time to switch between threads

    Two threads can never run in parallel because there's only 1 processor, it is faked by constantly switching between threads. A is both faster and simpler than B. There is no reason to do B.


    Yes--listen to the man: Scotty Does Know.

    I would still like to know why the OP feels the need to serialize all these objects to XML, as Scotty already asked. I would also question deserializing the data into objects in the first place, for this number of records. Although there is a single call to the database to get the data, converting it into (presumably) .Net objects will be relatively slow, probably a lot slower than writing a stored procedure to do all the data manipulation. Executing the stored procedure will hit the database performance, but this will not be a problem if it is running as a batch process overnight, when there is little other activity.

    I am a fan of C# and .Net but I have to resist the temptation to write everything in C# code, when sometimes it is better to use the power of the database. I have seen major performance gains by pushing code from the middle tier down to the database. Of course, this is the conflict between whether the database should just be a dumb store (more portable, better separation of concerns) and all code should be in the app, or whether we should use as much of the processing power of the database as possible (less portable, more load on the database server, but better performance).

    Wednesday, August 5, 2009 8:30 AM
  • Hey folks,

    It should be possible to get the SQL to return the data in the XML format needed. If the objects aren't very large then is there any advantage in storing them in 500 seperate files?

    Maybe there is. Perhaps the XML files are going to another system that you don't have any control over and the single XML file is required. However maybe there isn't an advantage and a single XML document can be used to represent all 500 objects. Depends on the size of the XML file and how that XML file was being read in but maybe it's another option to think about.

    Everyone is taking about using many threads, and I agree completely, but only if its absolutely necessary. 

    The database server should have far more !umph! than the local machine and I'd take advantage of that. Do the processing on the server.



    What do I think about 2 threads and 11 threads. I'd probably go for 3 or 4 threads. 1 thread to read the data and seperate it up into chunks, 2 threads (from the threadpool) to process the chunks of data, and 1 last thread taking the results and doing the file IO. This I believe would reduce the amount of locking required and allows the number of processing threads to be changed.

    Threading makes development much more difficult so keep it simple and develop it in such a way that you can change the number of threads doing the processing maybe 4 threads is faster, maybe 10 is faster, maybe 2 is faster, ???, develop so you can change the number of processing threads and you can see for yourself what the optimal speed is.

    www.dsmyth.net | www.dsmyth.net/wiki
    Wednesday, August 5, 2009 9:26 AM
  • Using as many threads as you have cores can help, don't use more. But if the bottleneck here is writing the xml files to the harddrive (which it is unless you do a lot of processing first) it's not gonna help.
    If the bottleneck is writing the xml files to harddrive, isn't it better to use more threads than cores. While the harddrive is writing, there's no need to wait for it to be finished before starting another thread.
    Saturday, August 14, 2010 2:17 PM