none
Handling 5 million records RRS feed

  • Question

  • All,

    I need suggestion. Below is my requirement

    I have millions of transaction data stored in a table, I will need to calculate commission based on transaction amount with few other constraints such as spend type, customer category etc. The solution should work both in SQL server & Oracle. On an avg, we expect 5 million records every day.


    I have been thinking of the following approaches

    a. Use Stored Procs (I got to maintain 2 sets one for SQL Server & another one for Oracle)
    b. Write everything in .Net code but use data set & data table, then do bulk insert (I am not sure about memory limitations and bulk insert for Oracle)
    c. Write .Net code using datareader, do the calculation row by row and insert (May be multi-threading help to speed up the process)

    Anyone who can suggest some other better solution or throw some light on my approaches would really be appreciated. 

    Thanks
    Sriram 
    Friday, September 18, 2009 3:03 PM

Answers

  • I agree with you Reed. But unfortunately, we wanted to keep these procs as database independent. I would like to explore more options before trying to convince customer. Please help.

    Sriram,

    My suggestion would be to use the following to convince the customer:

    Creating a stored procedure on a database and testing it will take X developer time.  Since you're targetting 2 databases, this will be 2*x.  Make a reasonable estimate of this.  However, the advantage here is that all of the processing will be handled on the database itself, and the database is specifically optimized to handle this type of operation.

    Creating a database independent means of analyzing this data will take Y developer time.  Making this an optimal algorithm will take Z time (which I guarantee, if you're being realistic, will probably be many, many multiples of the original X time, since multithreading is not simple to test thoroughly).  At best, this will be a very complex operation, and will require a large amount of data to be pulled back and forth across the wire.  Not only will this put a higher load on the database server (since it will need to serve a LOT of data), it will require quite a bit of processing time on the clients, and be a slower result.

    When you actually write it all out in terms of developer time and resources, you'll see that having 2 implementations (in this case) is a far, far more optimal approach.  It's an easy sell.

    Now, I'm not saying database independence is not a good goal as a general goal - and I'd try to make the implementations as similar in their logic as possible, so that it's easy to keep both implementations maintained.  However, there are times when having two codebases makes more sense, and this is a perfect example of where you may need two separate backends if you want a good performing solution.
    Reed Copsey, Jr. - http://reedcopsey.com
    • Marked as answer by Sriram M Wednesday, September 23, 2009 7:52 AM
    Sunday, September 20, 2009 6:35 PM

All replies

  • I would strongly recommend using approach a - write two stored procs.

    Pulling 5 million records back and forth across the wire in order to do these types of calculations within .NET is not going to be very easy on you, or your hardware.  Writing two stored procedures is a very simple, clean, and fast solution.
    Reed Copsey, Jr. - http://reedcopsey.com
    Friday, September 18, 2009 3:53 PM
  • I agree with you Reed. But unfortunately, we wanted to keep these procs as database independent. I would like to explore more options before trying to convince customer. Please help.
    Saturday, September 19, 2009 5:44 AM
  • Hi sriram,

    Have you thought about batching the data. Since the amount of data is huge, batching might be an good option. I am not sure about your requirments. I am assuming it cant be online batching as the data is huge.

    I had done similar thing where the records where close to 1 million but the processing time was huge as the calculation was based on complex business logic. I had used stored procedures to do the data. 

     I dont recommend using .net code , i am sure it will be taxing on the server.

    Also try to use multi threading. Ex: Try to to processed 1 lac records per thread and see that the thread threshold is controlled well.

    Check out the Mapreader site for guidelines on processing huge amount of data.

    Have a look at Hadoop alsp a framework for similar requirement.

    Mark this as answer if you find helpful

    Regards
    Azhar
    • Proposed as answer by Azhar_Amir Sunday, September 20, 2009 12:27 PM
    Sunday, September 20, 2009 12:27 PM
  • I agree with you Reed. But unfortunately, we wanted to keep these procs as database independent. I would like to explore more options before trying to convince customer. Please help.

    Sriram,

    My suggestion would be to use the following to convince the customer:

    Creating a stored procedure on a database and testing it will take X developer time.  Since you're targetting 2 databases, this will be 2*x.  Make a reasonable estimate of this.  However, the advantage here is that all of the processing will be handled on the database itself, and the database is specifically optimized to handle this type of operation.

    Creating a database independent means of analyzing this data will take Y developer time.  Making this an optimal algorithm will take Z time (which I guarantee, if you're being realistic, will probably be many, many multiples of the original X time, since multithreading is not simple to test thoroughly).  At best, this will be a very complex operation, and will require a large amount of data to be pulled back and forth across the wire.  Not only will this put a higher load on the database server (since it will need to serve a LOT of data), it will require quite a bit of processing time on the clients, and be a slower result.

    When you actually write it all out in terms of developer time and resources, you'll see that having 2 implementations (in this case) is a far, far more optimal approach.  It's an easy sell.

    Now, I'm not saying database independence is not a good goal as a general goal - and I'd try to make the implementations as similar in their logic as possible, so that it's easy to keep both implementations maintained.  However, there are times when having two codebases makes more sense, and this is a perfect example of where you may need two separate backends if you want a good performing solution.
    Reed Copsey, Jr. - http://reedcopsey.com
    • Marked as answer by Sriram M Wednesday, September 23, 2009 7:52 AM
    Sunday, September 20, 2009 6:35 PM
  • Thanks Reed. I could convince customer to go for a stored procedure.  Azar thanks to you as well.





    Wednesday, September 23, 2009 7:55 AM
  • Friday, October 2, 2009 11:26 AM
  • One more option worth exploring would be to write .Net C# stored procedures. This way you can have the power of complied language and at the same time close to database hence as good as stored procedure advantages. Something like Pro-C* in ORACLE. Thats the fastest way of doing the processing close to database.
    Wednesday, December 2, 2009 12:31 PM
  • There are probably several solutions, but it depends on why there is the requirement that it work in Oracle and SQL Server in the first place.
    Another question is where is the consumer of the data? If the consumer of the calculation is a SQL client, why not create a link to the Oracle server, and use a single stored procedure that uses all of the data. It sounds like this is a ETL issue, all of the data from the Oracle server and SQL Server should be pulled into one reporting database.
    Wednesday, December 2, 2009 9:46 PM