none
large dataset problem : please help RRS feed

  • Question

  • Hi ALL,

    I am working in an windows based application using SQL Server 2000 as database. There are few tables (refer parent tables) in the application which are uploaded by a seprate application.

    My application fetches data from parent tables and put data in separate tables (chils tables) of application.
    I am using dataset to fetch data from parent tables and insert/update data in child table.
    the problem is the records are so high (3-4 millions) then data is too large and it takes hell lot of time to complete the process. Also application server CPU utilization shoots out to max.

    What will be the best way to achive this.
    1. Should i use DataRepeater instead of dataset. or
    2. Should i do processing in chunks. how can I do processing in chunks???

    or is there any other way i can process data.

    Thanks

    Wednesday, July 11, 2007 7:51 PM

Answers

  • A Dataset is the wrong choice for this type of application.  You should be using a Datareader. 

     

    From MSDN: Best Practices for Using ADO.NET

     

    To determine whether to use the DataSet or the DataReader when you design your application, consider the level of functionality that is needed in the application.

    Use the DataSet in order to do the following with your application:

    • Navigate between multiple discrete tables of results.
    • Manipulate data from multiple sources (for example, a mixture of data from more than one database, from an XML file, and from a spreadsheet).
    • Exchange data between tiers or using an XML Web service. Unlike the DataReader, the DataSet can be passed to a remote client.
    • Reuse the same set of rows to achieve a performance gain by caching them (such as for sorting, searching, or filtering the data).
    • Perform a large amount of processing per row. Extended processing on each row returned using a DataReader ties up the connection serving the DataReader longer than necessary, impacting performance.
    • Manipulate data using XML operations such as Extensible Stylesheet Language Transformations (XSLT transformations) or XPath queries.

    Use the DataReader in your application if you:

    • Do not need to cache the data.
    • Are processing a set of results too large to fit into memory.
    • Need to quickly access data once, in a forward-only and read-only manner.
    Thursday, July 12, 2007 2:00 PM

All replies

  • Hi;

    Why you are using Dataset for such purpose? What I understand that you have ASP .Net web application that process that?

     

    I believe that all you are doing is Data Transformation and no human input is required. My guess is based on the size of data you are talking about.

    If my guess is right, You should look for another approach. In SQL Server 2000 you can use DTS (Data Transformation Services) to do such transformation.

     

    Let me know if my guess is correct and if I can help further in this.

    Wednesday, July 11, 2007 8:30 PM
  • Thanks for the reply...However as mentioned it is not an ASP.NET web application. It is kind of windows service developed in VB.NET that is used to fetch data from one parent tables and populate in child table.

    No user input is required, it is a service that is deployed using using windows service and based on timer it checks if data is available in parent table, if yes then it pulls the data from parent tables and populate in child table..

     

    Thanks

    P

    Thursday, July 12, 2007 1:03 AM
  • Hi;

    well, I think my suggestion is still valid regardles if it is .Net app or Windows service. Look for DTS to do the transformation for you.

    Thursday, July 12, 2007 4:30 AM
  • I don't know how you're implementing this problem, or planning to, but it seems to me that the fact that there's a timer, means that you're incrementally building data, so it would seem to make sense to be sending across deltas of the data that has changed since last time you looked?

     

    In terms of scalability, I think my approach would be to use a connection and DataReader, however, again, I don't know what you're doing, and why you have chosen a DataSet and a DataRepeater as your shortlist for plausible solutions...

     

    Without requirements, or a business context, it's quite difficult to answer the questions that you have asked thus far.

    Thursday, July 12, 2007 5:03 AM
  • Hi Martin;

    I think .Net technology is not suitable for such great amount of data (3-4 million records). DTS of SQL 2000 are optimized in terms of processing and memory consumption. Let me know what you think.

    Thursday, July 12, 2007 5:09 AM
  • Yes, I would agree with you there.  The more that you can leave up to the server box to do, the better...

     

    Kind of like writing granular Services and then combining the results, it works, but not very fast!!

    Thursday, July 12, 2007 6:52 AM
  • A Dataset is the wrong choice for this type of application.  You should be using a Datareader. 

     

    From MSDN: Best Practices for Using ADO.NET

     

    To determine whether to use the DataSet or the DataReader when you design your application, consider the level of functionality that is needed in the application.

    Use the DataSet in order to do the following with your application:

    • Navigate between multiple discrete tables of results.
    • Manipulate data from multiple sources (for example, a mixture of data from more than one database, from an XML file, and from a spreadsheet).
    • Exchange data between tiers or using an XML Web service. Unlike the DataReader, the DataSet can be passed to a remote client.
    • Reuse the same set of rows to achieve a performance gain by caching them (such as for sorting, searching, or filtering the data).
    • Perform a large amount of processing per row. Extended processing on each row returned using a DataReader ties up the connection serving the DataReader longer than necessary, impacting performance.
    • Manipulate data using XML operations such as Extensible Stylesheet Language Transformations (XSLT transformations) or XPath queries.

    Use the DataReader in your application if you:

    • Do not need to cache the data.
    • Are processing a set of results too large to fit into memory.
    • Need to quickly access data once, in a forward-only and read-only manner.
    Thursday, July 12, 2007 2:00 PM
  • I agree with Ahmed, You should use Data Transformation Services (DTS)  or SSIS if you can upgrade to SQL 2005.
    There's an article by Diane Lersen on Technet that can give you an introduction to DTS

    Arnon
    Friday, July 13, 2007 9:09 PM
  • Hi ;

     

    Thanks arnon. Just to mention, DTS and SSIS gives you the same time based schedules you might need. They also include transformation functionality that are suitable for extracting data. DTS includes Scripting capabilities that allows for any transformation you might think of.

     

    Be careful when you use any technique to transform large amount of data. You should expect down time for your service. Any transformation for large amount of data consumes the Memory and processing power of the server. The more you optimize your transformations, the more memory and process are consumed.

    Friday, July 13, 2007 9:21 PM
  • Congratulations - your post has been selected for an ARCast.TV Rapid Response!



    To listen to Architect MVP Scott Hanselman and Ron Jacobs address your question click here


    Hope this helps,

    Ron Jacobs

    Saturday, July 14, 2007 2:26 AM
  • Another solution would be to move all that processing to a stored procedure and use cursors to traverse the data. That way it won't matter if it is a windows or a web application. This solution is optimum if all the tables are on the same physical server.

     

    I hope this helps.

    Guys, please give me your feedback on this solutiuon.

    Sunday, July 15, 2007 5:38 PM
  •  

    Hi palsum,

     

     if you want all of the 3-4 millions records and you still want to process it in your managed code, than change the option from dataset to entity base system, Dataset are good if you have a limited amount of data roughly speaking .1 million if more than than then create entities, and process on them, as dataset are heavier, as they have lots of object in them e.g datatable, datacolumn, datarow, relation etc, while in entity base system all you have is your entity. and a generic list containing them.

     

    This solution will demand more coding as compared to dataset, as dataset provides lots of benifits like using select query on its data.

     

    Regards

    Monday, April 21, 2008 12:54 PM
  • IMHO, the best solution for the above mentioned problem is a ETL space.  You could use DTS (if SQL 2K or lower) and SSIS (if SQL 2005 or 2008)...

    Also, there are specialized ETL products like Informatica, that does this job much better than native database solutions like SQL Server or Oracle.

    Eitherways, wiriting a managed .NET code and running it as a windows service is a bad choice.  Managed code is optimized for transactional type applications, not ETL...

    Monday, April 21, 2008 2:06 PM
  • Hi,

    I have similar issue with the processor and dataset. I have a desktop app which generate invoices. I split the
    print jobs based on parameters such as office, location etc. However the minimum invoices per print job is
    2000. I am using Crystal Reports for the invoice and once dataset with 2 tables. One for invoice header and the other for
    invoice details. Each print job takes about .20 seconds to get the data from the database and 5 minutes before start printing.
    During that 5 minutes my processor is always 100% and memory consumption is normal, 100MB of 512MB.

    If I limit the "SELECT" to 20 invoices, it's lightning speed. Any ideas?

    Thanks

     

    Monday, April 21, 2008 5:39 PM