none
Improve DataTable filling perfomance RRS feed

  • Question

  • I have a list containing 21 YYYYMM (years and months) which I loop though a Parallel processing, executing a query and populating a DataTable which further will be written into a csv file as you can see bellow.

    Just the "Load" method in DataTable is taking 30 minutes to be executed, thereof I was looking to some suggestions how could I improve it.

    lstFilterValues.AsParallel().ForAll(filterValue =>
                    {
    
                        using (SAConnection _conn = DB_Connection.Connect(_psfAcronym))
                        { 
                        
                            var queryWithFilter = mainQuery.Replace("@FILTER", filterValue.ToString());
    
                            IDataReader _dataReader = new SACommand(queryWithFilter, _conn).ExecuteReader();
    
                            var _dataTable = new DataTable();
                            _dataTable.Load(_dataReader);
    
                            List<T> lstTableRows = new List<T>(TransformToObject.DataTableToList<T>(_dataTable));
    
                using (var sw = new StreamWriter(filePath))
                using (var csv = new CsvWriter(sw))
                {
                    csv.Configuration.Delimiter = UniversalVariables.csvDelimiter.ToString(); ;
                    csv.Configuration.HasHeaderRecord = true;
                    csv.Configuration.UseNewObjectForNullReferenceMembers = true;
    
                    csv.WriteHeader<T>();
                    csv.NextRecord();
                    csv.WriteRecords(lstRecords);
    
                    sw.Flush();
                }
    
                            SybaseIQ_Connection.Disconnect(_conn);
    
                        }
    
                    });
    The database table contains 21 million rows, which for each interaction should be something near 1 million. 



    • Edited by FcabralJ Tuesday, January 28, 2020 11:04 AM
    Tuesday, January 28, 2020 10:59 AM

All replies

  • Hello,

    When dealing with large amount of data the first place for a solution to look at is performing as much as possible at the database level rather than client side code as the database engine for all major databases is always faster server side then client side.

    In the following SQL you would run this using a connection object and a command object.

    For instance (taken from SyBase docs)

    SELECT * FROM SalesOrders; 
    OUTPUT TO 'c:\\test\\sales.csv' 
        FORMAT TEXT
        QUOTE '"'
        WITH COLUMN NAMES;

    Only when the database features don't provide what you are looking for go to a client side solution. When going client side

    • Consider working in batches e.g. split up total rows evenly and process.
    • Consider what large organizations do, perform this type of operation (thinking client sides) as a scheduled job (using say a server or windows scheduler) that runs a specific time or times.
    • Look at using a Windows service with a third party scheduler e.g. Quartz Enterprise Scheduler .NET


    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange



    Tuesday, January 28, 2020 11:41 AM
    Moderator
  • Why didn't  you use a datareader to load the collection using a custom type and eliminate the datatable which is slow due to boxing?

    http://lauteikkehn.blogspot.com/2012/03/datatable-vs-list.html

    https://dzone.com/articles/reasons-move-datatables

    Tuesday, January 28, 2020 1:37 PM
  • You're trying to load 1 million rows into a datatable? This is clearly beyond what DT was designed for. The overhead alone is awful. Stream the data from the database to the file directly and skip everything in between. 

    void SaveDataToCsv ( string filePath, ... )
    {
       using (var writer = OpenCsv(filePath))
       {
          //Have no idea what these do
          writer.WriteHeader<T>();
          writer.NextRecord();
    
          //Stream the data from the database
          using (var conn = DB_Connection.Connect(...))
          {
             using (var reader = new SACommand(...))
             {
                while (reader.Read())
                {
                   csv.WriteRecord(reader);
                };
             };
          };
       };
    }
    
    CvsWriter OpenCsv ( string filePath )
    {
       //Making the assumption the writer will clean up the stream, alternatively it probably has an overload accepting a file...
       var writer = new CsvWriter(new StreamWriter(filePath));
    
       //Init file stuff
    
       return writer;
    }
    Note that the assumption here is that each parallel block is writing to a separate file which does not appear to be the case in your code. Trying to have multiple threads write to the same file at the same time isn't going to work. If you need to write all the data to the same file then threading is useless because you cannot guarantee the ordering of the writes. So each thread needs to write to its own file.


    Michael Taylor http://www.michaeltaylorp3.net

    Tuesday, January 28, 2020 2:56 PM
    Moderator
  • Because I need to map the data into an Class using reflection in order to have just the properties I want. To explain better, I can't use Linq neither EF as it's a Sybase IQ. 

    Therefore, when I recover everything I need to map it to the class and I'm doing using the properties provided by the Datatabls as the Datacolumns. 

    If I could apply the same with Lists maybe could be a greater solution, I will try anyway. 

    Wednesday, January 29, 2020 4:54 PM
  • Regarding the parallel, I'm using your tip as my filepath is composed by the filterValue (my fault not including it in the code). 

    Regarding writing it directly i can't do it as I need to map to the properties I have in my class.

    Let's say I have 100 columns in the DB, I want to generate just 10 of the columns which are represented in my classess, therefore I don't need to map each individually column. 
    Wednesday, January 29, 2020 4:57 PM
  • Because I need to map the data into an Class using reflection in order to have just the properties I want. To explain better, I can't use Linq neither EF as it's a Sybase IQ. 

    I don't see how reflection has anything to do with a custom type that has public properties, using an ADO.NET datareader and populating the custom type by using the datareader by columnname to populate like property names in the custom type, loading the custom type into a List<T> and returning the List of objects.

    Therefore, when I recover everything I need to map it to the class and I'm doing using the properties provided by the Datatabls as the Datacolumns. 

    Boxing and unboxing is occurring using the datatable slowing the processing down

    https://www.codingblocks.net/programming/boxing-and-unboxing-7-deadly-sins/

    Wednesday, January 29, 2020 5:40 PM