none
How can we update a DataTable column values as fast as calculated columns does? RRS feed

  • Question

  • Hi,

    I have a small test program that creates a DataTable with 1M rows.

    Then I add a calculated column (Expression column), it's super fast like a second.

    Then I just try to add a new column and set values to it, it takes more than 10 seconds.

    DataTable update is slow. How can I make it as fast as calculated column?

    10 seconds does seem acceptable but that is just a test program to show the difference.

    This can become lot worst depending on how many rows, columns, etc.

    And anyway if calculated column is able to do it that fast, hopefully we can get same performance with normal updates.

    It seems calculated columns is able to turn off some notifications  or other things. I'm not sure.

    Any idea? Advice? Maybe I'm missing something when updating the DataTable.

    Below is the code sample I'm using:

    using System;
    using System.Data;

    namespace Test1
    {
        class Program
        {
            static void Main(string[] args)
            {
                try
                {
                    System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();
                    watch.Start();
                    Console.WriteLine("DataTable creation started ...");
                    DataTable data = new DataTable();
                    int nbrCols = 50;
                    for (int i = 0; i < nbrCols; i++)
                    {
                        data.Columns.Add(string.Concat("COL_", (i + 1)), typeof(double));
                    }

                    System.Random r = new System.Random();
                    data.BeginLoadData();
                    int nbrows = 1000000;
                    for (int i = 0; i < nbrows; i++)
                    {
                        DataRow newRow = data.NewRow();
                        for (int j = 0; j < nbrCols; j++)
                        {
                            newRow[j] = r.NextDouble();
                        }
                        data.Rows.Add(newRow);
                    }
                    data.EndLoadData();

                    watch.Stop();
                    Console.WriteLine("DataTable creation = " + watch.ElapsedMilliseconds / 1000.0);

                    int colIndexFlow1 = data.Columns.IndexOf("COL_1");
                    int colIndexFlow2 = data.Columns.IndexOf("COL_2");

                    /* Add a calculated columns */
                    watch.Restart();
                    data.Columns.Add("CALC_1", typeof(double), "COL_1 + COL_2");
                    data.AcceptChanges();
                    watch.Stop();
                    Console.WriteLine("Calculated Column creation = " + watch.ElapsedMilliseconds / 1000.0);

                    /* Add a new column */
                    watch.Restart();
                    data.BeginLoadData();
                    data.Columns.Add("NEW_1", typeof(double));
                    int colIdx = data.Columns.IndexOf("NEW_1");
                    foreach (DataRow row in data.Rows)
                    {
                        //double val = row.Field<double>("COL_1") + row.Field<double>("COL_2");
                        //row.SetField<double>(colIdx, val);
                        // Most of the time is spent to actually set the data
                        // Question is how can we make the update happening as fast as the update done when using a calculated column.
                        row.SetField<double>(colIdx, 2.0);
                    }
                    data.AcceptChanges();
                    data.EndLoadData();
                    watch.Stop();
                    Console.WriteLine("New Column creation = " + watch.ElapsedMilliseconds / 1000.0);

                }
                catch (Exception ex)
                {
                    Console.WriteLine("Error=" + ex.Message);
                }
                finally
                {
                    Console.ReadLine();
                    Console.WriteLine("Press any key to exit");
                }
            }
        }
    }

    Thanks in advance

    • Moved by DotNet Wang Friday, June 17, 2016 1:40 AM data related
    Thursday, June 16, 2016 3:11 PM

All replies

  • Hi dtf017,

    Thank you for your post. Based on your description, your question is more related to data development. We will move your thread to ADO.NET DataSet forum for better support.

    Best Regards,
    Li Wang


    We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
    Click HERE to participate the survey.

    Friday, June 17, 2016 1:42 AM
  • By setting formula to column, you're setting the formula to the column only (i.e." only one property of the class). The result of calculated column is only evaluated when you're trying to get the content of the field.

    Since so little job is done, it's superfast operation.

    If you want to increase performance of updating datarow in datatable, you can try Parallel.ForEach() in System.Threading.Tasks namespace. If the system has quad-core or above, it may gain some performance...

            static void Main(string[] args)
             {
                 try
                 {
                     System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();
                     watch.Start();
                     Console.WriteLine("DataTable creation started ...");
                     DataTable data = new DataTable();
                     int nbrCols = 50;
                     for (int i = 0; i < nbrCols; i++)
                     {
                         data.Columns.Add(string.Concat("COL_", (i + 1)), typeof(double));
                     }
    
                    System.Random r = new System.Random();
                     data.BeginLoadData();
                     int nbrows = 1000000;
                     for (int i = 0; i < nbrows; i++)
                     {
                         DataRow newRow = data.NewRow();
                         for (int j = 0; j < nbrCols; j++)
                         {
                             newRow[j] = r.NextDouble();
                         }
                         data.Rows.Add(newRow);
                     }
                     data.EndLoadData();
    
                    watch.Stop();
                     Console.WriteLine("DataTable creation = " + watch.ElapsedMilliseconds / 1000.0);
    
                    int colIndexFlow1 = data.Columns.IndexOf("COL_1");
                     int colIndexFlow2 = data.Columns.IndexOf("COL_2");
    
                    /* Add a calculated columns */
                     watch.Restart();
                     data.Columns.Add("CALC_1", typeof(double), "COL_1 + COL_2");
                     data.AcceptChanges();
                     watch.Stop();
                     Console.WriteLine("Calculated Column creation = " + watch.ElapsedMilliseconds / 1000.0);
    
                    /* Add a new column */
                     watch.Restart();
                     data.BeginLoadData();
                     data.Columns.Add("NEW_1", typeof(double));
                     int colIdx = data.Columns.IndexOf("NEW_1");
                     //foreach (DataRow row in data.Rows)
                     Parallel.ForEach(data.AsEnumerable(), (row) =>
                     {
                         //double val = row.Field<double>("COL_1") + row.Field<double>("COL_2");
                         //row.SetField<double>(colIdx, val);
                         // Most of the time is spent to actually set the data
                         // Question is how can we make the update happening as fast as the update done when using a calculated column.
                         row.SetField<double>(colIdx, 2.0);
                     }
                     );
                     data.AcceptChanges();
                     data.EndLoadData();
                     watch.Stop();
                     Console.WriteLine("New Column creation = " + watch.ElapsedMilliseconds / 1000.0);
    
                }
                 catch (Exception ex)
                 {
                     Console.WriteLine("Error=" + ex.Message);
                 }
                 finally
                 {
                     Console.ReadLine();
                     Console.WriteLine("Press any key to exit");
                 }
             }


    Friday, June 17, 2016 1:45 AM
  • Hi cheong00, cole wu,

    I did not realize evaluation would only happen when the field is accessed. That makes sense.

    So I'm guessing we can not do much to increase update performance.

    To my knowledge, Parallel is not an option as DataTable is not thread safe when writing data.
    That is really unfortunate.

    Running the code above would throw exception.

    Any other tip/advice on how to optimize updating values for specific column in large DataTable?

    I appreciate any help and comments

    Thanks

    Friday, June 17, 2016 8:40 AM
  • Perheps you should just give it a default value when setting the column:

    (The following just replace the second part of the code)

                    /* Add a new column */
                    watch.Restart();
                    data.BeginLoadData();
    
                    data.Columns.Add(new DataColumn("NEW_1", typeof(double))
                        {
                            DefaultValue = 2.0,
                        });
    
                    //data.Columns.Add("NEW_1", typeof(double));
                    int colIdx = data.Columns.IndexOf("NEW_1");
                    //foreach (DataRow row in data.Rows)
                    //{
                    //    row.SetField<double>(colIdx, 2.0);
                    //}
    
                    Console.WriteLine(data.Rows[0][colIdx]);
    
                    data.AcceptChanges();
                    data.EndLoadData();
                    watch.Stop();
                    Console.WriteLine("New Column creation = " + watch.ElapsedMilliseconds / 1000.0);


    • Edited by cheong00 Friday, June 17, 2016 10:17 AM
    Friday, June 17, 2016 10:16 AM
  • Hi Cheong00,

    Thanks again for the feedback. I agree that is good way to set all values for a column when values are same across all rows.

    My use case is little different and I will have different values to set for each row (like shown in test program in the commented out parts ... I just simplified the update to only calculate time spend to update all rows).

    Thanks

    Monday, June 20, 2016 10:27 AM
  • I don't know how to accelerate it then.

    You could possibly get a better timing using POCO objects (Such as EntityFramework or LINQ to SQL) instead of DataTable to store data, but I'd regard that as heavy lifting that might not pay well. It all depends on what other things you're trying to do.

    Monday, June 20, 2016 12:49 PM