none
Combining DataTables in C# RRS feed

  • Question

  • Hi all,

    Ok I am having major headaches with this one, I have multiple DataTables which have been populated using MDX queries run with similar but slightly different fields.

    Here is a quick example of the columns in each of my Table (I have shortened the list for speed):

    MonthlyTable: Customer, Monthly Sales

    QuarterlyTable: Customer, Quarterly Sales

    YearlyTable: Customer, Yearly Sales

    So the Customer is obviously the distinct Column, however you may have a customer appearing on just 1, 2 or all 3 tables.

    I want my result to simply be:

    ResultTable:Customer, Monthly Sales, Quarterly Sales, Yearly Sales

    I know it would be quite straightforward for me to do a foreach on each row but sometimes the number of rows could be in excess of 200k and it just takes too long.

    So I have already tried the following:

    DataTable montlyTest = MonthlyQuery;
                montlyTest.Columns.Remove("Monthly Sales");
                DataTable quarterlyTest = QuarterlyQuery;
                quarterlyTest.Columns.Remove("Quarterly Sales");
                DataTable yearlyTest = YearlyQuery;
                yearlyTest.Columns.Remove("Yearly Sales");
    
                DataTable combinedTable = new DataTable();
                combinedTable.Merge(montlyTest,true,MissingSchemaAction.Add);
                combinedTable.Merge(quarterlyTest,true,MissingSchemaAction.Add);
                combinedTable.Merge(yearlyTest,true,MissingSchemaAction.Add);
                //combinedTable.Merge(MonthlyQuery, false, MissingSchemaAction.Add);
                //combinedTable.Merge(QuarterlyQuery, false, MissingSchemaAction.Add);
                //combinedTable.Merge(YearlyQuery, false, MissingSchemaAction.Add);
    
                DataTable dataTable = new DataTable();
    
                foreach (DataColumn item in MonthlyQuery.Columns)
                {
                    dataTable.Columns.Add(item.ColumnName, item.DataType);
                }            
                
                dataTable.Columns.Add("Quarterly Sales", typeof(double));
                dataTable.Columns.Add("Yearly Sales", typeof(double));
    
                var results = from combined in combinedTable.AsEnumerable()
                              join monthly in MonthlyQuery.AsEnumerable() on (string)combined["Customer"] equals (string)monthly["Customer"]
                              join quarterly in QuarterlyQuery.AsEnumerable() on (string)combined["Customer"] equals (string)quarterly["Customer"]
                              join yearly in YearlyQuery.AsEnumerable() on (string)combined["Customer"] equals (string)yearly["Customer"]                          
                              select new 
                              {
                                  Customer=combined.Field<string>("Ship-to"),
                                  MonthlySales=monthly.Field<double>("Monthly Sales"),
                                  QuarterlySales=quarterly.Field<double>("Quarterly Sales"),
                                  YearlySales=yearly.Field<double>("Yearly Sales")
                              };

    Basically I am creating a new table for each and removing the columns that I know will be different then merging these to basically create a table with just unique value customers.

    Then I was trying to do a LINQ query to do a join on them but I cannot get this into my new DataTable. 

    Firstly do you think I am going down the right path?

    If so how can I convert from LINQ select on anonymous args to a datatable?

    Many thanks,

    Joe

    Sunday, January 20, 2019 8:47 PM

Answers

  • I know it would be quite straightforward for me to do a foreach on each row but sometimes the number of rows could be in excess of 200k and it just takes too long.

    You will have to loop through the rows anyway. If you use a LINQ query, the LINQ query will internally loop through the rows, so it's not going to be any faster than if you write a loop yourself. Be sure to NOT make three nested loops, this will be horribly slow. Instead, and presuming that the three tables are sorted by customer (if not, add the sorting in the MDX query, this will be much faster than sorting at the client side), then do a MERGE JOIN, which requires three pointers (one on each table) and a single loop that advances each of the three pointers as long as it finds rows that are not on the other tables.

    As an alternative, it should be reasonably simple to merge the three MDX queries into a single query that returns Customer, Monthly Sales, Quarterly Sales, Yearly Sales all in a single table. Then you would not need to loop on the client side through the three tables with 200K rows each.

    • Marked as answer by JoeyStyles Sunday, January 20, 2019 9:54 PM
    Sunday, January 20, 2019 9:18 PM
    Moderator
  • Hi Alberto,

    Wanted to say thank you and I have also come up with another way which is to create a primary key on each table and then when a merge is performed it displays correctly:

                MonthlyQuery.PrimaryKey = new DataColumn[] { MonthlyQuery.Columns["Customer"] };
                QuarterlyQuery.PrimaryKey = new DataColumn[] { QuarterlyQuery.Columns["Customer"] };
                YearlyQuery.PrimaryKey = new DataColumn[] { YearlyQuery.Columns["Customer"] };
    
                DataTable dataTable = MonthlyQuery;
                dataTable.Merge(QuarterlyQuery);
                dataTable.Merge(YearlyQuery);

    It's also very quick on 200k+ rows, just thought I would post it in case anyone else finds it useful.

    Many thanks,

    Joe

    • Marked as answer by JoeyStyles Monday, January 21, 2019 1:48 AM
    Monday, January 21, 2019 1:48 AM

All replies

  • I know it would be quite straightforward for me to do a foreach on each row but sometimes the number of rows could be in excess of 200k and it just takes too long.

    You will have to loop through the rows anyway. If you use a LINQ query, the LINQ query will internally loop through the rows, so it's not going to be any faster than if you write a loop yourself. Be sure to NOT make three nested loops, this will be horribly slow. Instead, and presuming that the three tables are sorted by customer (if not, add the sorting in the MDX query, this will be much faster than sorting at the client side), then do a MERGE JOIN, which requires three pointers (one on each table) and a single loop that advances each of the three pointers as long as it finds rows that are not on the other tables.

    As an alternative, it should be reasonably simple to merge the three MDX queries into a single query that returns Customer, Monthly Sales, Quarterly Sales, Yearly Sales all in a single table. Then you would not need to loop on the client side through the three tables with 200K rows each.

    • Marked as answer by JoeyStyles Sunday, January 20, 2019 9:54 PM
    Sunday, January 20, 2019 9:18 PM
    Moderator
  • Hi Alberto,

    Thanks for the quick answer it's much appreciated. I think the easiest thing to do would be to combine the MDX queries to return what I need. I will post another question on it.

    Thanks for your help.

    Joe

    Sunday, January 20, 2019 10:15 PM
  • Hi Alberto,

    Wanted to say thank you and I have also come up with another way which is to create a primary key on each table and then when a merge is performed it displays correctly:

                MonthlyQuery.PrimaryKey = new DataColumn[] { MonthlyQuery.Columns["Customer"] };
                QuarterlyQuery.PrimaryKey = new DataColumn[] { QuarterlyQuery.Columns["Customer"] };
                YearlyQuery.PrimaryKey = new DataColumn[] { YearlyQuery.Columns["Customer"] };
    
                DataTable dataTable = MonthlyQuery;
                dataTable.Merge(QuarterlyQuery);
                dataTable.Merge(YearlyQuery);

    It's also very quick on 200k+ rows, just thought I would post it in case anyone else finds it useful.

    Many thanks,

    Joe

    • Marked as answer by JoeyStyles Monday, January 21, 2019 1:48 AM
    Monday, January 21, 2019 1:48 AM