none
LinqToCSV - Writting chunk of files RRS feed

  • Question

  • I'm using Csv Helper to write out a Linq Query with million of rows. I would like to split the output by, for instance, 1 million of rows each. Could I do that or should I use other type of writting method?

    Here is my code:

    var _path = UniversalVariables.outputCsvFiles + "entire_output.csv"; 
    
    var pvQuery = from car in Cars 
                  select car;
    
    
    if (!Directory.Exists(UniversalVariables.outputCsvFiles))
    {
        Directory.CreateDirectory(UniversalVariables.outputCsvFiles);
    }
    
    using (var sw = new StreamWriter(_path))
    using (var csv = new CsvWriter(sw))
    {
        csv.Configuration.Delimiter = UniversalVariables.csvDelimiter;
        csv.Configuration.HasHeaderRecord = true;
    
        csv.WriteHeader<Car>();
        csv.NextRecord();
        csv.WriteRecords(pvQuery);
    
        sw.Flush();
    }
    Wednesday, January 8, 2020 9:54 PM

Answers

  • Hi Fcabralj,

    When the amount of data is too large, we can only reduce the time consumed to a certain extent.

    Two hundred million is a big number. It takes a lot of time anyway. I think using parallel linq may be the most effective optimization method.

    Best Regards,

    Timon



    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    • Marked as answer by FcabralJ Monday, January 13, 2020 9:38 PM
    Monday, January 13, 2020 6:39 AM

All replies

  • Perhaps split the file into smaller files (preprocess) first with file names that you can run through sequentially e.g.

    https://stackoverflow.com/questions/7289761/c-sharp-how-to-split-text-file-in-multi-files


    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange

    Thursday, January 9, 2020 1:16 AM
    Moderator
  • That's exactly what I need, splitting the files in the beginning, but my data is within a Linq query which get the data from the database. So, since this point I would like to chunk the output of the linq query in small csv files. 
    Thursday, January 9, 2020 5:32 AM
  • Hi,

    Thank you for posting here.

    A CSV file typically stores tabular data (numbers and text) in plain text.

    We can use StreamWriter directly to write CSV files, and based on my test, it is much faster than using CsvHelper.

    Here is the code.

            static async Task Main(string[] args)
            {
                int nums = 1000000;
                string path = @"\entire_output.csv";
                string path1 = @"\entire_output1.csv";
                       
                Stopwatch stopwatch = Stopwatch.StartNew();
                stopwatch.Start();
                Test1(nums, path);
                stopwatch.Stop();
                Console.WriteLine(nums);
                Console.WriteLine("CSVHelper:" + stopwatch.ElapsedMilliseconds);
    
                Stopwatch stopwatch1 = Stopwatch.StartNew();
                stopwatch1.Start();
                Test11(nums, path1);
                stopwatch1.Stop();
                Console.WriteLine("Normal:" + stopwatch1.ElapsedMilliseconds);
                Console.WriteLine("Press any key to continue...");
                Console.ReadKey();
            }
            static void Test1(int nums, string path)
            {
                List<Car> Cars = new List<Car>();
                
                for (int i = 0; i < nums; i++)
                {
                    Cars.Add(new Car() { Name = "TestCar" + i, Price = 100000, Origin = "test" });
                }
                string dPath = @"d:\test\csv";
                var _path = dPath + "\\" + path;
    
                var pvQuery = from car in Cars
                              select car;
                if (!Directory.Exists(dPath))
                {
                    Directory.CreateDirectory(dPath);
                }
                using (var sw = new StreamWriter(_path))
                {
                    using (var csv = new CsvWriter(sw))
                    {
                        csv.Configuration.Delimiter = ",";
                        csv.Configuration.HasHeaderRecord = true;
    
                        csv.WriteHeader<Car>();
                        csv.NextRecord();
                        csv.WriteRecords(pvQuery);
                        
                        sw.Flush();
                    }
                }
            }
            static void Test11(int nums, string path)
            {
                List<Car> Cars = new List<Car>();
                for (int i = 0; i < nums; i++)
                {
                    Cars.Add(new Car() { Name = "TestCar" + i, Price = 100000, Origin = "test" });
                }
                string dPath = @"d:\test\csv";
                var _path = dPath + "\\" + path;
                var pvQuery = from car in Cars
                              select car;
                if (!Directory.Exists(dPath))
                {
                    Directory.CreateDirectory(dPath);
                }
                using (StreamWriter sw = new StreamWriter(new FileStream(_path, FileMode.Create)))
                {
                    for (Int32 i = 0; i < Cars.Count; i++)
                    {
                        string[] s = new string[3];
                        s[0] = Cars[i].Name;
                        s[1] = Cars[i].Price.ToString();
                        s[2] = Cars[i].Origin;
    
                        sw.WriteLine(string.Join(",", s));
                    }
                }
            }
        }
        class Car
        {
            public String Name { get; set; }
            public double Price { get; set; }
            public String Origin { get; set; }
        }
    }


    Number of data Time spent with CsvHelper Time spent without CsvHelper
    10,000 89 7
    100,000 418 82
    1,000,000 4399 828

    Hope this could be helpful.

    Best Regards,

    Timon


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.



    Thursday, January 9, 2020 6:03 AM
  • Maybe try this approach too:

     

    var q = pvQuery.AsEnumerable();

     

    do

    {

       var d = q.Take( 1_000_000 );

       q = q.Skip( 1_000_000 );

     

       using(. . .)

       {

          . . .

          csv.WriteRecords(d);

       }

     

     

    } while( q.Any( ) );

     

     


    • Edited by Viorel_MVP Thursday, January 9, 2020 7:11 AM
    Thursday, January 9, 2020 7:11 AM
  • Hi @Tiomon Yang, thank you for the suggestion it's a good point using the streamwriter, it's faster in fact. But I'm still struggling with two things:

    1 - As you can see bellow I'm doing a mapping from Linq to my object, but it is taking more than 1 hour (still running) for all rows (200 million) to transform it to List which make it impossible. 

    var pvQuery = (from car in Cars
                                   select new InternalCar()
                                   {
                                       Id= (int)car.Id,
                                       Name = (string)car.Name,
                                       Year = (string)car.Year
                                   }).ToList();

    2 - As I want to output multiple csv files I would like to iterate over the pvQuery each time in order to generate million rows. As it's a big amount of data, should I use "Take" and "Skip" for that? How can I do it? 

    @Viorel_ I've tried your suggestion which is near what I want but when I do the: var q = pvQuery.AsEnumerable(); it got reading fore a long time also. 

    Could I just iterate through the Linq for every 1M, in the same time I'm writing the file in CSV? 


    • Edited by FcabralJ Thursday, January 9, 2020 2:05 PM
    Thursday, January 9, 2020 11:58 AM
  • Note, this code intent is to chunk a large file, not per say speed up the entire process. 

    Options

    • Split file then later process each file
    • Split file and process

    In the following code a text file is read in with 1,000 rows, creates 10 text files. The read method accepts the chunk size which in this case 100. For you the millions of lines means the chunk size needs to change to match what you want in each file. Each file in the case is store in a folder below the executable.

    Since work is done in a class a delegate is here so each time a new file is created this provides an opportunity to a) simply know the new file was created b) process the file as I've dummied this in.

    Delegate arguments

    using System;
    
    namespace ReadLargeTextFile
    {
        public class ProgressArgs : EventArgs
        {
            protected int Indexer;
            protected string Text;
            public ProgressArgs(int index, string information)
            {
                Indexer = index;
                Text = information;
            }
            public int LineIndex => Indexer;
            /// <summary>
            /// Text to place to listener
            /// </summary>
            public string FileName => Text;
        }
    }

    Delegate

    namespace ReadLargeTextFile
    {
        public static class DelegatesContainer
        {
            public delegate void ProgressHandler(
                object sender, ProgressArgs args);
        }
    }

    File operations class

    using System;
    using System.IO;
    using System.Text;
    using System.Threading.Tasks;
    
    namespace ReadLargeTextFile
    {
        public class FileOperations
        {
            public event DelegatesContainer.ProgressHandler OnProcessing;
    
            private readonly string _fileName = Path.Combine(
                AppDomain.CurrentDomain.BaseDirectory,"TextFile1.txt");
    
            /// <summary>
            /// Read large file asynchronous
            /// </summary>
            /// <returns></returns>
            public async Task Read(int interval)
            {
                int rowIndex = 1;
                int fileIndex = 1;
                string currentLine;
    
                var sb = new StringBuilder();
                
    
                using (var reader = File.OpenText(_fileName))
                {
    
                    while ((currentLine = await reader.ReadLineAsync()) != null)
                    {
    
                        sb.AppendLine(currentLine);
    
                        // at every 
                        if (rowIndex % interval == 0)
                        {
    
                            var newFileName = Path.Combine(
                                AppDomain.CurrentDomain.BaseDirectory, 
                                "SplitFiles",
                                $"File{fileIndex}.txt");
    
                            File.WriteAllText(newFileName, sb.ToString());
    
                            fileIndex += 1;
                            sb.Clear();
    
                            OnProcessing?.Invoke(this, new ProgressArgs(rowIndex, newFileName));
                        }
    
                        rowIndex += 1;
                    }
                }
            }
    
            public void ProcessFile(string fileName)
            {
                var lines = File.ReadAllLines(fileName);
                Console.WriteLine(); // put breakpoint here to see we have lines :-)
            }
        }
    }

    Form code, one button, one ListBox

    using System;
    using System.ComponentModel;
    using System.Data;
    using System.Drawing;
    using System.IO;
    using System.Linq;
    using System.Windows.Forms;
    
    namespace ReadLargeTextFile
    {
        public partial class Form1 : Form
        {
            public Form1()
            {
                InitializeComponent();
    
                Shown += Form1_Shown;
                // setup event which is fire off after a new file is created
                _fileOperations.OnProcessing += _fileOperations_OnProcessing;
            }
    
            private void Form1_Shown(object sender, EventArgs e)
            {
                if (!Directory.Exists(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "SplitFiles")))
                {
                    button1.Enabled = false;
                    MessageBox.Show("Missing folder");
                }
            }
    
            private void _fileOperations_OnProcessing(object sender, ProgressArgs args)
            {
                // show file name
                listBox1.Items.Add($"{Path.GetFileName(args.FileName)} ({args.LineIndex})");
                // this is where you could process the newly created file
                _fileOperations.ProcessFile(args.FileName);
            }
    
            private readonly FileOperations _fileOperations = new FileOperations();
            private async void button1_Click(object sender, EventArgs e)
            {
                listBox1.Items.Clear();
    
                try
                {
                    await _fileOperations.Read(100);
                }
                catch (Exception ex)
                {
                    // suggest something better but this will due for now
                    MessageBox.Show(ex.Message);
                }
            }
        }
    }
    

    After running


    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange

    Thursday, January 9, 2020 3:14 PM
    Moderator
  • @Karen Payne, my input is not a file. That's what I explained before, my input is a database where I can get the data using the Linq. Therefore I can split the file and after reading, rather I make a loop to interate though it using Take or Skip or I generate the whole file and after split it (which I think is not optimal). 

    Thursday, January 9, 2020 5:29 PM
  • @Karen Payne, my input is not a file. That's what I explained before, my input is a database where I can get the data using the Linq. Therefore I can split the file and after reading, rather I make a loop to interate though it using Take or Skip or I generate the whole file and after split it (which I think is not optimal). 

    Well that is pretty much a similar task. Here is enough for reading a database table in chunks into a list sized to chunk size.

    • The library in my signature is needed from NuGet
    • You need to handle the empty catch in button1

    namespace ReadLargeTable
    {
        public class OrderDetails 
        {
            public int OrderId { get; set; }
    
            public int ProductId { get; set; }
    
            public decimal UnitPrice { get; set; }
    
            public short Quantity { get; set; }
    
            public double Discount { get; set; }
            public string ItemArray => $"{OrderId},{ProductId}";
    
        }
    }

    Form

    using System;
    using System.Threading;
    using System.Windows.Forms;
    
    namespace ReadLargeTable
    {
        public partial class Form1 : Form
        {
            public Form1()
            {
                InitializeComponent();
            }
            private CancellationTokenSource _cancellationTokenSource = 
                new CancellationTokenSource();
    
            private void button1_Click(object sender, EventArgs e)
            {
                if (_cancellationTokenSource.IsCancellationRequested)
                {
                    _cancellationTokenSource.Dispose();
                    _cancellationTokenSource = new CancellationTokenSource();
                }
    
                var ops = new DataOperations();
                try
                {
                    ops.ReadData(40,_cancellationTokenSource.Token);
                }
                catch (Exception ex)
                {
                }
            }
    
            private void button2_Click(object sender, EventArgs e)
            {
                _cancellationTokenSource.Cancel();
            }
        }
    }

    Read ops - I only populate two fields as my time is limited

    using System;
    using System.Collections.Generic;
    using System.Data.SqlClient;
    using System.Linq;
    using System.Security.Policy;
    using System.Text;
    using System.Threading;
    using System.Threading.Tasks;
    using BaseConnectionLibrary.ConnectionClasses;
    
    namespace ReadLargeTable
    {
        public class DataOperations : SqlServerConnection
        {
            public DataOperations()
            {
                DatabaseServer = @".\SQLEXPRESS";
                DefaultCatalog = "NorthWindAzureForInserts";
            }
    
            public async Task<bool> ReadData(int chunkSize, CancellationToken ct)
            {
                var orderList = new List<OrderDetails>();
    
                var fetchCount = chunkSize;
                var incrementBy = fetchCount;
    
                using (var cn = new SqlConnection() { ConnectionString = ConnectionString })
                {
                    await cn.OpenAsync(ct);
    
                    using (var cmd = new SqlCommand() { Connection = cn })
                    {
                        cmd.CommandText = "SELECT COUNT(*) FROM dbo.[Order Details];";
    
                        int recordCount = (int) cmd.ExecuteScalar();
    
                        for (int index = 0; index < recordCount; index++)
                        {
                            fetchCount += incrementBy;
                            cmd.CommandText =
                                "SELECT OrderID, ProductID, UnitPrice, Quantity, Discount FROM dbo.[Order Details] " +
                                "ORDER BY ProductID " +
                                $"OFFSET {fetchCount} ROWS FETCH NEXT 20 ROWS ONLY;";
    
                            try
                            {
    
                                var reader = await cmd.ExecuteReaderAsync(ct);
    
                                if (!reader.HasRows)
                                {
                                    break;
                                }
    
                                while (await reader.ReadAsync(ct))
                                {
                                    orderList.Add(new OrderDetails()
                                    {
                                        OrderId = await reader.GetFieldValueAsync<int>(0),
                                        ProductId = await reader.GetFieldValueAsync<int>(1)
                                    });
                                }
    
                                reader.Close();
    
                                if (ct.IsCancellationRequested)
                                {
                                    ct.ThrowIfCancellationRequested();
                                }
    
                                await Process(orderList);
                                orderList.Clear();
    
                            }
                            catch (OperationCanceledException)
                            {
                                // operation has been cancelled by user
                                throw;
                            }
                            catch (Exception ex)
                            {
                                // handle it
                            }
    
                        }
                    }
                }
    
                return true;
            }
    
            private async Task<bool> Process(List<OrderDetails> orderDetails)
            {
                // process chunk list
                return true;
            }
        }
    }


    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange

    Thursday, January 9, 2020 6:38 PM
    Moderator
  • Splitting the file into smaller chunks would be the most efficient and to possibly improve performance you can look into using using Parallel LINQ (PLINQ). It can significantly increase the speed of LINQ to Objects queries by using all available cores on the host machine.

    How to: Create and Execute a Simple PLINQ Query

    var pvQuery = from car in Cars.AsParallel()
                  select car;
    I would also use a StringBuilder to buffer your read in memory and not constantly writing to disk could be very beneficial in this situation.


    william xifaras


    Thursday, January 9, 2020 8:23 PM
  • That's a good partial solution as now we can parallel the query but I couldn't loop though it, taking chunks of the list and after write on a file.

    I've tried to do a pvQuery.Count() in order to use Take and Skip but it's taking so long to do the Count.

    Do you know how could how get the results from this parallel query and writing this to a file? 

    I can't load everything in memory as it will explode the machine capacity. 

    Friday, January 10, 2020 9:00 AM
  • Hi Fcabralj,

    When the amount of data is too large, we can only reduce the time consumed to a certain extent.

    Two hundred million is a big number. It takes a lot of time anyway. I think using parallel linq may be the most effective optimization method.

    Best Regards,

    Timon



    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    • Marked as answer by FcabralJ Monday, January 13, 2020 9:38 PM
    Monday, January 13, 2020 6:39 AM