none
Regex CSV Splitting RRS feed

  • Question

  • Hi,

    I am trying to split lines from a CSV-Files that can include data within and without quotes:

    25;"text1";18.0;303.50;"text2a;text2b;text2c";2017-05-14 23:38:05;"woof";;10;

    As you see, strings/text is "quoted" and can include ; within the field, while other stuff is not.

    I need the following output, but failed so far, since I'm not good at regex...

    25

    text1

    18.0

    303.50

    text2a;text2b;text2c

    2017-05-14 23:38:05

    ....and so on. Notice there are empty fields.

    Thanks, Roger

    Saturday, September 30, 2017 11:35 AM

Answers

  • Hi Roger,

    Because there are ; inside a string you need a parser with a state machine.

    class Program { static void Main(string[] args) { // // To insert a quote in a literal C# string add an extra quote. // var s = @"25;""text1"";18.0;303.50;""text2a;text2b;text2c"";2017-05-14 23:38:05;""woof"";;10"; var result = StateParse(s); foreach (var str in result) Console.WriteLine(str); }

    private static string[] StateParse(string s) { var i = 0; var state=false; var sb = new StringBuilder(s); foreach (var c in s) { if (c == '"') state = !state; // The state machine. if (c == ';' && !state) sb[i] = ','; i++; } return sb.ToString().Split(','); } }

    Below you see the output of this console program.

    Best Regards,

    Rudy Meijer

    • Marked as answer by darkchanter99 Sunday, October 1, 2017 8:12 AM
    Saturday, September 30, 2017 9:10 PM

All replies

  • It's too early for me to go into detail, late night programming :)

    But you don't need to use regex for that, simplest is to use string.Split(delimiter)

    foreach(string line in lines)
    {
    string[] values = line.Split(';');
    // do something with you array of values from that line
    }
    hope that helps, I need coffee!

    Saturday, September 30, 2017 12:04 PM
  • Unfortunately, that doesn't deal with quotes. It would return the "sub-items" as values on it's own, rather than one value all together.
    Saturday, September 30, 2017 3:03 PM
  • Unfortunately, that doesn't deal with quotes. It would return the "sub-items" as values on it's own, rather than one value all together.

    Apologies, I didn't see the embedded ";". I knew I needed coffee.

    I haven't read the entire article but he suggests this regex

    ^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$

    and goes into some detail about it. Hope it helps.

    Saturday, September 30, 2017 5:02 PM
  • ...and those are really semi-colons instead of commas, right?

    TEHIII

    Saturday, September 30, 2017 6:41 PM
  • I have only found one reliable way of reading CSV files where the delimiter may be embedded in the fields and that is through the use of OleDB (treating it like a database file).  In most cases, it stops the need to use a custom parser, counting pairs of quotes or Regex.

    Unfortunately, if the delimiter is "other than a comma", the process also must include a schema.ini file that describes the delimiter.  Also, if your file does not contain a header, you must also specify that.

    THE GOOD NEWS is that it is possible to write the schema.ini right before processing the file.

    Here is a program that creates the Schema.ini and process a file called DarkChanter99.csv from the c:\science\data\ directory

    using System;
    using System.Data.OleDb;
    using System.IO;
    
    namespace ConsoleApp1
    {
       class Program
       {
          static void Main(string[] args)
          {
             string strDbDir = "c:/science/data/";
             string strInFile = Path.Combine(strDbDir, "darkchanter99.csv");
             //
             FileInfo fiInFile = new FileInfo(strInFile);
             File.WriteAllText(Path.Combine(strDbDir, "schema.ini"), string.Format("[{0}]\r\nFormat=Delimited(;)\r\nColNameHeader=False\r\n", fiInFile.Name));
             string strConnect = string.Format("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=\"{0}\";Extended Properties='text';", strDbDir);
             string strSQL = string.Format("select * from [{0}]", fiInFile.Name);
    
             using (OleDbConnection conn = new OleDbConnection(strConnect))
             {
                conn.Open();
                using (OleDbDataReader rdr = (new OleDbCommand(strSQL, conn)).ExecuteReader())
                {
                   while (rdr.Read())
                   {
                      for (int i = 0; i < rdr.FieldCount; i++)
                      {
                         Console.WriteLine(rdr[i].ToString().Trim());
                      }
                   }
    
                   rdr.Close();
                }
    
                conn.Close();
             }
          }
       }
    }
    


    TEHIII

    Saturday, September 30, 2017 8:26 PM
  • I could request the file to use commas instead of semicolons as the delimitor (it's produced by a java program) but the semicolons inside a "quote" are given from the database value itself...
    Saturday, September 30, 2017 8:40 PM
  • Hi Roger,

    Because there are ; inside a string you need a parser with a state machine.

    class Program { static void Main(string[] args) { // // To insert a quote in a literal C# string add an extra quote. // var s = @"25;""text1"";18.0;303.50;""text2a;text2b;text2c"";2017-05-14 23:38:05;""woof"";;10"; var result = StateParse(s); foreach (var str in result) Console.WriteLine(str); }

    private static string[] StateParse(string s) { var i = 0; var state=false; var sb = new StringBuilder(s); foreach (var c in s) { if (c == '"') state = !state; // The state machine. if (c == ';' && !state) sb[i] = ','; i++; } return sb.ToString().Split(','); } }

    Below you see the output of this console program.

    Best Regards,

    Rudy Meijer

    • Marked as answer by darkchanter99 Sunday, October 1, 2017 8:12 AM
    Saturday, September 30, 2017 9:10 PM
  • OK then please try my example code.
    Be sure to set the correct directory.

    :)


    TEHIII

    Sunday, October 1, 2017 3:07 AM
  • Thanks to all for the great help.

    Regarding the idea of dealing with the CSV like a DB: What I didn't mention is, this isn't a "normal" CSV. The first line contains the field names and the second line contains "high level" type information, such as "number, date, text(nnn) so the importer can produce a table with the most correct types, rather than varchar(50). Setting the field types manually isn't an option, since it's too much stuff to process. Therefore, I'm not sure if I could read the file as a DB-Table...

    The other idea using a "state parser" means, I would need to add some protection, in case the delimitor is used as part of the data, such as "he said "hello" to everyone";..... - in this case they place two double-qutoes "" within the string.

    Didn't have the time today to experiment with that...


    Sunday, October 1, 2017 8:31 PM