none
Parse CSV with almost identical columns RRS feed

  • Question

  • Hi, 

    I have a scenario where I have multiple files which contain same information but their headers are named differently. Example PickUpDate. Some files would say DatePickedup, some would say pickDate, some say BusinessDate but all have same information. How can I make one single parser to parse all of these different header names instead of creatng multiple parsers?

    Friday, November 1, 2019 3:52 AM

Answers

  • Then a simple mapping like Nicklaus mentioned would work provided you have a way of knowing that file A uses mapping A, file B mapping B, etc. Where you store this is up to you. It could be in a config file, database or code.

    Irrelevant your CSV reading code would simply need to use "friendly" column names to find data. The mapping structure would contains the "friendly name to column index" values that are actually needed. A simple extension method that took the mapping and friendly column name and returned the underlying value would be sufficient. To make this easier to work with creating a wrapper reader (or extending your existing one) to take the mapping in would allow your code to mostly remain unchanged. Just figure out which mapping to use for the given file, load up the mapping from wherever you are storing it and pass it to the reader during construction. Then your code can read the data without regards for mapping anymore. Something like this:

    //Won't compile...
    public class CsvReader
    {
       //Using a dictionary here to keep it simple, could use a more elaborate type - make sure case sensitive is turned off though...
       public class CsvReader ( string filename, IDictionary<string, int> columnMappings )
       { … }
    
       public IEnumerable<MyRecord> GetRecords ()
       {
           //foreach line in file
              yield return LoadRecord();
       }
    
       private MyRecord LoadRecord ()
       {
           return new MyRecord() 
           {   
              …
              PickUpDate = Get<DateTime>("PickUpDate")
           };
       }
    
       private T Get<T> ( string columnName )
       {
          //Not doing any error handling here
          var index = _mappings[columnName];
          
          //Get value and convert using the index instead of name
          ...
       }
    }

    Usage is straightforward.

    //Unique mappings from config, db, code, etc
    var fileAMappings = ...;
    var fileBMapping = ...;
    
    var reader = new CsvReader("fileA.csv", fileAMappings);
    foreach (var record in reader.GetRecords())
    {
       ...
    };
    
    reader = new CsvReader("fileB.csv", fileBMappings);
    ...
    
    


    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:49 PM
    Friday, November 1, 2019 8:12 PM
    Moderator
  • I guess I know where I was disconnected in terms of mapping between my normalized columns and file columns. To make sure, from example (from above code), the Key has name of my normalized columns and the value has the index corresponding in the file? Correct? 

    var fileAMappings = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase) 
    {
       { "site", 0 },
       { "pickupdate", 1 },
       { "amount", 2 },
       { "unit", 3 }
    }


    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:49 PM
    Monday, November 4, 2019 8:23 PM
  • "It is also possible that some files have not all columns available."

    Depends on how you're going to implement it in the mapping. If your mapping is string to ints then you need an integral value to represent no such column, such as -1. Alternatively leaving out the value entirely works. Your TryGetValue code should be fine in this regard. The only recommendation I might make is to adjust Get<T> to accept a default value to use if no column value is available. Then you can more easily handle missing values rather than always using the default value.


    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:20 PM
    Tuesday, November 5, 2019 3:20 AM
    Moderator
  • private T Get<T>(CsvReader csvReader, string columnName, T defaultValue = default)
    { 
        if (_columnMappings.TryGetValue(columnName, out int index))
           return csvReader.GetField<T>(index);
    
        return defaultValue;
    }


    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:20 PM
    Tuesday, November 5, 2019 3:46 AM
    Moderator
  • 80/20 rule. Optimize the 20% of your code that runs 80% of the time. After you have the current solution working for your production loads, profile it. If it is slow compared to everything else then look into the performance optimizations. If it runs fast enough then don't bother.

    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:20 PM
    Tuesday, November 5, 2019 2:59 PM
    Moderator
  • I would forgo the `TypeFormats` type and just use `IEnumerable<Format>`. Unless you need additional things besides the formats later it isn't gaining you anything. Beyond that it looks fine.

    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:25 PM
    Wednesday, November 6, 2019 3:28 PM
    Moderator

All replies

  • Hello, 

    Just create a mapping class with an array of names.  Then see if the array contains the column name.
    Something like this:

        public class ColumnMap
        {
            public string DestinationColumn { get; set; }
            public string[] SourceColumn { get; set; }
    
            public ColumnMap(string destinationColumn, string[] sourceColumn)
            {
                DestinationColumn = destinationColumn;
                SourceColumn = sourceColumn;
            }
        }
    
        public void Import(object sender, EventArgs e)
        {
    
            ColumnMap columnMap = new ColumnMap("Pickup", new string[] { "PickUpDate", "pickDate", "BusinessDate" });
            DataTable dtOutput = new DataTable();
            dtOutput.Columns.Add("Pickup", typeof(DateTime));
    
            DataTable dtInput = new DataTable();
            foreach (DataRow drIn in dtInput.Rows)
            {
                DataRow drOut = dtOutput.NewRow();
                foreach (DataColumn column in drIn.Table.Columns)
                {
                    if(columnMap.SourceColumn.Contains(column.ColumnName))
                    {
                        drOut[columnMap.DestinationColumn] = DateTime.Parse(drIn[column].ToString());
                        break;
                    }
                }
    
                // Process other columns etc.
                dtOutput.Rows.Add(drOut);
            }
            
            // BulkImport to SQL or whatever.
        }

    I hope that helps!

    Thank you,

    Nick

    Friday, November 1, 2019 4:47 AM
  • Hi Me.Saqib,

    Thank you for posting here.

    For your question, you want to get data from columns with different header names but the same data format from multiple csv files.

    You can try the following code to get it.

            static void Main(string[] args)
            {
                String[] strs = { "PickUpDate", "DatePickedup", "pickDate", "BusinessDate" };
    
                string[] files = Directory.GetFiles(@"d:\aaaa");
                List<String> myExtraction = new List<string>();
                foreach (var filePath in files)
                {
                    var temp = File.ReadAllLines(filePath);
                 
                    var delimitedLine = temp[0].Split(',');
                    int index = 0;
                    foreach (var header in delimitedLine)
                    {
                        if (strs.Contains(header))
                        {
                            index = delimitedLine.ToList().IndexOf(header);
                        }
                    }
                    foreach (string line in temp)
                    {
                        var data = line.Split(',');
                        myExtraction.Add(data[index]);
                    }
                }
    
                foreach (var item in myExtraction)
                {
                    Console.WriteLine(item);
                }
    
                Console.WriteLine("Press any key to continue...");
                Console.ReadKey();
            }
        }
    

    Result:

    Hope my solution could be helpful.

    Best Regards,

    Timon


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Friday, November 1, 2019 7:24 AM
  • This pattern is known as Adapter. In domain driven design there is anticoruption layer where Adapter should be implemented.
    Friday, November 1, 2019 8:26 AM
  • There is already NuGet packages that can parse arbitrary CSV files. What exactly are you trying to do that wouldn't work with one of them? The way I see it is that you are saying everything is the same except the column headers. Therefore column 1 is an Int (for example), 2 a string, 3 a DateTime, etc. Beyond the header itself the actual reading of the rows is identical. You can use ordinal indexing to access the values in the row without regard for the column headers. Hence the headers don't matter, just ignore them.

    Michael Taylor http://www.michaeltaylorp3.net

    Friday, November 1, 2019 1:51 PM
    Moderator
  • Hello, 

    Just create a mapping class with an array of names.  Then see if the array contains the column name.
    Something like this:

        public class ColumnMap
        {
            public string DestinationColumn { get; set; }
            public string[] SourceColumn { get; set; }
    
            public ColumnMap(string destinationColumn, string[] sourceColumn)
            {
                DestinationColumn = destinationColumn;
                SourceColumn = sourceColumn;
            }
        }
    
        public void Import(object sender, EventArgs e)
        {
    
            ColumnMap columnMap = new ColumnMap("Pickup", new string[] { "PickUpDate", "pickDate", "BusinessDate" });
            DataTable dtOutput = new DataTable();
            dtOutput.Columns.Add("Pickup", typeof(DateTime));
    
            DataTable dtInput = new DataTable();
            foreach (DataRow drIn in dtInput.Rows)
            {
                DataRow drOut = dtOutput.NewRow();
                foreach (DataColumn column in drIn.Table.Columns)
                {
                    if(columnMap.SourceColumn.Contains(column.ColumnName))
                    {
                        drOut[columnMap.DestinationColumn] = DateTime.Parse(drIn[column].ToString());
                        break;
                    }
                }
    
                // Process other columns etc.
                dtOutput.Rows.Add(drOut);
            }
            
            // BulkImport to SQL or whatever.
        }

    I hope that helps!

    Thank you,

    Nick

    I like to keep these columns in config file may be so in future if other files comes in, I dont have to change the code. Will that work in this scenario?
    Friday, November 1, 2019 4:19 PM
  • There is already NuGet packages that can parse arbitrary CSV files. What exactly are you trying to do that wouldn't work with one of them? The way I see it is that you are saying everything is the same except the column headers. Therefore column 1 is an Int (for example), 2 a string, 3 a DateTime, etc. Beyond the header itself the actual reading of the rows is identical. You can use ordinal indexing to access the values in the row without regard for the column headers. Hence the headers don't matter, just ignore them.

    Michael Taylor http://www.michaeltaylorp3.net

    There is no order to columns. It can come in any order and header names.  
    Friday, November 1, 2019 4:20 PM
  • "I have a scenario where I have multiple files which contain same information but their headers are named differently"

    So this isn't true then? What do you mean they can come in any order - that col 1 may be Id in file A but Name in file B? Do they all use the same header names just in different orders? If not then how do you know that integral column 3 in file A is the same column as 5 in file B?


    Michael Taylor http://www.michaeltaylorp3.net

    Friday, November 1, 2019 4:25 PM
    Moderator
  • Sorry about the confusion. The files can have data in any order with different header names as well.
    Friday, November 1, 2019 5:24 PM
  • So, again, how do you know that column 1 in file A matches column 2 in file B? Do these files come from different systems and you know that one system gives files in one way while another system gives files another way but they are always consistent?

    Michael Taylor http://www.michaeltaylorp3.net

    Friday, November 1, 2019 5:28 PM
    Moderator
  • Yes these files come from multiple systems. Yes they are consistent.
    Friday, November 1, 2019 7:17 PM
  • Then a simple mapping like Nicklaus mentioned would work provided you have a way of knowing that file A uses mapping A, file B mapping B, etc. Where you store this is up to you. It could be in a config file, database or code.

    Irrelevant your CSV reading code would simply need to use "friendly" column names to find data. The mapping structure would contains the "friendly name to column index" values that are actually needed. A simple extension method that took the mapping and friendly column name and returned the underlying value would be sufficient. To make this easier to work with creating a wrapper reader (or extending your existing one) to take the mapping in would allow your code to mostly remain unchanged. Just figure out which mapping to use for the given file, load up the mapping from wherever you are storing it and pass it to the reader during construction. Then your code can read the data without regards for mapping anymore. Something like this:

    //Won't compile...
    public class CsvReader
    {
       //Using a dictionary here to keep it simple, could use a more elaborate type - make sure case sensitive is turned off though...
       public class CsvReader ( string filename, IDictionary<string, int> columnMappings )
       { … }
    
       public IEnumerable<MyRecord> GetRecords ()
       {
           //foreach line in file
              yield return LoadRecord();
       }
    
       private MyRecord LoadRecord ()
       {
           return new MyRecord() 
           {   
              …
              PickUpDate = Get<DateTime>("PickUpDate")
           };
       }
    
       private T Get<T> ( string columnName )
       {
          //Not doing any error handling here
          var index = _mappings[columnName];
          
          //Get value and convert using the index instead of name
          ...
       }
    }

    Usage is straightforward.

    //Unique mappings from config, db, code, etc
    var fileAMappings = ...;
    var fileBMapping = ...;
    
    var reader = new CsvReader("fileA.csv", fileAMappings);
    foreach (var record in reader.GetRecords())
    {
       ...
    };
    
    reader = new CsvReader("fileB.csv", fileBMappings);
    ...
    
    


    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:49 PM
    Friday, November 1, 2019 8:12 PM
    Moderator
  • _mappings[columnName] 

    How do I suppose to use? Is _mappings array ?

    Also when I m loading mappings to fileAMappings or B, will below is what u thinkg too?

    fileA {
       site:0,
       pickupdate:1,
       amount:2,
       unit:3
    },
    fileB {
       pickupdate:0,
       value:1,
       units:2
       siteId:3,
    },
    


    Friday, November 1, 2019 10:10 PM
  • In my simple example it is simply a dictionary so you'd map it as such.

    var fileAMappings = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase) 
    {
       { "site", 0 },
       { "pickupdate", 1 },
       { "amount", 2 },
       { "unit", 3 }
    };
    
    

    Of course this is if you wanted to do it in code. If you want to do it from a config file then you'd just need a custom section handler (or use the existing dictionary section handler).

    If you want something more elaborate then use Nicklaus's solution of a custom type to handle the mappings. Either approach works and is completely dependent upon your needs. Ultimately all you're doing is passing some type that can take a string and get an (integral) index value. 


    Michael Taylor http://www.michaeltaylorp3.net

    Saturday, November 2, 2019 4:34 AM
    Moderator
  • How would I access the record in this function to retrieve value for the passed col name?
     private T Get<T> ( string columnName )
    Sunday, November 3, 2019 4:19 AM
  • That is completely dependent upon how you're reading your CSV file. As I mentioned, there are already CSV file readers available on NuGet. Additionally you could write a poor man's version that read data into a datatable. Your post was how to use the same code to access CSV files that differ only in column positions. So the method I posted, given a 
    "row" from your CSV file would give you the index to find the value.

    As example suppose your CSV reader converts the file to a DataTable with a DataRow for each line. When you ask for the next record from the reader it would use the current DataRow to build up the data. That is where this method I provided comes in. Given the "current" row you would get the index of the column given the column name and fetch it from the current row, whatever that means for your reader.

    private T Get<T> ( string columnName )
    {
       //Get the ordinal
       var index = _mappings[columnName];
    
       //Assume "row" is the current row as exposed by your underlying CSV reader
       var value = row[index];
    
       //Convert to the desired type, for this sample assuming that the CSV reader you're using already has converted the CSV cell to the appropriate type
       return (T)value;
    }


    Michael Taylor http://www.michaeltaylorp3.net

    Sunday, November 3, 2019 8:21 PM
    Moderator
  • My question is still where row is declared and where it is set with the row from file?
    Monday, November 4, 2019 1:42 AM
  • We cannot answer that because we have no idea how you're reading the CSV file. In your original post you said

    "How can I make one single parser to parse all of these different header names instead of creatng multiple parsers?"

    So the assumption is you already have a CSV reader code written. Post that code and we can try to interpolate from there what you need to do. If you don't yet have that code then use NuGet to find a CSV parser that you want to use such as CsvHelper, Csv.Net or CsvTools. Then we can help you modify that code to do what you want.


    Michael Taylor http://www.michaeltaylorp3.net

    Monday, November 4, 2019 2:46 AM
    Moderator
  • I looked at CSVHelper but it requires the .net properties to match with header names of file. Regardless, the way I did, I have streamreader and then I m reading line at a time and then splitting the line based on "," to get individual values. I added that code inside GetRecords. This is what I have:

            public IEnumerable<MyRecord> GetMyRecords()
            {
                using (var reader = new StreamReader(_fileName))
                {
                    while (!reader.EndOfStream)
                    {
                        var line = reader.ReadLine();
                        var values = line.Split(',');
    
                        yield return LoadRecord();
                    } 
                }
            }
    Monday, November 4, 2019 4:32 AM
  • OK but your code won't work with any CSV that supports strings with commas in them. Comments can also be an issue. That is why using a standard library is a better approach. Have you looked at CsvTools. Specifically it allows you to load the CSV into a table structure. It also supports using the loaded CSV data to find a column given its index. So rather than you creating a mapping from column names to indice you could create a mapping from the column name in the CSV to what your code is expecting.

    Given that it would look something like this.

    //Not tested - no error handling here
    public class CsvReader
    {
       ...
    
       public IEnumerable<MyRecord> GetRecords ()
       {
          //Load the CSV
          var table = DataTable.New.ReadLazy(filename);
          foreach (var row in table.Rows)
          {
             var record = ReadRecord(row);
             if (record != null)
                yield return record;
          };
       }
    
       private MyRecord ReadRecord ( DataRow row )
       {
          //Check for a valid row, whatever that means by your rules...
          return new MyRecord() {
             Id = GetValue<int>(row, "Id"),
             Date = GetValue<DateTime>(row, "SomeDate"),
             Number = GetValue<double>(row, "SomeNumber")
          };
       }  
    
       private T GetValue<T> ( DataRow row, string columnName )
       {
          //If you're using the original mapping structure of column name to index then use the Values property
          var index = _mappings[columnName];
          var stringValue = row.Values[index];
    
          //Alternatively since the reader supports this already use a mapping of the file-specific column to your normalized column name
          var actualColumnName = _mappings[columnName];
          var stringValue = row[actualColumnName];
    
          //Either way you now have the string version so parse it - note that for performance reasons you should probably consider doing all this lookup, converter mapping once and then reuse it for each row rather than doing this for each row
          var converter = TypeDescriptor.GetConverter(typeof(T));
          return (T)converter.ConvertFromString(stringValue);
       }
    }


    Michael Taylor http://www.michaeltaylorp3.net

    Monday, November 4, 2019 5:58 AM
    Moderator
  • In the ReadRecord, I cannot set fixed types and column names because those can be different in different files...

     private MyRecord ReadRecord ( DataRow row )
       {
          //Check for a valid row, whatever that means by your rules...
          return new MyRecord() {
             Id = GetValue<int>(row, "Id"),
             Date = GetValue<DateTime>(row, "SomeDate"),
             Number = GetValue<double>(row, "SomeNumber")
          };
       }  

    Monday, November 4, 2019 6:24 PM
  • Also CSVTools is not supported in .net core. I m trying csvhelper but they dont have following way to access cols 

     var index = _mappings[columnName];
          var stringValue = row.Values[index];
    

    Monday, November 4, 2019 6:40 PM
  • "I cannot set fixed types and column names because those can be different in different files"

    I know, that is the whole purpose of your post. The GetValue method is using your normalized name that you will use for all files. The mapping structure that you set up (as discussed in earlier posts) will be per-file and specify that column A (in GetValue) maps to column X (in the CSV file). Take a look at the GetValue code I posted again. It contains a call to get the correct (per file) column name from the mapping dictionary that you passed to the reader. Your code (outside this mapping structure) uses the normalized names of the columns. This was the whole point of your thread.

    In the future when targeting .NET Core please make this clear up front as things can be different if you're targeting it.

    CsvHelper has a couple of different approaches you can take. Option 1 is to forego the mapping that has been discussed all along and instead create a per-file record type that contains properties as defined by the CSV file in question. This type will also contain a conversion method that takes the current instance and return your file-agnostic record type that you need (or you could use inheritance). Then modify your reader class to take the record type as a generic parameter. Inside the reader use CsvHelper.GetRecords in combination with the record type to read the data from the CSV. Then call the type's conversion method to get the file-agnostic record equivalent. This approach is more flexible than the mapping solution being presented before. However this also means that if you want to support alternative CSV formats later you'll need to create more record types.

    public class MyRecord
    {
       public int Id { get; set; }
       public string Name { get; set; }
    }
    
    public interface ICsvRecord
    {
       MyRecord ToRecord ();
    }
    
    public class FileARecord : ICsvRecord
    {
       //These names must match the order and column names of the CSV file
       public int Id { get; set; }
       public string AName { get; set; }
    
       //Handle the mapping here...
       public MyRecord ToRecord () {
          return new MyRecord() {
              Id = Id,
              Name = AName
          };
       }
    }
    
    public class FileBRecord : ICsvRecord
    {
       //These names must match the order and column names of the CSV file
       public int IdA { get; set; }
       public string BName { get; set; }
    
       //Handle the mapping here...
       public MyRecord ToRecord () {
          return new MyRecord() {
              Id = IdA,
              Name = BName
          };
       }
    }
    
    public class CsvReader<T> where T: ICsvRecord
    {
       public IEnumerable<MyRecord> GetRecords ()
       {
          using (var reader = new StreamReader(filename))
          using (var csv = new CsvReader(reader))
          {
             var records = csv.GetRecords<T>();
             foreach (var record in records)
                yield return record.ToRecord();
          }
       }
    }
    Alternatively you can use the original mapping code with CsvHelper. This documentation demonstrates that you can retrieve values by index or column name and it already has a GetField<T> method to do the type conversion. So you can use the mapping structure discussed already to do what you want.


    Michael Taylor http://www.michaeltaylorp3.net

    Monday, November 4, 2019 6:57 PM
    Moderator
  • Yes it does. Refer to previous comment about GetField<T>.

    Michael Taylor http://www.michaeltaylorp3.net

    Monday, November 4, 2019 6:57 PM
    Moderator
  • I guess I know where I was disconnected in terms of mapping between my normalized columns and file columns. To make sure, from example (from above code), the Key has name of my normalized columns and the value has the index corresponding in the file? Correct? 

    var fileAMappings = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase) 
    {
       { "site", 0 },
       { "pickupdate", 1 },
       { "amount", 2 },
       { "unit", 3 }
    }


    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:49 PM
    Monday, November 4, 2019 8:23 PM
  • Correct

    Michael Taylor http://www.michaeltaylorp3.net

    Monday, November 4, 2019 8:26 PM
    Moderator
  • It is also possible that some files have not all columns available.

    From example below, some files may wont have "SomeNumber" and some will. Do I control that using if else statement or any better suggestion?

       private MyRecord ReadRecord ( DataRow row )
       {
          //Check for a valid row, whatever that means by your rules...
          return new MyRecord() {
             Id = GetValue<int>(row, "Id"),
             Date = GetValue<DateTime>(row, "SomeDate"),
             Number = GetValue<double>(row, "SomeNumber")
          };
       }  

    Tuesday, November 5, 2019 12:32 AM
  • This is what I did:

    private T Get<T>(CsvReader csvReader, string columnName)
    {
        if (_columnMappings.TryGetValue(columnName, out int index))
        {
            return csvReader.GetField<T>(index);
        }
        else
        {
            return default;
        }
     }


    • Edited by Me.Saqib Tuesday, November 5, 2019 2:53 AM
    Tuesday, November 5, 2019 1:46 AM
  • This is what I did:

            private T Get<T>(CsvReader csvReader, string columnName)
            {
                if (_columnMappings.TryGetValue(columnName, out int index))
                {
                    string stringValue = csvReader.GetField(index);
    
                    //Either way you now have the string version so parse it - note that for performance reasons you should probably consider doing all this lookup, converter mapping once and then reuse it for each row rather than doing this for each row
                    var converter = TypeDescriptor.GetConverter(typeof(T));
                    return (T)converter.ConvertFromString(stringValue);
                }
                else
                {
                    return default;
                }
            }

    Does that mean that you've solved your problem? Or are you just posting another thing that you've tried? Hard to tell.

    I don't know if this will help, or if it will give you some additional ideas, but I recently wrote a blog post about using Metadata for parsing data. https://geek-goddess-bonnie.blogspot.com/2019/09/parsing-data-with-metadata.html


    ~~Bonnie DeWitt [C# MVP]

    http://geek-goddess-bonnie.blogspot.com

    Tuesday, November 5, 2019 2:00 AM
    Moderator
  • "It is also possible that some files have not all columns available."

    Depends on how you're going to implement it in the mapping. If your mapping is string to ints then you need an integral value to represent no such column, such as -1. Alternatively leaving out the value entirely works. Your TryGetValue code should be fine in this regard. The only recommendation I might make is to adjust Get<T> to accept a default value to use if no column value is available. Then you can more easily handle missing values rather than always using the default value.


    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:20 PM
    Tuesday, November 5, 2019 3:20 AM
    Moderator
  • U mean something like this?

    private T Get<T>(CsvReader csvReader, string columnName)
    {
        T value = default;
    
        if (_columnMappings.TryGetValue(columnName, out int index))
        {
           value = csvReader.GetField<T>(index);
        }
        return value;
    }

    Tuesday, November 5, 2019 3:41 AM
  • private T Get<T>(CsvReader csvReader, string columnName, T defaultValue = default)
    { 
        if (_columnMappings.TryGetValue(columnName, out int index))
           return csvReader.GetField<T>(index);
    
        return defaultValue;
    }


    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:20 PM
    Tuesday, November 5, 2019 3:46 AM
    Moderator
  • Also for usage, I want to declare only one variable "fileMappings" and reload it every time the process receives the file. The file name will have info about the "mapping" which I will use to read appconfig.json to lookup for received mapping and then load the json to the variable. Do you see any issue with that?

    //Unique mappings from config, db, code, etc
    var fileAMappings = ...;
    var fileBMapping = ...;
    
    var reader = new CsvReader("fileA.csv", fileAMappings);
    foreach (var record in reader.GetRecords())
    {
       ...
    };
    
    reader = new CsvReader("fileB.csv", fileBMappings);
    ...

    Tuesday, November 5, 2019 4:02 AM
  • Got you. I can now always send some replacement values for all columns in case column is missing in file using this 3rd parameter with defaultValue parameter...
    Tuesday, November 5, 2019 2:24 PM
  • Correct.

    "Also for usage, I want to declare only one variable "fileMappings" and reload it every time the process receives the file."

    Doesn't matter where the mapping comes from provided you pass the correct instance to the reader each time it needs it.


    Michael Taylor http://www.michaeltaylorp3.net

    Tuesday, November 5, 2019 2:33 PM
    Moderator
  • Do you also think if I need to do something with lookups for performance reasons as suggested before?
    Tuesday, November 5, 2019 2:47 PM
  • 80/20 rule. Optimize the 20% of your code that runs 80% of the time. After you have the current solution working for your production loads, profile it. If it is slow compared to everything else then look into the performance optimizations. If it runs fast enough then don't bother.

    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:20 PM
    Tuesday, November 5, 2019 2:59 PM
    Moderator
  • Wanted to show you the keywords.json file and the class structure to read the elements in for any suggestions you may have:

    This is keywords.json

    {
      "Formats": [
        {
          "FileType": "TypeA",
          "FileProperties": {
            "SiteID": "1",
            "DateTime": "3",
            "OrderNumber": "4",
            "Carrier": "13"
          }
        },
        {
          "FileType": "TypeB",
          "FileProperties": {
            "SiteID": "1",
            "DateTime": "3",
            "Product": "7",
            "Carrier": "13"
          }
        }
      ]
    }


    This is structure to hold json data:

        public class TypeFormats
        {
            public List<Format> Formats { get; set; }
        }
        public class Format
        {
            public string FileType { get; set; }
            public Dictionary<string, int> FileProperties = new Dictionary<string, int>();
        }

    This is to read:

    TypeFormats typeFormats = new TypeFormats();
    StreamReader sr = new StreamReader("keywords.json");
    string jsonString = sr.ReadToEnd();
    typeFormats = JsonConvert.DeserializeObject<TypeFormats>(jsonString);



    • Edited by Me.Saqib Wednesday, November 6, 2019 4:45 AM
    Wednesday, November 6, 2019 4:45 AM
  • I would forgo the `TypeFormats` type and just use `IEnumerable<Format>`. Unless you need additional things besides the formats later it isn't gaining you anything. Beyond that it looks fine.

    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Me.Saqib Wednesday, November 13, 2019 3:25 PM
    Wednesday, November 6, 2019 3:28 PM
    Moderator
  • Hi Michael,

    One question or challenge I m facing that I have some CSV files where there is some parsing needed such as in one file, I have datetime splitted in two fields, date in one field and time in other. Time field is not properly formatted either. The time value is 0 for no time and then 3 digit time 145 for 1:45 pm and 4 digit 1645 for 16:45. What my question at what stage I would format the time and combine both fields in single datetime? I dont like to do custom logic inside the CsvReader because I also have some other Csv files where there is some custom work needed to before it can be passed to CsvReader. What would you to suggest ?

    Monday, November 18, 2019 7:14 PM
  • Couple of options come to mind depending on the complexity of the changes. 

    If some CSV files have separate columns for values that are normally combined (e.g. date/time) then add the separate columns to your core data structure that you produce by reading the CSV files. Then adjust your (normally combined) date/time field to see if it has an explicit value (set by the CSV reader normally). If it is not set then see if the separate pieces are available and if so build the actual date/time from them. Hence a CSV that has the combined values will use that. However for files that separate the 2 then you can use code to put them into a single value via a calculated property.

    If the logic is more complex then you might need to support derived-CSV readers. For the files that require special processing create a custom type that derives from your base CSV reader. Then implement any file-specific logic need in there. The advantage of this approach is that you can effectively change anything. The disadvantage is that if you run across a new file that requires custom reading then you'll have to add code to your app. However this should be an exception and not the norm. To keep the changes to a minimal you can store the "type" of reader to use in the same configuration option you're using to map fields to column indice. 


    Michael Taylor http://www.michaeltaylorp3.net

    Monday, November 18, 2019 7:38 PM
    Moderator
  • I like first option but it looks like I would pick option2 because I have some CSV that are splitted in multiple files etc and others with different complexities. 

    So the way it would work, I would inherit from the my current class which I have named as (CSVReaderBase) like myCustomCSVWthSplittedDateTime : CSVReaderBase

    Then how would I call the new derived class from my application based on type of reader? Would I do if/else?

    This is my current call to only CSVReaderBase class

    var reader = new CSVReaderBase(mStream, message.fileProperties, message.isHeader);

    Monday, November 18, 2019 9:40 PM
  • I would recommend that you expand your existing CSV mapping code to allow you to specify an optional reader type as well. If you simplify the logic and assume the custom readers are going to be in the same assembly as your main code then a full type name is sufficient otherwise you'll need a fully qualified type name with assembly.

    At any rate when you've identified the file to be processed then look at the required reader type.  If a non-default one is specified then use Activator.CreateInstance to create an instance of the reader (deriving from your base CSVReaderBase class). Then you'll use it instead. Unfortunately working with constructors is more difficult with reflection so I'd recommend that the parameters you normally pass to the constructor of your base type be moved to some "Initialize" method instead. Then you can create an instance of the reader type and then call Initialize and pass it the data you were previously using in the constructor. From this point on the code will be the same as before.


    Michael Taylor http://www.michaeltaylorp3.net

    Tuesday, November 19, 2019 5:29 AM
    Moderator