none
Dirty data in a file RRS feed

  • Question

  • I have a C# app that goes out to a government web site with an API key. Some of the file that the application looks at has a LF (I believe that the file may be UNIX, not totally sure) in the middle and at the end. My app compares the file with a database to ensure that there are no entities on the list that we can sell our products to (we have overseas accounts) which is a no no. My app also, if it finds a match between the listing and our database, sends out an email with the information about the entity. I am a noob to C# and cannot think of a way to accommodate the LF where it is not suppose to be. The owner of the site, (being the government) is not going to clean up the data.
    Part of my logic is as:

               Uri uri = new Uri(url);
                //Create the request object
                WebRequest req = WebRequest.Create(uri);
                WebResponse resp = req.GetResponse();
                Stream stream = resp.GetResponseStream();
                StreamReader sr = new StreamReader(stream);
                string s = sr.ReadToEnd();

                string[] lines = s.Split('\n');

                int iCount = 0;
                foreach (string line in lines)
                {
                    if (iCount++ < 2)
                        continue;
                    if (line == "")
                        continue;

                    string[] Items = line.Split('\t');
                    ScreenItem myScreenItem = new ScreenItem();

                    myScreenItem.source = Items[0];
                    myScreenItem.entity_number = Items[1];
                    myScreenItem.type = Items[2];
                    myScreenItem.programs = Items[3];
                    myScreenItem.name = Items[4].Replace('"', ' ').Trim();
                    myScreenItem.title = Items[5];
                    myScreenItem.addresses = Items[6];
                    myScreenItem.federal_register_notice = Items[7];
                    myScreenItem.start_date = Items[8];
                    myScreenItem.end_date = Items[9];
                    myScreenItem.standard_order = Items[10];
                    myScreenItem.license_requirement = Items[11];
                    myScreenItem.license_policy = Items[12];
                    myScreenItem.call_sign = Items[13];
                    myScreenItem.vessel_type = Items[14];
                    myScreenItem.gross_tonnage = Items[15];
                    myScreenItem.gross_registered_tonnage = Items[16];
                    myScreenItem.vessel_flag = Items[17];
                    myScreenItem.vessel_owner = Items[18];
                    myScreenItem.remarks = Items[19];
                    myScreenItem.source_list_url = Items[20];
                    myScreenItem.alt_names = Items[21];
                    myScreenItem.citizenships = Items[22];
                    myScreenItem.dates_of_birth = Items[23];
                    myScreenItem.nationalities = Items[24];
                    myScreenItem.places_of_birth = Items[25];
                    myScreenItem.source_information_url = Items[26];
                    myScreenItem.ids = Items[27];

                    if (myScreenItem.addresses == "\n")
                    {
                        myScreenItem.addresses = myScreenItem.addresses.Replace("\n", ",");
                    }


                    if (myScreenItem.entity_number == "12113")
                    {
                        myScreenItem.alt_names = myScreenItem.alt_names.Replace("\"", string.Empty);
                    }

                    //myScreenItem.Remarks = Items[29];
                    //myScreenItem.WebLink = Items[30];

                    myScreenList.Add(myScreenItem);
                    listBox2.Items.Add(myScreenItem);
                    listBox2.Refresh();
                }
            }
    My first thought was to use an IF statement somewhere but I'm not sure where that would go so I put it right after Items[27] line.  The LF comes in an address as part of Items[26]. Thank you.



    • Edited by billmarmc Tuesday, November 6, 2018 2:25 PM
    Tuesday, November 6, 2018 1:55 PM

All replies

  • First of all: I guess you have also some audit requirements also. In this case it is imho absolutly mandatory, that you store the original data in the form you retrieve.

    Also: Read the available specifications of this service again. They are normally well documented.

    Furthermore, using LF insteaf of CRLF is widespread in the Unix world. So it maybe a normal line terminator.

    And last, but not least: your splitting approach looks like you are working with CSV or DSV file having LF as line terminator. 

    Check the service again, maybe you can retrieve the data in a better format (XML, JSON). Cause parsing CSV is harder then it seems to look like at the first glance.

    And check nuget for ready to use CSV parsers.

    Tuesday, November 6, 2018 5:16 PM
  • @Stefan

    ...

    Check the service again, maybe you can retrieve the data in a better format (XML, JSON). Cause parsing CSV is harder then it seems to look like at the first glance.

    ...

    Ridiculous. Saying that CSV parsing is harder than XML is a huge joke. Look at the huge amount of classes and classes' elements you had to design in this thread! And all that to produce this little fart :S) as output:

    2.16.840.1.113883.10.20.1.24
    2
    EffectiveTime_IVL_TS
    EffectiveTime_PIVL_TS
    UNK
    UNK
    A
    h
    12
    Done.


    Tuesday, November 6, 2018 10:57 PM