none
How to group records in a CSV by Lattitude and Longitude with .002 Specificity? RRS feed

  • Question

  • Hi,

    I'm writing a C# console application that reads in a CSV file comprised of businesses with lattitude and longitude for each.  I need to create a resulting file that groups each entry in the original CSV to the nearest .002 lattitude / longitude pair. 

    Can anyone share some sample code, or links to blogs that would help me to achieve this?

    Thanks for your help and guidance.

    Tuesday, July 2, 2013 3:12 PM

Answers

  • Does you output file have to be CSV?  CSV doesn't contain formating. You could make your output mdb or xls where you can specifiy the format of each column/field.  You can use oledb to connect to either excel or access files as well has many other standard database formating.

    Do you want your output file to be read/write or just write?  Your input file is CSV, but your output file could be a database.  Or your project may have three files

       1) Input file

       2) Output file

       3) Database

    1.  I'm reading from one CSV and need to write results to another CSV.  Instead of the SQL conn string, how do I handle reading the CSV containing the data and writing the results to another CSV?

    Answer: Use Oleconnection.  Your SQL is a Insert string and instead of Select String use a Insert string.  Below {1} and {2} are CSV strings.

                           

    cmd.CommandText = string.Format("INSERT INTO {0} ({1}) values ({2})",
                            tablename, columnsHeaderText, columnValuesText);
                            cmd.ExecuteNonQuery();

    2.  Can latitude and longitude be defined as ints instead of doubles?

    I tried in my original code to read the latitide and longitude to double but could get it to work.  Later I realized that you can add two additional columns to the datatable that are types double with a formula to convert the string latitude and longitude to numbers.

    3.  How does the code you supplied ensure that the entries in the source CSV are grouped within a .002 lat/long specificity?

    You have write code to add the DataRows to the grid.  My solution meet all your requirements, but it was not the most efficient method.  My System Engineering Professor use to say the following :

    Requirements come from two places. Writen requirements and peoples visions.  It is the job of the System Engineer to take people visions and turn them into written requirements.

    Right now we are going through a brain-storming session.  Part of brainstorming is to write down all ideas no matter how studpid they are.  The stupid solutions usually leads to better solutions.

    Your second posting is giving me a little bit more knowledge on your problem that wasn't included in your 1st posting.  I still don't have all your written requirments because you still have visions that aren't in writting.  We need to turn your visions into requirments.

    There are two issues we must resolve

    1)   The format and contents of the output database.  I define a database as any file with known structure.  I consider a CSV file a database.

    2)    The structure of your grid.  I made the grid an array that contain pointers to rows of the datatable.  You may not want to use a standard array that contains every latitude.  Instead you may want to use a sparse matrix.  One of my college programming project (almost 40 years ago using Pascal) was to perform matrix operations using sparse.  A sparse matrix is an array that doesn't contain null entries.

    Your would have a List<List<cell>>.  I would add two more item to my cell structure which is the following

    1) cell nextLatitude;

    2) cell nextLongitude;

    The List<List<cell>> would have a header row and header column.  The header row would contain used the Latitudes used in the table and would point to to the first cell that contain that latitude.  The first cell can then point to more cells that contain the same latitude.  The last cell with the same latitude will point either to null or back to the header row.  You can do the same with the header columns using the Longitudes.

    Note : The header row and column could be rounded to .002.


    jdweng

    Wednesday, July 3, 2013 6:00 AM
  • I have additional things to say.  Most beginner computer couses sort algorithms  one dimensional sort algorithms are taught and a comparison of the algorithms are discussed.  You problem is a two dimensional sort with and added requirement of 'BIN' the results.  A BIN is to put data into groups in this case a bin size of .002.  There are three different algorithms that can be used to perform a two dimensional sort.

    1) Perform a query of the data using a FOR loop.  In your case the for loop would be

    for i = min longitude, i <= max longitude, i += .002

    Your query will be a range i <= query < i + .002 and may return more than one row of the table.  If you get more than one row from the query you may need to sort on the latitude. 

    The time to perform this algorithm is (max longitude - min longitude)/2 * Number of rows in datatable.  The time can be reduced by 1/2 if you remove each row from the datatable after it is found.  the search will autmatically do the BIN.

    2) Sort Datatable on Latitude and Longitude  After the sort you may need to apply a BIN function.  The time of a bubble sort is approximately (N*N)/2 where N is the number of rows of the datatable.

    3) Create a SPARSE matrix.  This is usually the quickest method because each row of the datatable is only accessed once.  It also requires writting more code.  Nothing comes for free.

    The time of the algorthm is the time of adding each datarow to the sparse matrix.  The adding of data to the sparse matrix still requires doing a sort when data is added to the table.  the worse case is when each longitude (or latitude) vaue is unique.  The average add time is the number of longitude entries (the number of rows of the datatable) N/2.  If you create a hash table of longitudes this time can be greatly reduced.  As you add items to the sparse matrix you can also do the BIN at the same time.


    jdweng

    Wednesday, July 3, 2013 11:43 AM

All replies

  • Can you post some sample lines of data.  Do you need to sort or just trucate the numbers to three decimal palces?

    jdweng

    Tuesday, July 2, 2013 4:06 PM
  •  Hi Joel,

    Here are a few lines from the original CSV:

    "LATITUDE","LONGITUDE","MatchType","CompanyName","Address","City","State","Zip","County","Gender","PhoneNumber","FaxNumber","CompanyContact","CompanyContactTitle","WebSite","TotalEmployees","AnnualSales","Industry","SIC_Code","SIC_Code_Description","GenericCompanyEmailAddress"
    "33.9601","-118.2316","ExactMatch","Belindas Furniture","6001 E Manchester Ave","Los Angeles","CA","90001","Los Angeles","male","3237582143","","Jose Pretado","Owner","","0","0","Miscellaneous Retail (Stores)","5932","Used Merchandise Stores",""
    "33.9547","-118.2441","IsPOBOX","Bill Jones Sr","P.O. BOX 1111","Los Angeles","CA","90001","Los Angeles","male","2136220243","","Bill Jones","Owner","","0","0","Miscellaneous Manufacturing Industries (Industries)","3911","Jewelry, Precious Metal",""

    The first thing the console app needs to do is group these records into .002 grids (not UI grids just in the console).  After that is achieved, then each grid has to be classified based on the SIC code.  I can handle the latter part.  The creation of the grids in the console app is my current challenge.

    Thanks much for your help!

    Tuesday, July 2, 2013 4:24 PM
  • Does the code below help?

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Data;
    using System.Data.OleDb;
    using System.IO;
    namespace ConsoleApplication1
    {
        class Program
        {
            static string filename = @"c:\TEMP\Latitude.csv";
            struct cell
            {
                public double latitude;
                public double longitude;
                public List<DataRow> records;
            }
            static void Main(string[] args)
            {
                DataTable table = new DataTable();
                string tablename = filename.Substring(filename.LastIndexOf('\\') + 1);
                string folder = filename.Substring(0,filename.LastIndexOf('\\') + 1);
                String ConnectionString = String.Format("Provider=Microsoft.Jet.OLEDB.4.0; Data Source={0};" +
                    "Extended Properties=\"Text\";",folder);
                string SQL = string.Format("SELECT * FROM {0}", tablename);
                OleDbDataAdapter adapter = new OleDbDataAdapter(SQL, ConnectionString);
                adapter.Fill(table);
                double maxLatitude = table.AsEnumerable()
                    .Max(row => double.Parse(row["Latitude"].ToString()));
                double minLatitude = table.AsEnumerable()
                    .Min(row => double.Parse(row["Latitude"].ToString()));
                int heightGrids = (int)Math.Ceiling((1000 * (maxLatitude - minLatitude)) / 2);
                double maxLongitude = table.AsEnumerable()
                    .Max(row => double.Parse(row["Longitude"].ToString()));
                double minLongitude = table.AsEnumerable()
                    .Min(row => double.Parse(row["Longitude"].ToString()));
                int widthsGrids = (int)Math.Ceiling((1000 * (maxLatitude - minLatitude)) / 2);
                cell[,] grid = new cell[widthsGrids,heightGrids];
            }
        }
    }


    jdweng

    Tuesday, July 2, 2013 10:59 PM
  • Hi Joel,

    This is a great start.  A few questions regarding your code:

    1.  I'm reading from one CSV and need to write results to another CSV.  Instead of the SQL conn string, how do I handle reading the CSV containing the data and writing the results to another CSV?

    2.  Can latitude and longitude be defined as ints instead of doubles?

    3.  How does the code you supplied ensure that the entries in the source CSV are grouped within a .002 lat/long specificity?

    Thanks,

    Sid

    Wednesday, July 3, 2013 1:24 AM
  • Does you output file have to be CSV?  CSV doesn't contain formating. You could make your output mdb or xls where you can specifiy the format of each column/field.  You can use oledb to connect to either excel or access files as well has many other standard database formating.

    Do you want your output file to be read/write or just write?  Your input file is CSV, but your output file could be a database.  Or your project may have three files

       1) Input file

       2) Output file

       3) Database

    1.  I'm reading from one CSV and need to write results to another CSV.  Instead of the SQL conn string, how do I handle reading the CSV containing the data and writing the results to another CSV?

    Answer: Use Oleconnection.  Your SQL is a Insert string and instead of Select String use a Insert string.  Below {1} and {2} are CSV strings.

                           

    cmd.CommandText = string.Format("INSERT INTO {0} ({1}) values ({2})",
                            tablename, columnsHeaderText, columnValuesText);
                            cmd.ExecuteNonQuery();

    2.  Can latitude and longitude be defined as ints instead of doubles?

    I tried in my original code to read the latitide and longitude to double but could get it to work.  Later I realized that you can add two additional columns to the datatable that are types double with a formula to convert the string latitude and longitude to numbers.

    3.  How does the code you supplied ensure that the entries in the source CSV are grouped within a .002 lat/long specificity?

    You have write code to add the DataRows to the grid.  My solution meet all your requirements, but it was not the most efficient method.  My System Engineering Professor use to say the following :

    Requirements come from two places. Writen requirements and peoples visions.  It is the job of the System Engineer to take people visions and turn them into written requirements.

    Right now we are going through a brain-storming session.  Part of brainstorming is to write down all ideas no matter how studpid they are.  The stupid solutions usually leads to better solutions.

    Your second posting is giving me a little bit more knowledge on your problem that wasn't included in your 1st posting.  I still don't have all your written requirments because you still have visions that aren't in writting.  We need to turn your visions into requirments.

    There are two issues we must resolve

    1)   The format and contents of the output database.  I define a database as any file with known structure.  I consider a CSV file a database.

    2)    The structure of your grid.  I made the grid an array that contain pointers to rows of the datatable.  You may not want to use a standard array that contains every latitude.  Instead you may want to use a sparse matrix.  One of my college programming project (almost 40 years ago using Pascal) was to perform matrix operations using sparse.  A sparse matrix is an array that doesn't contain null entries.

    Your would have a List<List<cell>>.  I would add two more item to my cell structure which is the following

    1) cell nextLatitude;

    2) cell nextLongitude;

    The List<List<cell>> would have a header row and header column.  The header row would contain used the Latitudes used in the table and would point to to the first cell that contain that latitude.  The first cell can then point to more cells that contain the same latitude.  The last cell with the same latitude will point either to null or back to the header row.  You can do the same with the header columns using the Longitudes.

    Note : The header row and column could be rounded to .002.


    jdweng

    Wednesday, July 3, 2013 6:00 AM
  • I have additional things to say.  Most beginner computer couses sort algorithms  one dimensional sort algorithms are taught and a comparison of the algorithms are discussed.  You problem is a two dimensional sort with and added requirement of 'BIN' the results.  A BIN is to put data into groups in this case a bin size of .002.  There are three different algorithms that can be used to perform a two dimensional sort.

    1) Perform a query of the data using a FOR loop.  In your case the for loop would be

    for i = min longitude, i <= max longitude, i += .002

    Your query will be a range i <= query < i + .002 and may return more than one row of the table.  If you get more than one row from the query you may need to sort on the latitude. 

    The time to perform this algorithm is (max longitude - min longitude)/2 * Number of rows in datatable.  The time can be reduced by 1/2 if you remove each row from the datatable after it is found.  the search will autmatically do the BIN.

    2) Sort Datatable on Latitude and Longitude  After the sort you may need to apply a BIN function.  The time of a bubble sort is approximately (N*N)/2 where N is the number of rows of the datatable.

    3) Create a SPARSE matrix.  This is usually the quickest method because each row of the datatable is only accessed once.  It also requires writting more code.  Nothing comes for free.

    The time of the algorthm is the time of adding each datarow to the sparse matrix.  The adding of data to the sparse matrix still requires doing a sort when data is added to the table.  the worse case is when each longitude (or latitude) vaue is unique.  The average add time is the number of longitude entries (the number of rows of the datatable) N/2.  If you create a hash table of longitudes this time can be greatly reduced.  As you add items to the sparse matrix you can also do the BIN at the same time.


    jdweng

    Wednesday, July 3, 2013 11:43 AM