none
Need Help On Best Querying ( is LINQ work with Huge amount of data .. RRS feed

  • Question

  • Dear Group members,

      I am new to LINQ, pls help on the deeling with huge amount of data
    with the C# stand Alone application.


     I have two file, which contains more then 2 lacs lines in every file And Those files formate can be changed every time, ie they will come with random number of coloumns.


     suppose file1 like ...

    999361894 422021100257001
    899119011 422021100344217
    899373022 422021100262179
    892044443 422021100426945
    899292491 422021000154860
    ----
    ----

    and file2 looks like

    92000000,422021100420675
    92000002,422021100420403
    92000003,422021100420614
    92000004,422021100425785
    92000005,422021100422232
    ---
    ----

    values under each file divided by different set of delimiter, and this
    delimiter can be changed... ie some files came with some other
    delimiter with more fields also.


      My motive is to import different kind of formated files to my
    application, and generate some auto-summarized report like ... 99999
    is missed in  file 2, and 99999 no is missed in file 1, and for this
    id 999999, values are not matching ..


    for this thing, i succeeded agenest the import various typed  files to
    my data grid.


    Remain, major act is to compare the files and generate summary
    report, kindly remember  i don't have Database support for my
    application, and all data came with different formates and data will
    be came like text files.


      So, that i am starting with small comparison with LINQ,

     like
         DataTable dt1;//hold's the first file data
         DataTable dt2; // hold's the second file data.


     When i try to find out the records which are having dt1 and not in
    dt2. with LINQ



    var test=   from k in dt1.AsEnumerable() where !(from t in dt2.AsEnumerable() select Field<string>(0)).Contains(k.Field<string>(0)) select k;

    myResult_DataGrid.Datasource = test.AsDataView();



    These steps running from several mints, still application is look
    like executing same statment.


     i don't know what is the cause of hangout the application,  it's look
    like a problem with the huge amount of data, ofcource it nearly
    having  2lac's X 2 lacs


     is there any other solutions, to do this type of thigs? is there any
    problem in my LINQ statement?


     pls let me know , if anybody having the solutions on this type  of
    programs.


    Thanks and Regards
    ranganadh Kodali
    Monday, March 17, 2008 5:14 AM

Answers

  • Hello,

    If I were you, I would do that:

    Code Snippet

    var q = from r in dt1.AsEnumerable()

            let i = r.Field<int?>(0) // int? for the case of DBNull.Value else int is ok

            where !dt2.AsEnumerable().Any(r2 => r2.Field<int?>(0) == i)

            select i;

     

     

    Monday, March 17, 2008 10:38 AM
  • The basic issue here is that when you load this data into a DataTable in order to do linq over it, you have to load all the data into memory, and if you truly have a large amount of data that's probably going to be an issue.  Unfortunately, if you want to use linq for this problem then you don't have a lot of choices.  I would either:

     

    1) Write an algorithm by hand that compares the incoming data incrementally (look at the first line of each file then discard it and look at the next line or something).

     

    2) Or I would load the data into a local database like SQL Express or something temporarily and use the indexes there and such.  This would likely be slower at execution time than option1 but might be faster/easier to write.

     

    - Danny

     

    Monday, March 17, 2008 2:48 PM

All replies

  • Hello,

    If I were you, I would do that:

    Code Snippet

    var q = from r in dt1.AsEnumerable()

            let i = r.Field<int?>(0) // int? for the case of DBNull.Value else int is ok

            where !dt2.AsEnumerable().Any(r2 => r2.Field<int?>(0) == i)

            select i;

     

     

    Monday, March 17, 2008 10:38 AM
  • The basic issue here is that when you load this data into a DataTable in order to do linq over it, you have to load all the data into memory, and if you truly have a large amount of data that's probably going to be an issue.  Unfortunately, if you want to use linq for this problem then you don't have a lot of choices.  I would either:

     

    1) Write an algorithm by hand that compares the incoming data incrementally (look at the first line of each file then discard it and look at the next line or something).

     

    2) Or I would load the data into a local database like SQL Express or something temporarily and use the indexes there and such.  This would likely be slower at execution time than option1 but might be faster/easier to write.

     

    - Danny

     

    Monday, March 17, 2008 2:48 PM
  • Depending on the other requirements for this data, a database may be overkill.  I'd do something like this:

     

    Code Snippet

    internal class KeyInfo

    {

       internal bool inFile1 = false;

       internal bool inFile2 = false;

    }

     

    internal class FileParser : IEnumerable<string>

    {

       // all the file I/O code goes in here

    }

     

    static Dictionary<string,KeyInfo> CompareFiles(string path1, string path2)

    {

       Dictionary<string, KeyInfo> dict = new Dictionary<string,KeyInfo>();

       IEnumerable<string> file1 = new FileParser(path1);

       IEnumerable<string> file2 = new FileParser(path2);

       foreach (string k in file1)

       {

          if (!dict.ContainsKey(k))

          {

             dict.Add(k, new KeyInfo());

          }

          dict[k].inFile1 = true;

       }

       foreach (string k in file2)

       {

          if (!dict.ContainsKey(k))

          {

             dict.Add(k, new KeyInfo());

          }

          dict[k].inFile2 = true;

       }

       return dict;

    }

     

     

    You can then use the dictionary returned by CompareFiles in whatever context is sensible, like iterating through the space of available keys and determining which value is in either, both, or neither input file.  It's easy to extend KeyInfo to record more information - you might make inFile a bool[], so that you can compare several files.  You might make it an int, so that you can count occurrences of the key values.  And so on.

     

    The hardest part of this is implementing IEnumerable<string> in the FileParser class, but you have to do that no matter how you solve the problem. 

    Tuesday, March 18, 2008 8:21 PM