none
What is the best way to load huge result set in memory? RRS feed

  • Question

  • I am trying to load 2 huge resultsets(source and target) coming from different RDBMS but the problem with which i am struggling is getting those 2 huge result set in memory.

    Considering below are the queries to pull data from source and target :

    Sql Server -  select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn
    
    Oracle -select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn

    Records in Source : 12377200

    Records in Target : 12266800

    Following are the approaches i have tried with some statistics:

    1) 2 open data reader approach for reading source and target data:

    Total jobs running in parallel = 3

    Time taken by Job1 = 01:47:25

    Time taken by Job1 = 01:47:25

    Time taken by Job1 = 01:48:32

    There is no index on Id Column.

    Major time is spend here :

    var dr = command.ExecuteReader();


    Problems : 
    There are timeout issues also for which i have to kept `commandtimeout` to 0(infinity) and it is bad.

    2) Chunk by chunk reading approach for reading source and target data:

       Total jobs = 1
       Chunk size : 100000
       Time Taken : 02:02:48
       There is no index on Id Column.

    3) Chunk by chunk reading approach for reading source and target data:

       Total jobs = 1
       Chunk size : 100000
       Time Taken : 00:39:40
       Index is present on Id column.

    4) 2 open data reader approach for reading source and target data:

       Total jobs = 1
       Index : Yes
       Time: 00:01:43

    5) 2 open data reader approach for reading source and target data:

       Total jobs running in parallel = 3
       Index : Yes
       Time: 00:25:12

    I does observe that while having index on LinkedColumn does improve performance but the problem is we are dealing with 3rd party RDBMS tables which might have index or might not.We would like to keep database server as free as possible so data reader approach doesnt seems good idea because there will be lots of jobs running in parallel which will put so much pressure on database server which we dont want.

    Hence we want to fetch records in my resource memory from source to target and do 1 - 1 records comparision keeping database server free.

    Note : I want to do this in my c# application and dont want to use SSIS or Linked Server.

    Source Sql Query Execution time in sql server management studio: 00:01:41

    Target Sql Query Execution time in sql server management studio:00:01:40

    What will be the best way to read huge result set in memory ?

    Code :

    static void Main(string[] args)
            {   
                // Running 3 jobs in parallel
                 //Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()),
            //Task<string>.Factory.StartNew(() => Compare()),
            //Task<string>.Factory.StartNew(() => Compare())
            //};
                Compare();//Run single job
                Console.ReadKey();
            }
    public static string Compare()
            {
                Stopwatch stopwatch = new Stopwatch();
                stopwatch.Start();
                var srcConnection = new SqlConnection("Source Connection String");
                srcConnection.Open();
                var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection);
                var tgtConnection = new SqlConnection("Target Connection String");
                tgtConnection.Open();
                var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection);
                var drA = GetReader(command1);
                var drB = GetReader(command2);
                stopwatch.Stop();
                string a = stopwatch.Elapsed.ToString(@"d\.hh\:mm\:ss");
                Console.WriteLine(a);
                return a;
            }
          private static IDataReader GetReader(SqlCommand command)
            {
                command.CommandTimeout = 0;
                return command.ExecuteReader();//Culprit
            }

    Friday, February 16, 2018 7:44 AM

All replies

  • Are you not able to get the smallest and then retrieve from the biggest the data with a "where" clause?

    Success
    Cor

    Friday, February 16, 2018 8:38 AM
  • Sorry i didnt get you.Can you please elaborate
    • Edited by WcfBeginner Friday, February 16, 2018 9:33 AM
    Friday, February 16, 2018 9:32 AM
  • I assume you have to do something with that resultset. And then related. 

    Therefore you need the data from both. But of course not at the same time all data. 

    Therefore if you load one and for instance loop throuhg it (or use Linq) and get every time the related data with a where clause with SQL from the server it should be possible to do your job. 


    Success
    Cor


    Friday, February 16, 2018 10:17 AM
  • I have 12377200 records in source and i loop through it.Now for each record in source,i should fire query in target so by this i will have to fire 12377200 times sql query right?
    Friday, February 16, 2018 10:29 AM
  • I have 12377200 records in source and i loop through it.Now for each record in source,i should fire query in target so by this i will have to fire 12377200 times sql query right?

    If you have them both in memory it should be the same loop without the query, simply test what that roundtrip takes, measuring is knowing. With 2 times 12377200 times in a collection it won't for sure also not go fast because the loop will than be probably needs to be done everytime as well through the second collection..

    There is no best way to load that kind of collections in memory. Maybe there is an less worst way. 


    Success
    Cor


    Friday, February 16, 2018 10:34 AM
  • But If i will have both source and target in memory,then with multiple jobs it will eat up my memory considering 24 millions of records.do you think having all records from source and target in memory is a good idea?
    Friday, February 16, 2018 10:36 AM
  • At my workplace we also have applications that process huge amount of records which can  go in millions, and  we use to follow the  approach of batches.

    What you can do is process  the  records  in  batches, create a query that fetches the min and max id from the source table which  you  need  to process:

    SELECT Min(ID) as RangeFrom, Max(ID) as RangeTo
    FROM Source
    WHERE .......

    then let's  say you pull up1000records from the source in memory using a query like:

    Select col1, col2, col3
    FROM  Source 
    WHERE Id between @rangeFrom and  @rangeTo

    and you process them and update them in the Target table and this  way you keep  on going  until you process all the records.

    Hope it  helps!


    [If a post helps to resolve your issue, please click the "Mark as Answer" of that post or click Answered"Vote as helpful" button of that post. By marking a post as Answered or Helpful, you help others find the answer faster. ]


    Blog | LinkedIn | Stack Overflow | Facebook
    profile for Ehsan Sajjad on Stack Exchange, a network of free, community-driven Q&A sites


    Sunday, February 18, 2018 9:27 AM