locked
processing large file connecting to Database RRS feed

  • Question

  • Hi,

    I have a large file with many millions of rows per day.  I need to lookup a column in the input file and get related information from the Database and regenerate the same file with all same column values as the original file but one column transformed. There is no database load involved, just a lookup.

    Could you please suggest the performance considerations such as splitting the file into smaller chunks and parallel processing,  because there is a Database involved.  What C# .Net programming features should I be exploring about (multi threading?)

    thank you

    Wednesday, June 24, 2020 6:38 PM

All replies

  • Hi etl2016,

    Thank you for posting here.

    I think your problem can be divided into two parts。

    First, I suggest that you can consider using DataReader when you want to retrieve data from database. 

    Because the DataReader is a good choice when you're retrieving large amounts of data because the data is not cached in memory.

    Retrieve data using a DataReader

    Second, I recommend that you can learn parallel.foreach to know how to Write a simple Parallel.ForEach loop.

    Hope my advice can help you.

    Best Regards,

    Jack


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Thursday, June 25, 2020 9:05 AM
  • It depends on the complexity of your "lookup column [..] and [..] related information".

    I would consider loading the file into a local database doing the "lookup" in SQL to local tables and export the result. Cause a SQL Server is pretty good at this.

    And for doing it in C#:

    - The make it fast, you need to minimize database roundtrips for your lookups.

    - Threading or using Parallel.ForEach() can be a problem, when the order of the rows in your file needs to be maintained. In this case you need to partition your file into chunks, then you can process these chunks in parallel, and then you need to combine the results.

    But I would not expect a hugh performance boost, cause there two basic bottlenecks.

    Reading and writing from the same disk and querying the same database.

    The first means that parallel read/write access will not give an hugh boost. The second means, that depending on the setup, you cannot use that much threads, otherwise you would saturate your query capacity (server tasks, latency, network bandwidth).


    Thursday, June 25, 2020 9:14 AM