locked
Quickest way to import blob data from multiple files using C# RRS feed

  • Question

  • Hello Developers,

    I have been referred to the C# corner as what I am looking for looks like cannot be efficiently handled by TSQL. I am doing a piece of work where I have to import millions of files of all sorts of extensions (msg, xlsx, pdf, etc) into SQL Server database. I need to import those files as blob data. The go to SQL OPENROWSET command is taking painfully long as it operates on a row by row basis. What I would like, is to grab multiple files at a time to reduce the processing time. One of the developers in the TSQL corner suggested that a multi-threaded C# method can do, but my C# skill is not up to that level. Can someone help with a handy script please?

    Kind regards,


    Mpumelelo

    Friday, January 10, 2020 1:07 PM

Answers

  • This is an update on the latest following my question. C# code for multi-threaded process proved challenging. I have, however, discovered a relatively quicker method, which, if I have to be honest, the solution has been all along staring at me. Instead of using OPENROWSET, I am now using the “Import Colum” component in SSIS Data Flow. While my entire solution is built on SSIS, I have been pulling BLOB data using OPENROWSET run from SSMS using a stored procedure that I call from SSIS. As said, I am now using SSIS “Import Column” component, and this is proving to be relatively faster and I’m happy with the progress.


    Mpumelelo

    • Marked as answer by Mpumelelo S Tuesday, January 14, 2020 12:29 PM
    • Edited by Mpumelelo S Tuesday, January 14, 2020 12:34 PM
    Tuesday, January 14, 2020 12:29 PM

All replies

  • First off, if this task of importing this many files is a one time thing and perhaps down the road less files or no files why not simply let the process run?

    If this is not a option you can learn about threading on this page which includes parallel techniques.

    In the end you are bound to disk operations so no matter what you do its going to take time and by experimenting you could be running the current process even if its slow.


    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange

    Friday, January 10, 2020 1:35 PM
  • [...] grab multiple files at a time [...] suggested that a multi-threaded C# method

    Be careful with this. If the table is indexed (any index, including the case where the table data stored in a clustered index) and you are running any version of SQL Server earlier than 2019, then using multiple threads becomes actually slower than using a single thread. I don't understand all the details about why it is so, but apparently it is related to the way in which SQL Server applies locks on the index during inserts, which creates contention and slows down the multithreaded operation when compared to using a single thread.

    The fastest way to insert lots of rows into SQL Server is to use a bulk insert, which from C# is produced by means of the SqlBulkCopy class. The way it works is that you create a DataTable with the same columns as the database table. The column for the blob would be of type byte[]. You use a loop to load your files into rows in the datatable. Do this in reasonably-sized batches (e.g., 1000 rows at a time, or whatever size is suitable so that amount of information loaded fits comfortably into memory). To load the blob use System.IO.File.ReadAllBytes. Once the DataTable is loaded, you pass it to SqlBulkCopy for writing into the database. This opens a streamed connection to the server that pumps all the data without generating individual Insert statements.

    As I mentioned, unless you have SQL Server 2019 -or- your table is stored in a Heap and doesn't have any index, then do the preceding from a single thread. At least, do the calls to SqlBulkCopy.Write in a single thread; the code that loads files into the datatable might benefit from multithreading on the condition that each thread loads files from a different disk spindle. Or you can use two threads, one for loading file and another for writing to the database by means of SqlBulkCopy. If you decide to code this, ensure that you apply appropriate locking to synchronize the threads when they simultaneously access the same object in memory.


    Friday, January 10, 2020 1:41 PM
  • Take a look at SqlBulkCopy. It is designed for bulk import into SQL. Honestly though I question whether you should be using C# at all. SSIS would probably be a faster/better implementation as it is designed to get data into/out of SQL from different sources and it is designed to do this in parallel.

    Michael Taylor http://www.michaeltaylorp3.net

    Friday, January 10, 2020 3:10 PM
  • Thank you for the responses. I will look into Alberto's suggestion first, failure to which I will resort to letting the process run as suggested by Karen. This is meant to be a one off load. The thing which is frustrating me is that I have never managed to get the process to run end-to-end during development. As it is, I am not sure how many days it will take to do so. 

    Michael, I am already using SSIS. But there is that point when I have to use the OPENROWSET query which is run in SSMS. Everything else is running superfast in my solution until the process hits the OPENROWSET where things get slowed down as I loop through row by row basis.

    Thank you for the suggestions. 

    Kind regards,


    Mpumelelo

    Friday, January 10, 2020 3:25 PM
  • In regards to not managing to get the process to run end-to-end, can you tell us what happens e.g. memory issues or something else?

    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange

    Friday, January 10, 2020 3:43 PM
  • The memory has been an issue a big time so much that I am now even running my process in batches of 500.

    Mpumelelo

    Friday, January 10, 2020 3:47 PM
  • The memory has been an issue a big time so much that I am now even running my process in batches of 500.

    Mpumelelo

    Perhaps you can show some code, there may be issues that can be resolved using a different code path or alternate strategy for the long process. 

    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange

    Friday, January 10, 2020 4:23 PM
  • This is an update on the latest following my question. C# code for multi-threaded process proved challenging. I have, however, discovered a relatively quicker method, which, if I have to be honest, the solution has been all along staring at me. Instead of using OPENROWSET, I am now using the “Import Colum” component in SSIS Data Flow. While my entire solution is built on SSIS, I have been pulling BLOB data using OPENROWSET run from SSMS using a stored procedure that I call from SSIS. As said, I am now using SSIS “Import Column” component, and this is proving to be relatively faster and I’m happy with the progress.


    Mpumelelo

    • Marked as answer by Mpumelelo S Tuesday, January 14, 2020 12:29 PM
    • Edited by Mpumelelo S Tuesday, January 14, 2020 12:34 PM
    Tuesday, January 14, 2020 12:29 PM