none
Should I Create Index before or after Inserting Large Volume of Data? RRS feed

  • Question

  • Before posting my post, I have read a lot of such articles and posts, such as:

    https://www.sqlservercentral.com/forums/topic/index-creation-after-of-before-populate https://nakulvachhrajani.com/2011/11/07/sql-server-performance-best-practice-create-index-before-or-after-a-data-insert/

    However, my case is a bit different, that is the reason why I ask it here.

    Also I have posted the question at https://stackoverflow.com/questions/60219106/should-i-create-index-before-or-after-inserting-large-volume-of-data , but cannot get an answer.

    I am using SQL Server 2008. In my database, there is a table MyTable, with the following structure:

    MyID (bigint)    MyData1 (bigint)     MyData2 (bigint)

    MyID is an unique ID for each record. But I do not set it as UNIQUE when creating the table.

    Then I use Visual C++ 2008/ADO to access the table, as expressed by the following pseudocode:

    Create MyTable
    
    // Method 1: Create Clustered Index for MyID here
    
    // Part1: Insert data to the table
    for (i = 0; i <= 500000; i++)
    {
       Read CurrentID, CurrentData1, CurrentData2 from File1
       Select MyID from MyTable Where MyID = CurrentID
       if Found nothing then
         Insert(CurrentID, CurrentData1, CurrentData2) to MyTable      
    }
    
    // Method 2: Create Non-Clustered Index for MyID here
    
    // Part2: Lookup data in the table
    for (j = 0; j <= 900000; j++)
    {
       Read CurrentID2 from File2
       Select MyData1 from MyTable Where MyID = CurrentID2
       if Found Then
         Do something
    }

    As you can see, my codes are composed by two parts, the first part is data insertion, but during the insertion, it will also lookup the table to prevent inserting records with duplicate MyID, the second part is data lookup, which will lookup the record based on MyID frequently.

    To improve the lookup performance, I create index for MyID. I try the following methods:

    1. Create Clustered Index for MyID, before the data insertion part.

    2. Create Non-Clustered Index for MyID, after the data insertion part, and before the data lookup part.

    To my surprise, method 2 will cause the data insertion part much more slower than method 1, which seems to be contract with the recommendation of "insert first, index next".

    My question is:

    1. Whether I should set MyID as UNIQUE when creating MyTable? If I set it as UNIQUE, then I do not need to lookup before inserting, but inserting record with duplicate MyID will fail.

    2. I should create clustered index or non-clustered index?

    3. Should I create index before or after data insertion part?

    Sorry for so many questions. However, they are related. Also as there are many combination of these choices, I want to get some hints on which direction should I try, since each test will consume a lot of time.

    Currently my test on method 2 has taken several days and still not completed yet, but it already taken much more time than method 1.

    Friday, February 14, 2020 11:35 PM

All replies

  • Hi,

    In your code you have:

    for (i = 0; i <= 500000; i++)
    {
       Read CurrentID, CurrentData1, CurrentData2 from File1
       Select MyID from MyTable Where MyID = CurrentID
       if Found nothing then
         Insert(CurrentID, CurrentData1, CurrentData2) to MyTable      
    }

    Instead of inserting in a loop and hitting the main table having index, you should insert into a temp table first #mytable. Once you found all the rows you should insert them in one shot preferably before creating the index. Also, create a clustered index after inserting data.

    Saturday, February 15, 2020 3:41 AM
  • Since you are doing a lookup during the insert part, you need to have an index in place at this point, or else every lookup will result in a scan. No wonder it is taking days.

    However, the overall approach is wrong. As Soumen points out, you should not all rows at once into the table - or possibly in batches of quite many rows at a time.

    Furthermore, you should not send rows one-by-one. This adds a lot of overhead, and this is a pattern you should always avoid, not the least in these days of the cloud. You sould send the data all at once, or in batches. This can be done in several ways. The most efficient may be to load a temp table through the SqlBulkCopy class. A table-valued parameter can also be a good choice. I have an article about the latter on my web site to get you started: http://www.sommarskog.se/arrays-in-sql-2008.html

    You can then load the target table like this:

    ; WITH numbering AS (
       SELECT *, row_number() OVER(PARTITION BY ID ORDER BY something) AS rowno
       FROM   #temptbl
    )
    INSERT tbl (...)
       SELECT col1, col2, ...
       FROM   numbering
       WHERE  rowno = 1

    As for when to apply indexes, in my tests I have found that it is faster to have a clustered index in place before loading, but non-clustered indxes should be added after load.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Saturday, February 15, 2020 11:16 AM
  • Hi, Erland

    Thank you very much.

    1. Will BULK INSERT has the same performance as SqlBulkCopy or table-valued parameter?

    https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver15

    2. SqlBulkCopy is only availble in .net. Is there a similar function in ADO(C++) as well?

    3. For all these approach, it seems there is no way to prevent inserting records with duplicate MyID value?

    Since you are doing a lookup during the insert part, you need to have an index in place at this point, or else every lookup will result in a scan. No wonder it is taking days.

    However, the overall approach is wrong. As Soumen points out, you should not all rows at once into the table - or possibly in batches of quite many rows at a time.

    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se


    Saturday, February 15, 2020 11:45 AM
  • 1. Will BULK INSERT has the same performance as SqlBulkCopy or table-valued parameter?

    BULK INSERT is certainly an option, if the format of the file is such that BULK INSERT can parse it. You may have to use a format file. I have an article about bulk load that you can read if the simple-minded approach does not work. http://www.sommarskog.se/bulkload.html

    Note that BULK INSERT requires that the file is somewhere where SQL Server can read it. But there is also the command-line tool BCP, which I also cover in my article.

    2. SqlBulkCopy is only availble in .net. Is there a similar function in ADO(C++) as well?

    Since ADO is very old, I assumed that you meant ADO .NET. No, SqlBulkCopy is a .NET class, and ADO as such does not offer any bulk load alternatives. However, the underlying OLE DB provider offers a bulk load interface. I would not recommended it though, since it is not easy to get started with. (I've programmed OLE DB myself, but I've never used the bulk-load API.) The bulk-copy interface in ODBC is easier to use. (And overall ODBC is a better choice for C++ programming, since ODBC is still being developed, but ADO lacks support for several features in SQL 2005 and later.)

    3. For all these approach, it seems there is no way to prevent inserting records with duplicate MyID value?

    There are.

    As long as you are reading the file in C++, one option is of course to handle the duplicates in the C++ by saving the keys into a hash table and then look up in that hash table.

    If you use BULK INSERT you cannot obviously not do that, but you can load the data into a staging table. This is often needed anyway to cleaning things up, for instance if there are things in the file format that BULK INSERT cannot handle.

    You can also use OPENROWSET(BULK) which returns a result set, which permits you weed out duuplicates in the manner I showed in my previous post. See my bulk-load article for more information article about OPENROWSET(BULK).

    Finally, there is an option when you create the index. You can add the clause

       WITH (IGNORE_DUP_KEY = ON)

    With this option, duplicate keys will not cause an error, but the rows that comes second and third will just be dropped on the floor. Note that with this option, you don't really have any control of which record from the file that is inserted.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Saturday, February 15, 2020 12:47 PM
  • Adding on to Erland's answer, one could use managed C++ in order to leverage ADO.NET functionality without resorting low-level OLE DB interfaces (e.g. IRowsetFastLoad) or ODBC function calls (e.g. bcp_*). However, that would require either take the leap into the managed world of .NET (also moving  from ADO to ADO.NET) or use mixed mode assemblies to co-mingle managed and unmanaged code along with the interop considerations.

    The low-level SQL Server OLE DB and ODBC drivers also support table-valued parameters which could be used as the source for a T-SQL solution. I would expect a T-SQL set-based approach with a TVP source to take seconds (minutes at worse) with appropriate indexes.

    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    Saturday, February 15, 2020 1:39 PM
  • hi

    First should bullock insert .So frontend and backend communication minimised and second think clustered index always sorting after inserting when do not have a unique and auto increment

    Sunday, February 16, 2020 7:55 AM
  • Hi tempc, 

    Thank you for your detailed reply.

    ----Whether I should set MyID as UNIQUE when creating MyTable? If I set it as UNIQUE, then I do not need to lookup before inserting, but inserting record with duplicate MyID will fail.

    It depends your own requirement . How would you like to define your UNIQUE value ? If it is like 1,2,3,4..., Maybe you can try to use IDENTITY, Please check CREATE TABLE (Transact-SQL) IDENTITY (Property).

    ----I should create clustered index or non-clustered index?

    Please check the difference between them . Clustered and Nonclustered Indexes Described 

    ----Should I create index before or after data insertion part?

    Maybe yes, after create index and then insert data , it might be cause performance issue . It also depends your own requirement. Please check . 

    By the way , you provide script about C++. I am not familiar with it . So if I have any incorrect understanding , please let me know. 

    Bets Regards,

    Rachel 


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Monday, February 17, 2020 4:38 AM