none
Copy data from one table to another

    Question

  • On the OLE DB Source, I have as following:

    select field1, field2, field3 from table_source

    On the OLE DB Destination, I used fast load option with table lock checked, and check constraints to copy data to table_destination.  Both table_source and table_destination have the same table definitions.  After the dtsx package ran, the number of rows copy over to the destination is not same.  I got different results from different runs.  This only happens when the source table is over 1 million rows and when I'm copying from the transaction database where user's activities are heavy.

    Any ideas how I can modify the package so it can copy correct data?  Thanks!

    Monday, June 05, 2006 6:47 PM

Answers

  • Is the source reading the correct number of rows but the destination is not inserting the right number?

    Is the destination table empty? Is it possible that having constraint checking on is failing some rows?

    I suggest you use SQL Profiler on the source and destination servers to see what SQL Server is actually doing.

    Donald

    Monday, June 05, 2006 9:08 PM
  • It looks like these additional rows may be being added in the destination table.

    I would address this systematically, first by dropping the triggers (and perhaps the constraints) and loading a simple "heap" table from the source. If no duplicates are added when loading a heap, then you have a problem with the triggers.

    From there you could add constraints and triggers back one at a time until you find the cause of the error.

    Donald

    Friday, June 09, 2006 5:16 AM

All replies

  • Is the source reading the correct number of rows but the destination is not inserting the right number?

    Is the destination table empty? Is it possible that having constraint checking on is failing some rows?

    I suggest you use SQL Profiler on the source and destination servers to see what SQL Server is actually doing.

    Donald

    Monday, June 05, 2006 9:08 PM
  • Yes, the source is reading the correct number of rows but the destination is not inserting the right number.  The source table is, let says, 1 million rows but the destination table is 1 million and 5 rows; I've checked the rows count before and after insert.  Of these 5 extra rows, they're duplicated.  Each execution returns different duplicated rows; sometimes 5, sometimes 10 but I've not find more than 20 duplicated rows yet.

    Yes, I always emty the destination table before insert; I even checked before insert and return a value to a temp table to make sure the table is empty.

    There's a primary key constraint on both source and destination tables.  I've removed the constraint on the destination table to copy data all over.  This is how I found the duplicated rows after the insert.  If I keep the constraint, the package will fail because with fast load I cannot redirect the failed rows.

    It's very hard to debug this issue because the result is not consistent.  Sometimes it copied correctly, sometimes not.  Through SQL profiler, it took a while to see it finishes copying the data and I've not find anything that I think may cause the issue yet.

     Could this be SSIS bug?

     

    Tuesday, June 06, 2006 4:27 AM
  • Is there anything else in the package, or is this just a straight source-destination data load?

    Donald

    Tuesday, June 06, 2006 4:03 PM
  • On the Control Flow, I've only one Data Flow Task.  On the Data Flow, I've an OLE DB Source and an OLE DB Destination.  There's no other task in the package.
    Tuesday, June 06, 2006 5:00 PM
  • BTW, I just upgrade server to SP1, and ran my package again.  Out of 3 runs during business hours (update/insert/delete action excutes to the source table during the copying), 1 run has copied 2 duplicated rows and the table is about 240,000 rows.  The other 2 runs are ok. 

    I had this package on the SQL job agent and scheduled to run at night for a while, most of the time (about 95%) it copied data correctly;the other 5% it does not is usually due to we have other problems to the source server so the job delays until morning to run.

     

    Tuesday, June 06, 2006 5:31 PM
  • Have you tried to use SQL exporting option? Try to use the SQL build option to transfer the data. Also during the export wizard, there is an option to save the package. You can compare your version and theirs to see if there are any difference.

    Another option, you might customize your package, that first query the last processed id on the destination and only grab the new data from the source. Since your data are coming in sequence order, you may not need to turn on transaction. If the task fails, you can restart from the last failed id.
    Tuesday, June 06, 2006 11:03 PM
  • Tianyu, I'm sorry but I do not understand your comments.  Can you give more details of how to accomplish that?  Thank you.
    Tuesday, June 06, 2006 11:22 PM
  • As option 1
    1. Open Microsoft SQL Server Manager Studio
    2. Drill down to the database you need to export
    3. Right mouse on the database and choose Export Data under tasks
    4. Select the data source and data destination
    5. Select Copy data from one or more tables or views, click next
    6. Map the tables and click Edit Mappings, customzie the options as you need
    7. Click Next, You will see  "Save SSIS Package" option.
    We found out the build in SQL Exporting wizard is acturally creating SSIS packages and using them to do the data transfer.

    For option 2
    I guess the reason to enable transaction is to avoid failure half way through and lose track on the data.
    One way to "work around" the problem is use last procssed id.
    1. You can do a  SELECT ISNULL(Max(IdentityID), 0) From DestinationTable and store as a user variable.
    2. Ole Db Source, you select statement would be something like SELECT  identityID, Field1, field2, field3 From sourceTable WHERE IdentityID > ? Order by IdentityID ASC
    3. At Destination, you aslo store the IdentityID.
    At any point failed, you can always find the failing point and can restart the package once the source data is corrected, this way just avoid using transaction.
    Tuesday, June 06, 2006 11:37 PM
  • Tianyu, thank you for the instructions.  As for option 1, I've already tried but still got the duplicated rows...same result as using my dtsx, sometimes it copied correctly and sometimes not.  As for option 2, I already had a work-around; 1. Remove the constrainsts on destination table, allow it to copy all data over to destination, then have a script to look for the duplicated rows and remove one of the duplicate. or 2. Use the data load (no fast load) in the package, have the constraint on the destination table, and redirect the failed rows to the text file.  I'm working on the data warehouse project, and using SSIS to ETL data, as I had found the issue but not sure of the root cause...just need to know if this is a bug in SQL or SSIS is the correct tool for data warehouse or not.

     

    Wednesday, June 07, 2006 4:43 PM
  • Keep me up to date, if this is a bug then it is a nasty one. I am working on data warehouse project too. Don't want uncleaned data into the DW.

    One more thing I would try to test is first load the data into Raw File Destination, then in another data flow to load to DW from raw file destination. This way we hold the dataset from transaction database for the shortest time. Seperate the dataset to smaller sets is also something else could try to work around the problem.

    This is an interesting topic. :)

    Wednesday, June 07, 2006 5:32 PM
  • ...correction to the first message:

    <<This only happens when the source table is over 1 million rows..>>

    The issue occurs regardless of the source table is over 1 million rows or not.  I've a table of about 200 thousands rows and it copied incorrectly.

    Wednesday, June 07, 2006 10:42 PM
  • >If I keep the constraint, the package will fail because with fast load I cannot redirect the failed rows.

    Here's something to try. You can redirect the failed rows if the MaxInsertCommitSize is not equal to 0. I suggest adding a batch size and setting MaxInsertCommitSize too. Then redirect the rows and let's see what happens.

    What is the primary key constraint you are using? Is it possible at any point that you have nulls in constrained columns?

    Donald

    Thursday, June 08, 2006 12:02 AM
  • Donald,

    Here's the sample of my source table:

    Table_source

    -------------------

    Column_1 varchar(15) not null (unique key)

    Column_2 binary null

    Column_3 datetime not null (audit trigger in the source for row inserted date, default as getdate() for insert)

    Column_4 datetime not null (audit trigger in the source for row updated date, default as getdate() for insert/update)

     

    Table_source sample data

    Column_1 Column_2 Column_3 Column_4

    --------- ------------------ ---------------------- ------------------------

    123ABC 0x000000002A5270C3 2006-06-08 11:03:30.883 2006-06-08 11:03:30.883

    Table_destination has the same table definition as Table_source.  As I used rows per batch and max insert commit size not equal to 0, and redirect rows to a text file.  It ended up that the row, ex. with 123ABC, redirected to the text file and this row did not copied over to the table_destination.  And ErrorCode=-1071607685, ErrorColumn=0.   Btw, is there a way to translate what these ErrorCode and ErrorColumn mean?  All of the redirected rows in the text file have these values.

    If I remove the unique key constraint on the Table_destination, this is how it looks like:

    Column_1 Column_2 Column_3 Column_4

    --------- ------------------ ---------------------- ------------------------

    123ABC 0x000000002B36DA37 2006-06-07 11:03:30.883 2006-06-07 11:03:30.883

    123ABC 0x000000002A5270C3 2006-06-07 11:03:30.883 2006-06-08 11:11:57.153

    Table_destination will end up with two 123ABC rows; Column_2 will have different binary, and Column_4 will have different date time stamp (sometimes two rows will have exactly same date time stamp).

     After the data were copied over, I check the source table and it had a row like this:

    Column_1 Column_2 Column_3 Column_4

    --------- ------------------ ---------------------- ------------------------

    123ABC 0x000000002A5270C3 2006-06-07 11:03:30.883 2006-06-08 11:11:57.153

    There's no 123ABC row with time stamp of 2006-06-07 11:03:30.883.  Look like this is the row from the source when package started, and other later time stamp is after the package ran; some update happened during the data extract.

    Any ideas what went wrong? 

    Thanks for previous suggestions.

    Thursday, June 08, 2006 8:28 PM
  • By the way, I got this error codes on the redirected rows before sp1 applied:

    ErrorCode=-1071607683, ErrorColumn=0

    And below is what after sp1 applied:

    ErrorCode=-1071607685, ErrorColumn=0

    My guess is that this is the violate primary key constraint, but don't know a way to translate this number. 

     

    Thursday, June 08, 2006 8:42 PM
  • It looks like these additional rows may be being added in the destination table.

    I would address this systematically, first by dropping the triggers (and perhaps the constraints) and loading a simple "heap" table from the source. If no duplicates are added when loading a heap, then you have a problem with the triggers.

    From there you could add constraints and triggers back one at a time until you find the cause of the error.

    Donald

    Friday, June 09, 2006 5:16 AM
  • Do you use (nolock) with your data source?

    We had the same problem once, and reason was there were page splits during reading. And some rows were read twice because of that.

    There could be other data integrity issues with (nolock).

    James

    Friday, June 09, 2006 7:07 AM
  • >>Do you use (nolock) with your data source?

    Yes.  I've tested without using nolock also.  The rows were still copied twice with or without nolock. 

    >>We had the same problem once, and reason was there were page splits during reading

    So what did you do to fix the problem?  I'm interested to know your resolution. 

    Our ETL runs for hours daily against production, I don't think it's good practice to lock our production tables during this time.

    Thank you for your comments.

    Friday, June 09, 2006 8:45 PM
  • I actually saved this method for last until I couldn't find any other ways to troubleshoot because it'll be lot of works on our production (source) side.  But look like I have no other ways now.  Will try this and keep you update.

    Ash

    Friday, June 09, 2006 8:56 PM
  • I am also vitally interested in this thread, as I am working on a DW project that will be doing ETL from a 100 million record table, and duplicates in my case would almost be disastrous. Please advise on how you resolved the issue of cross-page reads being the cause of duplicate rows.

    thanks

    Wednesday, June 14, 2006 5:03 AM
  • Just to set this in context - I have never heard of cross-page reads causing duplicates before and we don't have a repro of that situation. And the original thread here involves duplication appearing at a destination that appears to have triggers that may result in new records being created. In neither case are we seeing more detail or repros yet.

    In other words, I don't see any fundamental issue that would affect your scenario. It would be interesting to know more about your 100M records - either privately at DonaldDotFarmerAtMicrosoftDotCom, or on another thread - to see if we could help in advance with optimizations.

    Donald

    Friday, June 16, 2006 9:44 PM
  • Thanks for the reassuring reply, and let me know what is the best way to continue the optimization issue, on another thread or....

    Frango

    Friday, June 16, 2006 9:51 PM