none
Duplicate Key Error - Any Way Around This? RRS feed

  • Question

  • In one SSIS package I'm working with, I'm required to use a non-unique field in a given OLE DB Source as a primary key for a given OLE DB Destination. Obviously, this causes problems when there are duplicates in that field.

     

    My concern is that I've attempted to handle this gracefully, by having the data flow redirect the failing rows to another OLE DB Destination, but according to the errors I'm getting, the attempt to insert this duplicate key seems to generate an unrecoverable error. Here are the log entries for the error.

     

    SSIS Error Code DTS_E_OLEDBERROR.  An OLE DB error has occurred. Error code: 0x80004005.  An OLE DB record is available.  Source: "Microsoft SQL Native Client"  Hresult: 0x80004005  Description: "Cannot insert duplicate key row in object 'dbo.TABLE' with unique index 'PK_KEY'.".


    SSIS Error Code DTS_E_PROCESSINPUTFAILED.  The ProcessInput method on component "Copy to TABLE" (142) failed with error code 0xC0202009. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running.  There may be error messages posted before this with more information about the failure.

     

    Additional information: all OLE DB objects point to SQL Server 2005 database tables, and this data flow is within a transaction.

     

    My question is this: is there any setting I can use that will cause the data flow to behave as desired? Does this error result in an unrecoverable error because SSIS attempts to submit all the records in one batch, and is there any way to get around that? Am I going to need to find a different way to handle this, or break my data flow into two separate transactions? (The data flow in question is the third operation in the transaction, and the first two succeed just fine.)

    Wednesday, September 3, 2008 5:42 PM

Answers

  • Hi Brad,

     

    I recently encountered this while trying to write Validations for a data load process. After some umming and arring I opted for a Custom script transformation that performs my validations on each row and redirects it to a different output if validation fails. In this flow I perform the following validations:

    • Duplicate keys
    • NULL values
    • Failed data type conversion

    That way I get all the validation failures per row in one go. I won't go into detail about the others as they are not relelvant to your question, but for duplicate keys I generate a flag column that tells me if the particular rows key is a duplicate or not. I inspect this flag in my script and redirect accordingly. To generate this flag you can either:

    • Do the logic in your SQL query by writing something like this:

      Code Snippet
      select t.key_column, t.other_columns, is_duplicate = ISNULL(dups.is_duplicate, 0)
      from table t
      left join
      (
      select key_column, is_duplicate = 1
      from table
      group by key_column
      having count(*) > 1
      ) dups
      on t.key_column = dups.key_column
       
    • Have 2 separate data sources and quries, one for data the other for duplicate keys and do a left join transform in the data flow

    Note, with my approach, if the key is a duplicate then I fail both (all) duplicate rows (rather than letting the first one through and failing the rest)

    Wednesday, September 3, 2008 6:04 PM
    Moderator

All replies

  • I think that you can't have a one column-primary key with duplicates... and that's why is throwing the error


    If you can have more than one registry with the same primary key... then i'm missing something Stick out tongue
    Wednesday, September 3, 2008 5:55 PM
  • Hi Brad,

     

    I recently encountered this while trying to write Validations for a data load process. After some umming and arring I opted for a Custom script transformation that performs my validations on each row and redirects it to a different output if validation fails. In this flow I perform the following validations:

    • Duplicate keys
    • NULL values
    • Failed data type conversion

    That way I get all the validation failures per row in one go. I won't go into detail about the others as they are not relelvant to your question, but for duplicate keys I generate a flag column that tells me if the particular rows key is a duplicate or not. I inspect this flag in my script and redirect accordingly. To generate this flag you can either:

    • Do the logic in your SQL query by writing something like this:

      Code Snippet
      select t.key_column, t.other_columns, is_duplicate = ISNULL(dups.is_duplicate, 0)
      from table t
      left join
      (
      select key_column, is_duplicate = 1
      from table
      group by key_column
      having count(*) > 1
      ) dups
      on t.key_column = dups.key_column
       
    • Have 2 separate data sources and quries, one for data the other for duplicate keys and do a left join transform in the data flow

    Note, with my approach, if the key is a duplicate then I fail both (all) duplicate rows (rather than letting the first one through and failing the rest)

    Wednesday, September 3, 2008 6:04 PM
    Moderator
  • Sorry for my previous post... i totally misunderstood the initial post...
    Wednesday, September 3, 2008 6:15 PM
  •  Brad Vaughan wrote:

     

    My question is this: is there any setting I can use that will cause the data flow to behave as desired? Does this error result in an unrecoverable error because SSIS attempts to submit all the records in one batch, and is there any way to get around that? Am I going to need to find a different way to handle this, or break my data flow into two separate transactions? (The data flow in question is the third operation in the transaction, and the first two succeed just fine.)

     

    You can control the number of rows that the OLE DB Destination commits at one time by setting the MaxInsertCommitSize. 0 means commit all the rows in one batch, 1 would commit one row at a time (very slow), and 1000 would commit 1,000 rows at a time.

     

    Setting that to a value greater than 0 is necessary if you want to redirect the error rows from the OLE DB Destination.

    Wednesday, September 3, 2008 7:03 PM
    Moderator
  • Yes but even then it doesn't just redirect the error rows, it fails the whole batch which includes good rows which is not what you usually want. As you said, setting the commit size to 1 is very slow. This is why I feel the best way is to handle it explicitly, identify and redirect the bad rows so that you can be sure the what goes to your destination table will succeed.

    Wednesday, September 3, 2008 7:40 PM
    Moderator
  • JWelch:

     

    Looking for MaxInsertCommitSize, I don't see it here. I understand this is an SP1-only feature, and I'm not sure if we have this - but to clarify when I go to ask, does this require SP1 for SQL Server 2005 in its entirety, or just for the SSIS modules?

    Wednesday, September 3, 2008 7:49 PM
  • Adam Tappis:

     

    Thank you for the suggestion. Since I do not think I will be able to implement JWelch's suggestion (we don't seem to be on SP1 yet), I will likely go with yours. It is not perfect, but it will do for the time being.

    Wednesday, September 3, 2008 7:51 PM
  •  Adam Tappis wrote:

    Yes but even then it doesn't just redirect the error rows, it fails the whole batch which includes good rows which is not what you usually want. As you said, setting the commit size to 1 is very slow. This is why I feel the best way is to handle it explicitly, identify and redirect the bad rows so that you can be sure the what goes to your destination table will succeed.

     

    Actually, in this scenario (when I am worried about duplicate keys) I normally use an OLE DB Destination set with a batch size of 10,000, that redirects error rows to another OLE DB Destination with a batch size of 1. This lets the bulk of the processing occur quickly, while still identifying individual error rows.

     

    The issue with using a JOIN to identify duplicates is that the data needs to already be in the same database (or at least on the same server). In most scenarios I run into, that's not the case.

    Thursday, September 4, 2008 2:46 PM
    Moderator
  • Actually, in this scenario (when I am worried about duplicate keys) I normally use an OLE DB Destination set with a batch size of 10,000, that redirects error rows to another OLE DB Destination with a batch size of 1. This lets the bulk of the processing occur quickly, while still identifying individual error rows.

     

    Can you clarify this please because I'd be really interested in getting this to work if possible? Anything to simplify the logic, especially as there could be other constraints (unique, check) on the taget table that cause the failure. I just couldn't find the right combination of property values to make this work.

     

    The issue with using a JOIN to identify duplicates is that the data needs to already be in the same database (or at least on the same server). In most scenarios I run into, that's not the case.

     

    I always use staging tables so in my case it is.

    Thursday, September 4, 2008 7:06 PM
    Moderator
  • I posted an article and sample package to my blog here: http://agilebi.com/cs/blogs/jwelch/archive/2008/09/05/error-redirection-with-the-ole-db-destination.aspx

     

    The only settings you really have to make are the Max Insert Commit Size and to turn error redirection on. You might want to turn table lock off if you are inserting a large number of rows, but I didn't have to for my sample.

    Friday, September 5, 2008 7:49 AM
    Moderator
  • Excellent. Good idea. Thanks John.

    Friday, September 5, 2008 9:28 AM
    Moderator