none
How to delete rows not in all 3 tables

    Question

  • How do I write a query to delete all the rows in all 3 tables that do not match values in the other tables so all 3 tables have the same values for X in them and the same number of rows. The original query has multiple different columns in each table, but all sharing column X which are account numbers. All 3 tables should end up with the values (1,2,5,6,8,9) in them. This needs to work regardless of which table has the most or least rows. In other words if you have 3 separate queries, I am not sure if the order in which they are run will affect the end results. I just have a mental block here.

    CREATE TABLE #A (x INT)
    INSERT #A (x) VALUES (1)
    INSERT #A (x) VALUES (2)
    INSERT #A (x) VALUES (3)
    INSERT #A (x) VALUES (4)
    INSERT #A (x) VALUES (5)
    INSERT #A (x) VALUES (6)
    INSERT #A (x) VALUES (7)
    INSERT #A (x) VALUES (8)
    INSERT #A (x) VALUES (9)
    
    CREATE TABLE #B (x INT)
    INSERT #B (x) VALUES (1)
    INSERT #B (x) VALUES (2)
    INSERT #B (x) VALUES (3)
    INSERT #B (x) VALUES (4)
    INSERT #B (x) VALUES (5)
    INSERT #B (x) VALUES (6)
    INSERT #B (x) VALUES (8)
    INSERT #B (x) VALUES (9)
    
    CREATE TABLE #C (x INT)
    INSERT #C (x) VALUES (1)
    INSERT #C (x) VALUES (2)
    INSERT #C (x) VALUES (5)
    INSERT #C (x) VALUES (6)
    INSERT #C (x) VALUES (7)
    INSERT #C (x) VALUES (8)
    INSERT #C (x) VALUES (9)
    
    SELECT x FROM #A
    SELECT x FROM #B
    SELECT x FROM #C
    
    -- Delete all rows in all 3 tables not in the other tables using a join
    -- because the real query has multiple columns in each table
    
    SELECT x FROM #A
    SELECT x FROM #B
    SELECT x FROM #C
    
    DROP TABLE #A
    DROP TABLE #B
    DROP TABLE #C
    

    Thursday, July 04, 2013 2:37 AM

Answers

  • Thanks for your DDL and sample data. Other repliers have pretty good suggestions. I think the INTERSECT is the perfect solution to get your KEY set and you can use a temp table to hold this set and continue with a MERGE operation for each target table to delete rows and export these deleted rows to another holding table. It does not require the sample structure of your target tables as long as they have the same key defined in yor question.

    Here is the query you can test to see whether it works for you:

    CREATE TABLE #A (x INT)
    INSERT #A (x) VALUES (1)
    INSERT #A (x) VALUES (2)
    INSERT #A (x) VALUES (3)
    INSERT #A (x) VALUES (4)
    INSERT #A (x) VALUES (5)
    INSERT #A (x) VALUES (6)
    INSERT #A (x) VALUES (7)
    INSERT #A (x) VALUES (8)
    INSERT #A (x) VALUES (9)
    CREATE TABLE #B (x INT)
    INSERT #B (x) VALUES (1)
    INSERT #B (x) VALUES (2)
    INSERT #B (x) VALUES (3)
    INSERT #B (x) VALUES (4)
    INSERT #B (x) VALUES (5)
    INSERT #B (x) VALUES (6)
    INSERT #B (x) VALUES (8)
    INSERT #B (x) VALUES (9)
    CREATE TABLE #C (x INT)
    INSERT #C (x) VALUES (1)
    INSERT #C (x) VALUES (2)
    INSERT #C (x) VALUES (5)
    INSERT #C (x) VALUES (6)
    INSERT #C (x) VALUES (7)
    INSERT #C (x) VALUES (8)
    INSERT #C (x) VALUES (9)
    SELECT x FROM #A
    SELECT x FROM #B
    SELECT x FROM #C
    --Temp table to hold common keys
    CREATE TABLE #temp (x INT)
    --Output to this table after each deletion
    CREATE TABLE #Exception (x INT, src varchar(50))
    ;With mycte as 
    (SELECT x FROM #A
    INTERSECT
    SELECT x FROM #B
    INTERSECT
    SELECT X FROM #c)
    Insert into #Temp (x) 
    Select x FROM mycte
    Merge #A target
    Using  #temp  source On source.x=target.x
    WHEN NOT MATCHED BY SOURCE THEN
    Delete
    OUTPUT Deleted.x, '#A'  INTO #Exception ;
    Merge #B target
    Using  #temp  source On source.x=target.x
    WHEN NOT MATCHED BY SOURCE THEN
    Delete
    OUTPUT Deleted.x, '#B' INTO #Exception ;
     
    Merge #C target
    Using  #temp  source On source.x=target.x
    WHEN NOT MATCHED BY SOURCE THEN
    Delete
    OUTPUT Deleted.x, '#C' INTO #Exception ;
    DROP TABLE #Temp
    SELECT x FROM #A
    SELECT x FROM #B
    SELECT x FROM #C
    SELECT * from  #Exception ;
    DROP TABLE #A
    DROP TABLE #B
    DROP TABLE #C
    DROP TABLE #Exception

    Friday, July 05, 2013 3:15 PM

All replies

  • I don't think using JOIN will solve the problem becouse Join takes care of relation between two table(left & right) at a time though we can have multiple Join within a SELECT statement.

    Refer the below solution.

    ------------------------SOLUTION
    DECLARE @TEMP TABLE (TABLENAME NVARCHAR(10),X INT,FLAG NVARCHAR(10))
    INSERT INTO @TEMP
    SELECT 'A' TABLENAME, X,CASE
    WHEN X IN (SELECT X FROM #B) AND X IN (SELECT X FROM #C) THEN 'Y' 
    ELSE 'N'
    END AS FLAG
    FROM #A 
    ------------
    UNION ALL
    ------------
    SELECT 'B' TABLENAME,X,CASE
    WHEN X IN (SELECT X FROM #C) AND X IN (SELECT X FROM #A) THEN 'Y' 
    ELSE 'N'
    END AS FLAG
    FROM #B 
    ------------
    UNION ALL
    ------------
    SELECT 'C' TABLENAME,X,CASE
    WHEN X IN (SELECT X FROM #A) AND X IN (SELECT X FROM #B) THEN 'Y' 
    ELSE 'N'
    END AS FLAG
    FROM #C 
    ------------------ DELETING
    DELETE #A WHERE X IN (SELECT X FROM @TEMP WHERE FLAG='N' AND TABLENAME='A')
    DELETE #B WHERE X IN (SELECT X FROM @TEMP WHERE FLAG='N' AND TABLENAME='B')
    DELETE #C WHERE X IN (SELECT X FROM @TEMP WHERE FLAG='N' AND TABLENAME='C')
    ------------------ DISPLAYING
    SELECT * FROM #A
    SELECT * FROM #B
    SELECT * FROM #C


    Regards, RSingh


    • Edited by RSingh() Thursday, July 04, 2013 3:32 AM txt addition
    Thursday, July 04, 2013 3:28 AM
  • Assuming column x cannot contain NULL, then

    Declare @RowsToKeep Table(x int);
    ;With cte As
    (Select x From #A Intersect Select x From #B)
    Insert @RowsToKeep(x)
    Select x From cte Intersect Select x From #C;
    
    Delete From #A Where Not Exists(Select x From @RowsToKeep r Where #A.x = r.x); 
    Delete From #B Where Not Exists(Select x From @RowsToKeep r Where #B.x = r.x); 
    Delete From #C Where Not Exists(Select x From @RowsToKeep r Where #C.x = r.x); 

    Tom

    Thursday, July 04, 2013 3:31 AM
  • This is the case where INTERSECT keyword is the best:

    SELECT x
    FROM #A
    
    INTERSECT
    
    SELECT x
    FROM #B
    
    INTERSECT
    
    SELECT *
    FROM #c


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    • Proposed as answer by Uwe RickenMVP Thursday, July 04, 2013 5:46 AM
    Thursday, July 04, 2013 3:33 AM
  • Try This query :

    DELETE #A, #B, #C
    FROM   #A
    JOIN   #B ON (#B.X <> #A.X)
    JOIN   #C ON (#C.X <> #B.X)
    WHERE  #A.X = Your Value;

    Above query delete all miss mach items in B & C

    DELETE #A, #B, #C
    FROM   #A
    JOIN   #B ON (#B.X = #A.X)
    JOIN   #C ON (#C.X = #B.X)
    WHERE  #A.X = Your Value;

    Above query delete all mach items in A,B & C

    Thursday, July 04, 2013 3:42 AM
  • have a look at this,

    delete
    FROM #A
    where not exists 
    (
    select x from (
    SELECT x
    FROM #B
    INTERSECT
    SELECT *
    FROM #c) BC where BC.x=#A.X )
    
    delete
    FROM #B
    where not exists 
    (
    select x from (
    SELECT x
    FROM #A
    INTERSECT
    SELECT *
    FROM #c) AC where AC.x=#B.X )
    
    delete
    FROM #C
    where not exists 
    (
    select x from (
    SELECT x
    FROM #A
    INTERSECT
    SELECT *
    FROM #B) AB where AB.x=#C.X )

    Thursday, July 04, 2013 9:46 AM
  • Thank you all for the replies. I am going to try them at work tomorrow and hopefully I can apply them to the bigger solution.
    Friday, July 05, 2013 1:26 AM
  • ; WITH all_matches AS (
      SELECT x
      FROM   #a
    
      INTERSECT
    
      SELECT x
      FROM   #b
    
      INTERSECT
    
      SELECT x
      FROM   #c
    )
    DELETE a
    FROM   #a As a
    WHERE  NOT EXISTS (
             SELECT *
             FROM   all_matches
             WHERE  x = a.x
           )
    Repeat for #b and #c

    George
    blog | twitter

    Friday, July 05, 2013 12:21 PM
  • Thank you George (gvee) for your input as well. The only one I couldn't get to work is the reply from Nissan K George, and, though it looked promising, I don't think you can delete from multiple tables in SQL Server. I think this may be a MySQL idea and it may well work with that or another DBMS.

    Maybe someone can chime in here as to which of these would have the best performance, pitfalls to watch for, tips how to use them, and how they would work since each table structure is actually different, though they share common x columns, and which would be work best if you arrange the data differently in each of the tables. I always wonder if there were more data in one of the other tables and less in some of the others (is there an implicit parent/child relationship in any of these queries the way they are written so some data may be missed?). I have tested several variations and they all except the one outlined above seem to work and give the same results. I am just hoping I didn't miss some real world possibilities.

    Ultimately, I want to use the output of these deletions to insert to an exception table.

    I DO thank you all for the input here, as it looks potentially extremely useful and contains a treasure trove of varying tips to accomplish the same or similar purpose.

    Friday, July 05, 2013 2:13 PM
  • Thanks for your DDL and sample data. Other repliers have pretty good suggestions. I think the INTERSECT is the perfect solution to get your KEY set and you can use a temp table to hold this set and continue with a MERGE operation for each target table to delete rows and export these deleted rows to another holding table. It does not require the sample structure of your target tables as long as they have the same key defined in yor question.

    Here is the query you can test to see whether it works for you:

    CREATE TABLE #A (x INT)
    INSERT #A (x) VALUES (1)
    INSERT #A (x) VALUES (2)
    INSERT #A (x) VALUES (3)
    INSERT #A (x) VALUES (4)
    INSERT #A (x) VALUES (5)
    INSERT #A (x) VALUES (6)
    INSERT #A (x) VALUES (7)
    INSERT #A (x) VALUES (8)
    INSERT #A (x) VALUES (9)
    CREATE TABLE #B (x INT)
    INSERT #B (x) VALUES (1)
    INSERT #B (x) VALUES (2)
    INSERT #B (x) VALUES (3)
    INSERT #B (x) VALUES (4)
    INSERT #B (x) VALUES (5)
    INSERT #B (x) VALUES (6)
    INSERT #B (x) VALUES (8)
    INSERT #B (x) VALUES (9)
    CREATE TABLE #C (x INT)
    INSERT #C (x) VALUES (1)
    INSERT #C (x) VALUES (2)
    INSERT #C (x) VALUES (5)
    INSERT #C (x) VALUES (6)
    INSERT #C (x) VALUES (7)
    INSERT #C (x) VALUES (8)
    INSERT #C (x) VALUES (9)
    SELECT x FROM #A
    SELECT x FROM #B
    SELECT x FROM #C
    --Temp table to hold common keys
    CREATE TABLE #temp (x INT)
    --Output to this table after each deletion
    CREATE TABLE #Exception (x INT, src varchar(50))
    ;With mycte as 
    (SELECT x FROM #A
    INTERSECT
    SELECT x FROM #B
    INTERSECT
    SELECT X FROM #c)
    Insert into #Temp (x) 
    Select x FROM mycte
    Merge #A target
    Using  #temp  source On source.x=target.x
    WHEN NOT MATCHED BY SOURCE THEN
    Delete
    OUTPUT Deleted.x, '#A'  INTO #Exception ;
    Merge #B target
    Using  #temp  source On source.x=target.x
    WHEN NOT MATCHED BY SOURCE THEN
    Delete
    OUTPUT Deleted.x, '#B' INTO #Exception ;
     
    Merge #C target
    Using  #temp  source On source.x=target.x
    WHEN NOT MATCHED BY SOURCE THEN
    Delete
    OUTPUT Deleted.x, '#C' INTO #Exception ;
    DROP TABLE #Temp
    SELECT x FROM #A
    SELECT x FROM #B
    SELECT x FROM #C
    SELECT * from  #Exception ;
    DROP TABLE #A
    DROP TABLE #B
    DROP TABLE #C
    DROP TABLE #Exception

    Friday, July 05, 2013 3:15 PM
  • >> How do I write a query to delete all the rows in all 3 tables that do not match values in the other tables so all 3 tables have the same values for X in them and the same number of rows. <<

    Do you know that the goal of a database is to remove redundancy, not increase it? In a correctly designed schema, you will have one fact, in one place, one time, one way. This problem should not exist. 

    CREATE TABLE Alpha (x INTEGER NOT NULL PRIMARY KEY);
    INSERT INTO  Alpha(x) 
    VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9);

    CREATE TABLE Beta (x INTEGER NOT NULL PRIMARY KEY);
    INSERT INTO  Beta(x) 
    VALUES (1), (2), (3), (4), (5), (6), (8), (9);

    CREATE TABLE Gamma (x INTEGER NOT NULL PRIMARY KEY);
    INSERT INTO  Gamma(x)
    VALUES (1), (2), (5), (6), (7), (8), (9);

    CRATE VIEW Common_X(x)

    AS SELECT x 
      FROM (SELECT * FROM Alpha
           INTERSECT 
           SELECT * FROM Beta 
           INTERSECT 
           SELECT * FROM Gamma) AS ABC(x);


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Friday, July 05, 2013 4:44 PM
  • As long as x is unique as for a PK, you can simply use inner join instead of intersect.

    Josh

    Friday, July 05, 2013 5:02 PM
  • I appreciate your suggestion against redundancy and I agree with it. These test table columns are just the supposedly unique key values from much larger and different tables and they are the result of an import from various flat files (TXT, CSV, XLS, etc.) and they are supposed to come with the correct and matching amount of active accounts. But they often do not. And these files and the tables that result from the import are also unique or supposed to be. But we cannot use any rows in any of these tables unless there is a match in the others. The good rows go to a sort of staging table, and the bad or non-matched rows go to an exception table.

    What I am doing is processing all the raw tables into the destination or staging table and removing the non-matched rows and putting them in an exception table which can be reported on, and later put back into the staging tables if and when the issues with them are solved.

    I don't have too much flexibility with the structure of the destination or staging tables but I can insert and delete from them.

    As far as Josh's comment below about x being unique is that they are supposed to be, but are not always, and hence I have to process them as well rather than bomb out with a duplicate key violation. I am using an ID type column for the primary key, but it is not necessarily the same on each table so I cannot relate them that way.

    Friday, July 05, 2013 5:23 PM
  • I appreciate your suggestion against redundancy and I agree with it. These test table columns are just the supposedly unique key values from much larger and different tables and they are the result of an import from various flat files (TXT, CSV, XLS, etc.) and they are supposed to come with the correct and matching amount of active accounts. But they often do not.  ... I am using an ID type column for the primary key, but it is not necessarily the same on each table so I cannot relate them that way.

    Can you get an ETL tool? It is built for this kind of scrubbing. I have written a lot of staging (non-) tables, but SQL was not meant for this kind of work. Just data type conversions can be a mess. 

    What is an "ID type"? Surely, you are not using an exposed  IDENTITY table property (it is not a column by definition) in SQL fr anything!  


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Friday, July 05, 2013 8:36 PM
  • CREATE TABLE t
    (
    [ID] [int] NOT NULL IDENTITY(1, 1),
    [Col1] [VARCHAR](10)
    )
    
    I didn't design this. It was what I was left with. Many of the tables are like this, so I cannot join any of them on this column. We have SSIS. I haven't used it for this type of thing. I obviously need more experience with it so I can do it. Are you saying that it is bad to use any type of identity column like above, for example using the table name and ID, ie: t_ID Not Null Identity(1,1)? If so, what is the best thing to use if there is no natural key?
    Saturday, July 06, 2013 12:21 AM
  • >> I didn't design this. It was what I was left with. Many of the tables are like this, so I cannot join any of them on this column. <<

    I always hated the term "legacy code" and thought it ought to be "family curse" instead.


    Why would anyone think that an unpredictable count of physical access attempts (not even successes!) on one particular piece of hardware to one particular table in software from one vendor is part of RDBMS and correct data modeling? 

    Let's define what characteristics you want in an identifier. 

    1) An identifier has to be unique. Otherwise, it is not an identifier. Ideally, it should be an industry standard (validation and verification are discussed shortly). IDENTITY by itself only generates exact numeric values, but doing math on them make no sense. This comes from UNIX files on 16-bit hardware (read: PDP-11), which were based on magnetic tape file systems. Each physical record gets a record number at the start, then the fields are written in sequence within the records, just as the records are in physical sequence without the file. Since the records are strings of varying length, searches are done by reading the record number and moving the read/write heads as needed. If the records has been dropped, it is set negative but left in place for searching. The algorithms for this kind of searching are based on square roots, Fibonacci numbers that are fun to look at --- like learning how to take the n-th root of a number by hand :) I digress. 

    2) An identifier should be created with the entity, or before the entity exists, but not afterward. This is straight OO theory, not RDBMS. Without an identifier, you simply cannot put an entity into the database. As an example of an identifier coming into existence before the entity, think about a book that has not been published yet. It can be assigned an ISBN (International Standard Book Number), a title, an author, a price, and everything else while the publisher is waiting to get the manuscript. 

    But a future release book does not behave like a real book. You cannot put it in a box and ship it. You cannot get a review of the book either -- at least not an honest review ;-). It is not the same kind of thing as a real published book. 

    3) It should be verifiable within itself. That means that when I see a particular kind of identifier, I ought to know if it is syntactically correct. For example, I know that ISBN10: 0-486-60028-9 has the correct number of digits and that the check digit is correct for a proper old style International Standard Book Number. Later on I can find out that it identifies the Dover Books edition of AN INVESTIGATION OF THE LAWS OF THOUGHT by George Boole. 

    4) An identifier should have repeatable verification against the reality that you are trying to capture in your data model. It is not tied to the hardware of one machine. 

    Exactly what verification means can be a bit fuzzy. At one extreme, prison inmates are moved by taking their fingerprints at control points and courts want DNA evidence for convictions. At the other end of the spectrum, retail stores will accept your check on the assumption that you look like your driver's license photograph. :crying:

    IDENTITY is an exposed PHYSICAL locator. What does that phrase mean? The value is created by looking at the internal state of one particular piece of hardware at the time a PHYSICAL record containing a row is inserted into storage. Its purpose is to locate the row without any regard to what the data means. 

    Think about using a pointer or a track/sector number (Oracle ROWID); same thing but different mechanism. But SQL does not have a pointer data type or the mechanisms to handle pointer operators, garbage collection and housekeeping, so 25+ years ago the original Sybase SQL Server exposed an integer that can map back to the contiguous UNIX storage model used under the covers. It made cursors easier for the programmers and the early SQL engines built on file systems on simple 16-bit machines. It was pretty fast on that hardware, too. But smell the coffee; this is the 21-st century and the hardware is 32 or 64 bit – and it is sophisticated compared to the stuff we had in 1970. Multi-word operations are built-in now. 

    IDENTITY is not an attribute in the data model and can never be an attribute in the data model because it does not exist in the reality from which you derive your data model. Sure, I lose all the advantages of an abstract data model, SQL set oriented programming, carry extra data, destroy the portability of code and still have to maintain all the data integrity among keys and constraints. But this is easier to write than a real normalized RDBMS schema. 

    A programmer with only a few years in a procedural or OO language is all too often expected to produce a correct and usable database. What is he going to do? The smart ones will get some help and beg for training, knowing they are over their heads and can ruin the company. Most of them simply start programming SQL as if it were their native programming language. They grab at GUIDs, IDENTITY, ROWID and other proprietary auto-numbering "features" in SQL products to imitate either a record number (sequential file system mindset) or OID (OO mindset) since they do not know anything else. 

    They write code with cursors to mimic record-at-a-time file handling -- it is easy do a global replace of READ() with FETCH. They write to temp tables to mimic scratch files in a series of procedural steps. They use dynamic SQL and let the user figure out how the system should work on the fly -- they never had a software engineering course and do not know what coupling and cohesion are. 

    Experienced database designers look for industry standard codes for their keys first. Try to tell the IRS you do not have anybody's SSN at tax time but "Cindy Lou Who" was the 42-nd employee put into the Personnel table and you have the IDENTITY value to prove it. Try to sell a car without a VIN -- using IDENTITY for this would be like identifying a vehicle by the current local parking space number (think about it - that is exactly what an IDENTITY value is). 

    Finally, it is not possible to have a table with an IDENTITY PRIMARY KEY in anything higher than 2NF. The IDENTITY has no facts about it being stored, so everything in the table will have a transitive dependency on the candidate key. 

    We know that we have to use industry standard codes, either de facto or de jure. We also appreciate the fact that we can exchange data with the rest of the world. We appreciate the fact that someone else will maintain and define these codes. A trusted external source is a good thing to have. 

    If you do not have an industry standard or natural key and have to dign your own codes, it is hard work to do it right. I know that newbie programmers want to start coding first and thinking later. It does not work that way and all the "wish magic" in the world will not change that fact. I have several chapters in my books on how to design encoding schemes, but that is another topic. 

    The first practical consideration is that IDENTITY is proprietary and non-portable, so you know that you will have maintenance problems when you change releases or port your system to other SQL products. Newbies actually think they will never port code! Perhaps they only work for companies that are failing and will be gone. Perhaps their code is such crap nobody else wants their application. Otherwise, you will port code; you will share data with some other database; data does not exist in isolation. 

    But let's look at the logical problems. First try to create a table with two columns and try to make them both IDENTITY. If you cannot declare more than one column to be of a certain data type, then that thing is not a data type at all, by definition. It is a table property which belongs to the PHYSICAL table implementation, not the LOGICAL data inside the table. 

    Next, create a table with one column and make it an IDENTITY. Now try to insert, update and delete different numbers from it. If you cannot insert, update and delete rows from a table, then it is not a table by definition. 

    Finally create a simple table with one IDENTITY and a few other columns. Use a few statements like

    INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1'); 
    INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2'); 
    INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

    To put a few rows into the table and notice that the IDENTITY sequentially numbered them in the PHYSICAL order they were presented. If you delete a row, the gap in the sequence is not filled in and the sequence continues from the highest number that has ever been used in that column in that particular table. This is how we did record numbers in sequential disk files in the 1950's, as I said before. A utility program would "pack" or "compress" the records that were flagged as deleted or unused to move the empty space to the PHYSICAL end of the PHYSICAL file. Files do not have references to other files, so the only worry is that a program will use a hardwired record number. 

    But now use a statement with a query expression in it, like this:

    INSERT INTO Foobar (a, b, c)
    SELECT x, y, z
     FROM Floob;

    Since a query result is a table, and a table is a set which has no ordering, what should the IDENTITY numbers be? The entire, whole, completed set is presented to Foobar all at once, not a row at a time. 

    There are (n!) ways to number (n) rows, so which one did you pick? Why? The answer has been to use whatever the PHYSICAL order of the result set happened to be. That non-relational phrase "PHYSICAL order" again!

    But it is actually worse than that. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result back in a different PHYSICAL order. Indexes and statistics are not part of the logical model. 

    Can you explain from a logical model why the same rows in the second query get different IDENTITY numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical. 

    How do you verify that an entity has the right key when you use an IDENTITY? If I use a VIN for a vehicle, I can go to parking lot and read it off the dashboard. If I use a UPC for a candy bar, I can read the bar code on the wrapper. But why is little 'Cindy Lou Who' employee 42? If I turn her upside, will I find that she has 42 tattooed somewhere? :w00t:  If I call an external trusted source, will they know that she is employee 42? No, they will want her driver's license, tax id number or something. 

    In the Relational Model, you do not invent a key in the storage. You discover a key (and the other attributes) in the real world and model it. If you create your own encoding for a key, then you have to maintain it, provide audit trails and do all the work that an industry standard organization would do for you. 

    If I lean on a mouse button, I can insert the same data with a new IDENTITY over and over. 'Cindy Lou Who' is now employed two times and none of my reports are right! Now I have to write some procedural code like a trigger or a UNIQUE constraint on her tax identification number to prevent this, thus making the IDENTITY redundant. But we were assuming that we use only IDENTITY as a key, so we are screwed. 

    Newbies often design tables without bothering to look for a relational key, so they are so surprised when they do a data warehouse and nobody else has any idea what they are doing in their subsystem. 

    'Cindy Lou Who' now has two rows in Personnel. When we sign her up for the Dental Plan, we get the row with 42. When we sign her up for the Bowling Team, we get the row with 43. We find our error, and delete the row with 42 because we have a row that was created later and we assume it is more current. 

    Another common way to get this is to have two procedures, one for inserting a new employee to the Dental Plan and one for inserting a new employee to the Bowling Team. Both procedures create a row in Personnel since they use only IDENTITY as a key. 

    Well, now we have an orphan row in Personnel. In fact, in such systems, you will find a lot of orphans. I worked for a company that used GUIDs for OIDs substitutes and our software fell apart in about a year of actual use by a client.

    Another cute way to destroy data integrity:

    BEGIN TRANS
    DELETE FROM Foobar
    WHERE foobar_id = <constant>; 
    COMMIT;
    INSERT INTO Foobar
    VALUES (<row expr identical to what was deleted>);
    COMMIT;
    END;

    Logically this should do nothing in an RDBMS, but since IDENTITY has gaps, it trashes the data. When one query uses the IDENTITY and another uses the real key, you are like a man with two watches, you are never sure what time it is. 

    Finally, an appeal to authority, with a quote from Dr. Codd: "..Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434. 

    This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does. 

    Codd also wrote the following:

    "There are three difficulties in employing user-controlled keys as permanent surrogates for entities.

    (1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).

    (2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.

    (3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree).

    These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they
     have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410).

    References
    Codd, E. (1979), Extending the database relational model
     to capture more
    meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Saturday, July 06, 2013 3:54 PM