locked
How is a CHECK CONSTRAINT using the IN operator actually being processed for an [int] column? RRS feed

  • Question

  • I am participating in a discussion on SQL Server's handling of IN expressions when used for CHECK CONSTRAINTs for an [int] column.

    The issue boils down to whether SQL SERVER is "clever" enough to optimize non-normalized IN to be executed in an efficient manner.

    Question 1:

    Let's assume we have a contigous range of int values like:

    [MyColumn] IN(1, 2, 3, 4, 5, 6)

    Is the expression for the CHECK CONSTRAINT analyzed, optimized, compiled etc. so SQL SERVER actually executes something like:

    [MyColumn] BETWEEN 1 and 6

    or must the T-SQL writer do that optimization, i.e. to realize when the set of integer values denote a contigous range and then use BETWEEN in favor of IN?

    Question 2:

    Does the answer change if the definition of the CHECK CONSTRAINT has multiple contigous ranges like:

    [MyColumn] IN(1, 2, 3, 4, 5, 6, 203, 204, 205, 206)

    QUESTION 3:

    So generally, to achieve the best performance, should I normalize such enumerations of integers to a number of BETWEENs, one for each contigous range, or SQL SERVER is clever enough to execute IN with some list of integers in the most efficient way?

    Monday, May 11, 2020 2:01 PM

Answers

  • I'm honestly not sure how SQL Server currently handles a list of values for the IN () predicate. However, the elements do not have to be integer constants. They can be integer expressions and you can play some really nice games where you're essentially turning a list of OR-ed expressions into a single expression. If you write a long list, then other SQL products to special optimizations. For example, DB2 will build a binary search tree of the values rather than do a simple linear search when the list gets long enough. Then when it gets even longer, it will build a hash table. This leads to the weird programming trick where you deliberately put in extra values to make the list longer to trigger the optimization. In Oracle, it used to be a good idea to arrange the list from most likely to least likely. Their optimizer sucked and would do a simple left to right parse on the list.  But again like everyone's been saying, you need to test it for yourself on a sufficiently large data set. I would guess that it won't make that much difference.

    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    • Marked as answer by Henrik Dahl Thursday, May 14, 2020 11:14 AM
    Tuesday, May 12, 2020 2:05 AM

All replies

  • There is no public information on how the internals of SQL Server would handle that situation.  However, it is highly unlikely it would detect a contiguous range and optimize it.  There really is no benefit to the query processes in doing that.

    Monday, May 11, 2020 3:32 PM
  • The IN operator is a 'shortcut' for OR and SQL Server will optimize the IN to a set of OR statements.  The constraint would then become:

    MyColumn = 1 OR MyColumn = 2 OR MyColumn = 3 OR MyColumn = 4 ...

    BETWEEN is a 'shortcut' for >= and <=...

    For a check constraint - either method will perform the same as this is only performed when the value in the column is inserted or updated.


    Jeff Williams

    Monday, May 11, 2020 7:42 PM
  • What do you think of the following then?

    In case of INSERT or UPDATE. If we have a CHECK CONSTRAINT like "[MyColumn] IN(1, 2, 3, 4, 5, 6)" it would be quicker to have "[MyColumn] BETWEEN 1 AND 6" because it compares only two values, not 6. Basically BETWEEN gives an order of execution which is O(1) because always two values are being compared whereas IN is O(n) because n values are being compared.

    In case of SELECT if we have a where clause like "[MyColumn] IN(1, 2, 3, 4, 5, 6)" and we assume there is an index on [MyColumn] it is much faster to do "[MyColumn] BETWEEN 1 and 6" because in this case B+ trees are being used, as indexes are implemented using B+ trees. Therefore BETWEEN simply addresses a subtree of the B+ tree whereas IN makes n individual queries.

    Do you think, that it's correctly analyzed?

    Monday, May 11, 2020 9:09 PM
  • I don't think you are going to find either method to be faster - and if one is faster it is going to be in the sub-millisecond range.

    But...if you really want to know which one is faster then put together a test with at least one million rows being inserted/updated using one constraint and time the process.  Then do the test again with the other constraint.

    As for your assertion that BETWEEN is faster than IN (or a set of OR conditions) - that isn't really how it works.  SQL Server will perform an index scan starting at the beginning of the range and validate each row returned against the condition until it reaches the end of the range.  It will do the same for the IN operation and since they are equivalent - the performance should be the same.


    Jeff Williams

    Monday, May 11, 2020 10:01 PM
  • I think you are asking the wrong question. No matter the IN list is normalised to a BETWEEN range or not, the time that is actually spent on the validation is very small compared to the operation of updating a row.

    It would have been different if you had asked about a condition in a WHERE clause. Here it could matter, and I would guess BETWEEN would be better, as an index would be used more efficiently.

    It would also be different if you have a really long list of values, more than 64. In this case, SQL Server transforms the IN list to an internal temp table, so that could make a noticable difference to BETWEEN.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Monday, May 11, 2020 10:10 PM
  • Just to be sure, so if we assume we have e.g. 1,000,000 rows in the table and each row has a [MyColumn] column with numbers from 1 to 1,000, 000 and there is an index on that column and we do SELECT * from TheTable WHERE [MyColumn] IN (1000, 10000, 100000) then "SQL Server will perform an index scan starting at the beginning of the range and validate each row returned against the condition until it reaches the end of the range." as you wrote, i.e. start index scanning at 1000 and finish at 100000 and not do three queries, one for 1000, one for 10000 and one for 100000 and do a union as I assumed?
    Monday, May 11, 2020 10:14 PM
  • As Jeff Williams noted you can test it.  Here is a (not very sophisticated) test that inserts 1 million rows and then updates them.

    Set NoCount On;
    Create Table #In(i int primary key identity, j int Constraint InCk Check (j in (1,2,3,4,5,6)));
    Create Table #Between(i int primary key identity, j int Constraint BetweenCk Check (j Between 1 And 6));
    Declare @NumberLoops int;
    Set @NumberLoops = 1000000;
    Declare @Counter int;
    Set @Counter = 0;
    Declare @StartTime datetime2, @EndTime datetime2;
    Set @StartTime = CURRENT_TIMESTAMP;
    While @Counter < @NumberLoops
    Begin
      Insert #In(j) Values(1);
      Set @Counter = @Counter + 1;
    End;
    UpDate #In Set j = i % 6 + 1;
    Set @EndTime = CURRENT_TIMESTAMP;
    Select DateDiff(ms, @StartTime, @EndTime) / Cast(@NumberLoops As float) As AvgInTimeInms;
    
    Set @Counter = 0;
    Set @StartTime = CURRENT_TIMESTAMP;
    While @Counter < @NumberLoops
    Begin
      Insert #Between(j) Values(1);
      Set @Counter = @Counter + 1;
    End;
    Update #Between Set j = i % 6 + 1
    Set @EndTime = CURRENT_TIMESTAMP;
    Select DateDiff(ms, @StartTime, @EndTime) / Cast(@NumberLoops As float) As AvgBetweenTimeInms;
    
    
    go
    Drop Table #In;
    go
    Drop Table #Between;

    When I ran it on my PC (your mileage may vary), the results varied from run to run, but the Between version averaged about .001 to .002 milliseconds or 1 to 2 nanoseconds faster per row. But each row had 2 operations (1 insert and 1 update), so that means the difference is about 0.5 to 1 nanoseconds per operation.

    So the Between version appears to be very slightly faster.  But not enough (IMO) to make any difference in which version I would pick.  I would choose the Between version, but not because it is faster (the difference is so small), but rather because the Between version is easier to read and understand.  That's more important for anyone who must maintain your code at some future date than such a small difference in performance.

    Tom

    Monday, May 11, 2020 10:36 PM
  • I'm honestly not sure how SQL Server currently handles a list of values for the IN () predicate. However, the elements do not have to be integer constants. They can be integer expressions and you can play some really nice games where you're essentially turning a list of OR-ed expressions into a single expression. If you write a long list, then other SQL products to special optimizations. For example, DB2 will build a binary search tree of the values rather than do a simple linear search when the list gets long enough. Then when it gets even longer, it will build a hash table. This leads to the weird programming trick where you deliberately put in extra values to make the list longer to trigger the optimization. In Oracle, it used to be a good idea to arrange the list from most likely to least likely. Their optimizer sucked and would do a simple left to right parse on the list.  But again like everyone's been saying, you need to test it for yourself on a sufficiently large data set. I would guess that it won't make that much difference.

    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    • Marked as answer by Henrik Dahl Thursday, May 14, 2020 11:14 AM
    Tuesday, May 12, 2020 2:05 AM
  • Hi Henrik Dahl,

    Like others said, BETWEEN would have a weak edge than IN, it is easy to understand and can take a range from an index (assuming somecolumn is indexed).

    There are some explanations:

    Is there a performance difference between BETWEEN and IN?

    AGE [1, 2, 3] vs. AGE BETWEEN 1 AND 3.

    Best Regards,

    Lily


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Tuesday, May 12, 2020 6:21 AM
  • Can you refer to some source for "It would also be different if you have a really long list of values, more than 64. In this case, SQL Server transforms the IN list to an internal temp table"?
    Tuesday, May 12, 2020 1:45 PM
  • Just to be sure, so if we assume we have e.g. 1,000,000 rows in the table and each row has a [MyColumn] column with numbers from 1 to 1,000, 000 and there is an index on that column and we do SELECT * from TheTable WHERE [MyColumn] IN (1000, 10000, 100000) then "SQL Server will perform an index scan starting at the beginning of the range and validate each row returned against the condition until it reaches the end of the range." as you wrote, i.e. start index scanning at 1000 and finish at 100000 and not do three queries, one for 1000, one for 10000 and one for 100000 and do a union as I assumed?

    Note that now we are talking WHERE clauses and not CHECK constraints.
    I don't think Jeff is correct here. I ran this on a database that I have:

    SET STATISTICS IO ON
    SELECT * FROM Orders WHERE OrderID IN (11000, 11001, 11002)
    SELECT * FROM Orders WHERE OrderID BETWEEN 11000 AND 11002
    SET STATISTICS IO OFF

    The output was:

    Table 'Orders'. Scan count 3, logical reads 9, physical reads 0...
    Table 'Orders'. Scan count 1, logical reads 3, physical reads 0...

    So for the IN (which is just a shortcur for OR), it is making three point lookup, whereas for BETWEEN it reads the range. But this could be different if you have more values in the list.

    Can you refer to some source for "It would also be different if you have a really long list of values, more than 64. In this case, SQL Server transforms the IN list to an internal temp table"?

    Just try it yourself. Create a long list, and look at the query plan. This is how I made the observation.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    • Proposed as answer by Lily Lii Thursday, May 14, 2020 8:30 AM
    Tuesday, May 12, 2020 9:56 PM
  • Hi Henrik Dahl,

    If you have too many values within the IN() clause, SQL Server seems to automatically create temporary tables for performance improvement.

    The limitation is 65536, however, IN() will stop way before that. (Is there any limit for IN results?)

    The alternative way is storing the items in the IN list in a table, and use a SELECT subquery within an IN() clause.

    Best Regards,

    Lily


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Wednesday, May 13, 2020 8:39 AM
  • Dear Joe,

    Thank you very much for your reply. Yes, to know how it's actually served is obviously to be preferred and the more demanding design of DB2 is obviously also the be preferred. I just think you should be able to hint for the binary search variant and not have a need for stuffing in additional values. In the concrete case it's a matter of mapping an enum by an ORM and checking for valid values, which does not really leave room for fake values. So the issue is, if the ORM SQL database generator should produce INs or BETWEENs. The ORM is Entity Framework Core.

    I have marked your response as an answer due to the higher order computer science related perspectives of yours.

    Please allow to to mention, that I have a few books of yours, books which have always brought me cognitive pleasure when reading.

    Best regards,

    Henrik Dahl

    Thursday, May 14, 2020 11:26 AM