locked
The mechanics of the WHERE clause RRS feed

  • Question

  • I’ve been out in industry for over 10 years now doing DB work most of that time. Lately I’ve been doing some work that is starting to challenge my long held assumptions about the internal workings of SQL Server, specifically what exactly is happening when I do joins and where clauses.

    I was explaining to someone that when doing a join putting in a where clause limits the amount of records that are available to do the join on thereby making the query faster. My thought was if a table has 100MM records but only 50MM of those meet the WHERE criteria then the JOIN clause wouldn’t try matching the 50MM records that don’t meet the WHERE clause.

    I did some work that afternoon and the empirical results didn’t match the expected outcome of a query I had written in terms of execution time. I don’t think my understanding of what goes on during a JOIN is correct. Can someone educate me?

    Monday, May 6, 2013 4:29 PM

Answers

  • There is a definition of the order of the process of JOINs and the WHERE clause.  Then there is what SQL actually does.  These may be different.

    But the definition is the FROM clause including any JOINs is done first.  In general, if you have multiple JOINs, they are done left to right in the order of the ON clauses.  So if you have

    FROM A LEFT JOIN B INNER JOIN C ON B.ID = C.ID ON A.ID + B.ID

    then first B is joined to C and then that result is joined to A.  Of course, in most cases people write the ON clause immediately after the JOIN so you will see something like

    FROM A LEFT JOIN B ON A.ID = B.ID LEFT JOIN C ON A.ID = C.ID

    And, the rules of which join is done first can become complex if you are mixing both OUTER JOINs and CROSS JOINs.

    But the WHERE is done after all of the JOINs have been processed.

    But that's only the definition of what is theoretically done.  When the query is actuality done, SQL can (and does) process the query in any manner it thinks is most efficient as long as the result is the same as if it had followed the above definition.  In particular, if your query has only INNER JOINs and CROSS JOINs and a WHERE clause, then the order things are done will make no difference in the result.  So in that case, SQL will do the joins and where in any order it thinks is fastest.  So that could mean doing the joins first, or doing the where first, or doing part of the where first, followed by some of the joins followed by the rest of the where clause followed by the rest of the joins or whatever SQL thinks will be fastest given the available indexes and statistics.

    Tom



    • Edited by Tom Cooper Monday, May 6, 2013 6:07 PM
    • Marked as answer by falcon00 Monday, May 6, 2013 6:18 PM
    Monday, May 6, 2013 5:46 PM
  • The important concept here is that in SQL you specify the result you want, not the algorithm to create that result.  It's helpful to think about the FROM clause executing first, followed by the WHERE clause in order to reason about the results and the correctness of your query.  But that's not how SQL Server actually runs the query.

    The Query Optimizer looks as the query, the join criteria and the where clauses and considers several different algorithms (query plans) to get the specified result.  It assigns a "cost" to each potential plan and then executes whichever plan has the lowest cost. 

    So here it might start with a where clause, or it might start by joining tables and apply the where clause criteria later.

    David


    David http://blogs.msdn.com/b/dbrowne/

    • Proposed as answer by Naomi N Monday, May 6, 2013 5:41 PM
    • Marked as answer by falcon00 Monday, May 6, 2013 6:19 PM
    Monday, May 6, 2013 5:40 PM
  • Here is how a SELECT works in SQL ... at least in theory.  Real products will optimize things, but the code has to produce the same results. 

     a) Effectively materialize the CTEs in the optional WITH clause. CTE's come into existence in the order they are declared so only backward references are alllowed. A CTE can be recursive. Think of them as VIEWs that exist only in the scope of the query. In practice, if they are used once then they are implemented as an in-line macro.

     b) Start in the FROM clause and build a working table from all of the joins, unions, intersections, and whatever other table constructors are there.  The <table expression> AS <correlation name> option allows you give a name to this working table which you then have to use for the rest of the containing query.  Ther are UNIONB, INTERSECT and EXCEPT set construtors, LATERAL tables, table-valued funcitosn and all kinds of things happening in here. 

     c) Go to the WHERE clause and remove rows that do not pass criteria; that is, that do not test to TRUE (i.e. reject UNKNOWN and FALSE).  The WHERE clause is applied to the working set in the FROM clause.  

     d) Go to the optional GROUP BY clause, partiton the original table into groups and reduce each grouping to a *single* row, replacing the original working table with the new grouped table. The rows of a grouped table must be only group characteristics: (1) a grouping column (2) a statistic about the group (i.e. aggregate functions) (3) a function or constant(4) an expression made up of only those three items.  The original table no longer exists and you cannot reference anything in it (this was an error in early Sybase products).  

     e) Go to the optional HAVING clause and apply it against the grouped working table; if there was no GROUP BY clause, treat the entire table as one group. 

     f) Go to the SELECT clause and construct the expressions in the list. This means that the scalar subqueries, function calls and expressions in the SELECT are done after all the other clauses are done.  The AS operator can also give names to expressions in the SELECT list.  These new names come into existence all at once, but after the WHERE clause, GROUP BY clause and HAVING clause have been executed; you cannot use them in the SELECT list or the WHERE clause for that reason. 

    If there is a SELECT DISTINCT, then redundant duplicate rows are removed.  For purposes of defining a duplicate row, NULLs are treated as matching (just like in the GROUP BY).  

     g) Nested query expressions follow the usual scoping rules you would expect from a block structured language like C, Pascal, Algol, etc.  Namely, the innermost queries can reference columns and tables in the queries in which they are contained.  

     h) The ORDER BY clause is part of a cursor, not a query. The result set is passed to the cursor, which can only see the names in the SELECT clause list, and the sorting is done there.  The ORDER BY clause cannot have expression in it, or references to other columns because the result set has been converted into a sequential file structure and that is what is being sorted.  

    As you can see, things happen "all at once" in SQL, not "from left to right" as they would in a sequential file/procedural language model. In those languages, these two statements produce different results:
      READ (a, b, c) FROM File_X;
      READ (c, a, b) FROM File_X;

    while these two statements return the same data:

    SELECT a, b, c FROM Table_X;
    SELECT c, a, b FROM Table_X;

    Think about what a confused mess this statement is in the SQL model.

    SELECT f(c2) AS c1, f(c1) AS c2 FROM Foobar;

    That is why such nonsense is illegal syntax.

    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    • Marked as answer by falcon00 Monday, May 6, 2013 6:35 PM
    Monday, May 6, 2013 6:15 PM
  • I would suggest this book to get familiar with the fundamentals of T-SQL.

    Microsoft® SQL Server® 2012 T-SQL Fundamentals
    http://shop.oreilly.com/product/0790145321978.do

    Logically the JOIN should take place first and then the WHERE but the query optimizer can shift or move certain expressions if the semantic of the statement does not change.


    AMB

    Some guidelines for posting questions...

    • Proposed as answer by Naomi N Monday, May 6, 2013 5:40 PM
    • Marked as answer by falcon00 Monday, May 6, 2013 6:19 PM
    Monday, May 6, 2013 5:27 PM
  • I have a ton of books on T-SQL but none of them go into the real details of how SQL Server does it's thing. If I take all of your comments this is my take away. Let me know if you agree.

    1. SQL Server DOES work like I expect. Using a where clause WILL limit the amount of records it attempts to join on. (Based on comment from Tom Cooper in another thread.)

    2. It does this by optimization not because of anything that I write. It will take a look at the query and then on it's own come up with the most effective way to execute.

    With regard to 1) I would say it CAN work like you expect and a where clause CAN limit the number of rows it attempts to join.  But SQL also might do the join first and only then apply the where clause.  Whatever way SQL thinks is more efficient as long as that way generates the correct result.

    2) is correct.  As David Brown noted, when you write a SQL statement, you should think of it as you are telling SQL the result you want and then SQL generates that result in the manner it believes to be most efficient.

    Tom

    • Marked as answer by falcon00 Monday, May 6, 2013 6:35 PM
    Monday, May 6, 2013 6:30 PM

All replies

  • Hi -

    Can you check the execution plans?  That might provide some clue.


    - will

    Monday, May 6, 2013 4:38 PM
  • Hi,

    Join and Where clause are the different from each other as Where clause is used for condition and Joins are used for join two or more table's data on the basis of relational methodology, but where clause is also used in inner join to match the condition.

    For deeply:

    For Joins have a look on this good link:

    http://blog.sqlauthority.com/2009/04/13/sql-server-introduction-to-joins-basic-of-joins/

    and for use of where clause, have a look on this good link:

    http://www.techonthenet.com/sql/where.php


    Regards Avanish T

    Monday, May 6, 2013 4:41 PM
  • Its not just about the JOINs, its the execution flow of SQL.

    The FROM clause is executed first , then the WHERE clause. So, SQL Server engine would match the rows and then apply the predicate(i.e. WHERE clause).

    Have a look at this thread.

    http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/70efeffe-76b9-4b7e-b4a1-ba53f5d21916


    Thanks
    Sarat

    Please use Marked as Answer if my post solved your problem and use Vote As Helpful if a post was useful.

    Monday, May 6, 2013 5:06 PM
  • I do look at execution plans but some of the stuff in there is a little cryptic like Parallelism (repartition streams).
    Monday, May 6, 2013 5:13 PM
  • Dear falcon00

    T-SQL is based on "set theory" and "predicate logic". In "where clause" you simply eliminate those elements that you do not need. Imagine it when you query from a table you have a "Set". In each clause you shrink this "Set'. If you want more information about this please refer to this book From Itzik Ben-Gan:

    Microsoft SQL Server 2008: T-SQL Fundamentals

    Regards,

    Saeid


    http://sqldevelop.wordpress.com/

    Monday, May 6, 2013 5:15 PM
  • Hi falcon00

    A brief description on the differences would be...

    WHERE clause - on a Simple table the WHERE clause and a good index would limit the amount of rows returned from a table (data set). The index holds the rows in physical or logical order say A to Z or 1 to (n). So when looking for the values "T" it can in a sense skip down to "T" and stop at "U" and quickly return just these values.

    I used the term data set earlier for a purpose. It helps me to think out some issues. We can see how a table = data set. But note that a JOIN would also result in a data set. Since the WHERE clause works on "data sets". Then we can now imagine that a JOIN would first join every single row possible and then using the WHERE clause limit the amount of rows returned. :(

    Not exactly what we want... if we want to limit the amount of rows being joined then we have to add that to the ON clause.

    ON Clause - You probably know what this does already but lets review for those who don't. The ON Clause is used between two Normalized tables to rejoin vertically partitioned rows. Sometimes its better to add our WHERE conditions here to limit the number of rows Joined and sometimes we can't add it here. An example would serve us well.

    declare @table1 table (col1 int, col2 int)
    declare @table2 table (col1 int, col2 int)
    
    insert into @table1 values (1,10)
    insert into @table2 values (1,10),(1,20),(1,30),(1,40),(2,10)
    
    select * from @table1 t1 inner join @table2 t2 on t1.col1 = t2.col1
    
    select * from @table1 t1 inner join @table2 t2 on t1.col1 = t2.col1
    where t1.col2 = t2.col2
    
    select * from @table1 t1 inner join @table2 t2 on t1.col1 = t2.col1 and t1.col2 = t2.col2
    

    The execution plans are not very revealing with such a small data set but try these examples out on your own tables. :)


    Pérez

    As a courtesy please post table structure(s), sample data, your query and results, and your expected results.

    Monday, May 6, 2013 5:18 PM
  • I would suggest this book to get familiar with the fundamentals of T-SQL.

    Microsoft® SQL Server® 2012 T-SQL Fundamentals
    http://shop.oreilly.com/product/0790145321978.do

    Logically the JOIN should take place first and then the WHERE but the query optimizer can shift or move certain expressions if the semantic of the statement does not change.


    AMB

    Some guidelines for posting questions...

    • Proposed as answer by Naomi N Monday, May 6, 2013 5:40 PM
    • Marked as answer by falcon00 Monday, May 6, 2013 6:19 PM
    Monday, May 6, 2013 5:27 PM
  • The important concept here is that in SQL you specify the result you want, not the algorithm to create that result.  It's helpful to think about the FROM clause executing first, followed by the WHERE clause in order to reason about the results and the correctness of your query.  But that's not how SQL Server actually runs the query.

    The Query Optimizer looks as the query, the join criteria and the where clauses and considers several different algorithms (query plans) to get the specified result.  It assigns a "cost" to each potential plan and then executes whichever plan has the lowest cost. 

    So here it might start with a where clause, or it might start by joining tables and apply the where clause criteria later.

    David


    David http://blogs.msdn.com/b/dbrowne/

    • Proposed as answer by Naomi N Monday, May 6, 2013 5:41 PM
    • Marked as answer by falcon00 Monday, May 6, 2013 6:19 PM
    Monday, May 6, 2013 5:40 PM
  • There is a definition of the order of the process of JOINs and the WHERE clause.  Then there is what SQL actually does.  These may be different.

    But the definition is the FROM clause including any JOINs is done first.  In general, if you have multiple JOINs, they are done left to right in the order of the ON clauses.  So if you have

    FROM A LEFT JOIN B INNER JOIN C ON B.ID = C.ID ON A.ID + B.ID

    then first B is joined to C and then that result is joined to A.  Of course, in most cases people write the ON clause immediately after the JOIN so you will see something like

    FROM A LEFT JOIN B ON A.ID = B.ID LEFT JOIN C ON A.ID = C.ID

    And, the rules of which join is done first can become complex if you are mixing both OUTER JOINs and CROSS JOINs.

    But the WHERE is done after all of the JOINs have been processed.

    But that's only the definition of what is theoretically done.  When the query is actuality done, SQL can (and does) process the query in any manner it thinks is most efficient as long as the result is the same as if it had followed the above definition.  In particular, if your query has only INNER JOINs and CROSS JOINs and a WHERE clause, then the order things are done will make no difference in the result.  So in that case, SQL will do the joins and where in any order it thinks is fastest.  So that could mean doing the joins first, or doing the where first, or doing part of the where first, followed by some of the joins followed by the rest of the where clause followed by the rest of the joins or whatever SQL thinks will be fastest given the available indexes and statistics.

    Tom



    • Edited by Tom Cooper Monday, May 6, 2013 6:07 PM
    • Marked as answer by falcon00 Monday, May 6, 2013 6:18 PM
    Monday, May 6, 2013 5:46 PM
  • I have a ton of books on T-SQL but none of them go into the real details of how SQL Server does it's thing. If I take all of your comments this is my take away. Let me know if you agree.

    1. SQL Server DOES work like I expect. Using a where clause WILL limit the amount of records it attempts to join on. (Based on comment from Tom Cooper in another thread.)

    2. It does this by optimization not because of anything that I write. It will take a look at the query and then on it's own come up with the most effective way to execute.

    Monday, May 6, 2013 5:53 PM
  • Here is how a SELECT works in SQL ... at least in theory.  Real products will optimize things, but the code has to produce the same results. 

     a) Effectively materialize the CTEs in the optional WITH clause. CTE's come into existence in the order they are declared so only backward references are alllowed. A CTE can be recursive. Think of them as VIEWs that exist only in the scope of the query. In practice, if they are used once then they are implemented as an in-line macro.

     b) Start in the FROM clause and build a working table from all of the joins, unions, intersections, and whatever other table constructors are there.  The <table expression> AS <correlation name> option allows you give a name to this working table which you then have to use for the rest of the containing query.  Ther are UNIONB, INTERSECT and EXCEPT set construtors, LATERAL tables, table-valued funcitosn and all kinds of things happening in here. 

     c) Go to the WHERE clause and remove rows that do not pass criteria; that is, that do not test to TRUE (i.e. reject UNKNOWN and FALSE).  The WHERE clause is applied to the working set in the FROM clause.  

     d) Go to the optional GROUP BY clause, partiton the original table into groups and reduce each grouping to a *single* row, replacing the original working table with the new grouped table. The rows of a grouped table must be only group characteristics: (1) a grouping column (2) a statistic about the group (i.e. aggregate functions) (3) a function or constant(4) an expression made up of only those three items.  The original table no longer exists and you cannot reference anything in it (this was an error in early Sybase products).  

     e) Go to the optional HAVING clause and apply it against the grouped working table; if there was no GROUP BY clause, treat the entire table as one group. 

     f) Go to the SELECT clause and construct the expressions in the list. This means that the scalar subqueries, function calls and expressions in the SELECT are done after all the other clauses are done.  The AS operator can also give names to expressions in the SELECT list.  These new names come into existence all at once, but after the WHERE clause, GROUP BY clause and HAVING clause have been executed; you cannot use them in the SELECT list or the WHERE clause for that reason. 

    If there is a SELECT DISTINCT, then redundant duplicate rows are removed.  For purposes of defining a duplicate row, NULLs are treated as matching (just like in the GROUP BY).  

     g) Nested query expressions follow the usual scoping rules you would expect from a block structured language like C, Pascal, Algol, etc.  Namely, the innermost queries can reference columns and tables in the queries in which they are contained.  

     h) The ORDER BY clause is part of a cursor, not a query. The result set is passed to the cursor, which can only see the names in the SELECT clause list, and the sorting is done there.  The ORDER BY clause cannot have expression in it, or references to other columns because the result set has been converted into a sequential file structure and that is what is being sorted.  

    As you can see, things happen "all at once" in SQL, not "from left to right" as they would in a sequential file/procedural language model. In those languages, these two statements produce different results:
      READ (a, b, c) FROM File_X;
      READ (c, a, b) FROM File_X;

    while these two statements return the same data:

    SELECT a, b, c FROM Table_X;
    SELECT c, a, b FROM Table_X;

    Think about what a confused mess this statement is in the SQL model.

    SELECT f(c2) AS c1, f(c1) AS c2 FROM Foobar;

    That is why such nonsense is illegal syntax.

    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    • Marked as answer by falcon00 Monday, May 6, 2013 6:35 PM
    Monday, May 6, 2013 6:15 PM
  • I have a ton of books on T-SQL but none of them go into the real details of how SQL Server does it's thing. If I take all of your comments this is my take away. Let me know if you agree.

    1. SQL Server DOES work like I expect. Using a where clause WILL limit the amount of records it attempts to join on. (Based on comment from Tom Cooper in another thread.)

    2. It does this by optimization not because of anything that I write. It will take a look at the query and then on it's own come up with the most effective way to execute.

    With regard to 1) I would say it CAN work like you expect and a where clause CAN limit the number of rows it attempts to join.  But SQL also might do the join first and only then apply the where clause.  Whatever way SQL thinks is more efficient as long as that way generates the correct result.

    2) is correct.  As David Brown noted, when you write a SQL statement, you should think of it as you are telling SQL the result you want and then SQL generates that result in the manner it believes to be most efficient.

    Tom

    • Marked as answer by falcon00 Monday, May 6, 2013 6:35 PM
    Monday, May 6, 2013 6:30 PM
  • Hi I am relatively very inexperienced in SQL. According to my understanding of your comments and from a few others in other blogs, I believe that irrespective of where I place my filter criteria (either in an INNER JOIN as an ON condition or in a WHERE clause), the SQL query optimizer decides which is the best plan(estimated based on its Cost) and executes the WHERE or JOIN condition first in whichever way it thinks is better. Is it the case even for Trivial queries. I came across this "trivial" definition in this article https://www.simple-talk.com/sql/performance/execution-plan-basics/ , where it says for queries which does not demand an execution plan( simple queries) and those which lack deep statistical data, the query optimizer does not work upon. Will the Logical Processing Order of the SELECT statement be applicable only in these situations.

    Thanks.

    Thursday, February 19, 2015 8:03 AM
  • Well the query optimizer always works on every query plan.  (Maybe a better name for it would be the "uery Plan Generator".)  It takes your query and generates a plan that will produce the correct result.  It then attempts to see if the plan can be improved upon.  But it tries to minimize the total cost of producing the plan plus the cost of executing the plan.  If the original plan is expected to be very expensive to run, the optimizer may spend a significant amount of time attempting to improve it. 

    As an example, if the original plan would take 10 minutes to run, it is a good idea to have the optimizer spend a few seconds trying to improve the query.  If it spent 5 seconds creating a new plan that took only 8 minutes to run, that would be a great result.  On the other hand, if the original plan was expected to take 1 second to run it would be very expensive to spend 5 seconds to try to improve the plan since the time spent improving the plan would be more than the maximum possible savings.

    So the optimizer is a very sophisticated piece of software that takes many things into account including how expensive the current plan is, how likely it is the plan can be improved, and the cost of attempting to find that improvement.  So its goal is never "to produce the best possible plan no matter how long it takes", but instead its goal is to produce a plan which minimizes the total cost of producing the plan + running the plan.  And that may well mean that it produces and runs a suboptimal query.  But it does it only when it estimates the total cost of producing and running the more optimal query would be more expensive than just stopping and running the current plan.

    So in general, you don't have to worry about the situation you describe.  In general, your attitude should be the same as the optimizer's attitude - "It's good enough".  Why, for example, would you want to try to rewrite a query that runs in 0.003 seconds to run in 0.0028 seconds?  Surely, in most cases, you have bigger problems you can be working on.  There may be exceptions.  If this query runs thousands of times a second, it might be worth you spending time trying to improve it by rewriting the query to that the query optimizer produces a better plan or by creating a plan guide to tell the optimizer which plan to use.

    Tom

    Thursday, February 19, 2015 3:05 PM
  • Thanks a ton Tom. :)
    Friday, February 20, 2015 7:18 AM