locked
slow spatial predicates (STContains, STIntersects, STWithin, ...) RRS feed

  • Question

  • Hi all,

    I'm finding SQL Server's spatial predicates (STContains, STIntersects, STWithin, ...) to be extremely slow to use.  What am I doing wrong?

    I have single smallish polygon of about 35000 vertices and I want to determine which of 5000 points are contained within the polygon.  All the points are within the bounding box of the polygon so spatial indexes don't play any role here.

    -- example query
    SELECT b.id
    FROM poly a, points b
    WHERE a.geom.STContains(b.geom) = 1;

    But, I'm getting results whose query times are measured in minutes ... not at all appropriate for a live web mapping application I'm trying to build.  For comparison, the same type of point-in-polygon queries return in mere seconds using the free spatial database PostGIS.

    What am I doing wrong here?  Any tips, hints, or strategies I could employ to improve performance?

    Cheers,
    Kevin
    Monday, December 14, 2009 11:26 PM

Answers

  • I may be incorrect, but I seem to recall reading somewhere that in a SQL Server spatial operation, only one of the spatial indexes from one of the tables can be used in a single query (I wish I could remember where I came across that).  If that is correct, then maybe rewriting the query to focus on the points instead of the polygon is the way to go.  For example:

    SELECT count(*)
    FROM poly a, points b
    WHERE b.geom.STIntersects(a.geom) = 1

    The above query yields 1772 points but does it in less than 10 seconds on my rather old test hardware.  I guess this is one of those cases where optimizing the approach may be more beneficial than optimizing the spatial index. This of course is predicated on knowing something about the structure of the data ahead of time.
    Thursday, December 17, 2009 4:31 PM

All replies

  • Tuesday, December 15, 2009 1:37 AM
  • If you want a slightly more concise solution (or, at least, a first attempt at a solution):

    1.) Make sure you have a spatial index on the geom column of your table (you'll need a clustered primary key on the table first). This will create the default index on a geometry column:
    CREATE SPATIAL INDEX [idxGeometryIndex] ON [dbo].[MyTable] 
    ( [GEOM] )USING  GEOMETRY_GRID 
    WITH (
      BOUNDING_BOX =(-180, 0, -80, 85), GRIDS =(LEVEL_1 = MEDIUM,LEVEL_2 = MEDIUM,LEVEL_3 = MEDIUM,LEVEL_4 = MEDIUM), 
      CELLS_PER_OBJECT = 16)
    GO
    2.) Make sure your query is using the index. Add an index hint if you have to (might not be necessary - check the execution plan):
    SELECT b.id
    FROM poly a (WITH(index(idxGeometryIndex)), points b
    WHERE a.geom.STContains(b.geom) = 1;

    See if that makes a difference.

    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290
    Wednesday, December 16, 2009 8:36 AM
    Answerer
  • You may also be interested in Bob's blog posting if you have any null/empty values in your points database.
    http://www.sqlskills.com/BLOGS/BOBB/post/Be-careful-with-EMPTYNULL-values-and-spatial-indexes.aspx

    Wednesday, December 16, 2009 3:00 PM
  • Thanx all for your quick replies.  Unfortunately, as I've mentioned in my original post, *all* the points are already within the bounding box of the polygon in question, so a spatial index, which determines which points are within the bounding box of the polygon, doesn't play any role here, or at least I'd be surprised if it did.  Also, none of the input geometries are null.

    I'm mainly exercing a point-in-polygon use case here which I'm assuming STIntersects, STContains, and STWithin have implemented shortcut algorithms to handle this special case.

    What I'm finding however, is that depending on the size and complexity of the input polygon and the number of queried points, SQL Server is typcially 10-100 times slower than the same spatial query run in PostGIS.

    Since I need this functionality in SQL Server, I'm wondering if anyone else has come across this and if there are solutions available.

    The only things that come to mind for me are to implement my own point-in-polygon C# function directly in SQL Server, or come up with something in the middle tier layer, and utilitizing JTS and nettopologysuite come up with the same function in .NET, out of the database.  Either of these options would require quite a bit of work on my part. 

    Any other ideas?
    Thanx so much all,
    Kevin

    Wednesday, December 16, 2009 5:17 PM
  • ...as I've mentioned in my original post, *all* the points are already within the bounding box of the polygon in question, so a spatial index, which determines which points are within the bounding box of the polygon, doesn't play any role here...

    What makes you think that? SQL Server uses a multi-level grid indexing system, not an R-tree - it has nothing to do with the bounding box of the polygon. Why not try adding an index first and see if it makes a difference?

    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290
    Wednesday, December 16, 2009 5:27 PM
    Answerer
  • SQL Server uses a multi-level grid indexing system, not an R-tree - it has nothing to do with the bounding box of the polygon.
    Hi Tanoshimi,

    It appears that adding an index over the single polygon and the points doesn't help.

    Here's a concrete example.  I posted a sql server 2008 sql script to load a single medium-sized valid polygon (http://pastebin.ca/1717471) and 2000 random points (http://pastebin.ca/1717481) to pastebin.

    Query using STContains:
    SELECT count(*)
    FROM poly a, points b
    WHERE a.geom.STContains(b.geom) = 1

    On my system, SQL Server runs this query in 79 seconds,  PostGIS in 5.6 seconds.

    The query plan does seem to indicate that at least the points index is used (the index hint you suggested to use didn't parse for me).  Do I need to customize the multi-level grids of both indexes?

    Thanx for your help.
    -- Kevin
    Wednesday, December 16, 2009 10:34 PM
  • Ok, I can see what's going on, but I'm not quite sure what you should do about it.

    Let's start by examining the situation - the following picture shows the data created by your two scripts:


    There's a single large multipolygon, and 2,000 points densely clustered near the centre of the geometry.
    Most of them (1,772) lie within the multipolygon, but the remaining 228 of them lie outside.
    Let's start off by checking that these are the same numbers as you were getting in PostGIS? (I'm not sure if ST_Contains in PostGIS uses MBR approximations as MySQL Spatial does).

    So, as you know, a SQL Server spatial index stores a reference to the grid cells intersected by a geometry. When you use that index in a query of the type WHERE GeomColumn.STContains(x) = 1, SQL Server tesselates geometry x to the same grid settings as the index on GeomColumn so that it can make a comparison of the grid cells occupied by each geometry, as an approximation to a comparison between the true geometries themselves. Note that this is difference from the approach used by PostGIS, which uses minimum bounding rectangles (MBRs).

    Now, based on this comparison of grid cells, there are three possible outcomes:

      1.) It might be possible to identify certain rows that definitely don't intersect the geometry in question, and therefore they can be immediately discarded from the resultset. In terms of the 'grid', if none of the cells stored in the index that intersect geometry A are the same as the cells that intersects geometry B, then geometry A and geometry B cannot themselves intersect. This is called the primary filter.
      2.) Based on the index, it might be possible to identify rows that definitely do meet the criteria for inclusion in the resultset without needing the secondary filter. In terms of the grid again, if geometry A partially intersects a cell that is completely covered by geometry B, then geometry A and geometry B must also intersect. This is called the internal filter.
      3.) It might not be possible to say with certainty whether the two geometries intersect or not. For example, if both geometries only partially intersect the same grid cell we cannot tell whether the geometries themselves intersect. In this case, the secondary filter must be called to refine the results.

    So, to get a fast spatial query, you want as much work as possible done at the primary filter stage. In other words, you want the results of the comparison between your grid cells to either be outcome 1.) (discarding), or 2.) (pre-selecting)

    What's happening in your query? To find out, look at the sp_help_spatial_geometry_index stored procedure for the poly index.

    Firstly, look at these results:
    Total_Number_Of_ObjectCells_In_Level0_In_Index    1
    Total_Number_Of_Border_ObjectCells_In_Level0_In_Index 1
    The fact that you've got a Level0 cell in the index, and that it's a border cell, suggests that one of the geometries in the index must touch the edge of the bounding box. Therefore the root cell has had to be added (because a 'touching' cell is still an intersecting cell). Your bounding box extends from (0,0), to (374713, 282688) and your polygon actually touches both these points, so to make sure it's completely contained you should extend your box bounds slightly.


    Total_Number_Of_ObjectCells_In_Level1_In_Index    45
    Using your default settings, you've got 45 level 1 cells in the index, and yet the CELLS_PER_OBJECT limit was set to 16. The fact that this limit was broken must mean that this was necessary in order to fully cover the geometry. In other words, your grid resolution is too high and you can't fully cover your geometry in 16 MEDIUM level 1 cells. So you either need lower grid resolution or more cells per object.


    Number_Of_Rows_Selected_By_Primary_Filter 1
    Number_Of_Times_Secondary_Filter_Is_Called 1
    Internal_Filter_Efficiency    0
    Primary_Filter_Efficiency    100
    These are the results when you use almost any of the points in the table as a query sample. So, no rows are getting discarded by the primary filter (because they partially intersect a cell that is also partially intersected by the multipolygon), but no rows are getting pre-selected either (because the multipolygon doesn't have any interior cells - only intersecting cells). So, pretty much every row is getting run through the secondary filter, which is why it's not really making any difference whether you use the index or not.


    The thing to remember when setting index grid settings is that you need to optimise them having consideration for both sides of the comparison in your query. So, in a query WHERE A.STContains(B), the same grid must be used to compare both A and B, and this is where your particular set of data is pretty unusual. Is this actual data that you intend to use, or is this just an example?
    It's fairly unusual to have an index on a column containing only a single geometry (your multipolygon), and for that geometry to occupy so much of the total extent of the grid of the index. As a result, to get your multipolygon fully tesselated, you have to have low grid resolution and large cells_per_object in the grid.
    However, your other geometry to which you are comparing are points, which in contrast are most closely approximated in a HIGH resolution grid index, and generally only require 1 cell per object.
    So there's an immediate discrepancy there.

    Futhermore, since almost all your points are located very close to, if not actually within the multipolygon, they all partially intersect cells that are also partially intersected by the multipolygon and cannot be discarded by the primary filter, so the index isn't helping you there.
    And finally, because there are so few internal cells for the multipolygon, you're not getting any internal filtering either....

    So, what to do? Well, you can improve the index on the polygon table slightly, by increasing the bounding box and making the grid resolution less, as follows:
    CREATE SPATIAL INDEX [poly_geom_idx] ON [dbo].[poly]
    ( [geom] )USING  GEOMETRY_GRID
    WITH (
    BOUNDING_BOX =(-1, -1, 374714, 282689), GRIDS =(
    LEVEL_1 = LOW,
    LEVEL_2 = LOW,
    LEVEL_3 = LOW,
    LEVEL_4 = LOW),
    CELLS_PER_OBJECT = 512
    )

    However, this doesn't make much difference to the overall execution time of the query You could try adjusting the values further, but I doubt you'd get down to anything like 6 seconds.

    Alternatively, if you only need an approximation, you can cut out the secondary filter step and get the results of the primary filter alone, as follows:
    SELECT count(*)
    FROM poly a
    JOIN points b ON a.geom.Filter(b.geom) = 1

    Note that this only gives an approximate result though (1800, rather than 1772), but it does give you the results in 0.1 seconds...

    Finally, if you were getting such good results in PostGIS, why are you now moving to SQL Server? Why not stick with PostGIS?


    Remember that all these results are because of the specific combination of data you have in your tables - if your point data were to cover a wider area covered by the multipolygon (and beyond), or if you were comparing whether polygons were contained within the multipolygon, you might find the index becomes a lot more helpful, because the primary filtering and internal filtering would start to take effect.


    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290
    Thursday, December 17, 2009 10:48 AM
    Answerer
  • I may be incorrect, but I seem to recall reading somewhere that in a SQL Server spatial operation, only one of the spatial indexes from one of the tables can be used in a single query (I wish I could remember where I came across that).  If that is correct, then maybe rewriting the query to focus on the points instead of the polygon is the way to go.  For example:

    SELECT count(*)
    FROM poly a, points b
    WHERE b.geom.STIntersects(a.geom) = 1

    The above query yields 1772 points but does it in less than 10 seconds on my rather old test hardware.  I guess this is one of those cases where optimizing the approach may be more beneficial than optimizing the spatial index. This of course is predicated on knowing something about the structure of the data ahead of time.
    Thursday, December 17, 2009 4:31 PM
  • Yes, that's true - perhaps I didn't quite make that point clear in my rather drawn out post! In order to make the comparison of which cells the two geometries, they obviously need to be talking about the same grid... in other words, only a single spatial index.

    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290
    Thursday, December 17, 2009 5:16 PM
    Answerer
  • Wow.  Thank-you both for your insight and your detailed explaination, Tanoshimi.

    I'm still trying to wrap my head around the whole gridded index thing.  That the Filter() method, for example, yields a close 1800 result was quite surprising to me.  I obviously misread the docs ... I've been working with PostGIS for so long that I assumed the Filter() method was sort of equivalent to the bounding box overlap operator (&&) in PostGIS - clearly this is not the case.  I don't think this will work for my use case, but it's very good to know nonetheless.  So, just so I understand, Filter() returns false if the lowest grid cells of a and b definately don't intersect? But, then you said only one index ever gets used.  I get the same estimate of 1801 on b.geom.Filter(a.geom) and the reversed, a.geom.Filter(b.geom) ... the query plans looks the same as well.

    Tanoshimi, to answer a few of your questions,  yes, this data is canned.  In reality, I have over 4 million points within the bounding box of the polygon, but the density of the sample I posted looks about right.  As for the move to SQL Server, it's a client's request on a project I'm working on since their IT shop is already MS based.

    @Bixb0012, very curious.   I can't believe I didn't try that method/operand permutation.  I get 3 seconds on my test box here, even better than PostGIS!  Very kewl.  I'll definately have to try that in my application.

    Thanx again,
    -- Kevin
    Thursday, December 17, 2009 6:28 PM
  • bixb0012, is the speed increase of your query due to changing the order of a and b around? (which might be causing the optimiser to choose an alternative plan/index), or is it because you're using STIntersects() to test which of the points intersect the multipolygon, rather than STContains() to find those that are contained within it?

    Either way, it appears to have solved Kevin's problem, so I'm marking it as the answer.

    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290
    Friday, December 18, 2009 7:56 AM
    Answerer
  • tanoshimi, good observation.

    Since points cannot contain a polygon, I re-arranged the query to look at it from the points perspective instead of the polygon perspective, which of course caused me to use STIntersects.  In looking closer at execution plans and client statistics, the performance gain appears to be coming from switching to STIntersects from STContains.  STIntersects performs the same regardless of which table is used as the argument.

    I am not sure how to interpret the following, but the execution plan diagrams for STContains and STIntersects are visually identical in both structure and costs except for an extra 'Compute Scalar (Cost 0%)' in the STIntesects query.  It seems the performance is possibly tied to this extra step.  Now whether the extra step is possible (from a theoretical persepctive) with STIntersects but not with STContains, or whether the optimizer just made a bad call, is beyond me.  I will have to take a deeper look at the execution plans some other time.
    Friday, December 18, 2009 10:43 PM
  • After revisiting this last week, I have come to consider this issue a bug.  That is, in my mind, the query optimizer should be able to choose an execution plan for STContains that performs as timely as the plan being chosen for STIntersects.  So, I have opened a SQL Server Connect feedback item on this:

    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=525328

    For those reading this thread, and if this affects you, then I encourage you to vote and comment on the above SQL Server Connect feedback item.
    Wednesday, January 27, 2010 4:23 PM
  • how check check the execution plan thanks for ans
    Wednesday, May 5, 2010 9:31 AM
  • If you want a slightly more concise solution (or, at least, a first attempt at a solution):

    1.) Make sure you have a spatial index on the geom column of your table (you'll need a clustered primary key on the table first). This will create the default index on a geometry column:
    CREATE SPATIAL INDEX [idxGeometryIndex] ON [dbo].[MyTable] 
    ( [GEOM] )USING GEOMETRY_GRID 
    WITH (
     BOUNDING_BOX =(-180, 0, -80, 85), GRIDS =(LEVEL_1 = MEDIUM,LEVEL_2 = MEDIUM,LEVEL_3 = MEDIUM,LEVEL_4 = MEDIUM), 
     CELLS_PER_OBJECT = 16)
    GO
    
    2.) Make sure your query is using the index. Add an index hint if you have to (might not be necessary - check the execution plan):
    SELECT b.id
    FROM poly a (WITH(index(idxGeometryIndex)), points b
    WHERE a.geom.STContains(b.geom) = 1;
    

    See if that makes a difference.

    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290

    Could you give more explanation on it? It failed to solve my problem, I've got the same problem.
    Saturday, October 16, 2010 4:00 AM