locked
Spatial indexing and point-in-poly performance RRS feed

  • Question

  • Hi there

    I work at a Canadian university where I develop reports outlining alumni demographics and geographic distribution.

    Previously in SQL SERVER 2005 I built a report that counts alumni within 60 minute drive times from one of 140 major market centroids.  To do this I created drive-time polygons in ArcGIS, imported the Canadian national postal code point table, and then using spatial analysis tools in ArcGIS,  I created an attribute table that consisted of DRIVE TIME POLYGON ID and POSTAL CODE (since polygons can overlap a postal code can be represented multiple times).  I then imported this matchup file into SQL SERVER and joined the postal field to my alumni address records to produce counts of alumni by market.

    The university recently implemented SQL SERVER 2008 and I am trying to eliminate the need to do the point-in-poly in ArcGIS.  Using Shape2SQL I have successfully imported my 140 drive time polygons shapefile and my 825,000 record Canadian postal code file point file.  Using info from Aitchison’s book, I have now successfully tested my polygons for validity, fixed invalid polygons and using the STWithin() method, I can now do point-in-poly queries.

    The part I am having trouble getting my head around is the spatial index.  I have a national polygon file and a national postal code file and I am not sure what to use as my bounding box.  92% of our alumni (390,000) live in the province of Ontario while 74% live within the Greater Toronto Metropolitan Area (GTA).  Addtionally, 33% of the Canadian postal code point data are located in Ontario, and 13% of the same table observations are found in the GTA.

    My questions are these; will executing the point-in-poly routine in SQL SERVER 2008 using STWithin() and spatial indexing be able to outperform the lookup table I created using ArcGIS and subsequently imported?  Finally, am I on the right track by looking at using the province of Ontario or the GTA as my spatial extents for an index bonding box?

    Thanks in advance.

    Rich

    Friday, July 10, 2009 2:08 PM

Answers

  • The part I am having trouble getting my head around is the spatial index.

    You're not the only one ;)

    It's nearly impossible to generalise what the best spatial index settings are - they're specific to your dataset and to the sort of queries you will be issuing against that data. The (perhaps unsatsifying) answer is therefore that you need to experiment with different indexes, take some performance measurements, and see which works best for you.

    The only other advice I can think of is:
    1.) Remember that you can have more than one spatial index on a given column. So you can have two indexes, each with different bounding boxes - one for Ontario, and one for GTA. Each of these can also have different grid settings.
    2.) Upgrade to SP1 if you haven't already - the query plan choice is much better in regards to using spatial indexes.
    3.) Interrogate the system DMVs, e.g. sp_help_spatial_geometry_index, to see how effective your indexes are being

    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290
    • Marked as answer by Kellaro Friday, July 10, 2009 6:09 PM
    Friday, July 10, 2009 2:38 PM
    Answerer
  • Hi Rich,
     
    Look at the query plan to confirm the spatial index is being used. You should also install SQL Server SP1 and use appropriate queries to ensure your spatial index is being used. http://www.sqlskills.com/BLOGS/BOBB/post/How-to-ensure-your-spatial-index-is-being-used.aspx
     
    The premise with a spatial index is that you're saving yourself an expensive operation (STWithin) by eliminating most of the candidates. To look for best performance, you can try a spatial index with a bounding box of GTA and one with the bounding box of the province of Ontario and benchmark each one using the spatial index diagnostic stored procedures (sp_help_spatial_geometry_index_xml, doc'd here: http://msdn.microsoft.com/en-us/library/cc627383.aspx). There is addition information in the series beginning here: http://www.sqlskills.com/BLOGS/BOBB/post/Spatial-Index-Diagnostic-Procs-Intro.aspx. I'd start off by defining spatial indexes with densities of all "high" and all "medium"; although I'd suspect you'd get better perf from density all "high".
     
    You would use one of your drive-time polygons as a query sample and look for the best performance. You're looking for best Primary Filter Efficiency and Internal Filter Efficiency. With point-in-polygon internal filter efficiency is especially important. You're looking to minimize the number_of_times_secondary_filter_is_called, relative to the number of rows returned. You can run the diagnostic procs multiple times with different (individual) drive time polygons to get a feel for different queries.
     
    Be aware that you can have multiple spatial indexes but you may have to use index hints to get the correct one for the query. Also, if you're doing a spatial join (ie tablea.spatialcol.STWithin(tableb.spatialcol) you cannot currently use a spatial index on both tables.
     
    To answer your first question last, I really don't know how this will compare with your current process. You'll need to benchmark it to see. Write back to the forum if you have further questions.
     
    Hope this helps,
    Cheers,
    Bob Beauchemin
    SQLskills
    • Marked as answer by Kellaro Friday, July 10, 2009 6:09 PM
    Friday, July 10, 2009 3:23 PM

All replies

  • The part I am having trouble getting my head around is the spatial index.

    You're not the only one ;)

    It's nearly impossible to generalise what the best spatial index settings are - they're specific to your dataset and to the sort of queries you will be issuing against that data. The (perhaps unsatsifying) answer is therefore that you need to experiment with different indexes, take some performance measurements, and see which works best for you.

    The only other advice I can think of is:
    1.) Remember that you can have more than one spatial index on a given column. So you can have two indexes, each with different bounding boxes - one for Ontario, and one for GTA. Each of these can also have different grid settings.
    2.) Upgrade to SP1 if you haven't already - the query plan choice is much better in regards to using spatial indexes.
    3.) Interrogate the system DMVs, e.g. sp_help_spatial_geometry_index, to see how effective your indexes are being

    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290
    • Marked as answer by Kellaro Friday, July 10, 2009 6:09 PM
    Friday, July 10, 2009 2:38 PM
    Answerer
  • Hi Rich,
     
    Look at the query plan to confirm the spatial index is being used. You should also install SQL Server SP1 and use appropriate queries to ensure your spatial index is being used. http://www.sqlskills.com/BLOGS/BOBB/post/How-to-ensure-your-spatial-index-is-being-used.aspx
     
    The premise with a spatial index is that you're saving yourself an expensive operation (STWithin) by eliminating most of the candidates. To look for best performance, you can try a spatial index with a bounding box of GTA and one with the bounding box of the province of Ontario and benchmark each one using the spatial index diagnostic stored procedures (sp_help_spatial_geometry_index_xml, doc'd here: http://msdn.microsoft.com/en-us/library/cc627383.aspx). There is addition information in the series beginning here: http://www.sqlskills.com/BLOGS/BOBB/post/Spatial-Index-Diagnostic-Procs-Intro.aspx. I'd start off by defining spatial indexes with densities of all "high" and all "medium"; although I'd suspect you'd get better perf from density all "high".
     
    You would use one of your drive-time polygons as a query sample and look for the best performance. You're looking for best Primary Filter Efficiency and Internal Filter Efficiency. With point-in-polygon internal filter efficiency is especially important. You're looking to minimize the number_of_times_secondary_filter_is_called, relative to the number of rows returned. You can run the diagnostic procs multiple times with different (individual) drive time polygons to get a feel for different queries.
     
    Be aware that you can have multiple spatial indexes but you may have to use index hints to get the correct one for the query. Also, if you're doing a spatial join (ie tablea.spatialcol.STWithin(tableb.spatialcol) you cannot currently use a spatial index on both tables.
     
    To answer your first question last, I really don't know how this will compare with your current process. You'll need to benchmark it to see. Write back to the forum if you have further questions.
     
    Hope this helps,
    Cheers,
    Bob Beauchemin
    SQLskills
    • Marked as answer by Kellaro Friday, July 10, 2009 6:09 PM
    Friday, July 10, 2009 3:23 PM
  • Thanks to both of you for the insightful responses.  I still have a bunch more to learn, and you have helped by pointing me in relevant directions.

    Rich
    Friday, July 10, 2009 6:10 PM
  • @Kellaro: Heya mate, welcome to the Spatial Forums :) Any chance you can tell me where you grabbed your Canadian Data from?

    Bob said:
    I'd start off by defining spatial indexes with densities of all "high" and all "medium"; although I'd suspect you'd get better perf from density all "high".
    @Bob, i thought having LOW density (4x4?) means it might be a bit more performant (because there's less data) but you will most likely get more false-positives. High (16x616) == very fine-detailed and accurate. that's what i thought? also, Bounding Boxes are for GEOM's, only?
    -Pure Krome-
    Monday, July 13, 2009 1:20 AM
  • When referring to LOW grids as generally quicker, HIGH as more accurate, you're only talking about the performance of the primary filter of the index, not the query as a whole.

    - LOW grid indexes consist of a small number of relatively large grid cells. Thus they're small and fast to query, but the cells don't have a very 'tight' fit around geometries, and they generate more false positives.

    - HIGH grid indexes consist of a greater number of relatively small grid cells. This makes the total size of the index larger, which can take longer to interrogate, but means that the results obtained from the index are more accurate.

    The relative speed of the LOW index is offset by the fact that the secondary filter (e.g. the STWithin method, in this case) then has to do more work to get rid of the false positives. The final 'accuracy' of queries using either method is exactly the same* - by changing the index settings all you're doing is varying how many records get filtered by the primary filter compared to how many get filtered by the secondary filter.

    And yes, bounding boxes are geom only. As the OP refers to both bounding boxes and the STWithin() method, I am assuming that we are talking about the geometry datatype here (although the rules above apply to both).

    *Unless you're using Filter(), which only performs a primary filter.
    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290
    Monday, July 13, 2009 8:21 AM
    Answerer
  • Hey PK

    I get an 'enhanced' postal code conversion data file as part of a neighborhood segmentation or cluster system we license from a Canadian firm called Environics Analytics.

    Another company here in Toronto called DMTI Spatial also distributes Canadian geographic files.

    Rich
    Monday, July 13, 2009 2:32 PM
  • In addition to what Tanoshimi already wrote...
     
    You won't get any larger index because you're indexing points. You're likely to get 1 (2 at the most, if the point's on a border) index rows per data row.
     
    Since the query sample (the polygon) will be tesselated during the query using the same density, you're likely to get more advantage out of the "internal filter", (i.e. covering cells intersecting points) with a higher density index and sometimes (depending on size of the polygon) with more cells per object. Have a look at my blog entries about the spatial index analysis procs for more info about the internal filter.
     
    Cheers,
    Bob Beauchemin
    SQLskills

    Bob said:
    I'd start off by defining spatial indexes with densities of all "high" and all "medium"; although I'd suspect you'd get better perf from density all "high".
    @Bob, i thought having LOW density (4x4?) means it might be a bit more performant (because there's less data) but you will most likely get more false-positives. High (16x616) == very fine-detailed and accurate. that's what i thought? also, Bounding Boxes are for GEOM's, only?
    -Pure Krome-
    Monday, July 13, 2009 8:38 PM