locked
SQL Server Express Performance Limitations With OGC Methods on Geometry Instances RRS feed

  • Question

  • I will front load my question.  Specifically, I am wondering if any of the feature restrictions with SQL Server Express cause performance limitations/reductions with OGC methods on geometry instances, e.g., STIntersects?  I have spent time reading various documents about the different editions of SQL Server, including the Features Supported by the Editions of SQL Server 2014, but nothing is jumping out at me.  The limited information on spatial features in the aforementioned document implies spatial is the same across all editions.  I am hoping this is wrong.

    The situation....  I have roughly 200,000 tax parcels within 175 taxing districts.  As part of a consistency check between what is stored in tax records for taxing district and what is identified spatially, I set up a basic point-in-polygon query to identify the taxing district spatially and then count the number of parcels within in taxing district.  Surprisingly, the query took 66 minutes to run.  As I pointed out, this is being run on a test machine with SQL Server Express.

    Some specifics....  I wrote the query a few different ways and compared the execution plans, and the optimizer always choose the same plan, which is good I guess since it means it is doing its job.  The execution plans show a 'Clustered Index Seek (Spatial)' being used and only costing 1%.  Coming in at 75% cost is a Filter, which appears to be connected to the STIntersects predicate.  I brute forced alternate execution plans using HINTS, but they only turned out worse, which I guess is also good since it means the optimizer did choose a good plan.  I experimented some with changing the spatial index parameters, but the impact of the options I tried was never that much.  I ended up going with "Geometry Auto Grid" with 16 cells per object.

    So, why do I think 66 minutes is excessive?  The reason is that I loaded the same data sets into PostgreSQL/PostGIS, used a default spatial index, and the same query ran in 5 minutes.  Same machine, same data, SQL Server Express is 13x slower than PostgreSQL.  That is why I think 66 minutes is excessive.

    Our organization is mostly an Oracle and SQL Server shop.  Since more of my background and experience are with MS databases, I prefer to work with SQL Server.  I really do want to understand what is happening here.  Is there something I can do different to get more performance out of SQL Server?  Does spatial run slower on Express versus Standard or Enterprise?  Given I did so little tuning in PostgreSQL, I still can't understand the results I am seeing.

    I may or may not be able to strip the data down enough to be able to send it to someone.

    • Edited by bixb0012 Wednesday, November 19, 2014 2:34 AM
    Wednesday, November 19, 2014 2:19 AM

Answers

  • How complex are your polygons? how many vertices? how "square-like" are they? that is, if you do a bounding box around the polygon what's the ratio of area inside/outside the polygon.

    My testing on this in 2008r2 and 2012 (haven't tried 2014 yet) is that the engine can use a helping hand to maximise the number of 100% accurate .Filter hits and minimise teh number of .filter false positives.

    Have a read of this and try to run the custom tesselation functions over your polygon (maybe persist them if they're static) - https://social.msdn.microsoft.com/Forums/sqlserver/en-US/78132bca-2a95-45c1-b93c-64c83827dc4f/sql-2k8r2-any-performance-hints-for-bulk-loading-data-from-spatialgis-db-into-a-data-warehouse?forum=sqlspatial

    I've rewritten that logic (and added some extra steps that are are feasible in c# vs tsql) as a CLR. I'm maybe a week or 2 from publishing it for free use and publishing explanation explanation of what it's doing/benchmarking etc

    Here's an example of the speed results of the theory/algorithm when applied to diff db engines (this testing is about 1 1/2 yrs old and isn't meant as a comparison of DB engines, more as an example of how much  tesselating complex polygons yourself for point in polygon queries helps. Especially when you have a much larger number of points compared to complex polygons)

    data sources were :

    Polygons - I used the 2012 TIGER/Line Shapefiles for the polygon data http://www.census.gov/cgi-bin/geo/shapefiles2012/main

    Points - I used the main geonames table for the US http://download.geonames.org/export/dump/US.zip

    the query format is "join point table to polygon using stinterects. Return name/label of point and name/label of polygon. Filtered as specified"

    Filter
    SQL Server PostGRES Oracle

    Original

    Simple

    Original

    Simple

    Original

    Simple

    Alaska (34k points returned)

    40 sec

    1.2 sec

    2.8 sec

    0.4 sec

    10.8 sec

    2.2 sec

    Colorado (49k points returned)

    21 sec

    1.8 sec

    2.2 sec

    0.8 sec

    5 sec

    2.1 sec

    C% (190k points returned)

    2:02 min

    5.8 sec

    12 sec

    2.9 sec

    1:10 min

    6.5 sec

    All (2.1m points returned)

    1:14:58 hr

    48 sec

    5:30 min

    34 sec

    50:26 min

    1:20 min


    Jakub @ Adelaide, Australia Blog



    • Edited by jakubk Wednesday, November 19, 2014 11:38 PM
    • Marked as answer by bixb0012 Thursday, November 20, 2014 10:17 PM
    Wednesday, November 19, 2014 11:32 PM

All replies

  • Here is the last SQL statement I used with SQL Server:

    SELECT t.shape,
           t.tax_dist_name,
           s.parcel_count
    FROM   cdstrl_taxdistrict t
           INNER JOIN (SELECT st.objectid,
                              Count(*) AS parcel_count
                       FROM   cdstrl_taxdistrict st,
                              cdstrl_parcelinfo sp
                       WHERE  st.shape.STIntersects(sp.shape_ct1) = 1
                              AND st.tax_dist_name IS NOT NULL
                       GROUP  BY st.objectid) AS s
                   ON s.objectid = t.objectid 

    Functionally, I used the same SQL statement with PostgreSQL (the only difference being updating syntax for STIntersects).  The execution plan image is a bit too large/spread out to insert here.



    • Edited by bixb0012 Wednesday, November 19, 2014 3:00 AM
    Wednesday, November 19, 2014 2:55 AM
  • I split the execution plan in three (Left, Middle, Right) and inserted the images below.  There is one element of overlap between the images to help orient oneself when transitioning between images.

    Execution Plan Left:

    Execution Plan Middle:


    Wednesday, November 19, 2014 3:20 AM
  • (Can only insert two images per message, final image here)

    Execution Plan Right:

    Wednesday, November 19, 2014 3:25 AM
  • If you want to want to look for limits of Express vs Enterprise, download a 30-day copy of SQL Server Evaluation edition and compare your Express results against that. There are no spatial-specific differences, but resource differences (i.e number of cores) might explain some of it.

    In general, it's not a good idea to run two databases on the same machine at the same time. They'll fight for resources. If you haven't already, shut the other database down (disable the service so it doesn't start at startup time during testing) and reboot before doing any testing. With any two-database-on-same-machine test.

    Spatial-specific, the query plan "picture" without the counts doesn't help much, except to show that you are using some spatial index. The other numbers (e.g. cost) that you're quoting are estimates. Look at the counts in an actual query plan. Do you have a spatial index on the largest spatial table (tax parcel table)? And only one spatial index?  Are tax parcels really points? If they aren't, it's not really a point-in-polygon query. If the index on tax parcels doesn't work well, you might try one on tax districts instead. 66 minutes DOES sound extremely excessive. You may have to break your spatial query in two (one that just does the STIntersects, one that counts and joins in the other table). But looking at the query plan, it doesn't look like this should be necessary. You may also be able to rephrase it so it doesn't need the "outer" table re-join of tax district. You could also try using a traditional HHHH index on your point table.

    In addition, I believe PostGIS uses RTree indexes, SQL Server uses a different classification of spatial index (a hierarchical grid index). Each may perform better on some, worse on other queries. Are you using geometry or geography in SQL Server? And in Postgres? Geography is slower than geometry for calculations. Have a look at the Postgres query plan and see if they're doing anything differently. But 66 minutes does sound extremely excessive.

    Cheers, Bob



    • Edited by Bob Beauchemin Thursday, November 20, 2014 3:18 AM Add SQL Server spatial index classification
    Wednesday, November 19, 2014 9:43 PM
  • There is no difference in Express other than resource restrictions.  Express is not designed for performance.  So, you cannot use it for performance testing of your queries.  It may work entirely different in Standard or Enterprise.

    Wednesday, November 19, 2014 10:05 PM
  • I understand Express is not designed for performance.  That being said, if Express is used in certain operational situations, which it is in our organization, then I think looking at the performance of how queries will run on Express is relevant.

    If the only difference with Express is "resource restrictions," I guess what I am trying to understand is what types of resource restrictions play the biggest role in having a simple spatial query "work entirely different in Standard and Enterprise" when compared to Express.  Out of all of the resources that are restricted, which ones typically affect spatial queries like this when the data sets are fairly small?

    I guess I will download a trial version of Standard and Enterprise to give it a shot on those.  I was hoping someone who works with spatial more often could give feedback on the types of restrictions in Express that tend to drive performance differences with the more robust editions.

    Wednesday, November 19, 2014 10:21 PM
  • The Evaluation version of Enterprise Edition is a good idea, I will have to look into that.  Does Enterprise Edition do more with parallel execution with spatial queries?  The current Express instance is one database with two layers and one user.  Unless Enterprise can access more cores for a single query, I am not sure if number of cores will make a difference, but it doesn't hurt to try.

    The two databases don't run at the same time, they are simply both installed on the same machine so I can spin up whichever one I need when I need it.  For this specific situation, there isn't any competition for resources between the two.

    Actual versus Estimated execution plan, my bad.  I didn't realize until this morning that I had put the wrong one online.  I had both, just mixed them up while splitting them into image parts.

    Yes, I have a spatial index on the largest table (tax parcel).  Originally, the table just had a polygon field, but I added a point field as well so I could take the STCentroid computation out of the equation and build an index.  Both the polygon field and point field have spatial indexes, but I am only referencing the point field of the parcel table to make it a point-in-polygon query.

    The traditional HHHH index, I will look into that.

    I am sticking with geometry in both SQL Server and PostGIS because of the extra effort/overhead of working in geography.  I have looked at the query plan for PostGIS, it is different, seems simpler for lack of a better term.  I can't get the costs/times/etc... to display alongside the plan diagram so I will insert the diagram first followed by a breakdown of times.

    "Hash Join  (cost=16624.86..16642.01 rows=174 width=252532) (actual time=271119.793..271119.872 rows=174 loops=1)"
    "  Output: t.geom, t.tax_dist_n, s.parcel_count"
    "  Hash Cond: (t.gid = s.gid)"
    "  ->  Seq Scan on public.taxdistrict t  (cost=0.00..14.75 rows=175 width=252528) (actual time=0.197..0.234 rows=175 loops=1)"
    "        Output: t.gid, t.tax_dist_n, t.shape_star, t.shape_stle, t.geom"
    "  ->  Hash  (cost=16622.69..16622.69 rows=174 width=12) (actual time=271119.581..271119.581 rows=174 loops=1)"
    "        Output: s.parcel_count, s.gid"
    "        Buckets: 1024  Batches: 1  Memory Usage: 8kB"
    "        ->  Subquery Scan on s  (cost=16619.20..16622.69 rows=174 width=12) (actual time=271119.514..271119.561 rows=174 loops=1)"
    "              Output: s.parcel_count, s.gid"
    "              ->  HashAggregate  (cost=16619.20..16620.95 rows=174 width=4) (actual time=271119.512..271119.546 rows=174 loops=1)"
    "                    Output: st.gid, count(*)"
    "                    ->  Nested Loop  (cost=0.28..14494.70 rows=424901 width=4) (actual time=30.251..270728.829 rows=187179 loops=1)"
    "                          Output: st.gid"
    "                          ->  Seq Scan on public.taxdistrict st  (cost=0.00..14.75 rows=174 width=252511) (actual time=0.004..0.401 rows=174 loops=1)"
    "                                Output: st.gid, st.tax_dist_n, st.shape_star, st.shape_stle, st.geom"
    "                                Filter: (st.tax_dist_n IS NOT NULL)"
    "                                Rows Removed by Filter: 1"
    "                          ->  Index Scan using parcelinfo_ct1_gist on public.parcelinfo sp  (cost=0.28..83.16 rows=6 width=32) (actual time=15.903..1554.503 rows=1076 loops=174)"
    "                                Output: sp.gid, sp.prcl_nbr, sp.acreage, sp.tax_dist_n, sp.ownidn, sp.owname, sp.owadr2, sp.owadr1, sp.owadr3, sp.owadr4, sp.taxidn, sp.txname, sp.txadr1, sp.txadr2, sp.txadr3, sp.txadr4, sp.township, sp.range, sp.section_,  (...)"
    "                                Index Cond: (sp.geom_ct1 && st.geom)"
    "                                Filter: _st_intersects(sp.geom_ct1, st.geom)"
    "                                Rows Removed by Filter: 279"
    "Total runtime: 271125.764 ms"

    Wednesday, November 19, 2014 10:48 PM
  • Express is limited to 1 CPU and 1GB of RAM.   That has a huge impact on Express performance overall.  There is nothing specific, that I know of, having to do with Spatial data in Express.

    Wednesday, November 19, 2014 11:13 PM
  • How complex are your polygons? how many vertices? how "square-like" are they? that is, if you do a bounding box around the polygon what's the ratio of area inside/outside the polygon.

    My testing on this in 2008r2 and 2012 (haven't tried 2014 yet) is that the engine can use a helping hand to maximise the number of 100% accurate .Filter hits and minimise teh number of .filter false positives.

    Have a read of this and try to run the custom tesselation functions over your polygon (maybe persist them if they're static) - https://social.msdn.microsoft.com/Forums/sqlserver/en-US/78132bca-2a95-45c1-b93c-64c83827dc4f/sql-2k8r2-any-performance-hints-for-bulk-loading-data-from-spatialgis-db-into-a-data-warehouse?forum=sqlspatial

    I've rewritten that logic (and added some extra steps that are are feasible in c# vs tsql) as a CLR. I'm maybe a week or 2 from publishing it for free use and publishing explanation explanation of what it's doing/benchmarking etc

    Here's an example of the speed results of the theory/algorithm when applied to diff db engines (this testing is about 1 1/2 yrs old and isn't meant as a comparison of DB engines, more as an example of how much  tesselating complex polygons yourself for point in polygon queries helps. Especially when you have a much larger number of points compared to complex polygons)

    data sources were :

    Polygons - I used the 2012 TIGER/Line Shapefiles for the polygon data http://www.census.gov/cgi-bin/geo/shapefiles2012/main

    Points - I used the main geonames table for the US http://download.geonames.org/export/dump/US.zip

    the query format is "join point table to polygon using stinterects. Return name/label of point and name/label of polygon. Filtered as specified"

    Filter
    SQL Server PostGRES Oracle

    Original

    Simple

    Original

    Simple

    Original

    Simple

    Alaska (34k points returned)

    40 sec

    1.2 sec

    2.8 sec

    0.4 sec

    10.8 sec

    2.2 sec

    Colorado (49k points returned)

    21 sec

    1.8 sec

    2.2 sec

    0.8 sec

    5 sec

    2.1 sec

    C% (190k points returned)

    2:02 min

    5.8 sec

    12 sec

    2.9 sec

    1:10 min

    6.5 sec

    All (2.1m points returned)

    1:14:58 hr

    48 sec

    5:30 min

    34 sec

    50:26 min

    1:20 min


    Jakub @ Adelaide, Australia Blog



    • Edited by jakubk Wednesday, November 19, 2014 11:38 PM
    • Marked as answer by bixb0012 Thursday, November 20, 2014 10:17 PM
    Wednesday, November 19, 2014 11:32 PM
  • Yes, spatial queries that use spatial indexes can be parallelized in Enterprise Edition.

    You can also try without the spatial index in SQL Server. Either delete the spatial indexes or use a hint to prevent its use.

    Look at the counts in the SQL Server query plan to see how many rows are actually being fetched after the spatial index filters out the rows (clustered index seek in middle of the plan). If you want you can send me an actual plan in email (bobb-at-sqlskills) and I'll have a look; however, it might not be immediately.

    You can also use the spatial index diagnostic procs to see how well your spatial index is working.

    I'm not going to comment on PostGIS plan, this may be a place where RTree index works better. But, looking at your SQL Server plan, 66 minutes is excessive.

    Cheers, Bob

    Wednesday, November 19, 2014 11:40 PM
  • I will definitely take a look and let you know.  A quick response now.  At small map scale, the taxing districts seem fairly simple in terms of shape:

    At a larger map scale, the complexity within some of the taxing district bounding boxes becomes apparent.

    I understand someone could argue some of the fine line work or perimeters could, and maybe should, be simplified; but this is about understanding why SQL Server and PostgreSQL are providing such different results with the same data set.

    Thursday, November 20, 2014 12:29 AM
  • Not to be argumentative, but is this query representative of the queries that you do all the time? The query counts the number of parcels in a tax district, for all the tax districts. That means that you must seek to every row in the spatial indexed table (with SQL Server spatial index, say parcel), just to find out that we already know the tax district via the spatial index. In other words, if you have 200,000 parcel rows, you're doing 200,000 seeks on the base (parcel) table. Even though, since it's point-in-polygon, it's already found the polygon for every point. That's just the way this particular spatial query plan works. Even so, I'm still not convinced that this should take 66 minutes.

    There's an explanation of the way you'd read a spatial query plan in a chapter I wrote in "SQL Server 2012 Internals" book. When you use a spatial index, you end up doing a seek into the base table for each candidate row. Even if SQL Server can derive that single candidate row via the spatial index. And because, in "this particular query*, each parcel is a candidate row and all rows are "intersected" that's a lot of seeks. If you are looking for a single specific parcel/tax district, or a small subset of them, it's much faster. That's why I suggested trying no spatial index or spatial index on tax district. For *this particular* query. As Jakub noted, point-polygon queries that query a *large number of points through a spatial index on the point column*, is not where SQL Server's spatial index shines, vs. RTree (I know Oracle uses RTree). SQL Server's spatial index good for queries that return a small number of candidate rows. There is also a question of polygon complexity as well.

    There is, however, a way to optimize point-in-polygon spatial queries by stored the Lat and Long separately, as documented in this whitepaper http://social.technet.microsoft.com/wiki/contents/articles/9694.tuning-spatial-point-data-queries-in-sql-server-2012.aspx This whitepaper also explains how the SQL Server spatial index works (primary filter/secondary filter, etc), as well as providing specific optimizations for these types of queries. However.... 

    A different way of looking at the same problem is to store the tax district number in each row of the parcel info table, computed once via a persisted computed column. You'd only need to calculate it with a spatial query when a parcel is added or updated, or when tax districts change. Unless you're doing "what-if" tax re-districting problem, how often does that happen?

    If the tax district is stored in the parcel row, your query with the "group by tax district" becomes a simple, non-spatial join. Just like you don't spatially compute which state an address is in, or which state a city is in with a spatial query. That info is computed/entered and stored once.  

    Hope this helps, Cheers, Bob

    Thursday, November 20, 2014 3:11 AM
  • Bob, thanks for all the good information, I will definitely look it over.  As you point out, this query answers a question that should already have an answer, so I wouldn't say it would be run very often.  At times, the query might be run to validate that spatially-stored information lines up with non-spatially stored information.

    Beyond this specific dataset; yes, this query is representative of a type of query we do use more regularly by summarizing events (plant/animal occurrences, illegal dumping occurrences, search & rescue callouts, etc...) by geographic zones that wouldn't be collected with the event.  For example, law enforcement doesn't record the watershed of an illegal dumping, but hydrologists or fisheries biologist might be interested in summarizing such events by watersheds across the landscape.  I chose tax parcels and taxing districts for this example because I thought it would be an easier context to discuss.

     Some of what you say above does make me think about alternative ways of storing the information; however, we have other middle-ware involved in our workflows, so I would have to think about the consequences of changing data structure.  That said, definitely worth thinking about.

    Thursday, November 20, 2014 3:02 PM
  • Tessalating the polygons (tax districts) is the answer!

    Since my use of SQL Server Express was brought up as possibly contributing to the slow runtime, the first thing I did was download an evaluation version of Enterprise Edition.  The runtime on Enterprise Edition dropped from 66 minutes to 57.5 minutes.  A reduction of 13% isn't anything to scoff at, but total runtime was still 11x longer than in PostgreSQL.  Although Enterprise Edition had 4 cores available to it, it never really spun up more than 1 when executing the query, so it doesn't seem to have been parallelizing the query much, if at all.

    You asked about polygon complexity.  Overall, a majority are fairly simple but there are some complex ones with one really complex polygon.  Using the complexity index discussed in the reference thread, the tax districts had an average complexity of 4.6 and a median of 2.7.  One polygon had a complexity index of 120, which was skewing the average, as well as increasing the runtime I suspect.  Below is a complexity index breakdown:

    Index

    NUM_TAX_DIST

    1

    6

    <2

    49

    <3

    44

    <4

    23

    <5

    11

    <6

    9

    <7

    9

    <8

    4

    <9

    1

    <10

    4

    >=10

    14

    Before trying tessellation, I tweaked the spatial indexes in several different ways, but the runtimes never changed by more than a minute or two.  I reset the spatial indexes to "geometry auto grid @ 32" and tried out your tessellation functions using the default of 5000 vertices.  Total runtime 2.3 minutes, a 96% reduction and twice as fast as PostgresSQL!  Now that is more what I was expecting before i started.

    I tried using different thresholds, 3,000 and 10,000 vertices but the runtimes were slightly slower, 3.5 and 3.3 minutes respectively.  A threshold of 5000 definitely seems to be a sweet spot for the dataset I am using.  As the thread you referenced discussed, SQL Server spatial functions like STIntersect appear to be sensitive to the number of vertices of polygons.

    After reading your comment, it reminded me of some discussions with Esri staff about ArcGIS doing the same thing in certain circumstances, but I didn't go as far as thinking to apply it here.  So, thanks for the suggestion and code from another post.  Once I realized the SRID was hard coded to 0 in tvf_QuarterPolygon, I was able to update the code to set it to the same as the input shape, and then everything came together nicely.


    • Edited by bixb0012 Thursday, November 20, 2014 10:16 PM
    Thursday, November 20, 2014 10:14 PM
  • Glad it helped! I usually saw a x10-100 improvement in query speed, and it scaled the right way. The queries that took tens of minutes/hours were the ones that tended towards the x100 improvement

    did you try 1,500? that seemed like the best 'generic' vertex count with a whole bunch of polygons i tried. Although I admit, none were 'inverse' polygons like what some of yours appear like.

    Are you persisting the tesselated polygons in a table or are you running directly against your base polygon table and using the cross apply? The tvf adds quite a bit of overhead once you're tesselating more than a few polygons at once. one mod you could do is add a hard limit to cte.level. I found 8 levels of recursion was the point where you'd hit diminishing returns.

    as mentioned above, i've written a more robust version in c# as a CLR that should be up here http://simplifyspatial.hejsolutions.com in a week or 2.. Besides returning the tesselated polygons faster, it also does a few better checks to decide on complexity, has a hard recursion limit so it doesn't keep going forever (eg spiral shaped polygons), deals with multipolygons etc etc


    Jakub @ Adelaide, Australia Blog


    • Edited by jakubk Friday, November 21, 2014 1:02 AM
    Friday, November 21, 2014 1:02 AM
  • I just tried 1,500 vertices, it ran slightly slower than 5,000 but not as slow as 3,000 or 10,000.

    I did go back and try the same approach in SQL Server Express, since that is the edition where this all began.  The runtime was 3.3 minutes, which is 30% longer than with Enterprise Edition, but it is still significantly faster than Enterprise Edition without splitting polygons.

    I am not persisting, just running directly against the base polygon table and using cross apply, which is why the reduction in runtimes is so impressive.  The overhead of tesselating definitely pushed SQL Server Express to its restricted resources, but it wasn't much for Enterprise Edition.

    I did have one polygon with a complexity index of 120, and it needed more than 8 rounds to get down around 5,000 vertices, but 8 would have handled everything else.

    I look forward to seeing and testing the final product, especially considering how well the rough-out code works.

    Friday, November 21, 2014 2:25 AM