locked
SQL 2K8R2: Any performance hints for bulk loading data from spatial/GIS db into a data warehouse? RRS feed

  • Question

  • I need to denormalise a number of GIS tables to materialise spatial hierarchies for use in a data warehouse but the performance is killing me.

    for example.

    I have a table with 500,000 points as a geometry data type. These are the x,y coordinates of water meters

    I have another table with ~1mil rows with geometry shapes (polygons). These represent land usage of all the land parcels in the state ie vacant land, residential, commercial, industrial, utilities etc etc

    I've tried using

    select m.id, l.Description
    FROM dbo.Meters m
    LEFT JOIN dbo.Land l
    ON m.Shape.STIntersects(l.Shape)

    but I stopped it after it only returned a few hundred rows after a few minutes.

    The GIS db has the spatial columns indexed and the correct bounding box (ie the state) defined.

    This is just the first attribute of many that I'm going to need to associate to the various points i'm pulling out and performancewise im not seeing how this'll be possible.

    Anybody done this before and have any suggestions on more efficient approaches? I have very little experience with spatial data


    Adelaide, Australia

    Friday, March 22, 2013 6:34 AM

Answers

  • OK, this is along the lines which I anticipated.  Here are my preliminary results on the data you provided.  Let's see how well they jibe with yours;

    Using your query...

    SELECT m.id AS Point_ID, l.id AS Polygon_ID
      FROM dbo.Points m
        LEFT JOIN dbo.Polygon l
          ON m.Shape.STIntersects(l.Shape)=1

    ...I was able to achieve a rate of slightly more than 1094 rows per second (58 seconds to process all 63,483 rows in the Points table).  This was on a modest 4 CPU machine.  The query plan generated was parallelized indicating that the process will scale up with more CPUs (have any 24-CPU boxes hanging around?).

    I also created a query which used the spatial index on the shape column in the Points table but the performance was over 4 times slower. Why?  Because we are doing a table scan by fiat, going through the extra layer of an index actually slows the query performance down.

    I suspect that the only way to get more performance is to scale up using more CPUs.  Is this performance more in line with what you expected? 

    - Ed


    Ed Katibah Spatial Ed Blog


    • Edited by Spatial Ed Tuesday, April 16, 2013 2:10 AM spelling
    • Marked as answer by jakubk Tuesday, April 16, 2013 11:53 PM
    Tuesday, April 16, 2013 2:06 AM

All replies

  • Hi Adelaide,

    Please refer to the following document to tune Spatial Point Data queries.

    Tuning Spatial Point Data Queries in SQL Server 2012
    http://social.technet.microsoft.com/wiki/contents/articles/9694.tuning-spatial-point-data-queries-in-sql-server-2012.aspx


    Allen Li
    TechNet Community Support

    Monday, March 25, 2013 2:28 AM
  • I'm using 2008R2 not 2012 & my understanding is there was an overhaul of how spatial queries work in 2012

    I tried some of the tips anyway but it didnt make any difference. Any 2008r2 specific tips for merging a large set of points with a large set of distinct complex polygons?


    Adelaide, Australia

    Monday, March 25, 2013 10:55 PM
  • use index http://msdn.microsoft.com/en-us/library/bb964712.aspx

    from OP: The GIS db has the spatial columns indexed and the correct bounding box (ie the state) defined.


    Adelaide, Australia

    Monday, April 1, 2013 10:59 PM
  • I need to denormalise a number of GIS tables to materialise spatial hierarchies for use in a data warehouse but the performance is killing me.

    for example.

    I have a table with 500,000 points as a geometry data type. These are the x,y coordinates of water meters

    I have another table with ~1mil rows with geometry shapes (polygons). These represent land usage of all the land parcels in the state ie vacant land, residential, commercial, industrial, utilities etc etc

    I've tried using

    select m.id, l.Description
    FROM dbo.Meters m
    LEFT JOIN dbo.Land l
    ON m.Shape.STIntersects(l.Shape)

    but I stopped it after it only returned a few hundred rows after a few minutes.

    The GIS db has the spatial columns indexed and the correct bounding box (ie the state) defined.

    This is just the first attribute of many that I'm going to need to associate to the various points i'm pulling out and performancewise im not seeing how this'll be possible.

    Anybody done this before and have any suggestions on more efficient approaches? I have very little experience with spatial data

    I get the following if i try to supply the spatial index as query hints on both tables

    Msg 8622, Level 16, State 1, Line 1

    Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.



    Adelaide, Australia

    • Moved by Allen Li - MSFT Wednesday, April 3, 2013 5:38 AM SQL Server Spatial
    • Merged by Allen Li - MSFT Wednesday, April 3, 2013 5:43 AM Same question
    Monday, April 1, 2013 11:03 PM
  • I think you may get better response in the special Spatial forum. Do you want me to move your thread there?

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    Monday, April 1, 2013 11:07 PM
  • I thought so too, but I already tried there without much luck.

    The spatial forum doesn't seem to be used much. There are only 2 topics in there that have been posted in in the last 5 days.

    This forum is much more widely visited so I'm hoping the more eyes that see it the higher chance useful replies


    Adelaide, Australia

    Tuesday, April 2, 2013 12:30 AM
  • Hi Adelaide,

    Thank you for your question. 

    I am trying to involve someone more familiar with this topic for a further look at this issue. Sometime delay might be expected from the job transferring. Your patience is greatly appreciated. 

    Thank you for your understanding and support.


    Allen Li
    TechNet Community Support

    Wednesday, April 3, 2013 6:10 AM
  • Hi,

    Make sure you have latest service pack and cumulative update for SQL Server 2008 R2.

    Also take a look on these articles, might help you:

    - video.ch9.ms/teched/2012/eu/DBI405.pptx

    - http://gis.stackexchange.com/questions/12030/optimize-nearest-neighbor-query-on-70-million-extremely-high-density-spatial-poi

    - http://stackoverflow.com/questions/3350965/how-can-i-speed-up-this-sql-server-spatial-query


    Regards,
    Christian HL
    Microsoft Online Community Support


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.

    Wednesday, April 3, 2013 6:54 AM
  • Nothing there that applies to my problem.. Im doing a simple Intersects operation already.

     The best I've been able to get this down to is 4-5 minutes by jacking up the index to HIGH/8192. I dont think it'll get any faster

    But I have another simpler query that's even slower and it doesn't appear to be improving at all.

    I have 20 complex polygons that represent water restriction areas across the state. it takes over an hour to find which meter is in which water restriction area. The water restriction polygons are distinct (no overlaps) and sparse (gaps between polygons).

    I tried running reduce() across the data that reduced the vertice count from 30-40k for the most complex polygons (that also happen to contain the most meters) to a few thousand but it still took close to 56min

    I'll upload some source data later today

    edit: link to sample data http://sdrv.ms/10B94Kg

    Here's a small (~60k points - or 15%) sample of the point data and the polygon that's giving me the most trouble. The indexes are as they appear in the full table and are big enough to cover the full set of data, so shrinking them to suit the sample data isn't a viable solution


    Adelaide, Australia


    • Edited by jakubk Friday, April 5, 2013 5:42 AM
    Thursday, April 4, 2013 1:27 AM
  • I've downloaded your data and will take a look at your issue.

    - Ed


    Ed Katibah Spatial Ed Blog

    Monday, April 15, 2013 11:36 PM
  • OK, I have your data loaded and have done a quick review (i.e. checking index bounds to make sure no data falls outside, etc.).  Everything looks fine.

    To make sure that we are on the same page, can you provide a narrative, in plain English, of what you are trying to do with your data? For instance, "update all water meter records (points) with the underlying land usage (polygons)".

    Another question: In the land use data, do the polygons overlap (i.e. can you have more than one land use category for a given water meter?)?

    - Ed


    Ed Katibah Spatial Ed Blog

    Tuesday, April 16, 2013 12:10 AM
  • Hi Ed,

    Thanks for taking a look.

    Plain english:

    I am trying to create a denormalised table with a number of attributes for each point. So the meter is inside water restriction zone R1, it's inside pressure zone ABC3, it's land use is marked as Residential|units, It's inside suburb walkerville, it's in DemandArea B5 etc etc. A nightly batch operation. No spatial data is being retained in the new denormalised table.

    The source is arcGIS which holds the different layers as polygons in separate tables. So to get the above I need to use spatial functions to join my meter table to 5 layer tables

    Polygons of the same layer do not overlap (so no, a meter will not exist in 2 'land use' polygons)


    Adelaide, Australia

    Tuesday, April 16, 2013 1:46 AM
  • OK, this is along the lines which I anticipated.  Here are my preliminary results on the data you provided.  Let's see how well they jibe with yours;

    Using your query...

    SELECT m.id AS Point_ID, l.id AS Polygon_ID
      FROM dbo.Points m
        LEFT JOIN dbo.Polygon l
          ON m.Shape.STIntersects(l.Shape)=1

    ...I was able to achieve a rate of slightly more than 1094 rows per second (58 seconds to process all 63,483 rows in the Points table).  This was on a modest 4 CPU machine.  The query plan generated was parallelized indicating that the process will scale up with more CPUs (have any 24-CPU boxes hanging around?).

    I also created a query which used the spatial index on the shape column in the Points table but the performance was over 4 times slower. Why?  Because we are doing a table scan by fiat, going through the extra layer of an index actually slows the query performance down.

    I suspect that the only way to get more performance is to scale up using more CPUs.  Is this performance more in line with what you expected? 

    - Ed


    Ed Katibah Spatial Ed Blog


    • Edited by Spatial Ed Tuesday, April 16, 2013 2:10 AM spelling
    • Marked as answer by jakubk Tuesday, April 16, 2013 11:53 PM
    Tuesday, April 16, 2013 2:06 AM
  • Hi Ed,

    Thanks so much for the info, it shows that I've done all I can do and it's limited elsewhere.

    Out of curiosity how did you force the query to use the index? It doesn't let me use index hints (error about not being able to create a plan). I have to do an inner join between point and polygon and then right join another version of the point table to get my list of nulls

    And what version of sql server did you test this on?

    Mine is

    Microsoft SQL Server 2008 R2 (SP1) - 10.50.2811.0 (X64)   Apr  6 2012 01:59:29   Copyright (c) Microsoft Corporation  Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1) (Hypervisor)

    I don't think we'll be able to do the scale of processing we need in a timely fashion, so I think we're going to seriously look into pushing the processing into arcGIS and having it deliver a flat file with the point and all its attributes.

    One last thing. Your machine could get the query to run @ ~1min, are you able to test the following for me?

    I wrote my own tsql 'tesselation' function while researching this performance issue to do what I thought the engine was doing already (it was just a POC to see if it actually sped it up, which would mean that it wasn't happening inside the engine). I was able to get my query to finish in 16 sec that on this same machine where it takes over 7min

    Can you deploy them and use them in the original query to see if it reduces your query time as well?

    SELECT p2.id AS Point_ID, l.id AS Polygon_ID
      FROM dbo.Points m --with(index = s2918_idx)
        INNER JOIN (    
        SELECT x.shape, l.id
        FROM dbo.Polygon l
        cross apply dbo.tvf_SimplifyPolygon(l.shape,5000) x
        ) l
       ON m.Shape.STIntersects(l.Shape)=1
       right join dbo.points p2
       on m.ID = p2.ID
    --option(recompile)

    The functions: (new faster versions)

    http://simplifyspatial.hejsolutions.com/2017/08/simplify-spatial-release-v10_30.html

    thanks


    Adelaide, Australia


    • Edited by jakubk Monday, September 4, 2017 1:51 AM
    Tuesday, April 16, 2013 3:09 AM
  • I'm using SQL Server 2012, however, for this exercise, there is no difference between this version and SQL Server 2008 R2.

    Here are two queries which use the spatial index - to the detriment of query performance - :

    SELECT m.id, l.id
      FROM dbo.Points m, dbo.Polygon l
        WHERE m.shape.STIntersects(l.Shape)=1
     --Results: 60,590 rows processed in 04:07 min:sec 
     --SQL Plan: spatial index used, no parallel plan

      SELECT m.id, l.id 
        FROM dbo.Points m
        CROSS APPLY dbo.Polygon l
           WHERE m.shape.STIntersects(l.Shape)=1
     --Results: 60,590 rows processed in 04:07 min:sec 
     --SQL Plan: spatial index used, no parallel plan

    The query optimizer produces identical plans for both queries, btw.

    Regarding your tessellation... this looks really interesting.  It's getting late in California so I'll have to try it tomorrow.

    - Ed


    Ed Katibah Spatial Ed Blog

    Tuesday, April 16, 2013 3:38 AM
  • ok, so the indexes are only able to be forced when doing an inner join

    And to answer your earlier question regarding whether the 1min performance was what i was hoping for/expecting.. I was actually hoping it would be sub 10sec to be or at least better than the functions i hacked together for this small dataset. 

    The engine can burn through tens of millions of complex relational data processes in my DW load in 6-7 min, and while i wasn't expecting spatial querying to be that fast i didn't think there would be such a difference. I understand the complexity of the processing required when doing spatial queries but i thought there'd be some shortcuts and optimisations built into the engine similar to what's been done over the years to the relational query engine. You dont need to explicitly hold it's hand that much anymore to get it to do what you want as fast it you know it can

    Also interesting about 2012 being comparable to 2008r2. I thought the spatial aspect of 2012 received an overhaul, or was that more about adding further features and optimising complex queries; the simple ones (stintersects) are as fast as they get for now


    Adelaide, Australia


    • Edited by jakubk Tuesday, April 16, 2013 5:00 AM
    Tuesday, April 16, 2013 4:56 AM
  • After running your functions, I stand by my assertion that for your purposes, there are no readily apparent difference between spatial implementation in SQL Server 2008 R2 and SQL Server 2012.  What you have created is quite interesting and begs more internal discussion on performance optimization of spatial queries, especially with complex polygons.  

    Using the data and queries that you provided, here are the results.  First, here is the test polygon (31,011 points):

      Note to readers: I can't show the original polygon here since this blog service only allows 2 images per post. ;-(

    The "sweet spot", at least for this particular polygon, proved to be 1,500 points per tessellated (sub) polygon.

    

    Here is how the optimal tessellated polygon appears:

    Interestingly, this tessellation took 5 seconds (out of the 15 seconds for the overall query which incorporates the function which creates this tessellation) and returned 65 sub-polygons. 

    - Ed


    Ed Katibah Spatial Ed Blog

    Tuesday, April 16, 2013 7:23 PM
  • Thanks for checking it out and the interest in the functions, I hope it leads to something that can be implemented into the engine itself

    Reducing the threshold count to 1500 took my query time down to 11sec too (my 'tesselation' time is about 1-2 sec.. so looks like that part at least isn't cpu bound)

    I think there's further optimisation available here because using the vertice count wasn't my original plan for testing the complexity. I found there wasn't a direct relationship between the count and the time. It was a good approximation but it needed manual testing to find a good threshold for each specific polygon.

    if you take the original polygon and run Reduce() across it to the point where it's vertice count drops to ~5k, it'll only reduce the query time by 5-10 %. So the shape of the polygon is the main driver of how long it takes to run. I noticed that the more square-like the polygon looks, the faster it'll run - guessing it's related to how much work is done by the .Filter() step

    Some of my other water restriction zone polygons have 15-20k points but they look like squares drawn with a shaky hand & the join to my point table runs in 20-30sec.

    So the 'best' test for candidates to 'quarter' would be "how close to a simple shape (square, circle, triangle) is the polygon?" but i couldn't figure out how to do it. I thought of using the ratio of STArea() vs area of the envelope but foresaw worse issues with that than counting points; spirals would probably exit too early, triangles might recurse forever (or until the MAX RECURSION threshold).

    Are you aware of any sql spatial functions or general spatial patterns/paradigms that an be used to approximate the "real" simplicity of a shape?


    Adelaide, Australia

    Wednesday, April 17, 2013 12:33 AM
  • >Are you aware of any sql spatial functions or general spatial patterns/paradigms that an be used to approximate the "real" simplicity of a shape?

    No, but I've been curious about this for some time.  I thought about it for a little bit and came up with a first pass:  a ratio based on length of the polygon edges (of all rings) divided by the polygon area.  Let's call this the "polygon ratio".  Next let's compute a ratio of the length of the polygon envelope divided by the envelope area and call this the "base ratio".  Finally, we compete the calculation by dividing the "polygon ratio" by the "base ratio" to yield the Complexity Index:

                                   Polygon Ratio
    Complexity Index =  -----------------
                                   Base Ratio 

    The basic idea here is that the closer you are to 1, the simpler the polygon shape is.

    Here is the function that I created:

    /*
    select ROUND(dbo.fn_PolygonComplexityIndexGeometry(geom), 1) AS [Index], Name_1 AS Name, geom.STNumPoints() AS NumPoints
      from states
        order by dbo.fn_PolygonComplexityIndexGeometry(geom)
    */
    
    IF (SELECT object_id FROM sys.objects WHERE type = 'FN' and name = 'fn_PolygonComplexityIndexGeometry') != ''
      BEGIN
        PRINT N'Dropping function: fn_PolygonComplexityIndexGeometry...'
        DROP FUNCTION fn_PolygonComplexityIndexGeometry;
      END;
    GO
    
    --=========================================================================================
    -- Create function
    --=========================================================================================
    PRINT N'Creating function: fn_PolygonComplexityIndexGeometry...';
    GO
    
    CREATE FUNCTION dbo.fn_PolygonComplexityIndexGeometry(@poly geometry)
                                                     
    RETURNS float
    AS
    BEGIN
      -- Polygon Complexity Measure
      DECLARE @base_ratio FLOAT = @poly.STEnvelope().STBoundary().STLength()/@poly.STEnvelope().STArea()
      DECLARE @poly_ratio FLOAT = @poly.STBoundary().STLength()/@poly.STArea()
      DECLARE @index      FLOAT = @poly_ratio/@base_ratio
      -- Return result
      RETURN @index
    END

    For your polygon example, the Complexity Index was 6.9.  Here are the results of running this index over the US States polygons :

    Index	State			Number of Points
    ----- -------------------- ---------------- 1.1 Kansas 514 1.1 Colorado 327 1.1 Wyoming 420 1.2 New Mexico 420 1.2 North Dakota 2721 1.2 Utah 552 1.3 South Dakota 1079 1.3 Arizona 1638 1.3 Iowa 1834 1.3 Nebraska 1200 1.4 Pennsylvania 1535 1.4 Montana 3452 1.4 Nevada 1025 1.5 Indiana 1969 1.6 Wisconsin 2552 1.6 Illinois 3059 1.6 Arkansas 1668 1.6 Alabama 2473 1.7 Vermont 1925 1.8 Missouri 2165 1.8 Tennessee 1469 1.9 Oklahoma 1505 1.9 Michigan 1632 1.9 New Hampshire 1954 1.9 Oregon 4845 2 Connecticut 723 2 Minnesota 5262 2.1 Mississippi 3109 2.1 Idaho 4373 2.2 Ohio 1972 2.2 District of Columbia 91 2.2 Kentucky 2292 2.3 Georgia 5366 2.9 California 10221 2.9 New Jersey 2864 2.9 South Carolina 4839 3 Delaware 695 3 West Virginia 4439 3.4 Texas 12507 3.4 New York 3664 3.7 Washington 9187 4.2 Rhode Island 731 5.1 Massachusetts 2309 5.2 Maine 8501 5.8 North Carolina 11322 6.3 Virginia 9827 6.4 Louisiana 14282 11.8 Hawaii 26772 11.9 Maryland 8927 12.3 Florida 18381 26.2 Alaska 178137

    I'm sure that you (or others) can improve on this index but it might be a reasonable first approximation.  What do you think?

    -Ed


    Ed Katibah Spatial Ed Blog







    • Edited by Spatial Ed Wednesday, April 17, 2013 7:17 PM Spelling
    Wednesday, April 17, 2013 5:58 PM
  • That's really cool,

    I plugged your function into the 'simplify' one and used a @ComplexityIndex threshold instead of STNumPoints()

    The performance was pretty similar when compared to that 1500 sweetspot you found, maybe 10% slower than the NumPoints version. I tried it against my real data as well ; 500k points + 20-30 polygons. Again, both using 1500 and anything around "3" as the complexity index threshold produced similar results. The sub polygons returned were mostly similar. The complexity based approach left some large sub polygons unchanged (not sure if this made a difference to query time, might check this later) while splitting some of the peninsula bits of the polygon much more. It might've gone below the level 4 index level 'cell' size which might explain the slightly longer query times but that's just a wild guess

    Your algorithm is definitely a better low vertice count complex polygons.

    A .Reduce(100) version of my polygon will only be quartered once by a 1500 vertice count threshold but it made no difference to your algorithm; Both shapes were split into 51 sub polygons. I tried combining both approaches but it didn't really make much difference so I do suspect it was the weird wiggly peninsulas in my polygon recursing too deeply

    The DW project here is ramping up so I won't have as much time to play around with this at work anymore, but I might take a look at home at the individual sub polygons created by both and see where one is better than the other.


    Adelaide, Australia

    Thursday, April 18, 2013 7:39 AM
  • Well, not as cool as I had hoped.  I think this approach still has promise but your vertice filter is still the champ - by far.  I too noticed the "peninsula splitting" effect, especially with the complexity index approach.  I suspect that the complexity filter needs to be refined with a vertex count as this seems to be a dimension that the spatial STIntersects() method is quite sensitive to.

    In any event, I also modified your SimplifyPolygon function to handle generic SRID values and to accommodate the fn_PolygonComplexityIndexGeometry function. I processed 1,892,290 point locations (from the Geonames database) against 51 polygons from the States table, referenced in an earlier response on this thread.  First, using a "conventional" spatial join query I observed the following performance: 

    select m.geonameid, l.name
      from dbo.Geonames m, dbo.States_temp l with(index(shape_mmmm_16_sidx))
      where m.shape.STIntersects(l.Shape)=1
     --Results: 1,123,548 rows processed in 02:32:10 hr:min:sec 

    I neglected to perform the RIGHT JOIN on the geonameid field which would have slowed performance down even further.

    Here are the results (and code pattern) using your tessellation approach with both a vertice and complexity filters:

    SELECT P2.geonameid AS GeoNames_ID, l.Name AS State_Name
      FROM dbo.Geonames m --with(index = s2918_idx)
        INNER JOIN (    
                      SELECT x.shape, l.name
                        FROM dbo.States_temp l 
                          CROSS APPLY dbo.tvf_SimplifyPolygon(l.shape,5.5,l.shape.STSrid) x
                          --CROSS APPLY dbo.tvf_SimplifyPolygon(l.shape,1500,l.shape.STSrid) x
                   ) l
          ON m.Shape.STIntersects(l.Shape)=1
        RIGHT JOIN dbo.Geonames p2
          on m.geonameid = p2.geonameid
    --Results: [with polygon complexity filter = 3]   1,892,290 rows in 55:13 min sec
    --         [with polygon complexity filter = 4]   1,892,290 rows in 37:03 min sec
    --         [with polygon complexity filter = 5]   1,892,290 rows in 29:06 min sec
    --         [with polygon complexity filter = 5.5] 1,892,290 rows in 28:45 min sec
    --         [with polygon vertice filter = 1500]   1,892,290 rows in 11:25 min sec

    As you can see, the vertex filter is still king.  And what's impressive with your tessellation approach is that this is a 13x performance improvement over a simpler, more "conventional" spatial query pattern.

    Thanks for the post and your innovative approach to the problem.  We can take this up again when you have some spare cycles.

    Thanks,

    - Ed



    Ed Katibah Spatial Ed Blog

    Thursday, April 18, 2013 5:24 PM