Slow query to find geography instances close to each other

Answered Slow query to find geography instances close to each other

  • Thursday, July 12, 2012 2:15 PM
     
     

    Hi All,


    I'm running 2008 R2, and I have a table containing 150,000 locations spread pretty much all over the inhabitable parts of the globe. They are stored as geography multipoint instances, and most of them contain 1 or 2 points, with some containing maybe 10 up to 40. I'm trying to run a query that would return every pair of locations within 25 kilometers to each other. Something like

    Select p1.Id, p2.Id, p1.Location.STDistance(p2.Location) from Places p1 join Places p2 on p1.id>p2.ID where p1.Location.STDistance(p2.Location)<25000. This should return somewhere around 30,000 results. The issue is that is that performance of this query is horrific (as in taking hours to complete).

    This is what I tried (and failed) to make it faster

    1. Created a spatial index on the Location and experimented with it's parameters. The index is indeed used but seems to be very inefficient.

    2. Also tried to create a 25 Km buffer around the location (buf25K), indexed it and re-wrote the query predicate like  p1.Location.Query(p2.Buf25K), this was equally slow too and the only index used is still the one on Location.

    This is what I tried and works great, except it doesn't scale with the number of places (exponential), but for 150K places runs in seconds:

    I created a CLR routine, loaded all Places in memory, and pretty much computed distances between all possible combinations retaining the ones that matched the criteria. In the CLR routine I exploded the Location columns in sets of points and computed distances between them using some approximate (but good enough) algorithm like Haversine. As I said, this worked in a few seconds, but as it is comparing every possible combination of places it will grow exponentially with the size of data (I need to make this work on a larger data set, tens of millions).

    I was looking for ideas to improve this and my preference is to make use of some sort of spatial indexing.

    much ppreciated

All Replies

  • Thursday, July 12, 2012 5:05 PM
     
     
    I would think the query isn’t going to be exactly subsecond, because you are doing the equivalent of a nearest neighbor query 150,000 times.  You could also try a variation of this method: http://blogs.msdn.com/isaac/archive/2008/10/23/nearest-neighbors.aspx.
     
    In addition, although I see you’re on SQL Server 2008 R2, SQL Server 2012 has a specific optimization for the nearest neighbor query. You might try experimenting with 2012 on a test system and see how much it helps.
     
    Hope this helps,
    Bob
  • Thursday, July 12, 2012 6:14 PM
     
     

    Thanks Bob,

    I did actually try something similar, but it doesn't seem to help too much. Haven't tried it in Denali though

  • Friday, July 13, 2012 9:55 AM
    Answerer
     
     Answered

    It's interesting to see that you already tried and found option 2.) made no difference (precomputing a buffer around the MultiPoints and then using STIntersects on the buffered geometry), because that's what I was going to suggest :(

    Did you STBuffer() or BufferWithTolerance() ?- can you post the results of sp_help_spatial_geography_index for a typical query sample against the Buf25k column?


    twitter: @alastaira blog: http://alastaira.wordpress.com/

  • Friday, July 13, 2012 2:29 PM
     
     

    Thanks,

    I am going to do that sometime next week when hopefully I will be able to get a set of more reliable data. The data I'm currently working on is scrambled (as in really scrambled, not just translated north or east, so I suspect this might be impacting my performance tests (for instance I just realized that a multipoint that  fits in a 3Km buffer might end up spanning half the globe in the scrambled data) . Later next week I might be able to get my hands on a better sample, I will retry and post the statistics then.