Thursday, July 12, 2012 2:15 PM
I'm running 2008 R2, and I have a table containing 150,000 locations spread pretty much all over the inhabitable parts of the globe. They are stored as geography multipoint instances, and most of them contain 1 or 2 points, with some containing maybe 10 up to 40. I'm trying to run a query that would return every pair of locations within 25 kilometers to each other. Something like
Select p1.Id, p2.Id, p1.Location.STDistance(p2.Location) from Places p1 join Places p2 on p1.id>p2.ID where p1.Location.STDistance(p2.Location)<25000. This should return somewhere around 30,000 results. The issue is that is that performance of this query is horrific (as in taking hours to complete).
This is what I tried (and failed) to make it faster
1. Created a spatial index on the Location and experimented with it's parameters. The index is indeed used but seems to be very inefficient.
2. Also tried to create a 25 Km buffer around the location (buf25K), indexed it and re-wrote the query predicate like p1.Location.Query(p2.Buf25K), this was equally slow too and the only index used is still the one on Location.
This is what I tried and works great, except it doesn't scale with the number of places (exponential), but for 150K places runs in seconds:
I created a CLR routine, loaded all Places in memory, and pretty much computed distances between all possible combinations retaining the ones that matched the criteria. In the CLR routine I exploded the Location columns in sets of points and computed distances between them using some approximate (but good enough) algorithm like Haversine. As I said, this worked in a few seconds, but as it is comparing every possible combination of places it will grow exponentially with the size of data (I need to make this work on a larger data set, tens of millions).
I was looking for ideas to improve this and my preference is to make use of some sort of spatial indexing.
Thursday, July 12, 2012 5:05 PMI would think the query isn’t going to be exactly subsecond, because you are doing the equivalent of a nearest neighbor query 150,000 times. You could also try a variation of this method: http://blogs.msdn.com/isaac/archive/2008/10/23/nearest-neighbors.aspx.In addition, although I see you’re on SQL Server 2008 R2, SQL Server 2012 has a specific optimization for the nearest neighbor query. You might try experimenting with 2012 on a test system and see how much it helps.Hope this helps,Bob
Thursday, July 12, 2012 6:14 PM
I did actually try something similar, but it doesn't seem to help too much. Haven't tried it in Denali though
Friday, July 13, 2012 9:55 AMAnswerer
It's interesting to see that you already tried and found option 2.) made no difference (precomputing a buffer around the MultiPoints and then using STIntersects on the buffered geometry), because that's what I was going to suggest :(
Did you STBuffer() or BufferWithTolerance() ?- can you post the results of sp_help_spatial_geography_index for a typical query sample against the Buf25k column?
twitter: @alastaira blog: http://alastaira.wordpress.com/
- Marked As Answer by amber zhangModerator Thursday, July 19, 2012 2:58 AM
Friday, July 13, 2012 2:29 PM
I am going to do that sometime next week when hopefully I will be able to get a set of more reliable data. The data I'm currently working on is scrambled (as in really scrambled, not just translated north or east, so I suspect this might be impacting my performance tests (for instance I just realized that a multipoint that fits in a 3Km buffer might end up spanning half the globe in the scrambled data) . Later next week I might be able to get my hands on a better sample, I will retry and post the statistics then.