locked
Widely varying response time on spatial query RRS feed

  • Question

  • Hi,

    I have a spatial query looking like this...

       select @account_cluster_id = b.account_cluster_id
             ,@cluster_geog = b.cluster_geog
         from util.AccountCluster b with (index(SPATIAL_AccountCluster_BufferedCentroid), readpast)
              join
              util.EventGeography a with (readpast)
                 on a.Account = b.Account
                    and (
                            a.event_geog.STIntersects(b.buffered_centroid_geog) = 1
    --or a.event_geog.STIntersects(b.cluster_geog) = 1 ) where a.ID = @id and a.invalid_geog is null -- (i.e. not invalid) ;

    DDL is as follows...

    CREATE TABLE [util].[AccountCluster]( [account_cluster_id] [int] NOT NULL, [Account] [int] NOT NULL, [buffered_centroid_geog] [geography] NOT NULL,
    [cluster_geog] geography not null, CONSTRAINT [PK_AccountCluster] PRIMARY KEY CLUSTERED ([account_cluster_id] ASC));


    ... and ....

    CREATE TABLE [util].[EventGeography](
    	[ID] [int] NOT NULL,
    	[event_geog] [geography] NULL,
    	[account_cluster_id] [int] NULL,
     CONSTRAINT [PK_LiftGeography] PRIMARY KEY CLUSTERED ([ID] ASC)
    );
    CREATE SPATIAL INDEX [SPATIAL_AccountCluster_BufferedCentroid] ON [util].[AccountCluster]
    (
    	[buffered_centroid_geog]
    ) USING  GEOGRAPHY_AUTO_GRID 
    WITH (
    CELLS_PER_OBJECT = 16, PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)

    It is part of a procedure that loops through events and check to see if the event is in an account cluster or not. If not, create a new cluster equal to the centroid buffer (which happens to be 100m radius) else update the cluster. It takes about 8 minutes to loop through ~50 events. SOme are really fast (sub-second). Others up to 1 minutes. The events that take 1 minute in the procedure loop take sub-second time when I run the same query in isolation.

    At the moment, there is only one row in the AccountCluster table. There are approx 1 million in the EventGeography but the @ID restricts this to less that 100 rows at a time.

    I have a second OR predicate in the where clause but that is commented out at the moment. The Events are all in the vicinity of the cluster. The cluster is updated to incorporate the new point each loop. All these operations are sub-second.

    Could somebody throw some light on this please.

    Regards

    Liam

    Thursday, November 6, 2014 5:30 PM

Answers

  • Just realize that if you do a spatial join (based on STIntersects or other predicate that can use a spatial index) against two huge tables, only the spatial index on one table will be used. The other table will be tessellated on-the-fly to match. It's the first step (top-rightmost) in the query plan. This takes time and memory.

    Cheers, Bob

    Wednesday, November 12, 2014 9:33 PM

All replies

  • Hi Liam,

    For starters why are you using a spatial index on the AccountCluster table that has one row and not on the EventGeography table that has 1 million rows? In addition, have you looked at the query plan? Finally, have you thought of restructuring your process so that it isn't doing loops? SQL isn't very fast at loops. Perhaps, using temp tables to do the inserts or updates in batches after using one or more spatial queries to choose the appropriate keys to insert or update?

    Hope this helps, Bob

    • Proposed as answer by Bob Beauchemin Thursday, December 4, 2014 8:42 PM
    Thursday, November 6, 2014 6:57 PM
  • Hi Bob,

    I appreciate your reply. I would like to isolate this problem and put together a set of scripts but I haven't had a chance to get near it yet.

    Briefly, yes, I've looked at the query plan and I thought it was ok (famous last words!!). At the moment there is one row in AccountCluster but I anticipate that there will be 100k in normal use. The AccountCluster geography is a polygon and the EventGeography is a point (of which there are 1 million points). Curiously, I have another approach working which I imagined would be slower (rebuilds the cluster per account for all events each time a new event needs to be added to the cluster) but is in fact 20 to 40 times faster! That's set-based. It takes every event point in the account, buffers by 100m, does a cross product via a TVF and finds overalapping buffers. By definition all points in overlapping clusters are in the same general cluster.

    I'd normally avoid cursor loops like the plague but in this case it might make sense. What I mean is that is doesn't make sense to recalculate the clusters every time a new event arrives.

    In any case with my faster approach, I would still have to do some post-processing to aggregate overlapping cluster subsets per account and that will have to be cursor based because I have no way to partition the overlapping cluster subsets from the non-overlapping clusters in each account. Not sure if this makes sense but let me get back to you, next week, hopefully when I can get some more time at this and show you exactly what is happening.

    Regards

    Liam


    • Edited by lcaffrey Wednesday, November 12, 2014 2:39 PM clarification
    Wednesday, November 12, 2014 2:34 PM
  • Just realize that if you do a spatial join (based on STIntersects or other predicate that can use a spatial index) against two huge tables, only the spatial index on one table will be used. The other table will be tessellated on-the-fly to match. It's the first step (top-rightmost) in the query plan. This takes time and memory.

    Cheers, Bob

    Wednesday, November 12, 2014 9:33 PM