Odpovědět data reader slow when performing spatial query

  • 30 เมษายน 2555 14:40
     
      มีโค้ด

    Good afternoon,

    I have an odd issue with my data : I'm trying to perform a simple bounding box query, but fetching geometry data is longer than further processing the data.

    Here is my code :

    using (var conn = new SqlConnection("myconnectionstring"))
     using (var cmd = new SqlCommand("SELECT geom FROM " + table, conn))
      {
        cmd.CommandType = System.Data.CommandType.Text;
        using (var rdr = cmd.ExecuteReader())
        {
           DateTime start = DateTime.Now; // start timer
           while (rdr.Read())
           { SqlGeometry geom = (SqlGeometry)rdr[0]; }
    
           Debug.WriteLine("loaded in: " + (DateTime.Now - start).TotalMilliseconds); // around 50ms with 36000 records
        }
      }

    The fetch part (the while block) runs for 50ms.

    Now, when I add a WHERE clause that includes everything and returns the same records (like WHERE geom.STIntersects(@TheWholeWorld) = 1), the same block takes 500ms (ten times slower).

    using (var conn = new SqlConnection("myconnectionstring"))
     using (var cmd = new SqlCommand("SELECT geom FROM " + table + " WHERE geom.STIntersects(@worldBBox) = 1", conn))
      {
        cmd.CommandType = System.Data.CommandType.Text;
        using (var rdr = cmd.ExecuteReader())
        {
           DateTime start = DateTime.Now; // start timer
           while (rdr.Read())
           { SqlGeometry geom = (SqlGeometry)rdr[0]; }
    
           Debug.WriteLine("loaded in: " + (DateTime.Now - start).TotalMilliseconds); // now 500ms (!) with the same 36000 records
        }
      }

    Do you know why this is happening ?

    The aim is generating image tiles on the fly the .net way, and with this issue, on tiles with a lot of geometries, 75% of processing time is passed fetching data.




    • แก้ไขโดย aixchile 30 เมษายน 2555 14:42
    •  

ตอบทั้งหมด

  • 30 เมษายน 2555 19:16
    ผู้ตอบ
     
     

    Hi, Can you post the execution plans for both queries? (i.e. with and without the WHERE clause on the geom column)


    twitter: @alastaira blog: http://alastaira.wordpress.com/

  • 30 เมษายน 2555 20:24
     
     

    Hi, Can you post the execution plans for both queries? (i.e. with and without the WHERE clause on the geom column)


    twitter: @alastaira blog: http://alastaira.wordpress.com/

    Hi Alastair. Sure I can (which format ? sqlplan, xml, screenshot).

    But my problem is not the query execution time (ie: ExecuteReader), but the read (SqlDataReader.GetValue()) operation. The execution plan has something to do with this ?


    Here are the two files (in sqlplan.txt format)

    • แก้ไขโดย aixchile 30 เมษายน 2555 20:39
    •  
  • 1 พฤษภาคม 2555 9:19
    ผู้ตอบ
     
     คำตอบ

    (Firstly - a disclaimer - I'm no expert in ADO.Net!)

    SqlDataReader streams results from a query as it is executed by the DB (http://msdn.microsoft.com/en-us/library/ms254931.aspx) - it doesn't wait to get an entire resultset before proceeding. So, in the code above, you're setting the start variable when the query has been issued to SQL Server and commenced execution, not after waiting for the query to finish executing. The rdr.Read() loop will iterate until all the results have been processed, and the end variable is set after that.

    Therefore,  the elapsed time that you're calculating above is the sum of the execution time of the query + the time taken to loop over the results. Seeing as there's no reason to believe that it should take longer to loop over the same set of 36,000 records in either query, it follows that we should look to see whether the query execution varies between the two examples. Sure enough, you can see that the plans are very different.

    You might think that, because the "WithWHEREClause" query has a condition that can make use of a spatial index, it'll be faster, but this isn't always the case. Specifically, using a spatial index typically becomes less efficient as the size of the query window (i.e. the area in which you're looking for results) gets larger. In your case, the query window is the whole world - POLYGON ((-180 90, -180 -90, 180 -90, 180 90, -180 90)) - which is pretty much the least specific query window it's possible to be!

    To demonstrate, you might be interested to see the following diagram taken from the "Indexing" chapter of "Pro Spatial with SQL Server 2012", which shows how the execution time for a simple query that selects points that intersect a polygon varies with the number of records in the base table and the size of the query window relative to the overall extent of the data.

    Notice how execution time increases with both number of rows in table and also query window size relative to the bounding box of the index?

    So, given your described use case of generating image tiles on the fly, are you really going to be supplying an area of interest covering the whole world, or did you just put that in as a dummy value to test your code? If you are more likely to be generating 256px x 256px image tiles equivalent to, say, zoom level 16 of Bing Maps, then I'd suggest you test your code again using a more appropriate bounding box, such as that covering a 1km x 1km square, maybe.

    If you really want to retrieve every record from a table of spatial data, best to not put in any WHERE condition at all (even with a spatial index) - you'll always get better performance from a straightforward table scan.


    twitter: @alastaira blog: http://alastaira.wordpress.com/

    • ทำเครื่องหมายเป็นคำตอบโดย aixchile 3 พฤษภาคม 2555 6:29
    •  
  • 1 พฤษภาคม 2555 16:26
     
     

    Thank you Alastair for your quick, clear and detailed reply.

    You answered my question about why it is slow to fetch the same results : I thought that when the ExecuteReader method was executed, all data was loaded. You showed me that it was not the case and it explains the differences of timing between my two queries.

    About the tile engine I'm building it is true that it is not suitable for analysis to display the whole world at one on features the size of a city.

    But when I see sites as giscloud.com that renders such tiles so fast, it is now clear that they don't use the spatial query strategy on the fly, but that they have a Btree/Rtree index and all the geometries in memory, and then perform spatial queries in memory.

    I have tested this kind of strategy, simplified :

    1. I load all geometries in memory with their Id (identity column) and store them in a dictionnary (key : ID, value : geom)
    2. For a given tile I use a "standard" spatial query, but return only the Ids
    3. Then by code I retrieve the corresponding geometries

    The step 2 is 6x faster when returning Ids rather than geometries.

    Would you consider another approach for doing this ?

    Thank you

    Xavier

  • 1 พฤษภาคม 2555 16:45
    ผู้ตอบ
     
     

    I've examined the giscloud.com website in the past because they do some clever HTML5 optimisations in their rendering engine to plot large amounts of vector data in a much quicker and more interactive way than you'd typically get with Bing Maps/Google Maps. However, I'm not at all familiar with what they do in terms of data storage/retrieval - are you suggesting all the geometries are held in memory on the server or on the client?

    You should definitely look at creating a tile cache so that the only tiles that get rendered are either those that have never been requested before, or those that have expired (how dynamic is your data?) - all others can be retrieved almost instantly from the cache with no need to hit the database or rendering toolchain. I thought I'd written a post about that but, looking at my own blog now, I can't find it! I'll post back here if I find it...


    twitter: @alastaira blog: http://alastaira.wordpress.com/

  • 1 พฤษภาคม 2555 19:26
     
     

    I'm suggesting all the geometries are held im memory on the server. It cannot be something else, as performance for generating a tile with a thousand (even millions) features is really, really fast. They have generated a home made GeoJSON where features which area is smaller than a pixel are returned like ... pixels. Others implementations like geometry clipped to tile bounds and use of HTML5 canvas helps getting and rendering data so fast.

    Actually, with this discussion and a few tests on giscloud, I think they load dataset in memory because hitting the database for the firsts tile queries could not be so fast. I've also read on few posts that their are not caching rendered tiles.

    In my case, I would setup a tile cache for static data (I plan testing NoSQL dbs for this to see the performance). As for rendering 'dynamic' data (choroplet maps, tiles the user can style (WMS mode)), I consider caching features in a lightweight format (GeoJSON, or Key/Value dictionaries ).

    By the way, I will buy your book "Pro Spatial with SQL 2012" as soon as it will be available. And when I will find a good solution for my problem, I'll post back.

    Thank you

    Xavier


    • แก้ไขโดย aixchile 1 พฤษภาคม 2555 19:27
    •