locked
Shocking performance for parsing geometries in SQL RRS feed

  • Question

  •  

    I've been investigating whether its best to interact with SQL using WKT, WKB or the base binary udt.

     

    What I've found is that WKT and WKB perform better than the base UDT until about 40 or 50 points. After that the performance of WKT and WKB drops like a stone. Whereas the base udt performance is only reduces slighty.

     

    For 100 points the difference is double 10ms compared to 5ms, for 200 points its 40ms compared to 6 ms, 400 points its 170ms compared to 6.5ms

     

    It seems there is an issue with the parsing somewhere.

     

    So jumping to conclusions I thought it was a SQL issue. However what it seems is that the issue isn't specifc to SQL. Doing the same parsing on the client has the same performance almost like the performance is inversely proportional to the suqare of the number of points.

     

    All this was done with byte order 1 which may be part of the problem due to having to flip all the bytes round.

     

    Monday, January 7, 2008 8:16 PM

Answers

  • OK hands up. My test code was just a random set of points in a line.

     

    bw.Write((Byte)1);

    bw.Write((UInt32)2);//reverse (GetBytes((UInt32)2))); //linestring

    bw.Write((UInt32)size);//reverse(GetBytes((UInt32)size)));

    double posX = 0;

    double posY = 0;

    for (int i = 0; i < size; i++)

    {

    posX = Math.Round(1000 * pointValue.NextDouble(), 4);

    posY = Math.Round(1000 * pointValue.NextDouble(), 4);

    bw.Write(posX);//reverse (GetBytes(posX)));

    bw.Write(posY);//reverse (GetBytes(posY)));

    }

    SqlBytes complexWKB = new SqlBytes(ms);

    SqlGeometry complex = SqlGeometry.STGeomFromWKB(complexWKB , 0);

     

    Changing the posX and posY to be incrementing values the performance wasn't as shocking.

     

    Not sure if it highlights an issue in the validation code with very complex (but unlikely structures).

    Tuesday, January 8, 2008 1:48 AM
  • Ok, for some reason I was fixated on using the udt binary explicitly rather than using the SqlGeometry instance directly through SqlClient.  This is actually a somewhat more fair comparison as it comparing apples to apples with a binary format on the client to an instance on the server, which can be done like so:

     

    Code Block

    using (SqlCommand cmd = new SqlCommand("p_udtbinary", con))

    {

    SqlGeometry geom = SqlGeometry.Parse("LINESTRING (0 0, 1 1, 2 2, 3 3, 4 4, 5 5, 6 6, 7 7, 8 8, 9 9)");

    BinaryWriter writer = new BinaryWriter(ms);

    geom.Write(writer);

    SqlBytes udtBinary = new SqlBytes(ms);

    cmd.CommandType = CommandType.StoredProcedure;

    cmd.Parameters.Add(new SqlParameter("@GeoKey", SqlDbType.UniqueIdentifier));

    cmd.Parameters.Add(new SqlParameter("@GeoData", SqlDbType.VarBinary, -1));

    sw.Reset();

    sw.Start();

    for (int loop = 0; loop < loopCount; loop++)

    {

    cmd.Parameters[0].Value = Guid.NewGuid();

    cmd.Parameters[1].Value = udtBinary;

    cmd.ExecuteNonQuery();

    }

    sw.Stop();

    Console.WriteLine("Udt Binary: " + sw.ElapsedMilliseconds);

    }

     

     

     

    With a convert(geometry, @GeoData) step in the stored procedure.  Doing so yields the expected results as being faster than parsing WKB. 

     

    Regardless, I do see that there is definitely a performance overhead when using SqlClient to pass the geom instance directly that is especially significant for smaller instances (before the parsing/validation costs begin to dominate).  I'm following up on this.
    Wednesday, January 9, 2008 8:55 PM

All replies

  • I assume your geometry instance is a LineString or a Polygon?

     

    The big difference in parsing Wkt/Wkb/Gml is that we need to check the instance for validity when it is input, which is not that cheap of an operation on LineStrings or Polygons (MultiPoints are always valid, which is why you should not see as big of a hit for them). 

     

    Since the raw udt binary format is internal, we assume that it was produced by us so the validity bit can be trusted to be correct and the IsValid check can be skipped.  By writing the raw binary format yourself, you skip the validity check which obviously saves time, but you now run the risk of inputting invalid data which can lead to wrong results. 

     

     

     

    Monday, January 7, 2008 8:54 PM
  • Decomposing the WKB was not too hard, but decomposing the structure of the udt i.e. the serialisation of the GeometryData is not something I'll be spending any time on.

     

    I suppose the question comes as to why you need to validated the geometry. Is it just so that STIsValid is pre calculated.

     

    Shame this propety calculation can't be delayed or done in a background thread aka iFTS

     

    Have three states, true false and not calculated. If not calculated then when ever it is queried it is then calculated.

     

     

    Monday, January 7, 2008 9:01 PM
  • I think I misunderstood your scenario - what did you mean by "interact with SQL using WKT, WKB or the base binary udt"? I assumed you meant inputting data, but if you are not doing so using the internal binary format then I am not actually sure what you are asking about. 

     

    Unfortunately, delaying calculation of validity is not a good option because it is required to known whether the object is valid before performing most operations on it.  If you called a function on a geometry with this not-calculated state, we would need to calculate it.  However, since functions cannot have side-effects, we could not persist this value back to the instance but would need to call it for every function (until you ran the hypothetical 'UpdateValidity' mutating method). 

     

    We could accept the geometry but not let you call methods on it at all until you ran 'UpdateValidity', however I think this doesn't have many benefits over checking validity and time of input and is much more complicated to program against.

    Monday, January 7, 2008 9:16 PM
  • I think you did understand me. You talked about not doing the validation on the binary dat you generate. One could pass a binary value and then use the convert(sys.geometry, @value) which would achieve the no validation.

     

    However trying to build the binary format in my view is too complex.

     

    I guess the questions as to whether lines/polygons > 50 points is common. Freehand shapes can easily reach > 100 points, and I would have thought most geographical shapes (boundaries) would be in excess of 100 points.

     

    How expensive is the validation comapred to the calculation of the area, or one of the relationships.

     

    If the performance was an issue one would have to resort to using service broker I guess.

    Monday, January 7, 2008 9:30 PM
  • I also found that large types resulted in out of memory exceptions when you reached a few thousand points. This was with 2gb. I had to up the system to 3.5Gb to get it to parse. It appeared that a 3500 point line required 1.3Gb to parse it. seemed quite excessive to me

     

    Monday, January 7, 2008 10:15 PM

  • I am curious how complex of polygons people have been using for test purposes?  Has anyone used any of the hi-resolution boundary files for things such as postal code boundaries, states, waterways, etc?  I would expect many of them to have many thousand points.  It would be useful to have more information such as shown in the above post by Simon.

    Would it be fair to ask for some generic guidelines similar to the following:  (note, I am just making up these values for illustration)
    < 100 points - no problem
    < 1,000 points - should work, may have perfomance issues, monitor memory usage
    < 10,000 points - might work, expect problems
    > 10,000 points - don't bother trying, this will create problems.
    Monday, January 7, 2008 11:28 PM
  • I'm not sure why Simon is running into problems with Polygons as small as 3500 points.  We have many instances of geometries that run into hundreds of thousands of points that are usable, even on x86 machines which severely limit the Virtual Address Space available for the clr-based operations (such as parsing) used by Spatial (see this post for more info).

     

    The Jump Project has some good examples of large Polygons that are useful for testing: http://www.jump-project.org/project.php?PID=JTS&SID=DEMO

     

    On my slightly dated P4 x86 machine with 2 GB of ram, I can parse and validate Polygon D (163,122 points) in 5 seconds and Polygon E (598,093 points) in 20 seconds. 

     

    One thing you should definitely do is use STGeomFromText rather than Parse for large instances as it takes in SqlChars while Parse uses a string.  SqlChars supports streaming while string does not, meaning it will require a contiguous chunk of memory for the entire WKT string.  For Polygon E, this is 27 MB which due to memory fragmentation and the restricted VAS may not exist on many x86 servers after they have been in use.

    Tuesday, January 8, 2008 12:31 AM
  • OK hands up. My test code was just a random set of points in a line.

     

    bw.Write((Byte)1);

    bw.Write((UInt32)2);//reverse (GetBytes((UInt32)2))); //linestring

    bw.Write((UInt32)size);//reverse(GetBytes((UInt32)size)));

    double posX = 0;

    double posY = 0;

    for (int i = 0; i < size; i++)

    {

    posX = Math.Round(1000 * pointValue.NextDouble(), 4);

    posY = Math.Round(1000 * pointValue.NextDouble(), 4);

    bw.Write(posX);//reverse (GetBytes(posX)));

    bw.Write(posY);//reverse (GetBytes(posY)));

    }

    SqlBytes complexWKB = new SqlBytes(ms);

    SqlGeometry complex = SqlGeometry.STGeomFromWKB(complexWKB , 0);

     

    Changing the posX and posY to be incrementing values the performance wasn't as shocking.

     

    Not sure if it highlights an issue in the validation code with very complex (but unlikely structures).

    Tuesday, January 8, 2008 1:48 AM
  • Ah, thanks Simon, this makes a lot more sense.  This is to be expected due to the way the algorithm works as a large number of self-intersections will make the validation process much more expensive as every intersection point needs to be calculated as well.

     

    Tuesday, January 8, 2008 5:27 AM
  • How are you testing this? Is this by passing the instance as a parameter through SqlClient, or something else?

     

    I am not able to reproduce your results - using the udt binary data itself should always be faster than parsing Wkb.

     

    Wednesday, January 9, 2008 12:42 AM
  • using (SqlCommand cmd = new SqlCommand("SpatialLab.dbo.up_Geo_InsertFromText", con))

    {

    cmd.CommandType = CommandType.StoredProcedure;

    cmd.Parameters.Add(new SqlParameter("@GeoKey", SqlDbType.UniqueIdentifier));

    cmd.Parameters.Add(new SqlParameter("@GeoData", SqlDbType.VarChar, 800000));

    cmd.CommandTimeout = 6000;

    sw.Reset();

    sw.Start();

    for (int loop = 0; loop < loopCount; loop++)

    {

    cmd.Parameters[0].Value = Guid.NewGuid();

    cmd.Parameters[1].Value = simpleText;

    cmd.ExecuteNonQuery();

    //SqlGeometry g = SqlGeometry.Parse(simpleText);

    }

    sw.Stop();

    times[1] = sw.ElapsedMilliseconds ;// loopCount;

    }

    using (SqlCommand cmd = new SqlCommand("SpatialLab.dbo.up_Geo_InsertFromWKB", con))

    {

    cmd.CommandType = CommandType.StoredProcedure;

    cmd.Parameters.Add(new SqlParameter("@GeoKey", SqlDbType.UniqueIdentifier));

    cmd.Parameters.Add(new SqlParameter("@GeoData", SqlDbType.VarBinary, 800000));

    cmd.CommandTimeout = 6000;

    sw.Reset();

    sw.Start();

    for (int loop = 0; loop < loopCount; loop++)

    {

    cmd.Parameters[0].Value = Guid.NewGuid();

    cmd.Parameters[1].Value = simpleWKB;

    cmd.ExecuteNonQuery();

    // SqlGeometry g = SqlGeometry.STGeomFromWKB(simpleWKB, 0);

    }

    sw.Stop();

    times[2] = sw.ElapsedMilliseconds ;// loopCount;

    }

    using (SqlCommand cmd = new SqlCommand("SpatialLab.dbo.up_Geo_Insert", con))

    {

    cmd.CommandType = CommandType.StoredProcedure;

    cmd.Parameters.Add(new SqlParameter("@GeoKey", SqlDbType.UniqueIdentifier));

    cmd.Parameters.Add(new SqlParameter("@GeoData", SqlDbType.Udt));

    cmd.CommandTimeout = 6000;

    cmd.Parameters[1].UdtTypeName = "Geometry";

    sw.Reset();

    sw.Start();

    for (int loop = 0; loop < loopCount; loop++)

    {

    cmd.Parameters[0].Value = Guid.NewGuid();

    cmd.Parameters[1].Value = simple;

    cmd.ExecuteNonQuery();

    }

    sw.Stop();

    times[3] = sw.ElapsedMilliseconds ;// loopCount;

    }

     

     

    alter procedure up_Geo_InsertFromText

    @GeoKey uniqueidentifier

    ,@GeoData varchar(max)

    as

    insert into Geo (GeoKey,GeoData)

    values (@GeoKey,Geometry::STGeomFromText(@GeoData,0))

    go

    alter procedure up_Geo_InsertFromWKB

    @GeoKey uniqueidentifier

    ,@GeoData varbinary(max)

    as

    insert into Geo (GeoKey,GeoData)

    values (@GeoKey,Geometry::STGeomFromWKB(@GeoData,0))

    go

    create procedure up_Geo_Insert

    @GeoKey uniqueidentifier

    ,@GeoData Geometry

    as

    insert into Geo (GeoKey,GeoData)

    values (@GeoKey,@GeoData)

    go

    Wednesday, January 9, 2008 12:47 AM
  • This code generates the shapes

    private void SaveTODB_Click(object sender, RoutedEventArgs e)

    {

    Stopwatch sw = new Stopwatch();

    SqlGeometry simple = SqlGeometry.Parse ("POINT (10.123 10.123)");

    string simpleText = simple.ToString();

    SqlBytes simpleWKB = simple.STAsBinary();

    //Make the geometry valid if it isn't

    Random pointValue = new Random();

    List<long[]> times = new List<long[]>();

    for (int size = 2; size < 1000; size += (int)(Math.Pow(10, Math.Ceiling(Math.Log10(size + 1))) / 10))

    {

    long[] timeBits = new long[4];

    timeBits[0] = size;

    MemoryStream ms = new MemoryStream();

    BinaryWriter bw = new BinaryWriter(ms);

    //we need to write some geometry information before the points.

    //the byte order, the geometry type and the number of points.

    bw.Write((Byte)1);

    bw.Write((UInt32)2);//reverse (GetBytes((UInt32)2))); //linestring

    bw.Write((UInt32)size);//reverse(GetBytes((UInt32)size)));

    double posX = 500;

    double posY = 500;

     

    for (int i = 0; i < size; i++)

    {

    posX +=Math.Round(10 * pointValue.NextDouble(), 4) ;//*(pointValue.Next(1, 3) == 1 ? 1 : -1);

    posY += Math.Round(10 * pointValue.NextDouble(), 4);//* (pointValue.Next(1, 3) == 1 ? 1 : -1);

    bw.Write(posX);//reverse (GetBytes(posX)));

    bw.Write(posY);//reverse (GetBytes(posY)));

    }

    SqlBytes complexWKB = new SqlBytes(ms);

    SqlGeometry complex = SqlGeometry.STGeomFromWKB(complexWKB , 0);

    Geoms.Add(complex);

    if (!complex.STIsValid()) complex = complex.MakeValid();

    string complexText = complex.ToString();

    // SqlBytes complexWKB = complex.STAsBinary();

    PerftestDBInteraction( timeBits, complex, complexText, complexWKB);

    bw.Close();

    ms.Close();

    times.Add(timeBits);

    }

    StylusPointCollection[] lines = new StylusPointCollection[3];

    lines[0] = new StylusPointCollection();

    lines [1] = new StylusPointCollection ();

    lines [2] = new StylusPointCollection ();

    for (int i = 0; i < times.Count; i++)

    {

    long[] timeBits = timesIdea;

    lines[0].Add(new StylusPoint(timeBits[0], timeBits[1]));

    lines[1].Add(new StylusPoint(timeBits[0], timeBits[2]));

    lines[2].Add(new StylusPoint(timeBits[0], timeBits[3]));

    }

    Stroke l1 = new Stroke(lines[0]);

    Stroke l2 = new Stroke(lines[1]);

    Stroke l3 = new Stroke(lines[2]);

    l1.DrawingAttributes.Color = Colors.Red;

    l2.DrawingAttributes.Color = Colors.Blue;

    l3.DrawingAttributes.Color = Colors.Green;

    DrawingPad.Strokes.Add(l1);

    DrawingPad.Strokes.Add(l2);

    DrawingPad.Strokes.Add(l3);

    Geoms.Add(ConvertStrokeToGeometry(l1));

    Geoms.Add(ConvertStrokeToGeometry(l2));

    Geoms.Add(ConvertStrokeToGeometry(l3));

    }

    Wednesday, January 9, 2008 12:54 AM
  • Ok, for some reason I was fixated on using the udt binary explicitly rather than using the SqlGeometry instance directly through SqlClient.  This is actually a somewhat more fair comparison as it comparing apples to apples with a binary format on the client to an instance on the server, which can be done like so:

     

    Code Block

    using (SqlCommand cmd = new SqlCommand("p_udtbinary", con))

    {

    SqlGeometry geom = SqlGeometry.Parse("LINESTRING (0 0, 1 1, 2 2, 3 3, 4 4, 5 5, 6 6, 7 7, 8 8, 9 9)");

    BinaryWriter writer = new BinaryWriter(ms);

    geom.Write(writer);

    SqlBytes udtBinary = new SqlBytes(ms);

    cmd.CommandType = CommandType.StoredProcedure;

    cmd.Parameters.Add(new SqlParameter("@GeoKey", SqlDbType.UniqueIdentifier));

    cmd.Parameters.Add(new SqlParameter("@GeoData", SqlDbType.VarBinary, -1));

    sw.Reset();

    sw.Start();

    for (int loop = 0; loop < loopCount; loop++)

    {

    cmd.Parameters[0].Value = Guid.NewGuid();

    cmd.Parameters[1].Value = udtBinary;

    cmd.ExecuteNonQuery();

    }

    sw.Stop();

    Console.WriteLine("Udt Binary: " + sw.ElapsedMilliseconds);

    }

     

     

     

    With a convert(geometry, @GeoData) step in the stored procedure.  Doing so yields the expected results as being faster than parsing WKB. 

     

    Regardless, I do see that there is definitely a performance overhead when using SqlClient to pass the geom instance directly that is especially significant for smaller instances (before the parsing/validation costs begin to dominate).  I'm following up on this.
    Wednesday, January 9, 2008 8:55 PM
  •  

    Whats interesting is that I deemed, correctly or incorrectly, from profiler that in passing a udt the convert was being done on the server anyway.

     

    Which if I am correct suggest something else is going on with udt data. On looking at the script you've posts would suggest the binary serialisation to SqlBytes.

     

    Code Block

     

    BinaryWriter writer = new BinaryWriter(ms);

    geom.Write(writer);

    SqlBytes udtBinary = new SqlBytes(ms);

     

     

    Wednesday, January 9, 2008 9:47 PM
  • Yep, that's correct.  Both SqlClient and the snippet I posted are doing essentially the same thing: serializing the instance to bytes, sending the bytes over the wire, and instantiating it on the server using convert. 

     

    The only functional difference is if you want this to be handled implicitly by SqlClient, or do it yourself explicitly.  Currently doing it yourself has a performance advantage over both Wkt and Wkb, while letting SqlClient do it has some additional overhead.  This only works for Format.UserDefined Udts however, as the Format.Native serialization process is not public.
    Wednesday, January 9, 2008 10:17 PM
  • Most people are likely to have either the WKB, the WKT or the geometry floating around their app. I suppose one needs to include the cost of getting the WKB, WKT or base byte from the udt in the loop for it to be truely fair.

    Wednesday, January 9, 2008 10:21 PM
  •  

    I definitely agree that it depends on your input data.  If you already have Wkt or Wkb on the client, then it likely makes sense to use that and convert on the server.

     

    I followed up on this with the SqlClient team and it does appear to be a performance issue with how SqlClient serializes udts.  For now, the best way to get optimal performance when you have instances of SqlGeometry/SqlGeography on the client is to serialize them yourself and send them as bytes through SqlClient as I showed in the previous code snippet.

    Friday, January 11, 2008 7:51 AM