locked
What is the best way to Union some polygons together? RRS feed

  • Question

  • Hi folks,

    I'm playing around with the USA Tigerlines for streets on Staten Island in NYC. File is called All Lines, New York County, New York (tl_2008_36061_edges) .

    I grabbed the first random street name i saw in the result set -> Edgecombe Ave.

    Here's what it looks like in SSMS:-



    I was hoping to try and group all those geom polys (which i also need to convert to geography types), into one field - ie. make one field represent those segments which sorta means one field == the street. So i thought i could union the results by a grouped Fullname. But what happens If i happen to have a totally different polygon (in this case, street / street segments) which exist far far away to the segment above? In reality, I would need to have two seperate rows (ie. two seperate streets).

    So i was thinking, what happens if i do some sql that adds a wee bit of buffer to each segment and then checks if those two segments overlap AND have the same full name. If so, then union. rinse repeat?

    That hints at using cusors and i'm cringing at the performance hit that might incur, even though this would be a one off thing.

    Thoughts anyone?

    EDIT: I forgot to mention, David Lean has a good tutorial on joining/unioning polys.

    -Pure Krome-
    Wednesday, April 29, 2009 7:33 AM

All replies

  • What's the problem if a row representing part of the street is far away to the others? It just means that when you UNION them together, you'll end with a multipolygon rather than a single polygon...

    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290
    Wednesday, April 29, 2009 3:34 PM
    Answerer
  • the problem is that it shouldn't be a multi-poly ... but two separate polygons / rows.

    so lets imagine that the screenshot above is actually two different streets, which just happen to have the same FullName. the first street is 30 km's to the east of the 2nd street.

    If i union the results above, i'll get one row in the database, with the same name.
    What i need is TWO streets above. they just happen to have the same name, but they are two rows/results in the database.

    this is because they are, in this fake reality, two totally separate streets and need to be representated as such.

    does that make sense?

    -Pure Krome-
    Wednesday, April 29, 2009 10:05 PM
  • Without addressing the "one street vs. two separate streets problem" (how do you decide if a street is one street or two), to union the streets together for one row, use the SQL Spatial Tools GeographyUnionAggregate with the GROUP BY you're considering. This library of functions also has a "VacuousGeoemetryToGeography" or visa-versa function if your looking to convert, make sure they do X,Y to Long,Lat correctly (I haven't used this function extensively). It's at http://www.codeplex.com/sqlspatialtools.

    Write back if you have problems using these. I just answered a similar question in this thread: finding near point ( near cross of two streets) and how to simplify my db

    Hope this helps,
    Bob Beauchemin
    Wednesday, April 29, 2009 11:48 PM
  • Hi Bob :)

    great post! I just downloaded the dll, registered it in sql and tried it. Works like a charm :)

    the dll (in the downloads tab) is a wee bit out of date compared to the source code becuase i had to add in a second arg to the GeographyUnionAggregate method. But it works!!!

    Ok, back to the tough part of the question: Without addressing the "one street vs. two separate streets problem" (how do you decide if a street is one street or two)

    I was going to try and handle this via STDistance or STBuffer. If i was going to use STDistance, i was hoping to add an extra arg to the GeographyUnionAggregate method which is a distance in metres. If two polygon segments distance are equal or less than, that value, they are to be union'd. otherwise, they will not be union'd. I don't really like this solution because the question is which part of street segment 'A' do we do a distance check against street segment 'B' ?

    with the second suggestion, STBuffer(..), i was thinking that we get two segments, buffer them by a value the user selects (eg. 5 metres), and then do an STIntersect(..) check on these two polys. if they intersect, it's assumed they are part of the same street. Union the original two segments, not the two buffered segments.

    Now, with this source code .. maybe the second suggestion could be patched into the code?

    eg.

    SELECT FullName, dbo.GeographyUnionAggregate(geog, 10) -- 10 == 10 metres to buffer check
    FROM tl_2008_36061_edges WHERE FullName = 'Edgecombe Ave' GROUP BY FullName

    It's not perfect, but it could be a pretty good starting point.

    Now, I've looked at the code @ codeplex and .. well .. my blond single brain cell just exploded :( damn it :(

    Any suggestions how I could do this would be nice. Isaac???? :)

    EDIT: Here is Staten Island, with the 'streets' aggregate union'd but without my STBuffer suggestion trick.
    Very pretty :) (NOTE: my mouse pointer didn't show in the image, but i was pointing to the big long blue line on the west side of the island.)

    -Pure Krome-
    Thursday, April 30, 2009 12:41 AM
  • Patching it into the code could eliminate some of the rows from the aggregate answer, but it wouldn't "leave the rows that were eliminated" in the table as separate rows/groups. And it's gonna be slow. You'd need to do two SQL queries; one with an aggregate that eliminates some rows (which could be patched into the code), and one that detects the rows that you've left out of the aggregate answer (left out of the group) and adds them back in.

    To determine if this problem even exists (ie whether or not you have widely separated streets that get aggregated in as "the same street"), you could write a program that iterates through the multilinestrings and determines how many of these "outliers" you have. By issuing the appropriate INSERTs (insert the row for streets no associated with streets of the same name)/UPDATEs (to fix the multilinestring) as part of that program you could fix the condition, but it would be slow. You could start by seeing how many multilinestrings you have in the table at all.

    Hope this helps,
    Bob Beauchemin

    Thursday, April 30, 2009 2:01 AM
  • Hi Bob :)

    i'm terribly sorry, but none of that made any sense to me :( Sorry. I'll try and explain what I think you said (or more to the point, what you said but I'm trying to understand :)  )

    >>eliminate some of the rows from the aggregate answer, but it wouldn't "leave the rows that were eliminated"
    elimate some rows? do u mean split the same named streets into seperate polys?

    >>you could write a program that iterates through the multilinestrings and determines how many of these "outliers" you have
    Is this similar to my suggestion about using STBuffer but for each road segment and union'ing the result?

    >>You could start by seeing how many multilinestrings you have in the table at all.
    Using the UnionAggregate result above (that picture of Staten Island), if there are two streets that have the same name but do NOT connect, then it would be a MultiLineString. If it was a single street then it would be a LineString? If that's the case, then YEAH! that would be a great start and filter point :) Is it possible to split multilinestring into seperate polygons?
    -Pure Krome-
    Thursday, April 30, 2009 3:03 AM

  • OK. The STUnionAggregate (as with any aggregate, like SUM or COUNT) can only put out 1 row for each group. In your case, you're (I'm guessing) grouping by StreetName. There's no way, inside the aggregate, to split a group (ie, four streets have the same name = 1 group) and return more than one row. It returns one row per group (1 row per street name).

    I'm guessing that what you have is four rows for "A Street", for example. Three of the rows (rows 1,2,3) DO combine together to make 1 street, but the last row (row 4) does not combine, it's a separate street. You want to return 2 rows (row 1 is the union of the 3 rows (row 1,2,3) that do combine, row 2 is the row (row 4) that does not combine). There's no way an aggregate can return more than one row for a single group. So the best you could do inside the aggregate, is to eliminate row 4 from the aggregate. Does this make sense? I can't return 2 rows for 1 group (Group by StreetName).


    >If it was a single street then it would be a LineString?

    I don't have your data, but an easy way to find out is to do queries against your new table, that SELECT YourColumn.STGeometryType(). This would tell you if they are linestrings or multilinestrings. If they are all multlinestrings, then go to YourColumn.STNumGeometries(). You could, conceivably, write a table-valued function to split them apart (into multiple rows). The tricky part would be figuring out which linestrings stay together and which ones don't. Or you could write a program to postprocess them and break them up. It would be interesting to see, if you do have multilinestrings, if the individual linestrings are "in order' in some way, so you could do it without comparing each Geometry (in STGeometryN(n)) against the whole set. 

    But at least looking at STGeometryType and STNumGeometries will give you some idea of what GeographyUnionAggregate did. Write back (or send mail at bobb-at-companyname-com, if you want me to look at the data) with the results...or if this doesn't make sense.

    Cheers,
    Bob Beauchemin
    SQLskills

     
    Thursday, April 30, 2009 4:48 AM
  • ...I was going to try and handle this via STDistance or STBuffer. If i was going to use STDistance, i was hoping to add an extra arg to the GeographyUnionAggregate method which is a distance in metres. If two polygon segments distance are equal or less than, that value, they are to be union'd. otherwise, they will not be union'd. I don't really like this solution because the question is which part of street segment 'A' do we do a distance check against street segment 'B' ?...
    STDistance() will always return the distance between the closest points of any two instances (polygons, linestrings, or anything else), so you don't need to worry about which parts of Street A and Street B to compare - if you do StreetA.STDistance.(StreetB) it will work it out for you. Also, where did polygons come from? It looks like you're talking about linestrings..

    If I understand correctly, your desired result is to have each row in your database table corresponding to a single-element geometry ( i.e. a single linestring, or polygon) rather than it corresponding to a single feature (i.e. a street). This is not how I'd do it, but if you really want to do it this way, I'd let the UnionAggregate do its job until you get one row per street then, as Bob says, loop through any multilinestrings and break them up into separate row. You can split any multilinestrings up into linestrings using one of the suggestions on this thread: http://social.msdn.microsoft.com/Forums/en-US/sqlspatial/thread/c2096879-ebf8-4245-8f59-cbe853d00069 (the example there is splitting multipolys into polys, but the principle is the same).

    Nice pic, BTW.

    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290
    Thursday, April 30, 2009 5:48 AM
    Answerer
  • OK. The STUnionAggregate (as with any aggregate, like SUM or COUNT) can only put out 1 row for each group. In your case, you're (I'm guessing) grouping by StreetName. There's no way, inside the aggregate, to split a group (ie, four streets have the same name = 1 group) and return more than one row. It returns one row per group (1 row per street name).
    Yep - now that makes perfect sence :)

    I'm guessing that what you have is four rows for "A Street", for example. Three of ... snip snip ... Does this make sense? I can't return 2 rows for 1 group (Group by StreetName).
    Yep - this also makes sence.
    >If it was a single street then it would be a LineString?
    I don't have your data, but an easy way to find out is to do queries against your new table, that SELECT YourColumn.STGeometryType().
    This would tell you if ... snip snip ... the whole set.

    But at least looking at STGeometryType and STNumGeometries will give you some idea of what GeographyUnionAggregate did.
    Before I also continue with my comments, the data is USA TIGER Line data. i posted a the link the first post of this thread, on the second line. Just click on that link (up there) to grab it. We all know the TIGER data is all free.

    Ok. so first of all, back to my original question about MultiLine and Line :- if i've AggregateUnion'd a two polygons together and they do not touch each other, i will get a Multi? Otheriwse, i'll get a Linestring, representing the valid outside border of the two poly's union'd together?

    Assuming that is the case (EDIT: this is the case i believe, based on my test data, below), then i'll run that sql syntax in ssms and lets see what we got. brb

    This is what i ran...

    DECLARE @tempTable TABLE (fullname varchar(500), geog geography)

    INSERT INTO @tempTable
    SELECT FULLNAME, dbo.GeographyUnionAggregate(geog, 0)
    FROM tl_2008_36061_edges
    where FULLNAME != ''
    and ROADFLG = 'Y'
    group by FULLNAME

    select fullname, geog, geog.STGeometryType() as type from @tempTable
    where geog.STGeometryType() = 'MultiLineString'

    (please excuse my bad casing and unclean code .. it was all just quick sql programming in ssms).


    Interesting... out of 944 unique street rows (shown in that image above), 204 are MultiLineStrings. That's around 21.5% of the results.
    Cause i like pics, here's what it looks like:-


    So from here, i'm going to query just one or two results to see how they are multilinestrings ....
    First up.. lets look at some results...

    ok.. easy scenario first -> bad data. Are 11 Ave and 11th Ave the same thing? I'm assuming so...

    The really bad red circles around the cyan coloured road is the second result -> 11th Ave. the darker blue is the first result -> 11 Ave (notice the missing 'th'). They are the same street in real life, but the bad naming data indicates otherwise. Nothing can be done here but a manual aggregate union. Still, this should result in a LineString result :)

    Next sample. lets try 4 Ave. There's no 4th avenue.. so what do we get?


    Interesting :) this is the type of scenario i was wondering about. Ok. So what are these two streets?



    that's not what i expected. looks like more bad data. The top image shows the correct 4th Ave. Bottom picture shows the other part of the multiline. .. but i can't find a 4 Ave anywhere in that view port. I'm guessing it might 4x Ave (eg. 42nd ave or something) ??

    This might be the cost assosiated to having free data (you get what you pay for ) and just live with it and clean it up on a per incident basis.

    your desired result is to have each row in your database table corresponding to a single-element geometry ( i.e. a single linestring, or polygon) rather than it corresponding to a single feature (i.e. a street). This is not how I'd do it, but if you really want to do it this way
    Hmm. not really Tanoshimi (if i understand you right). I actually want each row in the db table to correspond o a single feature (ie a street). Currently, the data contains segments that, when put together, form a street. By grouping them on their FullName field (eg. 4th Ave, Wall Street, etc) you get a single street. It's that result i wish to put into the db. For example, when i get all the rows where FullName = '4 Ave', i get 8 results back. Those 8 results (when AggregateUnion'd) form that 2nd last image, in this post. Did i understand you right?
    -Pure Krome-
    Thursday, April 30, 2009 7:18 AM
  • OK, so it looks like lots of streets with the same name (modulo street name spelling errors) are not continuous. This happens in most cities (including my own street in my own city). But I'd guess that you want to keep those as one row with a multilinestring.

    It really depends on what kind of queries you want to do against the data...do you want to query against street segments or non-continuous streets? I'd guess the latter, but don't really know your query use cases for the data. Does the post office, for example, consider 11th street one non-continuous street or many segments.

    If you do really want to eliminate outliers and keep them as their own segment (non-aggregated), the strategy would be:
    1. Get a common table expression with outlier rows
    2. Do the spatial union query with all of the rows that aren't outliers
    3. Union (SQL Union, not spatial union) the outlier rows back into the set from step 2

    Using your dataset (I didn't see the link to it ;-), here's a query that will give you outliers:

    select t1.id, t1.FULLNAME,
           MIN(t1.geog.STDistance(t2.geog)) as min_distance,
           COUNT(t2.id) as count_of_same_named_roads
    from tl_2008_36061_edges t1
    join tl_2008_36061_edges t2
     on t1.FULLNAME = t2.FULLNAME and t1.ID != t2.ID
    where t1.FULLNAME != ''
    group by t1.id, t1.fullname
    having MIN(t1.geog.STDistance(t2.geog)) > 0  -- replace the zero with any distance you want

    I got 111 outliers with that query. Looking at the dataset a bit, there are ~1650 row with no street name at all. You don't want to aggregate those, unless you care for streets with no names, and if you do, you can use a variation of the outlier query. In step 2, the "rows that aren't outliers" are the ones where the id not in (select id from your outlier query union select id from tl_2008_36061_edges  where fullname = ''). If you send me the aggregate query you were doing (so I know what other columns to keep for the aggregated rows), I could write it as a single query or a workflow. The reason I thought of the outlier query as a CTE is so you won't have to execute it twice (it took about 3 minutes on my laptop).

    You can visualize the outliers (or any segment to make sure it really has a certain distance with some other segement) with the following query to check your work:

    select t1.id as t1_id , t1.FULLNAME, t2.id as t2_id, t1.geog.STDistance(t2.geog), t2.geog
    from tl_2008_36061_edges t1
    join tl_2008_36061_edges t2
     on t1.FULLNAME = t2.FULLNAME
     and t1.ID != t2.ID   -- comment this line out if you want to see your sample segment too
    where t1.id = 9210   -- where 9210 is the number of my sample segment

    But again, this all depends on whether you really want aggregated roads with little or no detail from the indiidual rows, or segments (like you originally have without aggregation). If you don't want aggregated roads because your use cases don't call for them, you won't care at all ;-)

    Hope this helped,
    Bob Beauchemin

    Thursday, April 30, 2009 9:14 AM
  • Actually, after thinking about it some more, even this algorithm to catch outliers isn't perfect. If there are 2 different "4th Streets" and each one has three segments, all 6 will be union'd together (producing a multilinestring) even though there are two different segments. The algorithm needs refining a bit...but this should get you started.

    Cheers,
    Bob Beauchemin
    Thursday, April 30, 2009 6:21 PM
  • Hi Bob.

       i'm going to have a play when i get to work, this morning. One of things i did to filter out non streets was to make sure my intinial query did where FULLNAME != '' and ROADFLG = 'Y'. This way, it dropped all the other places, like the docks on hudson river, rail, etc.

    But you're right about 4th Street. That's just bad data IMO and i think i need to live with all those samples (which is fine by me).

    >>do you want to query against street segments or non-continuous streets
    the non-continous streets. what i'll want to do is have the linestring or multilinestring... which is a shape/poly. then i'll do be trying to find which streets intersect my crime stats. so, i'n my images above, if there is at least one crime that physically intersects 4th ave, (STIntersects(..)) then that row/street/poly is a result i'll use.

    Depending on the accuracy of
    (a) the crime points we get
    (b) the tiger lines which represent a street

    i might have to add a wee bit of Buffer() to the polys and save that as a seperate field, for each row. index and then query against that.

    We'll also have other things (we've been told) like road incidents, maintenence problems (electricity, water, phone, etc), traffic issues, etc...

    and they want the entire street 'marked' as a result ... even if it's a long street. this way we can see the bigger picture of the area, a wee bit more.
    -Pure Krome-
    Thursday, April 30, 2009 10:21 PM
  • Hi folks,

    Just thought I might add a bit more info to this thread.

    I'm still trying to decide how i want to group various names to a single physical street .. (eg. a street that goes through NYC and then into QUEENS or BRONX .. eg, same name, different county id and/or different state id).

    But until i can get some data to play with (including Bob's Outlier examples, above), i need to import the data :)

    Well, good news is this
    a) FWTools is a command line prog.
    b) Shape2Sql is also has command line args.

    So i've ended up making a winforms app that
    a) Looks for all .shp files (recursively, starting from a root directory).
    b) FWTool convert from NAS83 to WGS84 and place in a new guid directory
    c) For each new guid directory, import that shapefile into an existing table (first file to import optionally can delete then create the table), using Shape2Sql.

    so this is a nice way to automate all the street files.

    Sucks how it took me around 10+ minutes to queue up all 'edge' folders for the state of NY ...  i think i mght just download the entire tiger ftp directory .. that will be a lot quicker :) then manually copy out all the edge folders into a new target directory. From there, i'll get my app to look at the new directory and off she goes :)

    So from here i'm just importing a few counties in the state of NY (which touch each other) to see if i can find any streets that cross different counties. This (in effect) gives them a different street name.

    eg.
    a) Fake Street, New York City (which is really manhatten), New York, USA
    b) Fake Street, Bronx, New York, USA

    because ideally, I'd love to have a 1 to many table of names per MultiLineString 'street'. This could get messy with ouliers .. BUT .. if i can base it on distance AND different county id's ... i might be in luck :)

    thoughts?

    EDIT: screenie..


    -Pure Krome-
    Saturday, May 2, 2009 1:22 AM
  • thoughts?
    Be interesting to know how you get on.... I suspect what you'll find is that, whatever method you employ, there will always be cases that will need manual intervention to get the data to a sufficient level of accuracy for your application (but, without knowing exactly what your application is, it's hard to say).

    I think it is a good academic exercise but I have say that, if it were me, I would have probably given up and bought some street map data from a vendor by now! One of the things you're paying for in commercial data is the fact that some other mug has already been through and done the sort of fixes to the data that you're having to do now. But, TIGER, is free, and if there is a way of making it fit for use then of course it would be great if we could do that...

    Beginning Spatial with SQL Server http://www.apress.com/book/view/1430218290
    Saturday, May 2, 2009 7:02 AM
    Answerer
  • >>I would have probably given up and bought some street map data from a vendor by now!

    If only i could pursade the powers that be, to do that. Downside is, it's not just usa that i would have to purchase data for. so the costs go up and up.

    Secondly, i'm not sure if my idea, once i've got my data sorted out, is a viable solution for the problem at hand - so in essence this is also a bit of R&D on the side (while the main website works fine, but not as good as i want it to be). image if i bought the data and found out, what i was trying to do .. wasn't better than our current solution? egads :(
    -Pure Krome-
    Sunday, May 3, 2009 9:51 AM