none
Using Geographic data in ADLA

    Question

  • I have weather data in my data lake with coordinates for weather events.  In SQL Server I have geographic information for State, Counties, Townships, Ranges, and properties.  What I need to do is cross-reference the coordinates in the weather data for the shapes for each of the geographic types.

    If this were SQL Server, I'd simply use STIntersects() to do my matches.  Since this is a data lake, I'm trying to find the best way to handle 60+ years of weather data.

    I found an example online that shows how to pull in the c# assemblies needed to write USQL with Geographic methods. The problem is, you can't use these extended data types with EXTRACT or OUTPUT statements. So I can't simply copy my geographic data to ADLS.  Since there are functions for turning the geographies into text and back again, I attempted to create text files with textual representations of my geographies.

    These geographies are either POLYGON or MULTIPOLYGON shapes. Here's the script I used to try and copy this data to ADLS.

    @state =
        SELECT *
        FROM EXTERNAL HailResarch.CPISSTGVMSQL_ODS EXECUTE @"
            SELECT 
    	        StateCode, StateName, PostalCode, Shape.ToString() AS ShapeText
            FROM cpis_ods.ods.[State]
            WHERE
    	        StateName NOT IN ('Unknown','Foreign Country')";
    OUTPUT @state
    TO "/sandbox/hail_research/StateShapes.csv"
    USING Outputters.Csv(outputHeader: true, quoting: true, rowDelimiter: "\n");

    Unfortunately, this doesn't work.  I get an error "Value in the SQL column ShapeText at row 0 is too large." Turns out Strings are limited to 128 kilobytes.

    Is there an official way to perform this kind of cross-reference in USQL? Any advice will be appreciated!



    • Edited by Shannon Lowder Wednesday, July 18, 2018 9:50 PM cleaning out html formatting
    Wednesday, July 18, 2018 8:27 PM

All replies

  • Dear Shannon

    Unfortunately you will have to do a bit more processing.

    I would suggest that you read the geometry shapes into byte[] (that can be up to 4MB) and then converts it into a geometry shape.

    Also note that you will have to wrap the UDTs (geometry/geography) into wrapper types that provide a serialization, if you need to flow the data between stages in your script (since it needs to be temporarily serialized).

    Best regards

    Michael


    Michael Rys

    Thursday, July 19, 2018 11:50 PM
    Moderator
  • Unfortunately, some of my multipolygon shapes are over 4MB. Alaska, Florida, North Carolina, Louisiana, Maryland, and California are over this limit. Is there any way around the upper limit for binaries in ADLA?

    In the meantime, I'm going to see if I can use the County level data.  Hopefully, they're all below the 4MB limit.


    Monday, July 23, 2018 12:27 PM
  • Unfortunately, the 4MB limit is currently the max. We are looking into increasing the rowsize to 16MB, so that would give you some additional size. Alternatively, you may want to look at reducing the complexity of the shapes. I think the spatial library has some reduction methods if I remember correctly.

    Michael Rys

    Tuesday, July 24, 2018 9:21 PM
    Moderator
  • I'll look into reducing the complexity.  I did finally get a version of the code to work.  What I found is once the script got to the STIntersects check, the performance just came to a halt.  Maybe I should geocode the rows BEFORE that data gets to the lake?

    What's your experience been with this kind of workload in ADLA?

    Wednesday, July 25, 2018 9:03 PM