none
U-SQL + Python basic questions

    Question

  • I am just getting started learning how to integrate Python with U-SQL. I am working through this example:

    REFERENCE ASSEMBLY [ExtPython];
    
    DECLARE @myScript = @"
    def get_mentions(tweet):    
          return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) )
    
    def usqlml_main(df):    
          del df['time']    
          del df['author']    
          df['mentions'] = df.tweet.apply(get_mentions)    
          del df['tweet']    
          return df
    ";
    
    @t  =     
    SELECT * FROM        
    (VALUES
               ("D1","T1","A1","@foo Hello World @bar"),
               ("D2","T2","A2","@baz Hello World @beer")
           ) AS
                D( date, time, author, tweet );
    @m  =
        REDUCE @t ON date
        PRODUCE date string, mentions string
        USING new Extension.Python.Reducer(pyScript:@myScript);
    
    OUTPUT @m
        TO "/tweetmentions.csv"
        USING Outputters.Csv();

    Some questions:

    • How does usqlml_main take in a dataframe? Is D(date, time, author, tweet) constructing a pandas dataFrame?
    • Inside of usqlml_main, what is the 'apply' function in df.tweet.apply(get_mentions)?
    • What does REDUCE do in this case? Is this always needed when integrating U-SQL with Python?

    Thank you!

    Friday, June 9, 2017 5:01 PM

All replies

    • Q: How does usqlml_main take in a dataframe? Is D(date, time, author, tweet) constructing a pandas dataFrame?
    • A: when the rowset @t is used by Extension.Python.Reducer, the rows will be placed into a pandas dataFrame.
    • Q: Inside of usqlml_main, what is the 'apply' function in df.tweet.apply(get_mentions)?
    • A: this is the pandas apply() method that is defined on DataFrames: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
    • Q: What does REDUCE do in this case? Is this always needed when integrating U-SQL with Python?
    • A: REDUCE is needed. It is unfortunately in this context a somewhat misleading given its name. The purpose of REDUCE in the U-SQL/Python context is that REDUCE is used to distribute a large set of rows into smaller partitions based on a key value that comes from a specific column. In the example above it is trying to partition on the date column. REDUCE does not imply that the code actually has to return a "reduced" set of rows. In fact the reduce could return even more rows. Again, it's more about data partitioning. 
    Saturday, June 10, 2017 6:04 PM
    Moderator