Azure Datalake with python


  • Hi, I tried to use python script into u-sql in Azure data lake by using python extension, and I think I think the version of the pandas is quite old.

    Here are my python extension code:

    @output_calculated =
    REDUCE @inputsample ON StrokeDataID
    //input dataframe columns that also exists in the output dataframe (could be deleted from the dataframe in the job if need be)
    JobDataID double, PileDataID double, StrokeDataID double,
    StrokeHeight int, HighPressureHammer int, ActualEnergy double, //for some reason this is turned into a double even though the source table says int
    //Extra dataframe output columns - the actual calculations:
    ecalc_2 double,
    eta_senken double
    USING new Extension.Python.Reducer("", pyVersion : "3.5.1");

    In the python script ("), the embedded function usqlml_main(df) is used, and its input parameter, "df", is used for dataframe calculation. I tried to use mean() function on dataframe, df.mean(), it works, but when I tried std() function on it, df.std(), I found the error message in executing step.  

    I think the embedded pandas package in Azure data lake support .mean() on dataframe, but not .std(). As a temporal remedy, I am importing customised package, which is the latest version of pandas, but of course it is not a great way.

    Could you check this matter and notice me as you update the pandas package? Thanks.

    Best regards,

    Keunsoo Park

    Tuesday, October 23, 2018 9:51 AM

All replies

  • Hi Kensoo,

    What was the error you received when you tried the .std() function?

    Wednesday, October 24, 2018 1:09 AM
  • I found a StackOverflow issue which seems very similar.

    The excerpt of interest is:
    The problem is solved. The for the errror was most likely, that the folder contained in the was namend in my first approach "UsqlPythonDeployPackage". However it should be namend "3.5.1". 

    Does this help at all?

    Tuesday, February 12, 2019 9:51 PM