none
Problems with R-script in U-SQL

    Question

  • Hi,

    I am having a problem with executing a R-script in U-sql.

    The scenario: I have a .dat file and a R-Script. This R-script translates the .dat file to a dataframe. In R-studio it runs fine.

    Because the.dat files are on Azure, we want to see if it is possible to convert the to a readable format and store them somewhere else on Azure. So I searched and found some good info about running R-scripts in U-SQL. The problem seems to be that these script al use a formatted input, run some R magic on it and output is to a formatted input.

    (https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/data-lake-analytics/data-lake-analytics-u-sql-r-extensions.md)

    It seems now that my R-script works just fine, but outputting the data does not work as expected. This is the script I am running now:

    DECLARE @INPUT_DAT string = @"/dat2json/data/validationData.dat.201805271617";
    DECLARE @OUTPUT string = @"/dat2json/data/validationdata.out";
    
    DECLARE @vartype string = "double";
    
    DECLARE @var1 string = "Plastic";
    DECLARE @var2 string = "Aluminum";
    
    REFERENCE ASSEMBLY [ExtR];
    
    DECLARE @myRScript string = @"
    datavector <- as.vector(readBin(@INPUT_DAT, @vartype, size = 4, n = 99000))
    Size <- length(datavector)
    numberOfPixels <- Size / 84
    MaterialBase <- factor(rep(c(@var1, @var2), each = (Size / 2)))
    ThicknessBase <- factor(rep(c(rep(c(0, 10, 20, 30, 40, 50), times = 7), rep(c(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0), each = 6)), each = numberOfPixels))
    ThicknessIterated <- factor(rep(c(rep(c(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0), each = 6), rep(c(0, 10, 20, 30, 40, 50), times = 7)), each = numberOfPixels))
    Pixel <- rep(1:numberOfPixels, times = 84)
    dflabel <- data.frame(MaterialBase, ThicknessBase, ThicknessIterated, Pixel, Value = datavector)
    ";
    
    @RScriptOutput = REDUCE @myRScript ON MaterialBase USING new Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe");
    OUTPUT @ScriptOutput
    TO @OUTPUT
    USING Outputters.Tsv();

    When I run it in Visual studio, I get the following error:

    E_CSC_USER_ROWSETVARIABLENOTFOUND: Rowset variable @myRScript was not found.
    Description:
    Rowset variables must be assigned to before they can be referenced.
    Resolution:

    Assign a rowset to the rowset variable or remove the reference.

    Does anyone have a solution for this?

    Thursday, July 19, 2018 12:59 PM

All replies

  • Were you able to execute the first sample script on this page successfully?
    Wednesday, July 25, 2018 9:48 PM
    Moderator
  • @myRScript is just a string variable which you have declared its not a rowset.


    DECLARE @myRScript string = @"
    datavector <- as.vector(readBin(@INPUT_DAT, @vartype, size = 4, n = 99000))
    Size <- length(datavector)
    …..
    dflabel <- data.frame(MaterialBase, ThicknessBase, ThicknessIterated, Pixel, Value = datavector)
    "
    ;

    @Rs1 = EXTRACT <specify the schema in the input file> FROM @INPUT_DAT USING Extractor.Text();

    @RScriptOutput = REDUCE @Rs1 ON MaterialBase USING new Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe");
    OUTPUT
    @ScriptOutput
    TO
    @OUTPUT
    USING
    Outputters.Tsv();

    Thursday, November 15, 2018 9:24 PM