none
U-SQL and R: ADLA job fails depending on type/position of partition column

    Question

  • Hi,

    I'm running R-code in my U-SQL script. I have data stored in Data Lake Sotre in similar format (100 000s rows):

    • Id (string)
    • Value (string)
    • Timestamp (string)

    Most of the time everything works fine, however there are exceptions that occur with following error when filtering data based on Id (selecting only rows that have certain Id):

    Unhandled exception from user code: "Error in `[.data.frame`(input.dataframe, input.dataframe$Id !=  :   Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'character' " The details includes more information including any inner exceptions and the stack trace where the exception was raised.

    ==== Caught exception RDotNet.EvaluationException

       at RDotNet.REngine.Parse(String statement, StringBuilder incompleteStatement)
       at RDotNet.REngine.<Defer>d__0.MoveNext()
       at System.Linq.Enumerable.LastOrDefault[TSource](IEnumerable`1 source)
       at RDotNet.REngine.Evaluate(String statement)
       at Extension.R.RDriver.RunRCode(REngine rEngine, String rText, RTextTypes rTextType, Boolean isReturnTypeDataFrame) in C:\Users\shravan\Source\Repos\VSTS\USqlExtensions\lang\R\ExtR\RDriver.cs:line 125
       at Extension.R.RDriver.PrepareEnvironmentRunRCode() in C:\Users\shravan\Source\Repos\VSTS\USqlExtensions\lang\R\ExtR\RDriver.cs:line 148
       at Extension.R.UsqlHelperFunctions.<CreateAndProcessDataFrame>d__1.MoveNext() in C:\Users\shravan\Source\Repos\VSTS\USqlExtensions\lang\R\ExtR\UsqlHelperFunctions.cs:line 40
       at ScopeEngine.SqlIpReducer<SV1_Extract_out0,SV5_Process_out0,ScopeEngine::KeyComparePolicy<SV1_Extract_out0,29> >.GetNextRow(SqlIpReducer<SV1_Extract_out0\,SV5_Process_out0\,ScopeEngine::KeyComparePolicy<SV1_Extract_out0\,29> >* , SV5_Process_out0* output) in d:\data\yarnnm\local\usercache\c0d4ee42-3fbf-4a04-ab90-ea0499a3c218\appcache\application_1517273684052_502535\container_e188_1517273684052_502535_01_000001\wd\sqlmanaged.h:line 2802
       at std._Func_class<void>.()(_Func_class<void>* )
       at RunAndHandleClrExceptions(function<void __cdecl(void)>* code)

    Before running R-script, I create partition to include all rows

    @RInput = SELECT
                Id,
                Value,
                0 AS Par,
                Convert.ToString(DateTimeOffset.Parse(Timestamp).UtcDateTime) AS Timestamp
        FROM @fileOutput;

    When I encounter mentioned error, I can change

    0 AS Par to Convert.ToDouble(0) AS Par

    and everything will work fine for that particular data. Sometimes I have to do it other way around or change the position of Par column and only then it will work.

    I have checked the contents of input data and it is fine, as the job works properly when I happen to find "correct" partition. I have tried to find solution to this but have been so far unsuccesful.

    Does anyone have idea what is causing this and/or how to solve?

    Thursday, February 8, 2018 11:03 AM

Answers

  • This usually happens when the data per partition is larger than the memory allocated. Not seeing an OOM is an artifact of the engine we use to transfer data into the R server. One easy way to fix this is by increasing the partition count which is the same as decreasing the amount of data per partition to reduce the memory pressure.
    • Marked as answer by okmijn Monday, February 12, 2018 6:13 AM
    Friday, February 9, 2018 9:19 PM

All replies

  • This usually happens when the data per partition is larger than the memory allocated. Not seeing an OOM is an artifact of the engine we use to transfer data into the R server. One easy way to fix this is by increasing the partition count which is the same as decreasing the amount of data per partition to reduce the memory pressure.
    • Marked as answer by okmijn Monday, February 12, 2018 6:13 AM
    Friday, February 9, 2018 9:19 PM
  • Hi,

    I think you are correct. After increasing partition count the code runs smoothly. I will have to rethink my R-code now but this should take care of the issue.

    Thank you

    Monday, February 12, 2018 6:13 AM
  • Hi,

    Thanks for the suggestion. I get the same error as in the original question, but I would not be able to Partition the data further.
    Minimum size of my partition, supported by the particular problem I am solving, is coming to ~40MB.

    Even if I perform pseudo-partition on this (with single partition), I am getting the same error: Unhandled exception from user code: "Error in unlist(lapply(list(...), .num_to_date), use.names = FALSE) : Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'character'

    Do you have any other suggestion apart from partitioning the data further? I understand that R extensions of ADLA can handle upto 500MB.

    Thanks in advance.

    • Edited by nikpod Monday, September 24, 2018 6:35 AM
    Monday, September 24, 2018 6:32 AM