none
Script hangs in RUNNING mode when 100+ files are processed.

    Question

  • SA job is currently laying down 100-200 MB files every hour. Json doc per line. Using custom JSON extractor. Works great with about 20-50 files. When I bump the file set to include more days for testing around 200+ input streams ( 200 files ) It just hangs on RUNNING. Left it running over night and was still RUNNING the next morning with no progress.

    Not sure how to troubleshoot this one since I never get back an error. Have to stop the job manually.

    Is there a limit to the number of files you can process in one EXTRACT? Also is it better to have more smaller files or larger files. My extractor is file atomic = false and seems to work fine with 20-30 files.


    Kyle Clubb

    Sunday, August 14, 2016 12:50 AM

Answers

  • Hi Kyle

    This is weird. Normally jobs that have a vertex that is running too long will kill that vertex after about 5 hours. Can you please contact me in email and send me your job link so I can have the team investigate? My email is mrys at Microsoft.

    If you are processing files containing single JSON documents, you need to use AtomicFileProcesing=true to avoid getting failures when the JSON document gets too large and may get split for processing. Now the file set currently has a scale limit of about 1000 to 3000 files (depending on complexity of the job), but you are below that. And in that case you would get a compilation time out. By the way, this limit will increase significantly in an upcoming refresh.

    In general it is better to have larger files, eg. 100 to 500 MB for files that cannot be parallelized and more (into TB range) if they can be processed in parallel chunks, since it reduces some of the generic overhead.


    Michael Rys

    Monday, August 15, 2016 12:16 AM
    Moderator

All replies

  • Hi Kyle

    This is weird. Normally jobs that have a vertex that is running too long will kill that vertex after about 5 hours. Can you please contact me in email and send me your job link so I can have the team investigate? My email is mrys at Microsoft.

    If you are processing files containing single JSON documents, you need to use AtomicFileProcesing=true to avoid getting failures when the JSON document gets too large and may get split for processing. Now the file set currently has a scale limit of about 1000 to 3000 files (depending on complexity of the job), but you are below that. And in that case you would get a compilation time out. By the way, this limit will increase significantly in an upcoming refresh.

    In general it is better to have larger files, eg. 100 to 500 MB for files that cannot be parallelized and more (into TB range) if they can be processed in parallel chunks, since it reduces some of the generic overhead.


    Michael Rys

    Monday, August 15, 2016 12:16 AM
    Moderator
  • Thanks. I have emailed you my source and job info.

    Kyle Clubb

    Monday, August 15, 2016 1:23 AM