none
Get file name being read in C# Mapper?

    Question

  • Presumably there is some way to know the name of the file being read in a C# mapper using Hadoop Streaming? (and even better, also some identifier of the chunk)

    Appreciate any pointers in the right direction of how to get this detail in y job

    Cheers, James


    James Beresford @ www.bimonkey.com & @BI_Monkey
    SSIS / MSBI Consultant in Sydney, Australia
    SSIS ETL Execution Control and Management Framework @ SSIS ETL Framework on Codeplex

    Friday, November 23, 2012 5:17 AM

Answers

All replies

  • Have you tried "Isotope.Sdk.ClusterService"? This will include all the classes that allows to retrieve information.

    - Thanks, Sumit

    Friday, November 23, 2012 2:26 PM
  • @BI Monkey,

    When running Hadoop streaming jobs, a fair amount of the configuration data is available in environment variables, see the very bottom of this page (http://hadoop.apache.org/docs/mapreduce/current/streaming.html).

    You then need to find the Hadoop configuration variable.  From this page http://hadoop.apache.org/docs/mapreduce/current/mapred_tutorial.html#Configured+Parameters, we can see mapreduce.map.input.file is likely the environment variable to query.

    Configured Parameters

    The following properties are localized in the job configuration          for each task's execution:

    Name Type Description
    mapreduce.job.id String The job id
    mapreduce.job.jar String job.jar location in job directory
    mapreduce.job.local.dir String The job specific shared scratch space
    mapreduce.task.id String The task id
    mapreduce.task.attempt.id String The task attempt id
    mapreduce.task.ismap boolean Is this a map task
    mapreduce.task.partition int The id of the task within the job
    mapreduce.map.input.file String The filename that the map is reading from
    mapreduce.map.input.start long The offset of the start of the map input split
    mapreduce.map.input.length long The number of bytes in the map input split
    mapreduce.task.output.dir String The task's temporary output directory

    Note:        During the execution of a streaming job, the names of the "mapred" parameters are transformed.         The dots ( . ) become underscores ( _ ).        For example, mapreduce.job.id becomes mapreduce.job.id and mapreduce.job.jar becomes mapreduce.job.jar.         To get the values in a streaming job's mapper/reducer use the parameter names with the underscores.


    Program Manager -- hadoop -- http://blogs.msdn.com/mwinkle

    Monday, November 26, 2012 5:34 PM
    Owner
  • Cheers Matt...  got me in the right direction. For anyone wanting to replicate the experience, the right code is: 

    string FileName = System.Environment.GetEnvironmentVariable("map_input_file");
    string FileChunk = System.Environment.GetEnvironmentVariable("map_input_start");

    Note no need to include the "mapreduce" prefix for the variable.

    Blogged in a little more depth here: http://www.bimonkey.com/2012/11/reference-environment-variables-in-c-mappers-for-hdinsight/

    Cheers, James

    James Beresford @ www.bimonkey.com & @BI_Monkey
    SSIS / MSBI Consultant in Sydney, Australia
    SSIS ETL Execution Control and Management Framework @ SSIS ETL Framework on Codeplex



    • Marked as answer by BI Monkey Tuesday, November 27, 2012 9:52 AM
    • Edited by BI Monkey Tuesday, November 27, 2012 9:52 AM
    Tuesday, November 27, 2012 12:40 AM