questions about Join method and job running


  • Hi,

    1)about join

    The join method can be applied to two fileset.Like:

     IQueryable<LineRecord> strtable1 = context.FromDsc<LineRecord>("Join_geoip");
     IQueryable<LineRecord> strtable2 = context.FromDsc<LineRecord>("Join_log");

    IQueryable<LineRecord> joined = strtable2.Join(strtable1, l1 => l1.Line.Split(' ').First(), l2 => l2.Line.Split(' ').First(),(l1, l2) =>l2);

    The quesion is,can the join method be applied to more than two filesets at the same time?Think that ,i want to join three filesets,A B and C.I do not want to join two filesets ,like A and B,first, then join the output with fileset C.It`s better to join A ,B and C at the same join operation.Can i achieve this?

    2)On the HPC paltform,if there is a job running,then the later jobs must wait.Can this be avilable,that the first job is not completed and the second job starts to run?

    Looking forward for your help.

    Tuesday, June 07, 2011 1:31 PM

All replies

  • 1) The Join() operator does not support joining more than two file sets, just like it's corresponding SQL JOIN.




    If you want to join more than two file sets yoou can do this by applying joins to pairs of file sets.

    2) The HPC Scheduler controls this. Dryad jobs will run if their are sufficient nodes available. You can specify the number of nodes a query requires using the JobMinNodes and JobMaxNodes properties on the HpcLinqConfiguration. Dryad jobs are scheduled on a per node basis so only one vertex can run on a node at any one time.


    Tuesday, June 07, 2011 4:02 PM
  • More description about the above questions.

    a)I wrote the following code to write records to DSC.

    IQueryable<LineRecord> strtable1 = context.FromDsc<LineRecord>(fileset1);
    IQueryable<LineRecord> strtable2 = context.FromDsc<LineRecord>(fileset2);
     strtable1.Join(strtable2, l1 => l1.Line.Split('\t').Last(), l2 => l2.Line.Split(' ')[0], (l1, l2) => l1.Line.Split('\t').First() + '\t' + l2.Line.Split(' ')[1] + '\n').ToDsc(fileset3).SubmitAndWait(context);

    After the job is completed,the records in fileest3 have some incorrect chars.One of the records may like:

    w??http://www.bjut.edu.cn/ontologies/2009/10/esesgrid.owl#压力计3     http://www.bjut.edu.cn/ontologies/2009/10/esesgrid.owl#传感器

    At the begining of the record,the chars "w??" are not from fileset1 and fileset2.

    But when i wrote the program like this:

    IQueryable<LineRecord> strtable1 = context.FromDsc<LineRecord>(fileset1);
    IQueryable<LineRecord> strtable2 = context.FromDsc<LineRecord>(fileset2);
     IQueryable<string> datas = strtable1.Join(strtable2, l1 => l1.Line.Split('\t').Last(), l2 => l2.Line.Split(' ')[0], (l1, l2) => l1.Line.Split('\t').First() + '\t' + l2.Line.Split(' ')[1] + '\n');

    DscFileSet fileSet = context.DscService.CreateFileSet(fileset3);
    DscFile file = fileSet.AddNewFile(100);
    File.WriteAllLines(file.WritePath, datas.ToArray<string>());


    Atter the program completed , the records in fileset3 are like:

    http://www.bjut.edu.cn/ontologies/2009/10/esesgrid.owl#压力计3     http://www.bjut.edu.cn/ontologies/2009/10/esesgrid.owl#传感器

    this is what i want,and the "w??"  is gone.

    So,why "w??" occurs?Is there some thing wrong with the useage of ToDSC method?

    b)From the HPC cluster management,I can see that two jobs  are submited at the same time .But when the first job was completed,the second job started to run.Is there a way can help me started the second job when the first job was not completed?Can i achieve this by writing some code or configing the HPC cluster?

    Pretty thanks.

    Tuesday, June 07, 2011 4:55 PM
  • a) LineRecord is a fairly simple utility API. My guess would be that your files contain UTF header codes that are not being handled correctly. You need to preprocess you files to ensure that they don't contain characters not handled by LineRecord. Another example of this is that LineRecord only recognised CR LF as record delimiters so the last line in each file being added to the fileset needs  to end with CR LF.

    b) If your first job requires all the available nodes on the cluster then the second job  will have to wait for it to finish. You could use the HpcLinqConfiguration.JobMaxNodes to ask for fewer nodes and this will allow your second job to start. This is not recommended as the data locality for both jobs is likely to be poor, in other words each job is likely to attempt to access data on nodes that the other job is running on which will result in network copies of data. The recommendation for SP2 is to run one job at a time on all the nodes hosting data within DSC for best performance.

    FYI: Asking question "2" (above) as question "b" willnot get you a different answer :).

    • Proposed as answer by Ade Miller Tuesday, June 07, 2011 5:57 PM
    Tuesday, June 07, 2011 5:57 PM
  • hi,Ade

        i have a dryad cluster with each computing node equipped with two dual-core cpus, i want to change the configuration that 'only one job can run on a computing node once' to make sufficient use of the computing resource. Any parameter can be adjust for it? Thanks!



    • Proposed as answer by Ade Miller Thursday, June 09, 2011 4:28 PM
    Thursday, June 09, 2011 4:14 PM