locked
hpc job failed RRS feed

  • Question

  • HI all

    I had created a cluster job by job managment consol in hpc cluster that this job include a executable mpi application file.

    I had done this steps:

    in ehe job management consol click on add new job

    then next and at the task page on the command,write:mpiexec.exe  myapp.exe

    at the worker directory,write:\\headnode\myapp   the location that my exe file there.

    and submit

    but the job failed....

    please help me...

    Monday, October 17, 2011 10:53 AM

All replies

  • Hello,

    There are several reasons could cause the MPI job fail, for example, wrong net mask, not enough resources, mpi servic edown, etc. Before figure out what's the root cause of MPI job failures, could you please post the full error message here? You can find the failed job ID and using command: task view [jobid].1 or you can browse the job management UI to find the details of the failed job.

    Thanks,

    James

    • Proposed as answer by Ade Miller Tuesday, November 1, 2011 4:28 PM
    Monday, October 17, 2011 10:43 PM
  • I agree with James. Some other general troubleshooting tips:

    Use the debugger (http://msdn.microsoft.com/en-us/library/ee945373.aspx)
    Turn on Auditing for Failures in Local Group Policy Manager
    Make sure you can access the share and execute it from each node. (\\headnode\myapp)


    --Patrick Gallucci
    Wednesday, October 19, 2011 7:43 PM