Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If your application can be parallelized hybridly, i.e. divide its MPI processes into OpenMP threads, you can call the job as shown in the GROMACS application example below:

Tip

OpenMP aplikacije zahtijevaju definiranje varijable OMP_NUM_THREADS . PBS sustav joj automatski pridružuje vrijednost varijable ncpus , definirane u zaglavlju PBS skripte.

Vrijednost varijable select iz zaglavlja PBS skripte odgovara broju MPI procesa, međutim, nema pripadajuću varijablu koju PBS sustav izvodi u okolinu. Kako bi se izbjeglo prepisivanje, u primjeru niže, definirana je varijabla MPI_NUM_PROCESSES koja odgovara vrijednosti varijable select .

Code Block
languagebash
#!/bin/bash
 
#PBS -q cpu
#PBS -l select=8:ncpus=4
#PBS -l place=scatter
 
MPI_NUM_PROCESSES=$(cat ${PBS_NODEFILE} | wc -l)
 
cd ${PBS_O_WORKDIR}
 
mpiexec -n ${MPI_NUM_PROCESSES} --ppn 1 -d ${OMP_NUM_THREADS} --cpu-bind depth gmx mdrun -v -deffnm md

...

The environment variables that the mpiexec command will set on each of the MPI ranks will be:

Environment variableDescription
PALS_RANKIDTotal rank of the MPI process
PALS_NODEIDSerial number of the local node (if the work is performed on several nodes)
PALS_SPOOL_DIRTemporary directory
PALS_LOCAL_RANKIDLocal ranking of the MPI process (if the work is performed on multiple nodes)
PALS_APIDThe unique identifier of the application you executed
PALS_DEPTHNumber of processor cores per rank


Note

Scientific applications on Supek and cray-pals

Scientific applications that are available on Supek via the modulefiles tool already call this module, so it is not necessary to call it again.


Monitoring and management of job execution

Job monitoring

The PBS command qstat is used to display the status of jobs. The basic command syntax is:

Code Block
qstat <options><job_ID>

By executing the qstat command without additional options, a printout of all jobs of all users is obtained:

Code Block
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
2663.x3000c0s25b* mpi+omp_s        kmrkalj           00:36:09 R cpu

Some of the more used options are:

-EGroups jobs by server and displays jobs sorted by ascending ID. When qstat is displayed with a list of jobs, the jobs are grouped by server and each group is shown by ascending ID. This option also improves the performance of qstat.
-tDisplays status information for jobs, job streams, and subjobs.
-pThe display of the Time Used column is replaced by the percentage of work done. For a string job, this is the percentage of subjobs completed. For normal work, this is a percentage of the allocated CPU time used.
-xDisplays status information for completed and moved jobs in addition to pending and running jobs.
-QShows queue status in standard format.
-qDisplays queue status in an alternate format.
-fDisplays job status in an alternate format

Examples of use:

Detailed job description:

Code Block
qstat -fxw 2648

The tracejob command extracts and displays log messages for a PBS job in chronological order.

Code Block
tracejob <job_ID>

Example:

Code Block
$ tracejob 2670
 
Job: 2670.x3000c0s25b0n0.hsn.hpc.srce.hr
 
03/30/2023 11:23:24  L    Considering job to run
03/30/2023 11:23:24  S    Job Queued at request of mhrzenja@x3000c0s25b0n0.hsn.hpc.srce.hr, owner =
                          mhrzenja@x3000c0s25b0n0.hsn.hpc.srce.hr, job name = mapping, queue = cpu
03/30/2023 11:23:24  S    Job Run at request of Scheduler@x3000c0s25b0n0.hsn.hpc.srce.hr on exec_vnode
                          (x8000c0s0b0n0:ncpus=40:mem=104857600kb)
03/30/2023 11:23:24  L    Job run
03/30/2023 11:23:24  S    enqueuing into cpu, state Q hop 1
03/30/2023 11:23:56  S    Holds u set at request of mhrzenja@x3000c0s25b0n0.hsn.hpc.srce.hr
03/30/2023 11:24:22  S    Holds u released at request of mhrzenja@x3000c0s25b0n0.hsn.hpc.srce.hr

Job managment

The job can also be managed after submitting.

While the job is in the queue, it is possible to temporarily stop its execution with the command:

Code Block
qhold <job_ID>

Returning to the queue:

Code Block
qrls <job_ID>

The job is completely stopped or unqueued with the command:

Code Block
qdel <job_ID>

Force stop should be used for stuck jobs:

Code Block
qdel -W force -x <job_ID>


Postponement of execution

PBS provides the feature to perform tasks in dependence on others, which is useful in cases such as:

  • the execution of jobs depends on the output or state of the previously executed
  • the application requires the sequential execution of various components
  • printing data from one job may compromise the execution of another

The directive that enables this functionality on instant job submission is:

Code Block
qsub -W depend=<type>:<job_ID>[:<job_ID>] ...

Where <type> can be:

  • after* - starting the current one with respect to the others
    • after - execution of the current one after the start of execution of the specified ones
    • afterok - execution of the current after the successful completion of the specified
    • afternotok - execution of the current after an error in the completion of the specified
    • afterany - execution of the current one after the completion of the specified ones
  • before* - starting others with respect to the current one
    • before - starting the listed ones after the start of the current one
    • beforeok - starting the above after the successful completion of the current one
    • beforenotok - starting the above after an error in the execution of the current one
    • beforeany - starting the listed ones after the end of the current one
  • on:<number> - execution of a job that will depend on the subsequently specified number of before* type of jobs
Note

A job with the -W depend=... directive will not be submitted if the specified job IDs do not exist (or if they are not queued)

Examples:
If we want job1 to start after the successful completion of job0:

Code Block
[korisnik@x3000c0s25b0n0] $ qsub job1
1000.x3000c0s25b0n0.hsn.hpc.srce.hr
 
[korisnik@x3000c0s25b0n0] $ qsub -W depend=afterok:1000 job1
1001.x3000c0s25b0n0.hsn.hpc.srce.hr
 
[korisnik@x3000c0s25b0n0] $ qstat 1000 1001
Job id                 Name             User              Time Use S Queue
---------------------  ---------------- ----------------  -------- - -----
1000.x3000c0s25b0n0    job0           korisnik           00:00:00 R cpu            
1001.x3000c0s25b0n0    job1           korisnik                  0 H cpu


If we want job0 to start only after the successful completion of job1:

Code Block
[korisnik@x3000c0s25b0n0] $ qsub -W depend=on:1 job0
1002.x3000c0s25b0n0.hsn.hpc.srce.hr
 
[korisnik@x3000c0s25b0n0] $ qsub -W depend=beforeok:1002 job1
1003.x3000c0s25b0n0.hsn.hpc.srce.hr
 
[korisnik@x3000c0s25b0n0] $ qstat 1002 1003
Job id                 Name             User              Time Use S Queue
---------------------  ---------------- ----------------  -------- - -----
1002.x3000c0s25b0n0    job0           korisnik                  0 H cpu            
1003.x3000c0s25b0n0    job1           korisnik           00:00:00 R cpu