...
If your application can be parallelized hybridly, i.e. divide its MPI processes into OpenMP threads, you can call the job as shown in the GROMACS application example below:
Tip |
---|
OpenMP aplikacije zahtijevaju definiranje varijable Vrijednost varijable |
Code Block | ||
---|---|---|
| ||
#!/bin/bash #PBS -q cpu #PBS -l select=8:ncpus=4 #PBS -l place=scatter MPI_NUM_PROCESSES=$(cat ${PBS_NODEFILE} | wc -l) cd ${PBS_O_WORKDIR} mpiexec -n ${MPI_NUM_PROCESSES} --ppn 1 -d ${OMP_NUM_THREADS} --cpu-bind depth gmx mdrun -v -deffnm md |
...
The environment variables that the mpiexec command will set on each of the MPI ranks will be:
Environment variable | Description |
---|---|
PALS_RANKID | Total rank of the MPI process |
PALS_NODEID | Serial number of the local node (if the work is performed on several nodes) |
PALS_SPOOL_DIR | Temporary directory |
PALS_LOCAL_RANKID | Local ranking of the MPI process (if the work is performed on multiple nodes) |
PALS_APID | The unique identifier of the application you executed |
PALS_DEPTH | Number of processor cores per rank |
Note |
---|
Scientific applications on Supek and cray-pals Scientific applications that are available on Supek via the modulefiles tool already call this module, so it is not necessary to call it again. |
Monitoring and management of job execution
Job monitoring
The PBS command qstat is used to display the status of jobs. The basic command syntax is:
Code Block |
---|
qstat <options><job_ID> |
By executing the qstat command without additional options, a printout of all jobs of all users is obtained:
Code Block |
---|
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
2663.x3000c0s25b* mpi+omp_s kmrkalj 00:36:09 R cpu |
Some of the more used options are:
-E | Groups jobs by server and displays jobs sorted by ascending ID. When qstat is displayed with a list of jobs, the jobs are grouped by server and each group is shown by ascending ID. This option also improves the performance of qstat. |
-t | Displays status information for jobs, job streams, and subjobs. |
-p | The display of the Time Used column is replaced by the percentage of work done. For a string job, this is the percentage of subjobs completed. For normal work, this is a percentage of the allocated CPU time used. |
-x | Displays status information for completed and moved jobs in addition to pending and running jobs. |
-Q | Shows queue status in standard format. |
-q | Displays queue status in an alternate format. |
-f | Displays job status in an alternate format |
Examples of use:
Detailed job description:
Code Block |
---|
qstat -fxw 2648 |
The tracejob command extracts and displays log messages for a PBS job in chronological order.
Code Block |
---|
tracejob <job_ID> |
Example:
Code Block |
---|
$ tracejob 2670
Job: 2670.x3000c0s25b0n0.hsn.hpc.srce.hr
03/30/2023 11:23:24 L Considering job to run
03/30/2023 11:23:24 S Job Queued at request of mhrzenja@x3000c0s25b0n0.hsn.hpc.srce.hr, owner =
mhrzenja@x3000c0s25b0n0.hsn.hpc.srce.hr, job name = mapping, queue = cpu
03/30/2023 11:23:24 S Job Run at request of Scheduler@x3000c0s25b0n0.hsn.hpc.srce.hr on exec_vnode
(x8000c0s0b0n0:ncpus=40:mem=104857600kb)
03/30/2023 11:23:24 L Job run
03/30/2023 11:23:24 S enqueuing into cpu, state Q hop 1
03/30/2023 11:23:56 S Holds u set at request of mhrzenja@x3000c0s25b0n0.hsn.hpc.srce.hr
03/30/2023 11:24:22 S Holds u released at request of mhrzenja@x3000c0s25b0n0.hsn.hpc.srce.hr |
Job managment
The job can also be managed after submitting.
While the job is in the queue, it is possible to temporarily stop its execution with the command:
Code Block |
---|
qhold <job_ID> |
Returning to the queue:
Code Block |
---|
qrls <job_ID> |
The job is completely stopped or unqueued with the command:
Code Block |
---|
qdel <job_ID> |
Force stop should be used for stuck jobs:
Code Block |
---|
qdel -W force -x <job_ID> |
Postponement of execution
PBS provides the feature to perform tasks in dependence on others, which is useful in cases such as:
- the execution of jobs depends on the output or state of the previously executed
- the application requires the sequential execution of various components
- printing data from one job may compromise the execution of another
The directive that enables this functionality on instant job submission is:
Code Block |
---|
qsub -W depend=<type>:<job_ID>[:<job_ID>] ... |
Where <type> can be:
- after* - starting the current one with respect to the others
- after - execution of the current one after the start of execution of the specified ones
- afterok - execution of the current after the successful completion of the specified
- afternotok - execution of the current after an error in the completion of the specified
- afterany - execution of the current one after the completion of the specified ones
- before* - starting others with respect to the current one
- before - starting the listed ones after the start of the current one
- beforeok - starting the above after the successful completion of the current one
- beforenotok - starting the above after an error in the execution of the current one
- beforeany - starting the listed ones after the end of the current one
- on:<number> - execution of a job that will depend on the subsequently specified number of before* type of jobs
Note |
---|
A job with the -W depend=... directive will not be submitted if the specified job IDs do not exist (or if they are not queued) |
Examples:
If we want job1 to start after the successful completion of job0:
Code Block |
---|
[korisnik@x3000c0s25b0n0] $ qsub job1
1000.x3000c0s25b0n0.hsn.hpc.srce.hr
[korisnik@x3000c0s25b0n0] $ qsub -W depend=afterok:1000 job1
1001.x3000c0s25b0n0.hsn.hpc.srce.hr
[korisnik@x3000c0s25b0n0] $ qstat 1000 1001
Job id Name User Time Use S Queue
--------------------- ---------------- ---------------- -------- - -----
1000.x3000c0s25b0n0 job0 korisnik 00:00:00 R cpu
1001.x3000c0s25b0n0 job1 korisnik 0 H cpu |
If we want job0 to start only after the successful completion of job1:
Code Block |
---|
[korisnik@x3000c0s25b0n0] $ qsub -W depend=on:1 job0
1002.x3000c0s25b0n0.hsn.hpc.srce.hr
[korisnik@x3000c0s25b0n0] $ qsub -W depend=beforeok:1002 job1
1003.x3000c0s25b0n0.hsn.hpc.srce.hr
[korisnik@x3000c0s25b0n0] $ qstat 1002 1003
Job id Name User Time Use S Queue
--------------------- ---------------- ---------------- -------- - -----
1002.x3000c0s25b0n0 job0 korisnik 0 H cpu
1003.x3000c0s25b0n0 job1 korisnik 00:00:00 R cpu |