Running and managing jobs (Padobran)

Introduction

To schedule and manage jobs on the Parachute computer cluster, PBS (Portable Batch System) is used, which performs job scheduling within the cluster. Its primary task is the distribution of computer tasks, i.e. batch jobs, among the available computer resources.

This document describes the use of the PBS 22.05.11 version

Running jobs

User applications (hereinafter jobs) that are started using the PBS system must be described by a start shell script (sh, bash, zsh...). Within the startup script above the standard commands, the PBS parameters are listed. These parameters can also be specified when submitting a job.

Basic job submission:

qsub my_job.pbs

Job submission with parameters:

qsub -q cpu -l walltime=10:00:00 -l select=1:ncpus=4:mem=10GB my_job.pbs

More information about qsub parameters:

qsub --help

After submitting the job, it is possible to view the standard output and error of the job that is in execution state with the commands:

qcat jobID
qcat -e jobID
qtail jobID
qtail -e jobID

Job submission

There are several ways jobs can be submitted:

by interactive submission
using a script
in an interactive session
job array

In the case of interactive submission, directly calling the qsub command will open a text editor in the terminal, through which the commands for execution are submitted:

# qsub
[korisnik@padobran:~] $ qsub
Job script will be read from standard input. Submit with CTRL+D.
echo "Hello world"

# print directory contents
[korisnik@padobran:~] $ ls -l
total 5140716
-rw-------  1 korisnik hpc          0 Jun  1 07:44 STDIN.e14571
-rw-------  1 korisnik hpc         12 Jun  1 07:44 STDIN.o14571

# print the contents of the output file
[korisnik@padobran:~] $ cat STDIN.o14571
Hello world

In the case of script submission, we can specify the commands to be executed in the input file that we submit:

# print the file hello.sh
[korisnik@padobran:~] $ cat hello.sh 
#!/bin/bash

#PBS -N hello
echo "Hello world"

# job script submission
[korisnik@padobran:~] $ qsub hello.sh

# print directory content
[korisnik@padobran:~] $ ls -l
total 5140721
-rw-------  1 korisnik hpc          0 Jun  1 07:44 STDIN.e14571
-rw-------  1 korisnik hpc         12 Jun  1 07:44 STDIN.o14571
-rw-------  1 korisnik hpc          0 Jun  1 08:02 hello.e14572
-rw-------  1 korisnik hpc         12 Jun  1 08:02 hello.o14572
-rw-r--r--  1 korisnik hpc         46 Jun  1 07:55 hello.sh

# print file content
[korisnik@padobran:~] $ cat hello.o14572 
Hello world

In the case of an interactive session, using the qsub -I option without an input script will open a terminal on the main working node within which we can run commands:

# hostname on access server
[korisnik@padobran:~] $ hostname
padobran.srce.hr

# starting interactive session
[korisnik@padobran:~] $ qsub -I -N hello-interactive
qsub: waiting for job 106.admin to start
qsub: job 106.admin ready

# hostname on the main worker node
[korisnik@node034:~] $ hostname
node034.padobran

In the case of an array of jobs, using the qsub -J X-Y[:Z] option we can submit a given number of identical jobs in the range X to Y with step Z:

# submission of job array 
[korisnik@padobran:~] $ qsub -J 1-10:2 hello.sh 
107[].admin

# print files content
[korisnik@padobran:~] $ ls -l
total 5140744
-rw-------  1 korisnik hpc          0 Jun  1 07:44 STDIN.e14571
-rw-------  1 korisnik hpc         12 Jun  1 07:44 STDIN.o14571
-rw-------  1 korisnik hpc          0 Jun  1 08:02 hello.e14572
-rw-------  1 korisnik hpc          0 Jun  1 08:21 hello.e14575.1
-rw-------  1 korisnik hpc          0 Jun  1 08:21 hello.e14575.3
-rw-------  1 korisnik hpc          0 Jun  1 08:21 hello.e14575.5
-rw-------  1 korisnik hpc          0 Jun  1 08:21 hello.e14575.7
-rw-------  1 korisnik hpc          0 Jun  1 08:21 hello.e14575.9
-rw-------  1 korisnik hpc         12 Jun  1 08:02 hello.o14572
-rw-------  1 korisnik hpc         12 Jun  1 08:21 hello.o14575.1
-rw-------  1 korisnik hpc         12 Jun  1 08:21 hello.o14575.3
-rw-------  1 korisnik hpc         12 Jun  1 08:21 hello.o14575.5
-rw-------  1 korisnik hpc         12 Jun  1 08:21 hello.o14575.7
-rw-------  1 korisnik hpc         12 Jun  1 08:21 hello.o14575.9
-rw-r--r--  1 korisnik hpc         46 Jun  1 07:55 hello.sh

Job array

This method is preferred over multiple submissions (e.g. with a for loop) because:

reduces job queue load - each job will compete for resources simultaneously with everyone else in the queue, instead of one after the other
easier management - modification of all jobs is possible by calling the main (e.g. 14575[]) or individual (e.g. 14575[3]) job identifier

The environment variables defined by PBS during their execution are:

PBS_ARRAY_INDEX - number of sub-jobs in the job array (e.g. one to nine in the example above)
PBS_ARRAY_ID - identifier of the main job field
PBS_JOBID - subjob identifier in the job field

Job description

The PBS system language is used to describe jobs, while the job description file is a standard shell script. In the header of each script, PBS parameters are listed to describe the job in detail, followed by commands to execute the desired application.

Structure of the startup script:

my_job.pbs

#!/bin/bash

#PBS -<parameter1> <value>
#PBS -<parameter2> <value>

<command>

Konkretni primjer startne skripte:

my_job.pbs

#!/bin/bash

#PBS -P test example
#PBS -e /home/my_directory
#PBS -q cpu
#PBS -l walltime=00:01:00
#PBS -l select=1:ncpus=10

module load mpi/openmpi-x86_64

mpicc --version

Osnovni PBS parametri

Opcija	Option argument	The meaning of the option
`-N`	name	Setting the job name
`-q`	destination	Specifying the job queue and/or server
`-l`	resource_list	Specifying the resources required to perform the job
`-M`	user_list	Setting up a list of mail recipients
`-m`	mail_options	Setting the email notification type
`-o`	path/to/desired/directory	Setting the name/path where standard output is saved
`-e`	path/to/desired/directory	Setting the name/path where the standard error is saved
`-j`	`oe`	Concatenation of standard output and error in the same file
-Wgroup_list	project_code	Selection of the project under which the job will be performed

Options for sending notifications by mail option -m:

a	Mail is sent when the batch system terminates the job
b	Mail is sent when the job starts executing
e	The mail is sent when the job is finished
j	Mail is sent for sub jobs. Must be combined with one or more sub-options a, b or e

Email example

#!/bin/bash

#PBS -q cpu
#PBS -l walltime=00:01:00
#PBS -l select=1:ncpus=2
#PBS -M <name>@srce.hr,<name2>@srce.hr
#PBS -m be

echo $PBS_JOBNAME > out
echo $PBS_O_HOST

Two emails were received

Job start

PBS Job Id: 110.admin
Job Name:   pbs.pbs
Begun execution

job end

PBS Job Id: 110.admin
Job Name:   pbs.pbs
Execution terminated
Exit_status=0
resources_used.cpupercent=0
resources_used.cput=00:00:00
resources_used.mem=0kb
resources_used.ncpus=2
resources_used.vmem=0kb
resources_used.walltime=00:00:01

Options for requesting resources with the -l option

-l select=3:ncpus=2	Requesting 3 chunks with 2 cores each (6 cores in total)
-l select=1:ncpus=10:mem=20GB	Requesting 1 chunk with 10 cores and 20GB of working memory
-l ngpus=2	Requesting 2 gpus
-l walltime=00:10:00	Maximum job execution time

PBS environmental variables

Name	Description
PBS_JOBID	Job identifier provided by PBS when a job is submitted. Created after executing the qsub command
PBS_JOBNAME	The name of the job provided by the user. The default name is the name of the submitted script
PBS_NODEFILE	List of work nodes, or processor cores on which the job is executed
PBS_O_WORKDIR	The working directory in which the job was submitted, i.e. in which qsub command was called
OMP_NUM_THREADS	An OpenMP variable that PBS exports to the environment, which is equal to the value of the ncpus option from the PBS script header
NCPUS	Number of cores requested. Matches the value from the ncpus option from the PBS script header
TMPDIR	Path to temporary directory

Specifying the working directory

While in PBS the path for the output and error files is specified in the directory in which they are executed, the input and output files of the program itself are loaded/saved in the $HOME directory by default. PBS does not have an option to specify the job execution in the current directory we are in, so it is necessary to change the directory manually.

If you want to switch to the directory where the script was started, after the header you have to write:

cd $PBS_O_WORKDIR

If you want to run jobs with high storage load (I/O intensive) job execution is not recommended to run from $PBS_O_WORKDIR-a but from $TMPDIR location, which will use fast storage. Read more about using fast storage and temporary results below.

Allocating resources to jobs

PBS makes it possible to define the necessary resources in several ways. The main unit for resource allocation is the so-called "Chunk" or piece of node. A chunk is defined with the select option. The number of processor cores per chunk can be defined with ncpus, the number of mpi processes with mpiprocs and the amount of working memory with mem. It is also possible to define walltime (maximum job execution time) and place (chunk allocation method by nodes).

If some of the parameters are not defined, the default value will be used:

Parameter	Default value
select	1
ncpus	1
mpiprocs	1
mem	3500 MB
walltime	48:00:00
place	pack

Memory control using cgroups

In addition to controlling processor usage, cgroups are also set to control memory consumption. This means that jobs run by the user are limited to the requested amount of memory. If the job tries to use more memory than requested in the job description, the system will terminate that job and write the following in the output error file:

Message to user when cgroups kill job due to lack of memory

-bash: line 1: PID Killed                  /var/spool/pbs/mom_priv/jobs/JOB_ID.SC
Cgroup mem limit exceeded: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=_JOB_ID,mems_allowed=0,oom_memcg=/pbs_jobs.service/jobid/JOB_ID,task_memcg=/pbs_jobs.service/jobid/JOB_ID,task=JOB_ID,pid=PID,uid=UID

For each job, this message will be slightly different, because it contains information such as UID (Unique Numeric Identification of the User), PID (Numeric Identification of the process that was killed), JOB_ID (Job ID assigned by PBS).

Allocation per requested chunk

Examples:
The user requests two chunks, each of which consists of 10 processor cores and 10GB of RAM, with the fact that the user did not specify how many nodes the system will optimize the allocation to. In this case, the user will get 20 processor cores and 20 GB of working memory.

Example of requesting resource

#PBS -l select=2:ncpus=10:mem=10GB

The user requests 10 chunks each consisting of one processor core and 1 GB of RAM on one node so the user will get a total of 10 processor cores and 10 GB of RAM.

Example of requesting resource

#PBS -l select=10:ncpus=1:mem=1GB:place=pack

In the above examples, jobs are defined by the amount of chunks, cores and memory, but the system allows resources to be assigned to jobs if they are not requested (default resources):

Example of requesting resource

#PBS -l ncpus=4
#PBS -l mem=14GB

In this case, the user gets 4 processor cores and a total of 14GB of memory on one chunk. When jobs are described without the select option, it is not possible to "chain resources" (separate the required resources with a colon, it is necessary to put the -l option on a new line for each resource)

Memory

If you define jobs using ncpus without the select option, it is preferable to define the amount of memory, because otherwise the available working memory will be 3500 MB.

Saving temporary results

A $TMPDIR directory can be used to store temporary results generated at runtime instead of $HOME directory. Using $TMPDIR-a takes advantage of the fast storage (BeeGFS-fast) reserved for storing temporary files.

PBS creates a temporary directory for each individual job at the address stored in the $TMPDIR variable (/beegfs-fast/scratch/<jobID>).

The temporary directory is automatically deleted when the job is done!

Usage examples

Example of simple use of $TMPDIR variable:

#!/bin/bash

#PBS -q cpu
#PBS -l walltime=00:00:05

cd $TMPDIR
pwd > test
cp test $PBS_O_WORKDIR

An example of copying the input data to $TMPDIR, running the application, and copying it to the working directory:

#!/bin/bash

#PBS -q cpu
#PBS -l walltime=00:00:05

# Creating directories for input data in a temporary directory
mkdir -p $TMPDIR/data

# Copy all required inputs to a temporary directory
cp -r $HOME/data/* $TMPDIR/data

# Run the application and redirect the outputs to the "current" (temporary) directory
cd $TMPDIR
<application executable command> 1>output.log 2>error.log

# Copy desired output to working directory
cp -r /$TMPDIR/output $PBS_O_WORKDIR

Parallel jobs

OpenMP parallelization

If your application uses parallelization exclusively at the level of OpenMP threads and cannot expand beyond one worker node (that is, it works with shared memory), you can call the job as shown in the xTB application example below.

OpenMP applications require the definition of the OMP_NUM_THREADS

The PBS system takes care of this for you, and associates it with the value of the ncpus variable, defined in the header of the PBS script.

If you define jobs using ncpus without the select option, it is preferable to define the amount of memory as well, because otherwise the available working memory will be 3500 MB (select x mem → 1 x 3500 MB).

#!/bin/bash

#PBS -q cpu
#PBS -l walltime=10:00:00
#PBS -l ncpus=8:mem=28GB

cd ${PBS_O_WORKDIR}

xtb C2H4BrCl.xyz --chrg 0 --uhf 0 --opt vtight

MPI parallelization

If your application uses parallelization exclusively at the MPI process level and can extend beyond a single worker node (that is, it works with distributed memory), you can call the job as shown in the Quantum ESPRESSO application example below. To run applications using MPI (or hybrid MPI+OMP) parallelization, the mpi module must be loaded before calling mpiexec or mpirun.

The value of the variable select from the header of the PBS script corresponds to the number of the MPI process.

#!/bin/bash

#PBS -q cpu
#PBS -l walltime=10:00:00
#PBS -l select=16

module load mpi/openmpi-x86_64

cd ${PBS_O_WORKDIR}

mpiexec pw.x -i calcite.in

MPI + OpenMP (hybrid) parallelization

If your application can be parallelized hybridly, i.e. divide its MPI processes into OpenMP threads, you can call the job as shown in the GROMACS application example below:

OpenMP applications require the variable OMP_NUM_THREADS to be defined. The PBS system automatically associates it with the value of the ncpus variable, defined in the header of the PBS script.

The value of the variable select from the header of the PBS script corresponds to the number of the MPI process.

#!/bin/bash

#PBS -q cpu
#PBS -l walltime=10:00:00
#PBS -l select=8:ncpus=4:mem=14GB

module load mpi/openmpi-x86_64

cd ${PBS_O_WORKDIR}

mpiexec -d ${OMP_NUM_THREADS} --cpu-bind depth gmx mdrun -v -deffnm md

Monitoring and management of job performance

Job monitoring

The PBS command qstat is used to display the status of jobs. Command syntax is:

qstat <options> <job_ID>

Executing the qstat command without additional options displays a printout of all current jobs of all users:

Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
111.admin         mpi+omp_s        kmrkalj           00:36:09 R cpu

Some of the more frequently used options are:

-E	Groups jobs by server and displays jobs sorted by ascending ID. When qstat is displayed with a list of jobs, the jobs are grouped by server and each group is displayed by ascending ID. This option also improves the performance of qstat.
-t	Displays status information for jobs, jobs array, and subjobs.
-p	The display of the Time Used column is replaced by the percentage of work done. For a job arrays, this is the percentage of subjobs completed. For normal job, this is a percentage of the allocated CPU time used.
-x	Displays status information for completed and moved jobs in addition to pending and running jobs.
-Q	Shows queue status in standard format.
-q	Displays queue status in an alternative format.
-f	Displays job status in an alternative format

Usage examples:

Detailed job description:

qstat -fxw 2648

The tracejob command extracts and displays log messages for a PBS job in chronological order.

tracejob <job_ID>

Example:

$ tracejob 111

Job: 111.admin

03/30/2023 11:23:24  L    Considering job to run
03/30/2023 11:23:24  S    Job Queued at request of mhrzenja@node034, owner =
                          mhrzenja@node034, job name = mapping, queue = cpu
03/30/2023 11:23:24  S    Job Run at request of Scheduler@node034 on exec_vnode
                          (node034:ncpus=40:mem=104857600kb)
03/30/2023 11:23:24  L    Job run
03/30/2023 11:23:24  S    enqueuing into cpu, state Q hop 1
03/30/2023 11:23:56  S    Holds u set at request of mhrzenja@node034
03/30/2023 11:24:22  S    Holds u released at request of mhrzenja@node034

Job management

The job can be managed even after it has started.

While the job is in the queue, it is possible to temporarily stop its execution with the command:

qhold <job_ID>

To return to the queue:

qrls <job_ID>

The job is completely stopped or unqueued with the command:

qdel <job_ID>

Force stop should be used for stuck jobs:

qdel -W force -x <job_ID>

Delay of execution

PBS provides the ability to perform jobs in dependence on others, which is useful in cases such as:

the execution of jobs depends on the output or state of the previously executed
the application requires the sequential execution of various components
printing data from one job may compromise the execution of another

The directive that enables this functionality when submitting a job immediately is:

qsub -W depend=<type>:<job_ID>[:<job_ID>] ...

Where < type> can be:

after* - starting the current one with respect to the others
- after - execution of the current one after the start of execution of the specified ones
- afterok - execution of the current one after the successful completion of the specified ones
- afternotok - execution of the current after an error in the completion of the specified
- afterany - execution of the current one after the end of the specified ones
before* - starting the others with respect to the current one
- before - execution of the specified ones after the start of the current one
- beforeok - execution of the specified ones after the successful completion of the current one
- beforenotok - execution of the specified ones after the an error in the completion of the current one
- beforeany - execution of the specified ones after the end of the current one
on:<number> - execution of a job that will depend on the subsequently specified number of before* types of jobs

A job with a directive -W depend=... will not be submitted if the specified job IDs do not exist (or if they are not in a queue)

Usage examples:

If we want posao1 to start after successful completion of posao0 :

[korisnik@padobran] $ qsub posao0
1000.admin

[korisnik@padobran] $ qsub -W depend=afterok:1000 posao1
1001.admin

[korisnik@padobran] $ qstat 1000 1001
Job id                 Name             User              Time Use S Queue
---------------------  ---------------- ----------------  -------- - -----
1000.admin             posao0           korisnik           00:00:00 R cpu             
1001.admin             posao1           korisnik                  0 H cpu

If we want posao0 to start after successful completion of posao1 :

[korisnik@padobran] $ qsub -W depend=on:1 posao0
1002.admin
[korisnik@padobran] $ qsub -W depend=beforeok:1002 posao1
1003.admin

[korisnik@padobran] $ qstat 1002 1003
Job id                 Name             User              Time Use S Queue
---------------------  ---------------- ----------------  -------- - -----
1002.admin             posao0           korisnik                  0 H cpu             
1003.admin             posao1           korisnik           00:00:00 R cpu

Page tree

Running and managing jobs (Padobran)

Introduction

Running jobs

Job submission

Job description

Osnovni PBS parametri

PBS environmental variables

Allocating resources to jobs

Memory control using cgroups

Allocation per requested chunk

Saving temporary results

Usage examples

Parallel jobs

OpenMP parallelization

MPI parallelization

MPI + OpenMP (hybrid) parallelization

Monitoring and management of job performance

Job monitoring

Job management

Delay of execution

Usage examples: