Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Intro

For deploying and managing jobs on Isabella computer cluster, SGE or Sun Grid Engine is used and job managment system JMS. In this document use of SGE ver. 9 is described.


Running jobs

User applications (in continuation jobs) which are run with SGE system have to be described with the startup shell script. Withing starting script, alongside the usual commands, SGE parameters are stated. It is possible to state the same parameters outside of starting script, during job submission.

Job run starts with qsub command:

Code Block
languagebash
qsub <SGE_parameters> <name_of_starting_script>


Info

SGE also has graphical interface or GUI for access to whole system functionality. GUI starts with qmon command. Use of GUI will not be described because there is no instruction manual within it (Help button).

Describing jobs

The SGE system language is used to describe jobs, and the job description file (startup script) is a standard shell script. The header of each script lists the SGE parameters that describe the job in detail, followed by the normal commands to execute the user application.

Startup script structure:

Code Block
languagebash
titlemy_job.sge
#!/bin/bash

#$ -<parameter1> <value1>
#$ -<parameter2> <value2>

<command1>
<command2>


The job described by this start script is submitted with the command:

Code Block
qsub my_job.sge


The qsub command returns a job ID that is used to monitor the job's status later:

Code Block
Your job <JobID> ("my_job") has been submitted


Basic SGE parameters

Code Block
-N <job_name> : the job name that will be displayed when retrieving job information
-P <project_name> : the name of the project to which the job belongs
-cwd : defines that the directory where the startup script is located is the working directory of the job

  • Info

    The default working directory of the job is $HOME.


Code Block
-o <file_name> : the name of the file where the standard output is saved
-e <file_name> : the name of the file where the standard error is saved
-j y|n : allows merging standard output and standard error into the same file (default value is n)

  • Info

    If standard output and error are not explicitly specified, they will be saved to the files:

    1. If job name is not defined:
      <working_directory>/<script_name>.o<job_id>
      <working_directory>/<script_name>.e<job_id>
    2. else:
      <working_directory>/<script_name>.o<job_id>
      <working_directory>/<script_name>.e<job_id>


    The -o and -e parameters can point to a directory:

    #$ -o outputDir/
    #$ -e outputDir/

    In this case, SGE will create standard output and standard error files in the outputDir directory named  <job_name>.o<JobID> and <job_name>.e<JobID>

    Important: outputDir must be created manually before submitting the job.


Code Block
languagetext
-M <emailAddress>[,<emailAddress>]…	: list of email addresses to which job notifications are sent

-m [a][b][e][n]	: defines in which case mail notifications are sent: 
					b - start of job execution, 
					a - job execution error, 
					e - completion of job, 
					n - do not send notifications (default option)

-now y|n : the value of y defines that the job must be performed immediately. For interactive jobs, this is the default value.
		   If SGE cannot find free resources, the job is not queued but ends immediately.

-r y|n : whether the job should be restarted in case of a runtime error (default value is n)

-R y|n : the value of y defines that SGE will reserve nodes when deploying (important for multiprocessor jobs)

-l <resource>=<value>[,<resource>=<value>...] : defines the resources that the job requires. See Resources for details.

-pe <parallel_environment> <range> : parameter is used for parallel jobs.
     The first parameter defines the module that runs the requested form of parallel job.
     The second parameter defines a specific number of processors or a range in the form <N>,[<Ni>,...]<S>-<E>,[<Si>-<Ei>,] which                                         parallel job demands. For more details see Parralel jobs 
    
-q <queue_name>[,<queue_name>...] : job queue in which job is being prepared. This option can also be used to request a specific node, such as requesting a local job queue (eg a12.q@sl250s-gen8-08-01.isabella).

-t <start>:<end>:<step> : the parameter defines that it is a job queue. For details, see Job queue.

-v <variable>[=<value>][,<variable>[=<value>]...] : ption defines that SGE sets the environment variable when executing the job. This parameter is useful when the application uses special environment variables, because SGE does not set them by default when starting the job.

-V : SGE passes all current environment variables to the job. 

  • Info

    Note: spaces are not allowed when listing parameter values ​​(eg -l or -q).



  • Info

    Detaljan popis i informacije o parametrima moguće je dobiti naredbom man qsub.


SGE environment variables

Within the startup script it is possible to use SGE variables. Some of them are:

Code Block
$TMPDIR : the name of the directory where temporary files can be saved (/scratch)
$JOB_ID : SGE job identifier
$SGE_TASK_ID : task identifier of the job queue
$SGE_O_HOST : address of the computer from which the job was started
$SGE_O_PATH : the original value of the PATH environment variable when starting the job
$SGE_O_WORKDIR : the directory from which the job was started
$SGE_STDOUT_PATH : file where standard output is saved
$SGE_STDERR_PATH : file where standard error is saved
$HOSTNAME : the address of the computer on which the script is executed
$JOB_NAME : job name
$PE_HOSTFILE : the name of the file in which the addresses of the computers are listed
$QUEUE : the name of the queue in which the job is executed 


Types of jobs

Serial jobs

The simplest form of SGE jobs are batch jobs that require only one processor to run. For them, it is usually not necessary to specify any special parameters, but only the name of the program is specified.

Examples of use:

  1.  An example script without additional parameters:

    Code Block
    languagebash
    #!/bin/bash 
    
    date


  2.  Example of a simple script with parameters:

    Code Block
    languagebash
    #!/bin/bash 
    
    #$ -N Date_SGE_script 
    #$ -o Date_SGE.out 
    #$ -e Date_SGE.err 
    
    date


  3. Example of running a program from the current directory:

    Code Block
    languagebash
    titlemoj_program.sge
    #!/bin/bash
    
    #$ -N myprog
    #$ -P local
    #$ -o myprog.out
    #$ -e myprog.err
    #$ -cwd
    
    myprog


Parallel jobs

To start parallel jobs, it is necessary to specify the desired parallel environment and the number of processor cores required to perform the job.

The syntax is:

Code Block
languagetext
#$ -pe <type_of_parallel_job> <N>,[<Ni>,...]<S>-<E>,[<Si>-<Ei>,]

Examples of use:

  1.  The job requires 14 processor cores to run:

    Code Block
    #$ -pe *mpi 14


  2.  The number of allocated processor cores can be between 2 and 4:

    Code Block
    #$ -pe *mpi 2-4


  3. The number of allocated processor cores can be 5 or 10:

    Code Block
    #$ -pe *mpi 5,10


  4. The number of allocated processor cores can be 1 or between 2 and 4:

    Code Block
    #$ -pe *mpi 1,2-4


Info

More information about the parallel environments available on Isabella can be found on this page : Redovi poslova i paralelne okoline.