The Padobran computer cluster contains two job queues. The difference between the queues is in the execution time of the jobs. Each queue has a minimum and maximum execution time. Computation time refers to the "walltime" parameter, and does not depend on the number of requested cores.

The job is terminated at the end of the requested or assigned time.


QueueNumber of nodesNumber of cores per nodeWorking memory per node (GB)Job execution time (h)TMPDIR
cpu2512847500:00:00 - 168:00:00/beegfs-fast/scratch
cpu_3025128475168:00:01 - 720:00:00/beegfs-fast/scratch


Using queues

The system is set up so that above all queues there is a RouteQ routing queue that ensures that jobs without a defined queue fall into the right job queue, based on the required computation time. If jobs do not ask for compute time in the job description, they are automatically assigned 48 hours and sent to the cpu queue.

Users are advised to search for runtime in order to increase the efficiency of using the cluster, but also to spend as little time as possible in the waiting queue.

If a user submits a job that requests a queue but does not define a runtime, the job will be allocated the maximum time in that queue.

If the job requests a queue, but also a time that does not correspond to the queue limits, a message will be displayed on the screen:

Message when the runtime is incorrectly defined
qsub: Job violates queue and/or server resource limits


Examples of running jobs with defined walltime


An example of executing jobs without defining the execution time, but with a defined queue



Backfilling

In order to further encourage users to define the execution time, the Backfilling option is enabled on the system, i.e. jobs with a shorter execution time are allowed to skip jobs with a longer execution time, if it is always satisfied that the execution of a shorter job will not affect the start of the one with a longer execution.


Example

Job 101 requires 128 processor cores and 10 hours of computing time. Based on the job owner's project 101, its priority is 1000 and as such it is first on the execution list. There are currently 60 cores available on the cluster, while the remaining 68 requested by job 101 will be released in 5 hours. In other words, job 101 will start in 5 hours. A user from another project with a lower priority sends job 105. Job 105 requires 40 cores and 4 hours of computing time, the priority of the job is 750. Because of the backfilling option, job 105 starts executing. The system knows that job 101 cannot start execution before 5 hours have passed, and allows the use of free cores for a shorter time than that.

If job 105 requested 5 or more hours, it would not start execution, even though the resource it is requesting is free.

In order not to overload the system, it is allowed to skip 5 jobs.




  • No labels