Skip to content

Running Jobs

User storage

The users have a quota of 25GB and 200000 files in their /home directory, they are NOT allowed to user more storage.

The users MUST run jobs only in /projects/$project, $project represents the project respective name.

Software in Deucalion

The software is available as loadable Modules. The modules allow a set of combinations of different versions of the software. Users can request additional module.

Info

Browsing and selection of available software is only available through command line, by default with SSH access, please check Connecting with SSH. If you are using HTTP access, you need to start a Shelll Access session to run the above commands.

List of available software modules:

module avail

Search for the desired software module:

module spider OpenMPI

Where OpenMPI is the name of software to search for.

Loading the default version of a software module:

module load OpenMPI

Loading a specific version of a software module:

module load OpenMPI/4.1.5-GCC-12.3.0

Display information of the selected software module:

module whatis OpenMPI/4.1.5-GCC-12.3.0

To change the software version of the module:

module switch OpenMPI OpenMPI/new-version>

List of loaded or currently active software modules:

module list

Remove all loaded software modules:

module purge

Remove and then load all loaded software modules:

module reload

Remove selected modules:

module unload OpenMPI

Help for the selected software:

module help OpenMPI/4.1.2.1

Additional commands can be found in the help section:

module help

System Environments ARM

Basic Information

Compiler Name FUJITSU Software Compiler Package
Compiler Version V1.0L21 (cp-1.0.21.02a)
MPI Interconnect InfiniBand (HDR100)
Job Scheduler Name Slurm
Job Scheduler Version 23.11.4

System Environments x86

Basic Information

Compilers CPU GCC 12.3.0, Intel oneAPI HPC Toolkit 2023.1.0
Compilers GPU CUDA 11.8: GCC 11.3.0, NVIDIA HPC SDK 22.9
MPI Interconnect CPU InfiniBand (HDR100)
MPI Interconnect GPU InfiniBand (HDR200)
Job Scheduler Name Slurm
Job Scheduler Version 23.11.4

Partitions on Deucalion

Partition Architecture Max Nodes Max Jobs Time Limit
dev-arm aarch64 16 1 4 hours
normal-arm aarch64 128 4 48 hours
large-arm aarch64 512 1 72 hours
dev-x86 x86_64 8 1 4 hours
normal-x86 x86_64 64 4 48 hours
large-x86 x86_64 128 1 72 hours
dev-a100-40 x86_64 1 1 4 hours
normal-a100-40 x86_64 4 1 48 hours
dev-a100-80 x86_64 1 1 4 hours
normal-a100-80 x86_64 4 1 48 hours

Slurm Basics

An HPC cluster is made up of a number of compute nodes, which consist of one or more processors, memory and in the case of the GPU nodes, GPUs. These computing resources are allocated to the user by the resource manager. This is achieved through the submission of jobs by the user. A job describes the computing resources required to run application(s) and how to run it. HPC Cluster uses Slurm as job scheduler and resource manager.

In the following, you will learn how to submit your job using the Slurm Workload Manager. If you're familiar with Slurm, you probably won't learn much. However, If you aren't acquainted with Slurm, the following will introduce you to the basics. If you would like to play around with Slurm in a sandboxed environment before submitting real jobs, we highly recommend that you try the interactive Slurm Learning tutorial.

The main commands for using Slurm are summarized in the table below.

Command Description
sbatch Submit a batch script

| squeue | View information about jobs in the scheduling queue | | scancel | Signal or cancel jobs, job arrays or job steps | | sinfo | View information about nodes and partitions |

Creating a batch script

The most common type of jobs are batch jobs which are submitted to the scheduler using a batch job script and the sbatch command.

A batch job script is a text file containing information about the job to be run: the amount of computing resource and the tasks that must be executed.

A batch script is summarized by the following steps:

  • the interpreter to use for the execution of the script: bash, python, ...
  • directives that define the job options: resources, run time, ...
  • setting up the environment: prepare input, environment variables, ...
  • run the application(s)

As an example, let's look at this simple batch job script:

#!/bin/bash
#SBATCH --job-name=exampleJob
#SBATCH --account=exampleAccount
#SBATCH --time=02:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G

source /share/env/module_select.sh #configure the correct path for lmod
module load MyApp/1.2.3

myapp -i input -o output

In the previous example, the first line #!/bin/bash specifies that the script should be interpreted as a bash script.

The lines starting with #SBATCH are directives for the workload manager. These have the general syntax

#SBATCH option_name=argument

Now that we have introduced this syntax, we can go through the directives one by one. The first directive is

#SBATCH --job-name=exampleJob

which sets the name of the job. It can be used to identify a job in the queue and other listings.

It is then necessary to select the account:

#SBATCH --account=exampleAccount
which sets the billed account. To check your available accounts you can run the command
sacctmgr show Association where User=<username> format=Cluster,Account%30,User
where <username> is your Deucalion username.

The remaining lines specify the resources needed for the job. The first one is the maximum time your job can run. If your job exceeds the time limit, it is terminated regardless of whether it has finished or not.

#SBATCH --time=02:00:00

The time format is hh:mm:ss (or d-hh:mm:ss where d is the number of days). Therefore, in our example, the time limit is 2 hours.

The next four lines of the script describe the computing resources that the job will need to run

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G

In this instance we request one task (process) to be run on one node. A task corresponds to a process (or an MPI rank). One CPU thread (used, for example, with OpenMP) is requested for the one task as well as 2 GiB of memory should be allocated to the whole job.

Now that the needed resources for the job have been defined, the next step is to set up the environment. For example, copy input data from your home directory to the scratch file system or export environment variables.

source /share/env/module_select.sh #configure the correct path for lmod
module load MyApp/1.2.3
In our example, the first step is to select the lmod (configure the correct path for the command module according to the partition), which is mandatory for arm (the default configuration is correct for x86 and a100). Second, we load a module so that the MyApp application is available to the batch job. Finally, with everything set up, we can launch our program.

myapp -i input -o output

Submit a batch job

To submit the job script we just created we use the sbatch command. The general syntax can be condensed as

$ sbatch [options] job_script [job_script_arguments ...]

The available options are the same as the one you use in the batch script: sbatch --nodes=2 in the command line and #SBATCH --nodes=2 in a batch script are equivalent. The command line value takes precedence if the same option is present both on the command line and as a directive in a script.

For the moment let's limit ourselves to the most common way to use the sbatch: passing the name of the batch script which contains the submission options.

$ sbatch myjob.sh
Submitted batch job 123456

The sbatch command returns immediately and if the job is successfully submitted, the command prints out the ID number of the job.

More details may be found on the dedicated [batch jobs][batch-jobs] page.

Examine the queue

Once you have submitted your batch script it won't necessarily run immediately. It may wait in the queue of pending jobs for some time before its required resources become available. In order to view your jobs in the queue, use the squeue command.

$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
 123456 small-arm exampleJ john.doe  PD       0:00      1 (Priority)

The output shows the state of your job in the ST column. In our case, the job is pending (PD). The last column indicates the reason why the job isn't running: Priority. This indicates that your job is queued behind a higher priority job. One other possible reason can be that your job is waiting for resources to become available. In such a case, the value in the REASON column will be Resources.

Let's look at the information that will be shown if your job is running:

$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
 123456 small-arm exampleJ john.doe  R      35:00      1 node-0123

The ST column will now display a R value (for RUNNING). The TIME column will represent the time your job has been running. The list of nodes on which your job is executing is given in the last column of the output.

In practice the list of jobs printed by this command will be much longer since all jobs, including those belonging to other users, will be visible. In order to see only the jobs that belong to you use the squeue command with the --me flag.

$ squeue --me

The squeue command can also be used to determine when your pending job will start.

$ squeue --me --start
 JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
123456     batch Computat   vananh PD 2021-06-01T16:10:28      1 node0012             (Priority)
123457     batch Computat   vananh PD 2021-06-01T18:21:28      1 (null)               (Priority)

In our example, both jobs listed will start June 1 at different times. You will also notice that for the first job, the scheduler plan to run the job on node0012 while for the second job, no node has been chosen yet.

Cancelling a job

Sometimes things just don't go as planned. If your job doesn't run as expected, you may need to cancel your job. This can be achieved using the scancel command which takes the job ID of the job to cancel.

$ scancel <jobid>

The job ID can be obtained from the output of the sbatch command when submitting your job or by using squeue. The scancel command applies to either a pending job waiting in the queue or to an already running job. In the first case, the job will simply be removed from the queue while in the latter, the execution will be stopped.

Batch jobs

This pages covers advanced topics related to running Slurm batch jobs on cluster. If you are not already familiar with Slurm, you should read the Slurm quickstart guide which covers the basics. You can also refer to the Slurm documentation or manual pages, in particular the page about sbatch.

Common Slurm options

Here is an overview of some of the most commonly used Slurm options.

Basic job specification

Option Description
--time Set a limit on the total run time of the job allocation
--account Charge resources used by this job to specified project
--partition Request a specific partition for the resource allocation
--job-name Specify a name for the job allocation

Specify tasks distribution

Option Description
--nodes Number of nodes to be allocated to the job
--ntasks Set the maximum number of tasks (MPI ranks)
--ntasks-per-node Set the number of tasks per node
--ntasks-per-socket Set the number of tasks on each node
--ntasks-per-core Set the maximum number of task on each core

Request CPU cores

Option Description
--cpus-per-task Set the number of cores per tasks

Request memory

Option Description
--mem Set the memory per node
--mem-per-cpu Set the memory per allocated CPU cores
--mem-per-gpu Set the memory per allocated GPU

Email notifications

Option Description
--mail-type Set when you want to receive emails (list of options)
--mail-user Email to receive notifications

Pipelining with dependencies

Job dependencies allow you to defer the start of a job until the specified dependencies have been satisfied. Dependencies can be defined in a batch script with the --dependency directive or be passed as a command-line argument to sbatch.

$ sbatch --dependency=<type:job_id[:job_id]>

The type defines the condition that the job with ID job_id must fulfil before the job which depends on it can start. For example,

$ sbatch job1.sh
Submitted batch job 123456

$ sbatch --dependency=afterany:123456 job2.sh
Submitted batch job 123458

will only start execution of job2.sh when job1.sh has finished. The available types and their description are presented in the table below.

Dependency type Description
after:jobid[:jobid...] Begin after the specified jobs have started
afterany:jobid[:jobid...] Begin after the specified jobs have finished
afternotok:jobid[:jobid...] Begin after the specified jobs have failed
afterok:jobid[:jobid...] Begin after the specified jobs have run to completion

The example below demonstrates a bash script for submission of multiple Slurm batch jobs with dependencies. It also shows an example of a helper function that extracts the job ID from the output of the sbatch command.

#!/bin/bash

submit_job() {
  sub="$(sbatch "$@")"

  if [[ "$sub" =~ Submitted\ batch\ job\ ([0-9]+) ]]; then
    echo "${BASH_REMATCH[1]}"
  else
    exit 1
  fi
}

# first job - no dependencies
id1=$(submit_job job1.sh)

# Two jobs that depend on the first job
id2=$(submit_job --dependency=afterany:$id1 job2.sh)
id3=$(submit_job --dependency=afterany:$id1 job3.sh)

# One job that depends on both the second and the third jobs
id4=$(submit_job  --dependency=afterany:$id2:$id3 job4.sh)

Interactive Slurm jobs

Interactive jobs allow a user to interact with applications on the compute nodes. With an interactive job, you request time and resources to work on a compute node directly, which is different to a batch job where you submit your job to a queue for later execution.

You can use two commands to create an interactive session: srun and salloc. Both of these commands take options similar to sbatch.

Using salloc

Using salloc, you allocate resources and spawn a shell that is then used to execute parallel tasks launched with srun. For example, you can allocate 2 nodes for 30 minutes with the command

$ salloc --nodes=2 --time=00:30:00
salloc: Granted job allocation 123456
salloc: Waiting for resource configuration

Once the allocation is made, this command will start a shell on the login node. You can start parallel execution on the allocated nodes with srun.

$ srun --ntasks=32 --cpus-per-task=8 ./mpi_openmp_application

After the execution of your application ended, the allocation can be terminated by exiting the shell (exit).

When using salloc, a shell is spawned on the login node. If you want to obtain a shell on the first allocated compute node you can use srun --pty.

$ srun --cpu_bind=none --nodes=2 --pty bash -i

If you want to use an application with a GUI, you can use the --x11 flag with srun to enable X11 forwarding.

Using srun

For simple interactive session, you can use srun with no prior allocation. In this scenario, srun will first create a resource allocation in which to run the job. For example, to allocate 1 node for 30 minutes and spawn a shell

$ srun --time=00:30:00 --nodes=1 --pty bash

Using srun to check running jobs

Currently, ssh'ing to compute nodes is not allowed, but the srun command can be used to check in on a running job in the cluster. In this case, you need to give the job ID and possibly also the specific name of a compute node to srun.

This starts a shell, where you can run any command, on the first allocated node in a specific job:

$ srun --interactive --pty --jobid=<jobid> $SHELL

To check processor and memory usage quickly, you can run top directly:

$ srun --interactive --pty --jobid=<jobid> top

The -w nid00XXXX option can be added to select a specific compute node to view:

$ srun --interactive --pty --jobid=<jobid> -w nid002217 top

Enroot

What is Enroot?

  • A tool to turn traditional container/OS images into unprivileged sandboxes.
  • It uses the same underlying technologies as containers but removes much of the isolation they inherently provide while preserving filesystem separation.
  • Enroot can be thought of as an unprivileged chroot that provides facilities to import well known container image formats (e.g. Docker)

Key concepts

  • Standalone (no daemon)
  • Fully unprivileged and multi-user capable (no setuid binary, cgroup inheritance, per-user configuration/container store…)
  • Easy to use
  • No isolation (no performance overhead, simplifies HPC deployements)
  • Built-in GPU support with libnvidia-container
  • Facilitate collaboration and development workflows (bundles, in-memory containers...)

Usage

Import and start an Ubuntu image from DockerHub Pull the image from docker hub

enroot import docker://ubuntu

Create the container

enroot create ubuntu.sqsh

Start the container

enroot start ubuntu

If you need to run something as root inside the container, you can use the --root option.

enroot start --root ubuntu

List the existing containers

enroot list -f

Remove a container

enroot remove ubuntu

Pyxis

Pyxis is a SPANK plugin for the Slurm Workload Manager.  It allows unprivileged cluster users to run containerized tasks through the srun command.   Pyxis requires Slurm and Enroot to work.

Benefits

  • Execute the user's task in an unprivileged container.
  • Fast Docker image download.
  • Simple command-line interface.
  • Supports multi-node MPI jobs through PMI2 or PMIx (requires Slurm support).
  • Allows users to install packages inside the container.

Usage

srun

Run a command on a node

srun cat /etc/os-release
…
PLATFORM_ID="platform:el8"

Run the same command, but now inside of a container

srun --container-image=/apps/public/containers/ubuntu.sqsh --container-name=ubuntu cat /etc/os-release
…
PRETTY_NAME="Ubuntu 22.04.4 LTS"

Mount a file from the host and run the command on it, from inside the container

srun --container-name=ubuntu --container-mounts=/etc/os-release:/etc/os-release cat /etc/os-release
…
PLATFORM_ID="platform:el8"

To see more options

srun --help  | grep container

sbtach Execute the sbatch script inside a container image, a real application example GROMACS (example not reproducible its from another cluster):

#!/bin/bash
#SBATCH -p all-a100 -t 30:00 
#SBATCH --container-mounts /var/spool/slurm,/home/hpcnow/examples/stmv:/host_pwd
#SBATCH --container-workdir=/host_pwd
# SBATCH --container-image nvcr.io\#hpc/gromacs:2021.3
# SBATCH --container-image  /home/hpcnow/examples/hpc+gromacs+2021.3.sqsh
#SBATCH --container-name hpc+gromacs+2021.3
export GMX_ENABLE_DIRECT_GPU_COMM=1
/usr/local/gromacs/avx2_256/bin/gmx mdrun -ntmpi 8 -ntomp 16 -nb gpu -pme gpu -npme 1 -update gpu -bonded gpu -nsteps 100000 -resetstep 90000 -noconfout -dlb no -nstlist 300 -pin on -v -gpu_id 0123

Singularity

  • It allows you to create and run containers that package up pieces of software in a way that is portable and reproducible.
  • You can build a container using SingularityCE on your laptop, and then run it on many of the largest HPC clusters in the world, local university or company clusters, a single server, in the cloud, or on a workstation down the hall. Your container is a single file, and you don’t have to worry about how to install all the software you need on each different operating system.
  • Installed version: 3.11.3

Usage

sbatch singurality example:

#!/bin/bash
#SBATCH -p all-a100 -t 00:30:00
# REFERENCE: https://sylabs.io/guides/3.11/user-guide/gpu.html
#https://www.tensorflow.org/install/docker
#2.10.1 release w/ GPU support: cuda 11.2 
#To be downloaded from login nodes (internet access)
#singularity pull tensorflow.sif docker://tensorflow/tensorflow:2.10.1-gpu
#wget https://github.com/tensorflow/benchmarks/archive/039fa9550c9efd25e09db2e73f360b67378436fe.zip
#mv benchmarks-039fa9550c9efd25e09db2e73f360b67378436fe/ benchmarks
CONTAINER_URI='tensorflow.sif'
#--nv option will setup the container's environment to use an NVIDIA GPU and the basic CUDA libraries to run a CUDA enabled application.
# instance tensorflow started
singularity instance start --nv $CONTAINER_URI tensorflow
singularity exec instance://tensorflow python \
        $(pwd)/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
        --num_gpus=4 --model resnet50 --batch_size 32
singularity exec instance://tensorflow python \
        $(pwd)/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
        --num_gpus=4 --model inception3 --batch_size 32
#shutdown instance
singularity instance stop -t 30 tensorflow