Running Jobs
User storage
The users have a quota of 25GB and 200000 files in their /home
directory, they are NOT allowed to user more storage.
The users MUST run jobs only in /projects/$project
, $project
represents the project respective name.
Software in Deucalion
The software is available as loadable Modules. The modules allow a set of combinations of different versions of the software. Users can request additional module.
Info
Browsing and selection of available software is only available through command line, by default with SSH access, please check Connecting with SSH. If you are using HTTP access, you need to start a Shelll Access session to run the above commands.
List of available software modules:
module avail
Search for the desired software module:
module spider OpenMPI
Where OpenMPI is the name of software to search for.
Loading the default version of a software module:
module load OpenMPI
Loading a specific version of a software module:
module load OpenMPI/4.1.5-GCC-12.3.0
Display information of the selected software module:
module whatis OpenMPI/4.1.5-GCC-12.3.0
To change the software version of the module:
module switch OpenMPI OpenMPI/new-version>
List of loaded or currently active software modules:
module list
Remove all loaded software modules:
module purge
Remove and then load all loaded software modules:
module reload
Remove selected modules:
module unload OpenMPI
Help for the selected software:
module help OpenMPI/4.1.2.1
Additional commands can be found in the help section:
module help
System Environments ARM
Basic Information
Compiler Name | FUJITSU Software Compiler Package |
Compiler Version | V1.0L21 (cp-1.0.21.02a) |
MPI Interconnect | InfiniBand (HDR100) |
Job Scheduler Name | Slurm |
Job Scheduler Version | 23.11.4 |
System Environments x86
Basic Information
Compilers CPU | GCC 12.3.0, Intel oneAPI HPC Toolkit 2023.1.0 |
Compilers GPU | CUDA 11.8: GCC 11.3.0, NVIDIA HPC SDK 22.9 |
MPI Interconnect CPU | InfiniBand (HDR100) |
MPI Interconnect GPU | InfiniBand (HDR200) |
Job Scheduler Name | Slurm |
Job Scheduler Version | 23.11.4 |
Partitions on Deucalion
Partition | Architecture | Max Nodes | Max Jobs | Time Limit |
---|---|---|---|---|
dev-arm | aarch64 | 16 | 1 | 4 hours |
normal-arm | aarch64 | 128 | 4 | 48 hours |
large-arm | aarch64 | 512 | 1 | 72 hours |
dev-x86 | x86_64 | 8 | 1 | 4 hours |
normal-x86 | x86_64 | 64 | 4 | 48 hours |
large-x86 | x86_64 | 128 | 1 | 72 hours |
dev-a100-40 | x86_64 | 1 | 1 | 4 hours |
normal-a100-40 | x86_64 | 4 | 1 | 48 hours |
dev-a100-80 | x86_64 | 1 | 1 | 4 hours |
normal-a100-80 | x86_64 | 4 | 1 | 48 hours |
Slurm Basics
An HPC cluster is made up of a number of compute nodes, which consist of one or more processors, memory and in the case of the GPU nodes, GPUs. These computing resources are allocated to the user by the resource manager. This is achieved through the submission of jobs by the user. A job describes the computing resources required to run application(s) and how to run it. HPC Cluster uses Slurm as job scheduler and resource manager.
In the following, you will learn how to submit your job using the Slurm Workload Manager. If you're familiar with Slurm, you probably won't learn much. However, If you aren't acquainted with Slurm, the following will introduce you to the basics. If you would like to play around with Slurm in a sandboxed environment before submitting real jobs, we highly recommend that you try the interactive Slurm Learning tutorial.
The main commands for using Slurm are summarized in the table below.
Command | Description |
---|---|
sbatch |
Submit a batch script |
| squeue
| View information about jobs in the scheduling queue |
| scancel
| Signal or cancel jobs, job arrays or job steps |
| sinfo
| View information about nodes and partitions |
Creating a batch script
The most common type of jobs are batch jobs which are submitted to the
scheduler using a batch job script and the sbatch
command.
A batch job script is a text file containing information about the job to be run: the amount of computing resource and the tasks that must be executed.
A batch script is summarized by the following steps:
- the interpreter to use for the execution of the script: bash, python, ...
- directives that define the job options: resources, run time, ...
- setting up the environment: prepare input, environment variables, ...
- run the application(s)
As an example, let's look at this simple batch job script:
#!/bin/bash
#SBATCH --job-name=exampleJob
#SBATCH --account=exampleAccount
#SBATCH --time=02:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
source /share/env/module_select.sh #configure the correct path for lmod
module load MyApp/1.2.3
myapp -i input -o output
In the previous example, the first line #!/bin/bash
specifies that the script
should be interpreted as a bash script.
The lines starting with #SBATCH
are directives for the workload manager.
These have the general syntax
#SBATCH option_name=argument
Now that we have introduced this syntax, we can go through the directives one by one. The first directive is
#SBATCH --job-name=exampleJob
which sets the name of the job. It can be used to identify a job in the queue and other listings.
It is then necessary to select the account:
#SBATCH --account=exampleAccount
sacctmgr show Association where User=<username> format=Cluster,Account%30,User
<username>
is your Deucalion username.
The remaining lines specify the resources needed for the job. The first one is the maximum time your job can run. If your job exceeds the time limit, it is terminated regardless of whether it has finished or not.
#SBATCH --time=02:00:00
The time format is hh:mm:ss
(or d-hh:mm:ss
where d
is the number of
days). Therefore, in our example, the time limit is 2 hours.
The next four lines of the script describe the computing resources that the job will need to run
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
In this instance we request one task (process) to be run on one node. A task corresponds to a process (or an MPI rank). One CPU thread (used, for example, with OpenMP) is requested for the one task as well as 2 GiB of memory should be allocated to the whole job.
Now that the needed resources for the job have been defined, the next step is to set up the environment. For example, copy input data from your home directory to the scratch file system or export environment variables.
source /share/env/module_select.sh #configure the correct path for lmod
module load MyApp/1.2.3
MyApp
application is available
to the batch job. Finally, with everything set up, we can launch our program.
myapp -i input -o output
Submit a batch job
To submit the job script we just created we use the sbatch
command. The
general syntax can be condensed as
$ sbatch [options] job_script [job_script_arguments ...]
The available options are the same as the one you use in the batch script:
sbatch --nodes=2
in the command line and #SBATCH --nodes=2
in a batch
script are equivalent. The command line value takes precedence if the same
option is present both on the command line and as a directive in a script.
For the moment let's limit ourselves to the most common way to use the
sbatch
: passing the name of the batch script which contains the submission
options.
$ sbatch myjob.sh
Submitted batch job 123456
The sbatch
command returns immediately and if the job is successfully
submitted, the command prints out the ID number of the job.
More details may be found on the dedicated [batch jobs][batch-jobs] page.
Examine the queue
Once you have submitted your batch script it won't necessarily run immediately.
It may wait in the queue of pending jobs for some time before its required
resources become available. In order to view your jobs in the queue, use the
squeue
command.
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 small-arm exampleJ john.doe PD 0:00 1 (Priority)
The output shows the state of your job in the ST
column. In our case, the job
is pending (PD
). The last column indicates the reason why the job isn't
running: Priority
. This indicates that your job is queued behind a higher
priority job. One other possible reason can be that your job is waiting for
resources to become available. In such a case, the value in the REASON
column
will be Resources
.
Let's look at the information that will be shown if your job is running:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 small-arm exampleJ john.doe R 35:00 1 node-0123
The ST
column will now display a R
value (for RUNNING
). The TIME
column
will represent the time your job has been running. The list of nodes on which
your job is executing is given in the last column of the output.
In practice the list of jobs printed by this command will be much longer since
all jobs, including those belonging to other users, will be visible. In order
to see only the jobs that belong to you use the squeue
command with the
--me
flag.
$ squeue --me
The squeue
command can also be used to determine when your pending job will
start.
$ squeue --me --start
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
123456 batch Computat vananh PD 2021-06-01T16:10:28 1 node0012 (Priority)
123457 batch Computat vananh PD 2021-06-01T18:21:28 1 (null) (Priority)
In our example, both jobs listed will start June 1 at different times. You will
also notice that for the first job, the scheduler plan to run the job on
node0012
while for the second job, no node has been chosen yet.
Cancelling a job
Sometimes things just don't go as planned. If your job doesn't run as expected,
you may need to cancel your job. This can be achieved using the scancel
command which takes the job ID of the job to cancel.
$ scancel <jobid>
The job ID can be obtained from the output of the sbatch
command when
submitting your job or by using squeue
. The scancel
command applies to
either a pending job waiting in the queue or to an already running job. In the
first case, the job will simply be removed from the queue while in the latter,
the execution will be stopped.
Batch jobs
This pages covers advanced topics related to running Slurm batch jobs on cluster. If you are not already familiar with Slurm, you should read the Slurm quickstart guide which covers the basics. You can also refer to the Slurm documentation or manual pages, in particular the page about sbatch.
Common Slurm options
Here is an overview of some of the most commonly used Slurm options.
Basic job specification
Option | Description |
---|---|
--time |
Set a limit on the total run time of the job allocation |
--account |
Charge resources used by this job to specified project |
--partition |
Request a specific partition for the resource allocation |
--job-name |
Specify a name for the job allocation |
Specify tasks distribution
Option | Description |
---|---|
--nodes |
Number of nodes to be allocated to the job |
--ntasks |
Set the maximum number of tasks (MPI ranks) |
--ntasks-per-node |
Set the number of tasks per node |
--ntasks-per-socket |
Set the number of tasks on each node |
--ntasks-per-core |
Set the maximum number of task on each core |
Request CPU cores
Option | Description |
---|---|
--cpus-per-task |
Set the number of cores per tasks |
Request memory
Option | Description |
---|---|
--mem |
Set the memory per node |
--mem-per-cpu |
Set the memory per allocated CPU cores |
--mem-per-gpu |
Set the memory per allocated GPU |
Email notifications
Option | Description |
---|---|
--mail-type |
Set when you want to receive emails (list of options) |
--mail-user |
Email to receive notifications |
Pipelining with dependencies
Job dependencies allow you to defer the start of a job until the specified
dependencies have been satisfied. Dependencies can be defined in a batch script
with the --dependency
directive or be passed as a command-line argument to
sbatch
.
$ sbatch --dependency=<type:job_id[:job_id]>
The type
defines the condition that the job with ID job_id
must fulfil
before the job which depends on it can start. For example,
$ sbatch job1.sh
Submitted batch job 123456
$ sbatch --dependency=afterany:123456 job2.sh
Submitted batch job 123458
will only start execution of job2.sh
when job1.sh
has finished. The available
types and their description are presented in the table below.
Dependency type | Description |
---|---|
after:jobid[:jobid...] |
Begin after the specified jobs have started |
afterany:jobid[:jobid...] |
Begin after the specified jobs have finished |
afternotok:jobid[:jobid...] |
Begin after the specified jobs have failed |
afterok:jobid[:jobid...] |
Begin after the specified jobs have run to completion |
The example below demonstrates a bash script for submission of multiple Slurm
batch jobs with dependencies. It also shows an example of a helper function
that extracts the job ID from the output of the sbatch
command.
#!/bin/bash
submit_job() {
sub="$(sbatch "$@")"
if [[ "$sub" =~ Submitted\ batch\ job\ ([0-9]+) ]]; then
echo "${BASH_REMATCH[1]}"
else
exit 1
fi
}
# first job - no dependencies
id1=$(submit_job job1.sh)
# Two jobs that depend on the first job
id2=$(submit_job --dependency=afterany:$id1 job2.sh)
id3=$(submit_job --dependency=afterany:$id1 job3.sh)
# One job that depends on both the second and the third jobs
id4=$(submit_job --dependency=afterany:$id2:$id3 job4.sh)
Interactive Slurm jobs
Interactive jobs allow a user to interact with applications on the compute nodes. With an interactive job, you request time and resources to work on a compute node directly, which is different to a batch job where you submit your job to a queue for later execution.
You can use two commands to create an interactive session: srun
and salloc
.
Both of these commands take options similar to sbatch
.
Using salloc
Using salloc
, you allocate resources and spawn a shell that is then used to
execute parallel tasks launched with srun
. For example, you can allocate 2
nodes for 30 minutes with the command
$ salloc --nodes=2 --time=00:30:00
salloc: Granted job allocation 123456
salloc: Waiting for resource configuration
Once the allocation is made, this command will start a shell on the login
node. You can start parallel execution on the allocated nodes with srun
.
$ srun --ntasks=32 --cpus-per-task=8 ./mpi_openmp_application
After the execution of your application ended, the allocation can be terminated
by exiting the shell (exit
).
When using salloc
, a shell is spawned on the login node. If you want to
obtain a shell on the first allocated compute node you can use srun --pty
.
$ srun --cpu_bind=none --nodes=2 --pty bash -i
If you want to use an application with a GUI, you can use the --x11
flag with
srun
to enable X11 forwarding.
Using srun
For simple interactive session, you can use srun
with no prior allocation. In
this scenario, srun
will first create a resource allocation in which to run
the job. For example, to allocate 1 node for 30 minutes and spawn a shell
$ srun --time=00:30:00 --nodes=1 --pty bash
Using srun
to check running jobs
Currently, ssh'ing to compute nodes is not allowed, but the srun
command can
be used to check in on a running job in the cluster. In this case, you need to
give the job ID and possibly also the specific name of a compute node to srun
.
This starts a shell, where you can run any command, on the first allocated node in a specific job:
$ srun --interactive --pty --jobid=<jobid> $SHELL
To check processor and memory usage quickly, you can run top
directly:
$ srun --interactive --pty --jobid=<jobid> top
The -w nid00XXXX
option can be added to select a specific compute node to view:
$ srun --interactive --pty --jobid=<jobid> -w nid002217 top
Enroot
What is Enroot?
- A tool to turn traditional container/OS images into unprivileged sandboxes.
- It uses the same underlying technologies as containers but removes much of the isolation they inherently provide while preserving filesystem separation.
- Enroot can be thought of as an unprivileged chroot that provides facilities to import well known container image formats (e.g. Docker)
Key concepts
- Standalone (no daemon)
- Fully unprivileged and multi-user capable (no setuid binary, cgroup inheritance, per-user configuration/container store…)
- Easy to use
- No isolation (no performance overhead, simplifies HPC deployements)
- Built-in GPU support with libnvidia-container
- Facilitate collaboration and development workflows (bundles, in-memory containers...)
Usage
Import and start an Ubuntu image from DockerHub Pull the image from docker hub
enroot import docker://ubuntu
Create the container
enroot create ubuntu.sqsh
Start the container
enroot start ubuntu
If you need to run something as root inside the container, you can use the --root option.
enroot start --root ubuntu
List the existing containers
enroot list -f
Remove a container
enroot remove ubuntu
Pyxis
Pyxis is a SPANK plugin for the Slurm Workload Manager. It allows unprivileged cluster users to run containerized tasks through the srun command. Pyxis requires Slurm and Enroot to work.
Benefits
- Execute the user's task in an unprivileged container.
- Fast Docker image download.
- Simple command-line interface.
- Supports multi-node MPI jobs through PMI2 or PMIx (requires Slurm support).
- Allows users to install packages inside the container.
Usage
srun
Run a command on a node
srun cat /etc/os-release
…
PLATFORM_ID="platform:el8"
Run the same command, but now inside of a container
srun --container-image=/apps/public/containers/ubuntu.sqsh --container-name=ubuntu cat /etc/os-release
…
PRETTY_NAME="Ubuntu 22.04.4 LTS"
Mount a file from the host and run the command on it, from inside the container
srun --container-name=ubuntu --container-mounts=/etc/os-release:/etc/os-release cat /etc/os-release
…
PLATFORM_ID="platform:el8"
To see more options
srun --help | grep container
sbtach Execute the sbatch script inside a container image, a real application example GROMACS (example not reproducible its from another cluster):
#!/bin/bash
#SBATCH -p all-a100 -t 30:00
#SBATCH --container-mounts /var/spool/slurm,/home/hpcnow/examples/stmv:/host_pwd
#SBATCH --container-workdir=/host_pwd
# SBATCH --container-image nvcr.io\#hpc/gromacs:2021.3
# SBATCH --container-image /home/hpcnow/examples/hpc+gromacs+2021.3.sqsh
#SBATCH --container-name hpc+gromacs+2021.3
export GMX_ENABLE_DIRECT_GPU_COMM=1
/usr/local/gromacs/avx2_256/bin/gmx mdrun -ntmpi 8 -ntomp 16 -nb gpu -pme gpu -npme 1 -update gpu -bonded gpu -nsteps 100000 -resetstep 90000 -noconfout -dlb no -nstlist 300 -pin on -v -gpu_id 0123
Singularity
- It allows you to create and run containers that package up pieces of software in a way that is portable and reproducible.
- You can build a container using SingularityCE on your laptop, and then run it on many of the largest HPC clusters in the world, local university or company clusters, a single server, in the cloud, or on a workstation down the hall. Your container is a single file, and you don’t have to worry about how to install all the software you need on each different operating system.
- Installed version: 3.11.3
Usage
sbatch singurality example:
#!/bin/bash
#SBATCH -p all-a100 -t 00:30:00
# REFERENCE: https://sylabs.io/guides/3.11/user-guide/gpu.html
#https://www.tensorflow.org/install/docker
#2.10.1 release w/ GPU support: cuda 11.2
#To be downloaded from login nodes (internet access)
#singularity pull tensorflow.sif docker://tensorflow/tensorflow:2.10.1-gpu
#wget https://github.com/tensorflow/benchmarks/archive/039fa9550c9efd25e09db2e73f360b67378436fe.zip
#mv benchmarks-039fa9550c9efd25e09db2e73f360b67378436fe/ benchmarks
CONTAINER_URI='tensorflow.sif'
#--nv option will setup the container's environment to use an NVIDIA GPU and the basic CUDA libraries to run a CUDA enabled application.
# instance tensorflow started
singularity instance start --nv $CONTAINER_URI tensorflow
singularity exec instance://tensorflow python \
$(pwd)/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--num_gpus=4 --model resnet50 --batch_size 32
singularity exec instance://tensorflow python \
$(pwd)/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--num_gpus=4 --model inception3 --batch_size 32
#shutdown instance
singularity instance stop -t 30 tensorflow