Compilation

You will notice major speed improvements if you compile your code in your project folder (inside /projects/) instead of your home folder. This is due to differences in disk reading/writing speeds.

Deucalion has a heterogeneous architecture with nodes featuring both x86 and arm architectures.

ARM Environmentx86 Environment

Available compilers (Native)

The login nodes use x86 microprocessors. To compile a program for the arm architecture the fastest way is to take a node from the queue via salloc:

salloc -N1 -p dev-arm -A <account> -t 4:00:00

where <account> should be one of those accounts ending in "a" when running /usr/local/bin/billing. The messages will tell you which node you took so you can ssh that node via

ssh cna[xxxx]

If you forgot what node you allocated, you can view that information via

squeue --me

The name of the job should be "interact".

The Native compilers are only used on the compute nodes.

The dev-arm partition should only be used for compilation tasks.

Non-MPI

Creator	Language	Compile command
Fujitsu	Fortran	frt
Fujitsu	C	fcc
Fujitsu	C++	FCC
GNU	Fortran	gfortran
GNU	C	gcc
GNU	C++	g++

MPI

Kind	Language	Compile command
Fujitsu	Fortran	mpifrt
Fujitsu	C	mpifcc
Fujitsu	C++	mpiFCC
GNU	Fortran	mpifort
GNU	C	mpicc
GNU	C++	mpiCC

To access all Fujitsu compilers:

ml FJSVstclanga (run in arm compute node)

To access all GNU compilers:

ml OpenMPI (run in arm compute node)

For recommended options for GNU compilers check here For recommended options for FUJITSU compilers check here

Available compilers (Cross)

To avoid compilation through the compute nodes, FUJITSU created a cross-compiler (i.e. runs on the x86 architecture but produces machine code for ARM processors) Because cross compilers run on x86 instructions, they may only be used on the login nodes.

Non-MPI

Creator	Language	Compile command
Fujitsu	Fortran	frtpx
Fujitsu	C	fccpx
Fujitsu	C++	FCCpx

MPI

Creator	Language	Compile command
Fujitsu	Fortran	mpifrtpx
Fujitsu	C	mpifccpx
Fujitsu	C++	mpiFCCpx

To load all FUJITSU compilers:

ml FJSVstclanga (run in login node)

Compile Information

Fortran compiler
Creates object programs from Fortran source programs.
C/C++ compiler
Creates object programs from C/C++ source programs.
C/C++ compiler has 2 modes with different user interfaces, the mode to be used is specified by the option of the compile command.
The default mode is Trad Mode.

Mode	Description
Trad Mode (default)	This mode uses an enhanced compiler based on compilers for K computer and PRIMEHPC FX100 or earlier system. This mode is suitable for maintaining compatibility with the past Fujitsu compiler. [C] Supported specifications are C89/C99/C11 and OpenMP 3.1/Part of OpenMP 4.5. [C++] Supported specifications are C++03/C++11/C++14/Part of C++17 and OpenMP 3.1/Part of OpenMP 4.5.
Clang Mode (-Nclang)	This mode uses an enhanced compiler based on Clang/LLVM. This mode is suitable for compiling programs using the latest language specification and open source software. [C] Supported specifications are C89/C99/C11 and OpenMP 4.5/Part of OpenMP 5.0. [C++] Supported specifications are C++03/C++11/C++14/C++17 and OpenMP 4.5/Part of OpenMP 5.0.

Recommended Options for Fujitsu Compilers

Language/Mode	Focus	Recommended options	Induced options
Fortran	Performance	Kfast,openmp[,parallel]	‑O3 ‑Keval,fp_contract,fp_relaxed,fz,ilfunc,mfunc omitfp,simd_packed_promotion
Fortran	Precision	Kfast,openmp[,parallel],fp_precision	‑Knoeval,nofp_contract,nofp_relaxed,nofz,noilfunc,nomfunc,parallel_fp_precision
C/C++ Trad Mode	Performance	Kfast,openmp[,parallel]	‑O3 ‑Keval,fast_matmul,fp_contract,fp_relaxed,fz,ilfunc,mfunc,omitfp,simd_packed_promotion
C/C++ Trad Mode	Precision	Kfast,openmp[,parallel],fp_precision	‑Knoeval,nofast_matmul,nofp_contract,nofp_relaxed,nofz,noilfunc,nomfunc,parallel_fp_precision
C/C++ Clang Mode		Nclang -Ofast	-O3 -ffj-fast-matmul -ffast-math -ffp-contract=fast -ffj-fp-relaxed -ffj-ilfunc -fbuiltin -fomitframe-pointer -finline-functions

Recommended options for GNU compilers

Language/Mode	Compiler	Recommended options
C/C++/Fortran	g[cc,CC,fortran], mpi[cc,CC,fort]	-O2 -ftree-vectorize -march=native -fno-math-errno

Compile Examples ARM

Fortran

Sequential program

Fujitsu Cross-compilation: (ln0x)$ frtpx -Kfast sample.f #for cross-compilation using login nodes (FUJITSU)
Fujitsu Native: (cnaxxxx)$ frt -Kfast sample.f #for compilation inside compute node (FUJITSU)
GNU Native: (cnaxxxx)$ gfortran -O2 -ftree-vectorize -march=native -fno-math-errno sample.f #for compilation inside compute node (GNU)

Thread-parallel program (using automatic parallelization)

Fujitsu Cross-compilation: (ln0x)$ frtpx -Kfast,parallel sample.f #for cross-compilation using login nodes
Fujitsu Native: (cnaxxxx)$ frt -Kfast,parallel sample.f #for compilation inside compute node
GNU Native: (cnaxxxx)$ gfortran -O2 -ftree-parallelize-loops=n -ftree-vectorize -march=native -fno-math-errno sample.f #for compilation inside compute node (GNU) with n threads

Thread-parallel program (Open MP)

(ln0x)$ frtpx -Kfast,openmp sample.f
(cnaxxxx)$ frt -Kfast,openmp sample.f
(cnaxxxx)$ gfortran -O2 -fopenmp -ftree-vectorize -march=native -fno-math-errno sample.f

Thread-parallel program (Open MP+ auto-parallel)

(ln0x)$ frtpx -Kfast,openmp,parallel sample.f
(cnaxxxx)$ frt -Kfast,openmp,parallel sample.f

MPI program

(ln0x)$ mpifrtpx -Kfast sample.f
(cnaxxxx)$ mpifrt -Kfast sample.f
(cnaxxxx)$ mpifort -O2 -ftree-vectorize -march=native -fno-math-errno sample.f

Hybrid program (Open MP+ auto-parallel + MPI)

(ln0x)$ mpifrtpx -Kfast,openmp,parallel sample.f
(cnaxxxx)$ mpifrt -Kfast,openmp,parallel sample.f
(cnaxxxx)$ mpifort -O2 -fopenmp -ftree-vectorize -march=native -fno-math-errno sample.f

C

Sequential program

(ln0x)$ fccpx -Kfast sample.c #for cross-compilation using login nodes (FUJITSU)
(cnaxxxx)$ fcc -Kfast sample.c #for compilation inside compute node (FUJITSU)
(cnaxxxx)$ gcc -O2 -ftree-vectorize -march=native -fno-math-errno sample.c #for compilation inside compute node (GNU)

Thread-parallel program (using automatic parallelization)

(ln0x)$ fccpx -Kfast,parallel sample.c
(cnaxxxx)$ fcc -Kfast,parallel sample.c
(cnaxxxx)$ gcc -O2 -ftree-parallelize-loops=n -ftree-vectorize -march=native -fno-math-errno sample.c #for compilation inside compute node (GNU) with n threads

Thread-parallel program (Open MP)

(ln0x)$ fccpx -Kfast,openmp sample.c
(cnaxxxx)$ fcc -Kfast,openmp sample.c
(cnaxxxx)$ gcc -O2 -fopenmp -ftree-vectorize -march=native -fno-math-errno sample.c

Thread-parallel program (Open MP+ auto-parallel)

(ln0x)$ fccpx -Kfast,openmp,parallel sample.c
(cnaxxxx)$ fcc -Kfast,openmp,parallel sample.c

MPI program

(ln0x)$ mpifccpx -Kfast sample.c
(cnaxxxx)$ mpifcc -Kfast sample.c
(cnaxxxx)$ mpicc -O2 -ftree-vectorize -march=native -fno-math-errno sample.c

Hybrid program (Open MP+ auto-parallel + MPI)

(ln0x)$ mpifccpx -Kfast,openmp,parallel sample.c
(cnaxxxx)$ mpifcc -Kfast,openmp,parallel sample.c
(cnaxxxx)$ mpicc -O2 -fopenmp -ftree-vectorize -march=native -fno-math-errno sample.c

C++

Sequential program

(ln0x)$ FCCpx -Kfast sample.c
(cnaxxxx)$ FCC -Kfast sample.c
(cnaxxxx)$ g++ -O2 -ftree-vectorize -march=native -fno-math-errno sample.c

Thread-parallel program (using automatic parallelization)

(ln0x)$ FCCpx -Kfast,parallel sample.c

Thread-parallel program (Open MP)

(ln0x)$ FCCpx -Kfast,openmp sample.c

Thread-parallel program (Open MP+ auto-parallel)

(ln0x)$ FCCpx -Kfast,openmp,parallel sample.c

MPI program

(ln0x)$ mpiFCCpx -Kfast sample.c
(cnaxxxx)$ mpiFCC -Kfast sample.c
(cnaxxxx)$ mpiCC -O2 -ftree-vectorize -march=native -fno-math-errno sample.c

Hybrid program (Open MP+ auto-parallel + MPI)

(ln0x)$ mpiFCCpx -Kfast,openmp,parallel sample.c
(cnaxxxx)$ mpiFCC -Kfast,openmp,parallel sample.c
(cnaxxxx)$ mpiCC -O2 -fopenmp -ftree-vectorize -march=native -fno-math-errno sample.c

Basic Information


Compilers CPU	GCC 12.3.0, Intel oneAPI HPC Toolkit 2023.1.0
Compilers GPU	CUDA 11.8: GCC 11.3.0, NVIDIA HPC SDK 22.9
MPI Interconnect CPU	InfiniBand (HDR100)
MPI Interconnect GPU	InfiniBand (HDR200)
Job Scheduler Name	Slurm
Job Scheduler Version	23.11.4

Available compilers

The Native x86 compilers are used on both the login nodes and compute nodes. You may compile directly in the login nodes.

Non-MPI

Creator	Language	Compile command
GNU	Fortran	gfortran
GNU	C	gcc
GNU	C++	g++
Intel	Fortran	ifort
Intel	C	icc
Intel	C++	icpc

To load GNU compilers:

ml GCCcore/11.3.0

To load INTEL compilers:

ml intel

MPI

Creator	Language	Compile command
GNU	Fortran	mpifort
GNU	C	mpicc
GNU	C++	mpic++
Intel	Fortran	mpiifort
Intel	C	mpiicc
Intel	C++	mpiicpc

To load GNU compilers:

ml GCCcore/11.3.0

To load INTEL compilers:

ml intel

Compile Information

Fortran compiler
Creates object programs from Fortran source programs.
C/C++ compiler
Creates object programs from C/C++ source programs.

Compiler	Description
GCC	https://gcc.gnu.org/gcc-12
Intel oneAPI HPC Toolkit	https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit.html

CPU: Recommended Options

Language/Mode	Focus	Recommended options
C/C++/Fortran	GCC	-O2 -ftree-vectorize -march=native -fno-math-errno
C/C++/Fortran	Intel oneAPI	-O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise

Compile Examples x86

Fortran

Sequential program

Intel oneAPI: (ln0x)$ ifort -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.f

GCC: (ln0x)$ gfortran -O2 -ftree-vectorize -march=native -fno-math-errno sample.f

Thread-parallel program (Open MP)

Intel oneAPI: (ln0x)$ ifort -qopenmp -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.f

GCC: (ln0x)$ gfortran -fopenmp -O2 -ftree-vectorize -march=native -fno-math-errno sample.f

MPI program

Intel oneAPI: (ln0x)$ mpiifort -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.f

GCC: (ln0x)$ mpif90 -O2 -ftree-vectorize -march=native -fno-math-errno sample.f

Hybrid program (Open MP + MPI)

Intel oneAPI: (ln0x)$ mpiifort -qopenmp -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.f

GCC: (ln0x)$ mpif90 -fopenmp -O2 -ftree-vectorize -march=native -fno-math-errno sample.f

C

Sequential program

Intel oneAPI: (ln0x)$ icx -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.c

GCC: (ln0x)$ gcc -O2 -ftree-vectorize -march=native -fno-math-errno sample.c

Thread-parallel program (Open MP)

Intel oneAPI: (ln0x)$ icx -qopenmp -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.c

GCC: (ln0x)$ gcc -fopenmp -O2 -ftree-vectorize -march=native -fno-math-errno sample.c

MPI program

Intel oneAPI: (ln0x)$ mpiicx -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.c

GCC: (ln0x)$ mpicc -O2 -ftree-vectorize -march=native -fno-math-errno sample.c

Hybrid program (Open MP+ MPI)

Intel oneAPI: (ln0x)$ mpiicx -qopenmp -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.c

GCC: (ln0x)$ mpicc -fopenmp -O2 -ftree-vectorize -march=native -fno-math-errno sample.c

C++

Sequential program

Intel oneAPI: (ln0x)$ icpx -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.cpp

GCC: (ln0x)$ g++ -O2 -ftree-vectorize -march=native -fno-math-errno sample.cpp

Thread-parallel program (Open MP)

Intel oneAPI: (ln0x)$ icpx -qopenmp -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.cpp

GCC: (ln0x)$ g++ -fopenmp -O2 -ftree-vectorize -march=native -fno-math-errno sample.cpp

MPI program

Intel oneAPI: (ln0x)$ mpiicpx -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.cpp

GCC: (ln0x)$ mpic++ -O2 -ftree-vectorize -march=native -fno-math-errno sample.cpp

Hybrid program (Open MP + MPI)

Intel oneAPI: (ln0x)$ mpiicpx -qopenmp -O2 -march=core-avx2 -ftz -fp-speculation=safe -fp-model precise sample.cpp

GCC: (ln0x)$ mpic++ -fopenmp -O2 -ftree-vectorize -march=native -fno-math-errno sample.cpp

GPU: Recommended Options

NVIDIA recommendations:
https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html
Resources on:
https://developer.nvidia.com/hpc-sdk

Compiler	Description
CUDA/GCC	https://docs.nvidia.com/cuda/archive/11.8.0/index.html
NVIDIA HPC SDK	https://docs.nvidia.com/hpc-sdk/archive/22.9/index.html

Compile Examples GPU

Fortran

Fortran support it is only available through OpenACC directives or using CUDA Fortran

OpenACC:
(ln0x)$ ml NVHPC/22.9-CUDA-11.8.0
(ln0x)$ nvfortran -acc -gpu=cc80 -Minfo=accel -Mpreprocess -o sample_acc sample_acc.f90

CUDA Fortran:
(ln0x)$ ml NVHPC/22.9-CUDA-11.8.0
(ln0x)$ nvfortran -gpu=cc80 -Minfo=accel -Mpreprocess -o sample sample.cuf

C

CUDA

nvcc:
(ln0x)$ ml CUDA/11.8.0 GCC/11.3.0
(ln0x)$ nvcc --generate-code arch=compute_80,code=sm_80 -o sample sample.c

OpenACC

nvhpc:
(ln0x)$ ml NVHPC/22.9-CUDA-11.8.0
(ln0x)$ nvc -acc -gpu=cc80 -Minfo=accel -Mpreprocess -o sample_acc sample_acc.c

C++

CUDA

nvcc:
(ln0x)$ ml CUDA/11.8.0 GCC/11.3.0
(ln0x)$ nvcc --generate-code arch=compute_80,code=sm_80 -o sample sample.cpp

OpenACC

nvhpc:
(ln0x)$ ml NVHPC/22.9-CUDA-11.8.0
(ln0x)$ nvc++ -acc -gpu=cc80 -Minfo=accel -Mpreprocess -o sample_acc sample_acc.cpp