MPI and Infiniband

To run multi-node MPI workloads over the low-latency network, some of the components within the container must closely match those of the host environment. In addition to Infiniband drivers, the container’s MPI implementation needs to be ABI-compatible with the version running on the host.

On PEARL the supported MPI implementation is OpenMPI. We recommend using OpenMPI 4.1.0 but you are able to use OpenMPI 3.1.6 if you prefer. The OpenMPI version within your container should be the same as that on the host, or newer.

Note

We have a Singularity definition files available for download that includes everything you need to run an MPI workload on the PEARL worker nodes. It is available as part of an archive which also includes some simple test programs and a SLURM submission script.

To build the image from the definition file, first request an interactive session on one of the worker nodes:

$ srun --pty /bin/bash

Then build the image using the following command:

$ singularity build --fakeroot ompi4.sif ompi4.def

Once complete, you can launch a shell within the container:

$ singularity shell ompi4.sif

From here you can check the following important components:

The Infiniband driver version:

Singularity> ofed_info -s
OFED-internal-5.2-1.0.4:

The OpenMPI version:

Singularity> mpirun --version
mpirun (Open MPI) 4.1.0

Then check that OpenMPI has been built with IB support:

Singularity> ompi_info | grep btl
              MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.1.0)

To ensure that everything works as expected, you can run the included tests by submitting them as a batch job using the job script provided:

$ sbatch submit.slurm

When the job has completed, view the output file to check the results. The first result is a bandwidth test between the two nodes running on the host OS. The second set of figures is the same test but run within the container. The third set of results is another bandwidth test running inside the container, but this time using mpi4py.

OpenMPI run-time tuning parameters

To ensure optimum performance over the low-latency network, There are a two run-time parameters that you should include when running an MPI workload.

--mca btl_openib_allow_ib 1 --mca btl_openib_if_include mlx5_0:1

In context, the full command used to run the mpi4py bandwidth test would be:

mpirun -q --mca btl_openib_allow_ib 1 --mca btl_openib_if_include mlx5_0:1 singularity exec --nv ompi4.sif python tests/bandwidth.py

Note

The -q directive sets OpenMPI’s reporting to ‘quiet’ mode to supress non-error messages