Open MPI is a free and open-source implemention of the MPI standard.
MPI Sessions support, especially in Fortran, is still lacking in the offical Open MPI repo (as of early 2023). In this project, two pull requests were created and merged to the upstream Open MPI repository:
MPI_SESSION_NULL
available in Fortran programsA new API for dynamic resource management was introduced by Huber et al in Towards Dynamic Resource Management with MPI Sessions and PMIx (2022). This API has been continously refined and is the basis for the final version of this project.
In the course of this IDP, a Fortran interface was derived for version v2a
of the Open MPI prototype introduced by Dominik Huber (link to repo).
This interface was added to said Open MPI fork and has been merged into the main
branch.
The interface is specified in Fortran 90, as this is the MPI version that LibPFASST uses. However, little work is required to add Fortran 08 support.
Also, the non-blocking variants of the MPI Sessions API (MPI_Session_dyn_v2a_psetop_nb
, MPI_Session_dyn_v2a_query_psetop_nb
, MPI_Session_get_pset_data_nb
) could not be implemented in Fortran, due to required conversion of the results of these calls.
It is advised to use the repository and follow the setup instructions described in the quickguide here to setup a multi-container docker cluster on your computer. Alternatively, you can also try to manually build the components: prrte, openpmix, ompi.
For development, it might help to enable debug symbols within Open MPI and other runtime libraries. To enable these, add the --enable-debug
flag when running install_docker.sh
.
Based on the information provided here: https://doku.lrz.de/display/PUBLIC/Building+software+in+user+space+with+spack
unzip spack_packages.zip
module load user_spack
mkdir -p ~/spack_repos
spack repo create ~/spack/repos/mine
mv packages ~/spack/repos/mine/
spack install dyn_ompi
From then on, to activate the environment after logging in, run:
module load user_spack
spack load dyn_ompi
To run an MPI application with the fork, it is currently best to allocate an interactive session:
NNODES
with the number of nodes you want to allocate):salloc --nodes NNODES ...
PROCS_PER_NODE
with the number of processors available on each node, which is also the granularity of the addition/removal of resources):mpirun --host $(scontrol show hostname $SLURM_NODELIST | sed 's/$/:PROCS_PER_NODE/g' | tr '\n' ',') <further mpirun arguments> ...
A proof of concept, loop-based MPI Fortran application is available for testing the API here. It behaves similar to the C examples in the same project.
If you followed the quickguide linked above, the test_applications
folder should already be available in your docker cluster.
Make sure that you have started the docker cluster and entered the environment using the ./mpiuser-drop-in.sh
script.
Then run the following commands:
cd /opt/hpc/build/test_applications
# build fortran example in release mode
scons example=DynMPISessions_v2a_fortran compileMode=release
The resulting binary is available at build/DynMPISessions_v2a_fortran_release
.
The following flags are available:
Usage: ./build/DynMPISessions_v2a_fortran_release [-d] [-c <ITER_MAX>] [-l <proc_limit>] [-n <num_delta>] [-f <rc_frequency>] [-m <mode_string>]
Options:
-d Enable debug prints
-c <ITER_MAX> Maximum number of iterations (default: 200)
-l <proc_limit> Maximum (or minimum in s_/b_ mode) number of processors (default: 64)
-n <num_delta> Number of delta values (default: 8)
-f <rc_frequency> Frequency of resource change steps (default: 10)
-m <mode_string> Mode (default: i+)
I recommend using tmpi.py
(described in the Tools section) to run the examples with a nice visualization of the processes:
# download tmpi.py
wget https://raw.githubusercontent.com/boi4/tmpi-py/main/tmpi.py
chmod +x tmpi.py
# run fortran example with tmpi.py
MPIRUNARGS="--display map --mca btl_tcp_if_include eth0 --host n01:4,n02:4,n03:4,n04:4 -x LD_LIBRARY_PATH -x DYNMPI_BASE" \
./tmpi.py 16 \
build/DynMPISessions_v2a_fortran_release -d -c 3000 -l 1 -m s_ -n 4 -f 200
For a list of code changes, check out the commit history on GitLab.
To add a new Fortran function to Open MPI, the following steps can be used when in the root of the Open MPI repository. I am documenting this here, as I could not find this information easily online:
ompi/ompi/mpi/fortran/mpif-h/
with the name of your function (check existing files to get the idea)ompi/ompi/fortran/mpif-h/Makefile.am
ompi/ompi/mpi/fortran/mpif-h/profile/
with "p" at start of filename: ln -s ../../../../../ompi/mpi/fortran/mpif-h/${name} p${name}
ompi/ompi/mpi/fortran/mpif-h/profile/Makefile.am
ompi/ompi/fortran/use-mpi-ignore-tkr/mpi-ignore-tkr-interfaces.h.in
ompi/ompi/fortran/use-mpi-tkr/mpi-f90-interfaces.h
ompi/ompi/fortran/mpif-h/prototypes_mpi.h
ompi/ompi/fortran/use-mpi-ignore-tkr/pmpi-ignore-tkr-interfaces.h
The most important file is the one created in the first step. It includes the C implementation of the Fortran call. For this, the Fortran arguments are automatically translated into C arguments, as described here.
Importantly, each argument is passed as a single pointer type (even multi-dimensional arrays). For each string type argument, a hidden additional argument specifying the string's length is passed.
Open MPI offers multiple conversion functions which were used to implement the functions for this project.
After the conversion of the input arguments, the internal Open MPI function is called with the converted arguments. The "OUT" parameters of this call are converted back into fortran after the internal function returns.
The following new constants are available in the mpi
module:
integer MPI_PSETOP_NULL
integer MPI_PSETOP_ADD
integer MPI_PSETOP_SUB
integer MPI_PSETOP_REPLACE
integer MPI_PSETOP_MALLEABLE
integer MPI_PSETOP_GROW
integer MPI_PSETOP_SHRINK
integer MPI_PSETOP_UNION
integer MPI_PSETOP_DIFFERENCE
integer MPI_PSETOP_INTERSECTION
integer MPI_PSETOP_SPLIT
These constants are used for the op
and type
arguments in MPI_Session_dyn_v2a_psetop
and MPI_Session_dyn_v2a_query_psetop
and represent the respective pset operation.
Note that MPI_PSETOP_SHRINK
, MPI_PSETOP_GROW
, MPI_PSETOP_ADD
and MPI_PSETOP_SUB
modify the available resources of the application.
The other pset operations only recombine existing process sets.
Analogously to existing Fortran subroutines in the MPI standard, the corresponding Fortran routines have the same name and arguments as their C counterpart.
Additionally, an optional integer ierror
argument can be added at the end of each call to get the return status of the operation.
The exception is the output_psets
argument of MPI_Session_dyn_v2a_psetop
and MPI_Session_dyn_v2a_query_psetop
.
Instead of being allocated by the Open MPI runtime, it requires the user to pre-allocate these.
The reason for this is the difficulty of allocating Fortran objects in C. Furthermore, this has an additional benefit: It is easier to broadcast output_psets
, as other processes might also allocate output_psets
before the call.
String arguments that contain a process set name and are of type IN
or INOUT
should be able to hold at least MPI_MAX_PSET_NAME_LEN
arguments (constant is available in the mpi
module).
Furthermore, process set names are terminated by filling the rest of the string with blanks (' '
, ASCII 0x20).
PSet Data Routines
interface
subroutine MPI_Session_get_pset_data(session, pset_name , coll_pset_name, keys, nkeys, wait, info_used, ierror)
implicit none
integer, intent(in) :: session
character(len=*), intent(in) :: coll_pset_name
character(len=*), intent(in) :: pset_name
character(len=*), dimension(*), intent(in) :: keys
integer, intent(in) :: nkeys
integer, intent(in) :: wait
integer, intent(out) :: info_used
integer, intent(out) :: ierror
end subroutine MPI_Session_get_pset_data
end interface
From the C documentation:
RETURN:
- MPI_SUCCESS if operation was successful
Description: Publishes a key-value pair in the dictionary associated with the given PSet name.
The PSet name has to exist.
interface
subroutine MPI_Session_set_pset_data (session, pset_name, info_used, ierror)
implicit none
integer, intent(in) :: session
character(len=*), intent(in) :: pset_name
integer, intent(in) :: info_used
integer, intent(out) :: ierror
end subroutine MPI_Session_set_pset_data
end interface
From the C documentation:
RETURN:
- MPI_SUCCESS if operation was successful
Description: Looks up a key-value pair in the dictionary associated with the given PSet name.
The PSet name has to exist.
The call is collective over the processes in coll_pset_name, i.e. all processes in
coll_pset_name have to call this function. All processes are guaranteed to receive the
same values. mpi://SELF may be used for individual lookups
PSet Operation Routines
interface
subroutine MPI_Session_dyn_v2a_query_psetop (session, coll_pset, input_pset, type, output_psets, noutput, ierror)
integer, intent(in) :: session
character(len=*), intent(in) :: coll_pset
character(len=*), intent(in) :: input_pset
integer, intent(out) :: type
character(len=*), dimension(*), intent(out) :: output_psets
integer, intent(inout) :: noutput
integer, intent(out) :: ierror
end subroutine MPI_Session_dyn_v2a_query_psetop
end interface
output_psets
must be pre-allocated by the caller.noutput
needs to be set to the number of entries in output_psets
.type != MPI_PSETOP_NULL
, noutput
will be set to the number of output psets and the names of the output psets will be located in noutputs
.From the C documentation:
RETURN:
- MPI_SUCCESS if operation was successful
Description: Queries for pending PSet Operation invloving the specified PSet.
This only applies to PSet operations involving changes of resources:
-> MPI_PSETOP_{ADD, SUB; GROW, SHRINK, REPLACE}
If no pending PSet operation is found for the specified PSet, op will be set to MPI_PSETOP_NULL
interface
subroutine MPI_Session_dyn_v2a_psetop (session, op, input_sets, ninput, output_psets, noutput, info, ierror)
integer, intent(in) :: session
integer, intent(inout) :: op
character(len=*), dimension(*), intent(in) :: input_sets
integer, intent(in) :: ninput
character(len=*), dimension(*), intent(inout) :: output_psets
integer, intent(inout) :: noutput
integer, intent(inout) :: info
integer, intent(out) :: ierror
end subroutine MPI_Session_dyn_v2a_psetop
end interface
output_psets
must be pre-allocated by the caller.noutput
needs to be set to the number of entries in output_psets
.noutput
will be set to the number of output psets and the names of the output psets will be located in noutputs
.From the C documentation:
RETURN:
- MPI_SUCCESS if operation was successful
Description: Requests the specified PSet Operation to be applied on the input PSets.
The info object can be used to specify parameters of the operation.
If successful, the function allocates an array of n_output PSet names.
It is the callers responsibility to free the PSet names and the output_psets array.
interface
subroutine MPI_Session_dyn_finalize_psetop(session, pset_name, ierror)
implicit none
integer, intent(in) :: session
character(len=*), intent(in) :: pset_name
integer, intent(out) :: ierror
end subroutine MPI_Session_dyn_finalize_psetop
end interface
From the C documentation:
RETURN:
- MPI_SUCCESS if operation was successful
Description: Indicates finalization of the PSet Operation.
This will make the operation unavailable for MPI_Session_dyn_v2a_query_psetop.
This only applies to PSet operations involving changes of resources:
-> MPI_PSETOP_{ADD, SUB; GROW, SHRINK, REPLACE}