This is a guide on how to build, install, and test ANL MPICH with lock ahead support. It requires an up to date copy of auto tools (provided in another attachment to this ticket), which must be in your $PATH throughout. If you encounter build problems with these instructions, please post to the ticket. Note that this code will eventually be available as part of the Argonne MPICH distribution, but for now, it's only available by building it yourself. I did this on a CentOS 7 box. This should work similarly on a CentOS 6, SLES, or Scientific Linux system. The packages you need may not match exactly what I list, since my starting config and yours will probably differ. If the autogen or config process turn up something you need, you'll have to install it. Basically, we're going to take autotools provided by Cray, and build openmpi 1.8.8 and MPICH 3.3a2, plus the lockahead patch. You will need all of these packages, so please install them first: git gcc glibc patch perl python gcc-c++ gcc-gfortran rpm-build (Package names are for yum install on CentOS/RHEL 7, may differ on other systems.) Take the attached package of autotools binaries and extract it somewhere on your system. These were built on SLES, but worked for me on RHEL 7. Then put the destination in path environment variable [note that if you log out you'll have to do this again]: [My directory is /root/cray_autotools/bin/] export PATH=/root/cray_autotools/bin/${PATH:+:$PATH} And verify you're getting the right version: [root@cent7c01 mpich]# which automake /root/cray_autotools/bin/automake [root@cent7c01 mpich]# automake --version automake (GNU automake) 1.13.4 [...] Then, check out the ANL MPICH code in to a directory: git clone git://github.com/pmodels/mpich Check out the right version [As of 17/1/27, this is the latest version, and the Cray patch works on it.]: git checkout v3.3a2 Put the patch from Cray (also attached to this message) somewhere on your system, then cd in to the MPICH directory and patch with it: patch -p0 < lockahead_ladvise_mpich_patch Then, do autogen: sh autogen.sh Then configure - You must change the prefix to where you plan to install mpich: ./configure --prefix=/shared/paf/mpich-latest_first/ \ --enable-error-checking=all --enable-error-messages=all \ --enable-f77 --enable-fc --enable-cxx \ --enable-romio --with-file-system=ufs+nfs+lustre \ --enable-threads=multiple --enable-thread-cs=global \ --enable-shared \ --enable-debuginfo \ |& tee configure.out Note shmem is not enabled. If you need it for an application you want to test, you'll have to add it. It's not required for IOR and has some additional build dependencies. Finally, make: make | tee make.out # optionally add "-j n" for faster parallel build Finally, install: make install |& install.out This will install this mpich to the directory you gave as prefix (/shared/paf/mpich-latest_first/ for me above). You will either need to install this to a shared location visible to all the nodes which will use it, or copy it to the same place on all the nodes. Note this must either be the same location you specified in the configure command, or you must add the /lib directory in your install location to LD_LIBRARY_PATH when running applications. (For example: export LD_LIBRARY_PATH=/shared/paf/mpich-latest_movetest/lib/${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH} ) You can add the bin directory to your path, but running the mpiexec executable directly may be better. This is the equivalent of mpirun from OpenMPI. (Example of adding the bin directory: export PATH=/shared/paf/mpich-latest_first/bin/${PATH:+:$PATH} ) You can test running it on your nodes with a few simple commands, first version, just to check the version: [root@cent7c01 mpich]# ./mpiexec --version HYDRA build details: Version: 3.3a2 Release Date: unreleased development copy CC: gcc CXX: g++ F77: gfortran F90: gfortran Configure options: '--disable-option-checking' '--prefix=/root/mpich_installs/mpich-latest_first' '--enable-error-checking=all' '--enable-error-messages=all' '--enable-f77' '--enable-fc' '--enable-cxx' '--enable-romio' '--with-file-system=ufs+nfs+lustre' '--enable-threads=multiple' '--enable-thread-cs=global' '--enable-shared' '--enable-debuginfo' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/root/anl/upstream/mpich/src/mpl/include -I/root/anl/upstream/mpich/src/mpl/include -I/root/anl/upstream/mpich/src/openpa/src -I/root/anl/upstream/mpich/src/openpa/src -D_REENTRANT -I/root/anl/upstream/mpich/src/mpi/romio/include' 'MPLLIBNAME=mpl' Process Manager: pmi Launchers available: ssh rsh fork slurm ll lsf sge manual persist Topology libraries available: hwloc Resource management kernels available: user slurm ll lsf sge pbs cobalt Checkpointing libraries available: Demux engines available: poll select Then you can test that it works on multiple nodes like this (running it directly from the common location saves you having to add it to $PATH on all nodes): (Note you will need to change the hostnames given after hosts, and you will also need passwordless SSH set up between hosts, just like for OpenMPI.) [root@cent7c01 bin]# ./mpiexec -hosts cent7c01,cent7c02 hostname cent7c01 cent7c02 ============= ANL MPICH is now built and installed. I'll now show building IOR with it and running with lockahead enabled. Lockahead can work with any application that use MPI I/O collective I/O, but I'm only documenting how to use it with IOR. We can provide info on how to use it with other applications if needed. Do this on a node where you've added the MPICH install directory to $PATH, and also with the Cray provided autotools added to $PATH. Get IOR from github: git clone https://github.com/LLNL/ior.git Then: cd ior ./bootstrap ./configure make The ior binary will be here: ./src/ior You can run it like this (this assumes PATH has not been set up with the mpich bin directory): /shared/paf/mpich-latest_movetest/bin/mpiexec -hosts cent7c01,cent7c02 /shared/testcases/upstream_IOR/ior/src/ior That should run a trivial test in the current directory, looking like this: IOR-3.0.1: MPI Coordinated Test of Parallel I/O Began: Wed Feb 1 15:03:51 2017 Command line used: /shared/testcases/upstream_IOR/ior/src/ior Machine: Linux cent7c01 Test 0 started: Wed Feb 1 15:03:51 2017 Summary: api = POSIX test filename = testFile access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 2 (1 per node) repetitions = 1 xfersize = 262144 bytes blocksize = 1 MiB aggregate filesize = 2 MiB access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 51.58 1024.00 256.00 0.009054 0.031493 0.015868 0.038778 0 read 28.43 1024.00 256.00 0.000636 0.069359 0.046108 0.070360 0 remove - - - - - - 0.001661 0 Max Write: 51.58 MiB/sec (54.08 MB/sec) Max Read: 28.43 MiB/sec (29.81 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum write 51.58 51.58 51.58 0.00 0.03878 0 2 1 1 0 0 1 0 0 1 1048576 262144 2097152 POSIX 0 read 28.43 28.43 28.43 0.00 0.07036 0 2 1 1 0 0 1 0 0 1 1048576 262144 2097152 POSIX 0 ----------------- Now, here's how to actually test lock ahead. You'll need to have a Lustre mount with Cray client and a server with lockahead support present and enabled, and of course mpich and ior available in the same locations on every node. (You can verify that lockahead is working by running the lockahead_test binary from the Lustre sanity test suite.) This will run just two processes on two nodes, with two aggregators. Substitute your paths and hostnames (change the path of testfile to be on your Lustre): IOR_HINT__MPI__romio_lustre_co_ratio=2 /shared/paf/mpich-latest_movetest/bin/mpiexec -n 2 -hosts cent7c01,cent7c02 /shared/testcases/upstream_IOR/ior/src/ior -o /mnt/centss03/testfile -a MPIIO -c -s 1024 -b 1m -t 1m -v -w -H -E This is 1024 MiB of data per rank (-s), written in 1 MiB chunks (-b, -t) using MPIIO collective I/O. 2048 MiB total. Run on a Lustre file system, this should give mediocre performance. Output should look like this [note these tests are all on very slow VM systems, so bandwidth numbers are very low]: IOR-3.0.1: MPI Coordinated Test of Parallel I/O Began: Fri Feb 3 16:15:22 2017 Command line used: /shared/testcases/upstream_IOR/ior/src/ior -o /mnt/centss03/testfile -a MPIIO -c -s 1024 -b 1m -t 1m -v -w -H -E Machine: Linux cent7c01 Start time skew across all tasks: 0.23 sec Test 0 started: Fri Feb 3 16:15:22 2017 Path: /mnt/centss03 FS: 15.7 GiB Used FS: 31.3% Inodes: 1.0 Mi Used Inodes: 0.0% Participating tasks: 2 Summary: api = MPIIO (version=3, subversion=1) test filename = /mnt/centss03/testfile access = single-shared-file, collective pattern = strided (1024 segments) ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 2 (1 per node) repetitions = 1 xfersize = 1 MiB blocksize = 1 MiB aggregate filesize = 2 GiB access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- hints passed to MPI_File_open() { romio_lustre_co_ratio = 2 } hints returned from opened file { direct_read = false direct_write = false romio_lustre_co_ratio = 2 romio_lustre_coll_threshold = 0 romio_lustre_ds_in_coll = enable cb_buffer_size = 16777216 romio_cb_read = automatic romio_cb_write = automatic cb_nodes = 2 romio_no_indep_rw = false romio_cb_pfr = disable romio_cb_fr_types = aar romio_cb_fr_alignment = 1 romio_cb_ds_threshold = 0 romio_cb_alltoall = automatic ind_rd_buffer_size = 4194304 ind_wr_buffer_size = 524288 romio_ds_read = automatic romio_ds_write = automatic cb_config_list = *:1 romio_filesystem_type = LUSTRE: romio_aggregator_list = 0 1 striping_unit = 1048576 striping_factor = 1 romio_lustre_start_iodevice = 0 } Commencing write performance test: Fri Feb 3 16:15:22 2017 write 182.88 1024.00 1024.00 0.010887 11.19 0.000873 11.20 0 remove - - - - - - 0.001600 0 Max Write: 182.88 MiB/sec (191.77 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum write 182.88 182.88 182.88 0.00 11.19843 0 2 1 1 0 0 1 0 0 1024 1048576 1048576 2147483648 MPIIO 0 Here's that same command, with lockahead added. IOR_HINT__MPI__romio_lustre_cb_lock_ahead_write=1 IOR_HINT__MPI__romio_lustre_co_ratio=2 /shared/paf/mpich-latest_movetest/bin/mpiexec -n 2 -hosts cent7c01,cent7c02 /shared/testcases/upstream_IOR/ior/src/ior -o /mnt/centss03/testfile -a MPIIO -c -s 1024 -b 1m -t 1m -v -w -H -E Output should look like this: IOR-3.0.1: MPI Coordinated Test of Parallel I/O Began: Fri Feb 3 16:16:26 2017 Command line used: /shared/testcases/upstream_IOR/ior/src/ior -o /mnt/centss03/testfile -a MPIIO -c -s 1024 -b 1m -t 1m -v -w -H -E Machine: Linux cent7c01 Start time skew across all tasks: 0.23 sec Test 0 started: Fri Feb 3 16:16:26 2017 Path: /mnt/centss03 FS: 15.7 GiB Used FS: 31.3% Inodes: 1.0 Mi Used Inodes: 0.0% Participating tasks: 2 Summary: api = MPIIO (version=3, subversion=1) test filename = /mnt/centss03/testfile access = single-shared-file, collective pattern = strided (1024 segments) ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 2 (1 per node) repetitions = 1 xfersize = 1 MiB blocksize = 1 MiB aggregate filesize = 2 GiB access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- hints passed to MPI_File_open() { romio_lustre_cb_lock_ahead_write = 1 romio_lustre_co_ratio = 2 } hints returned from opened file { direct_read = false direct_write = false romio_lustre_co_ratio = 2 romio_lustre_coll_threshold = 0 romio_lustre_ds_in_coll = enable romio_lustre_cb_lock_ahead_write = 1 romio_lustre_cb_lock_ahead_num_extents = 500 cb_buffer_size = 16777216 romio_cb_read = automatic romio_cb_write = automatic cb_nodes = 2 romio_no_indep_rw = false romio_cb_pfr = disable romio_cb_fr_types = aar romio_cb_fr_alignment = 1 romio_cb_ds_threshold = 0 romio_cb_alltoall = automatic ind_rd_buffer_size = 4194304 ind_wr_buffer_size = 524288 romio_ds_read = automatic romio_ds_write = automatic cb_config_list = *:1 romio_filesystem_type = LUSTRE: romio_aggregator_list = 0 1 striping_unit = 1048576 striping_factor = 1 romio_lustre_start_iodevice = 0 } Commencing write performance test: Fri Feb 3 16:16:26 2017 write 264.95 1024.00 1024.00 0.012465 7.72 0.000813 7.73 0 remove - - - - - - 0.001924 0 Max Write: 264.95 MiB/sec (277.82 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum write 264.95 264.95 264.95 0.00 7.72970 0 2 1 1 0 0 1 0 0 1024 1048576 1048576 2147483648 MPIIO 0 Finished: Fri Feb 3 16:16:34 2017 This should be faster than the previous one. Note these two new lines in the output: romio_lustre_cb_lock_ahead_write = 1 romio_lustre_cb_lock_ahead_num_extents = 500 Just one more specific test. Consider this test, with 4 aggregators to a singly striped file, on OST 1. Note that there are 4 hosts given in the hosts list. The filename used here is "testfile", and the command removes it and re-creates it before running IOR. This will write 4 GiB total of data, from 4 aggregators, 1 GiB of data per aggregator. (And since there are only 4 processes, all of them are aggregators.) rm -f testfile; lfs setstripe -i 1 testfile; IOR_HINT__MPI__romio_lustre_cb_lock_ahead_write=1 IOR_HINT__MPI__romio_lustre_co_ratio=4 /shared/paf/mpich-latest_movetest/bin/mpiexec -n 4 -hosts cent7c01,cent7c02,cent7c03,cent7c04 /shared/testcases/upstream_IOR/ior/src/ior -o /mnt/centss03/testfile -a MPIIO -c -s 1024 -b 1m -t 1m -v -w -H -E The primary case of interest for lockahead is more aggregators than stripes of a file, because it is intended to improve performance with multiple aggregators per stripe. If there are 1 or fewer aggregators per stripe, lockahead is expected to hurt performance slightly. The parameters of interest are the following: mpiexec: -n, number of processes/ranks -hosts hosts to run on IOR: Block size (-b) and transfer size (-t), which are usually changed together. Larger block sizes than 1 MiB show greater benefit from lockahead, up to around 64 MiB, then it plateaus. (Extremely large block sizes - 512MiB or more, for example - reduce the benefit.) -s Number of blocks to write. Total data per rank is -s * -t, so in our case, 1024 * 1 MiB, or 1 GiB per rank. -k Keep output file (otherwise it is deleted) -E Use existing file (do not delete and re-create the output file) -w write. Lockahead is not implemented for reading, since it has no benefit there. If you do reading with lockahead enabled, the library recognizes this and won't use lockahead. -o [filename] The file IOR will write to or read from Lockahead hints: IOR_HINT__MPI__romio_lustre_cb_lock_ahead_write=1 turns on lock ahead (defaults to 0/off) IOR_HINT__MPI__romio_lustre_co_ratio= sets the number of aggregators. Note that number of aggregators must be less than or equal to the number of nodes, and if you specify number of aggregators greater than number of nodes, it will still just give you one aggregator per node. And one more, which you may not choose to change: IOR_HINT__MPI__romio_lustre_cb_lock_ahead_num_extents= This control how many extents to request 'ahead' of I/O. The default is 500, which works well. If it is tuned too low (100 or below), the performance benefits are reduced. The same happens if it is tuned too high (say, 2000 or more). A few other notes: Scaling up the number of processes without scaling the number of aggregators is not very interesting, since it doesn't change how lockahead is used. The extra processes just send their data to the aggregators. Varying these options should allow lockahead to be fully tested with IOR. You can mimic the test plan Cray provided previously, or do your own exploration of the different options available. We can give additional guidance as required.