HDF5 performance testing#

In order to improve upon an existing solution one should be able to have a way of measuring its characteristics and have a reference point to compare to. The industry has produced wide range of performance measurement tools out of them IOR and MacsIO has been selected for this purpose however the system is flexible enough to extend it with alternative IO test frameworks.

The IO bandwidth testing framework is made up several files and are installed in $HOME/.local/bin directory. You can directly submit by sbatch job_perf into SLURM queue; all results will be collected in your home directory. The batch scripts have following components:

job_perf                 # the main driver script, that you'd schedule for parallel execution
h5cluster.cmake          # configuration for ctest to aid the ctest dashboard
job_cleanup              # may be used to get rid of temporary files if any
job_compile_hdf5         # compiles and installs the parallel HDF5 library in .local/job-name/bin
job_compile_ior          # once previous step is completed IOR and MacsIO is compiled then linked
job_compile_macsio       # against the previously installed HDF5 libraries
job_download_hdf5        # obtains the softare components, and takes care of local caching
job_download_ior
job_download_macsio
job_throughput_ior       # runs actual testing -- this is the part which you most likely want 
job_throughput_macs      # to tailor to your needs

Brief review of the environment#

H5CLUSTER composed of a custom linux kernel with cgroup support to enforce certain limits. This edition is a hackable cluster environment giving you the freedom of choice. However freedom comes at a price, in order to get the most performance out of the AWS infrastructure some choices had been made for you:

SLURM concepts:#

CPU this is a consumable resource and in the default case on AWS EC2 instance means a Virtual CPU or a single hardware thread. Keep in mind that there is a valid configuration to boot up instances without hardware threads, and or configure SLURM to schedule for CORE only. In any event this is the smallest compute unit SLURM understands and may mean a hardware thread or a core.

CORE in default case two hardware threads sharing a single floating point unit. The total available cores is the number of cores - 1 reserved for system processes and one core for each running OrangeFS services per node. If you see an error message complaining about core counts not available then reduce the request until good.

cmake and ctest#

While HDF5 source distribution supports both automake and cmake, the later has been selected for the library configuration and installation since it provides mechanism to submit results to CDASH; the online web based dashboard operated by The HDFGroup. All configuration related logics are coded in h5cluster.cmake and job_compile_hdf5 launches ctest parameterized with the content of this file. This configuration file supports obtaining a specific git commit defaulting with the latest, as well as controls cdash submit.

executing performance report#

This is the SLURM batch script entry point, note all the job control parameter, as they may be overwritten. Here is an example to limit processor usage and redefine repository directory:

sbatch --export H5_NCPUS=10,H5_REPO_DIR=/home/username/repo job_perf

Once this controller job is executed it generates an informative report how it's been parameterised and how to change some of these parameters. All in all you will find new jobs en-queued, these jobs will download or retrieve all required packages from caches or directly from the internet then schedules them for compile and installation and finally running the performance test.

#!/bin/bash
#  _____________________________________________________________________________
#
#  Copyright (c) <2019> <copyright Steven Varga, Toronto, On>
#  Contact: steven@vargaconsulting.ca
#           2019 Toronto, On Canada
#  _____________________________________________________________________________

#SBATCH --job-name=performance-test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=/home/%u/%j-perf.txt

[ ! -z "$H5_REPO_DIR" ]      || export H5_REPO_DIR=/mnt/ebs01/repos                    # optional project cache files 
[ ! -z "$H5_GIT_DIR" ]       || export H5_GIT_DIR=/home/repos                          # must be on shared disk visible to all nodes
[ ! -z "$H5_JOBNAME" ]       || export H5_JOBNAME=`printf "tmp-%.3i" $SLURM_JOB_ID`    # change it to your preference
[ ! -z "$H5_INSTALL_DIR" ]   || export H5_INSTALL_DIR=~/.local/$H5_JOBNAME             # install hdf5 to shared drive
[ ! -z "$H5_SCRATCH_DIR" ]   || export H5_SCRATCH_DIR=~/scratch                        # use local scratch directory, much faster -- requires io-nodes
[ ! -z "$H5_NHOSTS" ]        || export H5_NHOSTS=$(sinfo -h -o %D)                     # number of hosts
[ ! -z "$H5_NCPUS" ]         || export H5_NCPUS=$((`sinfo -h -o %c` - 6))                      # number of cpus
[ ! -z "$H5_NCPUS_CTEST" ]   || export H5_NCPUS_CTEST=$((4<$H5_NCPUS?4:$H5_NCPUS))     # maximum of CPUS for CMAKE srun -n $H5_NCPUS_CTEST
[ ! -z "$H5_NCPUS_IO_TEST" ] || export H5_NCPUS_IO_TEST=$((40<$H5_NCPUS?40:$H5_NCPUS)) # maximum 10 CPUS 
[ ! -z "$H5_NCPUS_COMPILE" ] || export H5_NCPUS_COMPILE=$((30<$H5_NCPUS?30:$H5_NCPUS)) # ctest -n H5_CTEST_CPUS-s
## CONTROL CTEST HERE:
[ ! -z "$H5_WITH_CTEST" ]    || export H5_WITH_CTEST=TRUE                              # do ctest source tree
[ ! -z "$H5_WITH_SUBMIT" ]   || export H5_WITH_SUBMIT=TRUE                             # cdash submit if true
[ ! -z "$H5_WITH_COMMIT" ]   || export H5_WITH_COMMIT=                                 # checkout latest version if empty

# try not changing these variables:
export H5_INSTANCE_TYPE=`curl -s http://169.254.169.254/latest/meta-data/instance-type`
export H5_JOBID=$SLURM_JOB_ID
export H5_SCRIPT_DIR=~/.local/bin
export CPPFLAGS="-I$H5_INSTALL_DIR/include"
export LDFLAGS="-L$H5_INSTALL_DIR/lib"
export LD_LIBRARY_PATH=$H5_INSTALL_DIR/lib

export H5_BLOCK_SIZE=1G
export H5_TRANSFER_SIZE=1G

echo '==============================================================================='
echo following variables may be overwritten when launching script: 
echo "             sbatch --export VAR_01=value_01,VAR_02=value_02[,...] job_perf"
echo "hdf5 git directory (H5_GIT_DIR):  $H5_GIT_DIR"
echo "temporary compile placed here (H5_SCRATCH_DIR): $H5_SCRATCH_DIR" 
echo "temporary directory name (H5_JOBNAME): $H5_JOBNAME"
echo "install directory (H5_INSTALL_DIR): $H5_INSTALL_DIR"
echo
echo "H5_JOBID         = $SLURM_JOB_ID    # this is assigned by SLURM, don't change it"
echo "H5_JOBNAME       = $H5_JOBNAME      # overwrite to re-run same job"
echo "H5_INSTALL_DIR   = $H5_INSTALL_DIR  # where this version of HDF5 get's installed, must be a shared"
echo "H5_REPO_DIR      = $H5_REPO_DIR     # shared directory where you keep local copy of git"
echo "H5_GIT_DIR       = $H5_GIT_DIR      # EBS back git repo dirs, so not everything gets pulled from origin"
echo "H5_SCRATCH_DIR   = $H5_SCRATCH_DIR  # each user has a local disk on IO nodes, accelerates compile"
echo "H5_SCRIPT_DIR    = $H5_SCRIPT_DIR   # this is where these scripts are located"
echo "H5_INSTANCE_TYPE = $H5_INSTANCE_TYPE"
echo "H5_NHOSTS        = $H5_NHOSTS"
echo "H5_NCPUS         = $H5_NCPUS"
echo "H5_NCPUS_COMPILE = $H5_NCPUS_COMPILE"
echo "H5_NCPUS_IO_TEST = $H5_NCPUS_IO_TEST"
echo "H5_NCPUS_CTEST   = $H5_NCPUS_CTEST"
echo H5_COMMIT         = $H5_COMMIT
echo 
echo "LDFLAGS=$LDFLAGS"
echo "CPPFLAGS=$CPPFLAGS"
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
echo here is an example to repeat this entire compile: 
echo "                             sbatch --export H5_JOBNAME=$H5_JOBNAME job_perf"
echo '==============================================================================='

# only 'master' node has EBS volume attached, where repo-s are stashed away
# therefore it is recommended to leave 1-2 CPU-s free on master node or
# copy the repositories up into H5_GIT_DIR manually then remove `-w master`
job_download_hdf5=$(sbatch -w master --dependency=singleton --parsable $H5_SCRIPT_DIR/job_download_hdf5)
job_download_macsio=$(sbatch -w master --dependency=singleton --parsable $H5_SCRIPT_DIR/job_download_macsio)
job_download_ior=$(sbatch -w master --dependency=singleton --parsable $H5_SCRIPT_DIR/job_download_ior)

job_compile_hdf5=$(sbatch --dependency=afterany:$job_download_hdf5 -n $H5_NCPUS_COMPILE --parsable $H5_SCRIPT_DIR/job_compile_hdf5)
job_compile_ior=$(sbatch --parsable --dependency=aftercorr:$job_download_ior:$job_compile_hdf5 $H5_SCRIPT_DIR/job_compile_ior)
job_compile_macsio=$(sbatch --parsable --dependency=aftercorr:$job_download_macsio:$job_compile_hdf5 $H5_SCRIPT_DIR/job_compile_macsio)

# the actual IO tests must be scheduled on all nodes to take adavange of available bandwidth
job_ior=$(sbatch --exclusive --parsable -N $H5_NHOSTS --ntasks-per-node $H5_NCPUS_IO_TEST \
            --dependency=aftercorr:$job_compile_ior $H5_SCRIPT_DIR/job_throughput_ior)
job_macsio=$(sbatch --exclusive --parsable -N $H5_NHOSTS --ntasks-per-node $H5_NCPUS_IO_TEST \
            --dependency=aftercorr:$job_compile_macsio $H5_SCRIPT_DIR/job_throughput_macsio)

The job file are well documented and you only need minimal level of BASH and SLURM batch scripting to edit and meaningfully execute a performance test.

Download step#

While this is a particular example for HDF5 source directory, all other packages follow suit. If the attached EBS volume containing the source distributions then they get copied to a shared directory: /home/repos and become accessible from all nodes. This job step are only scheduled on master node, where the EBS drives are attached. It is important to be certain the repository is on shared disk/directory and reachable from all nodes to prevent serializing the otherwise parallel job execution.

In case of missing cache directory: the EBS volumes, the system fetches the source distributions directly from the internet, and caches them in /home/repos. The primary objective of this step is to furnish the compile steps with reachable source directories uses up a single CPU from SLURM consumables, must be scheduled to master node.

GIT_SRC=https://bitbucket.hdfgroup.org/scm/hdffv/hdf5.git
REPO_NAME=hdf5

# CONDITIONAL FETCHING 
# you can't checkout git repository into orangefs directory -- reason unknown
# unless with elevated rights or using `rsync` from an ebs volume
if [ ! -d $H5_GIT_DIR/$REPO_NAME ]; then
    sudo mkdir -p  $H5_GIT_DIR
    sudo chmod a+rwx $H5_GIT_DIR
    if [ ! -d $H5_REPO_DIR/$REPO_NAME ]; then
        echo "fetching/updating source files from $GIT_SRC" on `hostname`
        sudo git clone $GIT_SRC $H5_GIT_DIR/$REPO_NAME
        sudo chown -R $USER:users $H5_GIT_DIR/$REPO_NAME
        sudo chmod -R a+rw $H5_GIT_DIR/$REPO_NAME
    else
        echo "found $REPO_NAME on EBS volume, copying directory:"
        echo "             rsync -avz $H5_REPO_DIR/$REPO_NAME $H5_GIT_DIR/"
        rsync -avzq $H5_REPO_DIR/$REPO_NAME $H5_GIT_DIR/
    fi
else
    echo "git directory found: $H5_GIT_DIR/$REPO_NAME skipping git cloning "
fi

HDF5 compile step#

this job step will take up maximum 30 CPU-s of a single host, and may be paramterised if test + cdash submit should be executed, most importantly what commit/git version you are to work with. In addition to general paramaters note the script checks if there already exist a compile, and if so it will skip the lengthy process. In order to take benefit of this cached feature you need to submit the job with the same H5_JOBNAME that generated the souce and compile directories. The job_perf will use the SLURM JOB_ID to create a jobname each time executed, so pristine environment is guaranteed.

if [ ! -d $CMAKE_TREE ]; then
    ctest . --verbose -j $H5_NCPUS_COMPILE -S ~/.local/bin/h5cluster.cmake -V \
        -DWITH_TEST=$H5_WITH_CTEST -DWITH_COMMIT=$H5_WITH_COMMIT  -DWITH_SUBMIT=$H5_WITH_SUBMIT\
        -DSCRATCH_DIR=$H5_SCRATCH_DIR -DREPO_DIR=$H5_GIT_DIR -DJOBNAME=$H5_JOBNAME \
        -DINSTALL_DIR=$H5_INSTALL_DIR -DNCPUS_TEST=$H5_NCPUS_CTEST -DNCPUS_COMPILE=$H5_NCPUS_COMPILE \
        -DINSTANCE_TYPE=$H5_INSTANCE_TYPE -DJOBID=$H5_JOBID
    cd $CMAKE_TREE
    make -j $H5_NCPUS_COMPILE install 
fi

Throughput test#

This is the part you would customize for your work.

Baseline#

Below is an example of a MPIIO collective IO baseline test running on 10 x m5d.24xlarge instance cluster on 600 cores transfering ~350GiB on a roughly 100TB shared disk space with 22GB/sec write and 28GB/sec read thoughput. This is as good as it gets, considering the underlying hardware 2 x 900GB NVMe drives and 25Mbit/sec ethernet throughput. When comparing parallel HDF5 you should recall that all IO calles are chanelled through the same MPIIO layer to read|write OrnageFS datastripe. It is reasonable to have some friction and loose some 15% of this available bandwidth to some HDF5 internal, however if you noticed that there is a statistically significant difference between the mean throughout of MPI-IO and HDF5 test then you may choose to investigate further.

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Thu Jul 18 17:23:56 2019
Command line        : /usr/local/bin/ior -a MPIIO -b 10MB -t 10MB -s 60 -c -C
Machine             : Linux master
TestID              : 0
StartTime           : Thu Jul 18 17:23:56 2019
Path                : /home/steven/.local/share/mpi-hello-world
FS                  : 94.5 TiB   Used FS: 0.0%   Inodes: 2932031007402.7 Mi   Used Inodes: 0.0%

Options: 
api                 : MPIIO
apiVersion          : (3.1)
test filename       : testFile
access              : single-shared-file
type                : collective
segments            : 60
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
tasks               : 600
clients per node    : 60
repetitions         : 1
xfersize            : 10 MiB
blocksize           : 10 MiB
aggregate filesize  : 351.56 GiB

Results: 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     21193      10240      10240      1.82       14.90      4.33       16.99      0   
read      27351      10240      10240      0.265332   12.63      5.08       13.16      0   
remove    -          -          -          -          -          -          0.009275   0   
Max Write: 21192.94 MiB/sec (22222.41 MB/sec)
Max Read:  27351.10 MiB/sec (28679.71 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       21192.94   21192.94   21192.94       0.00    2119.29    2119.29    2119.29       0.00   16.98679     0    600  60    1   0     1        1         0    0     60 10485760 10485760  360000.0 MPIIO      0
read        27351.10   27351.10   27351.10       0.00    2735.11    2735.11    2735.11       0.00   13.16217     0    600  60    1   0     1        1         0    0     60 10485760 10485760  360000.0 MPIIO      0
Finished            : Thu Jul 18 17:24:26 2019

The above test is a sigle run demostrates functionality but not sufficient for reasoning. In order to conclude there is a statistical significance bewteen HDF5 and MPIIO IO you need to run the tests in minumum 30 iterations. IOR has a -i N switch to specify iteration, and computes the mean and standad deviation. Given ethernet fabric not surpirzingly the standard deviation is near 10%.

It is advisable to transfer data in a GB range to reduce transient effects. In case of Individual IO choose the apropriate block and transfer size, whereas with collective IO set the block and transfer size to multiple of 1MB and manipulate the total data transferred by increasing segments and running number of tasks per node.

Collective IO#

MPIIO is a distributed framework with IO services running on top of some underlying filesystem and the MPI software clients connect to them requesting metadata and blockked data, the latter outweighting the first. It is easy to see that the parallel filsystem may be overwhelemed with the number of request after some number of clients being executed.

What if there was a mechanism to aggregate these client requests on each local nodes and have the aggregators connect to the perallel fileystem? -- the number of reuests would certainly fall and the cost of some added complexity. Collective calls do exactly that. OpenMPI has options to control the level of process aggregation, change strategies and so on... When you wish to measure throughput in this popular case you have to tell IOR or MacsIO to use collective calls. In addition to that be certain to set segment size to multiple of available IO servers, which is equivalent with strip size.

In the following case STRIP_WIDTH is grabbed from openmpi-mca-params.conf which is unique and optimized to each cluster.

STRIPE_WIDTH=$(cat /usr/local/etc/openmpi-mca-params.conf |grep fs_pvfs2_stripe_width | cut -d'=' -f2)
srun --export ALL $H5_INSTALL_DIR/bin/ior  -a HDF5  -b 1GB -t 1GB -s $STRIPE_WIDTH -C -c -i 100
srun --export ALL $H5_INSTALL_DIR/bin/ior  -a MPIIO -b 1GB -t 1GB -s $STRIPE_WIDTH -C -c -i 100

Independent IO#

This approach works well for small process counts, or OpenMP multhithreded applications where IO thread is limited to a few per sheduled node. To model the workflow launch IOR with settings:

srun --export ALL $H5_INSTALL_DIR/bin/ior  -a HDF5  -b 10GB -t 1GB -i 100
srun --export ALL $H5_INSTALL_DIR/bin/ior  -a MPIIO -b 10GB -t 1GB -i 100