HDF5 performance testing#
In order to improve upon an existing solution one should be able to have a way of measuring its characteristics and have a reference point to compare to. The industry has produced wide range of performance measurement tools out of them IOR and MacsIO has been selected for this purpose however the system is flexible enough to extend it with alternative IO test frameworks.
The IO bandwidth testing framework is made up several files and are installed in
$HOME/.local/bin directory. You can directly submit by
sbatch job_perf into SLURM queue; all results will be collected in your home directory.
The batch scripts have following components:
job_perf # the main driver script, that you'd schedule for parallel execution h5cluster.cmake # configuration for ctest to aid the ctest dashboard job_cleanup # may be used to get rid of temporary files if any job_compile_hdf5 # compiles and installs the parallel HDF5 library in .local/job-name/bin job_compile_ior # once previous step is completed IOR and MacsIO is compiled then linked job_compile_macsio # against the previously installed HDF5 libraries job_download_hdf5 # obtains the softare components, and takes care of local caching job_download_ior job_download_macsio job_throughput_ior # runs actual testing -- this is the part which you most likely want job_throughput_macs # to tailor to your needs
Brief review of the environment#
H5CLUSTER composed of a custom linux kernel with cgroup support to enforce certain limits. This edition is a hackable cluster environment giving you the freedom of choice. However freedom comes at a price, in order to get the most performance out of the AWS infrastructure some choices had been made for you:
- CPU isolation: this means not all reported cores/thread are available for your jobs; some of them are dedicated for system and IO server use.
- OMPIO IO tuning: in
/usr/local/etc/openmpi-mca-params.confyou find correct settings for your running cluster
- sudo access: there is little point of a research cluster if you can't hack it
CPU this is a consumable resource and in the default case on AWS EC2 instance means a Virtual CPU or a single hardware thread. Keep in mind that there is a valid configuration to boot up instances without hardware threads, and or configure SLURM to schedule for CORE only. In any event this is the smallest compute unit SLURM understands and may mean a hardware thread or a core.
CORE in default case two hardware threads sharing a single floating point unit. The total available cores is the number of cores - 1 reserved for system processes and one core for each running OrangeFS services per node. If you see an error message complaining about core counts not available then reduce the request until good.
cmake and ctest#
While HDF5 source distribution supports both automake and cmake, the later has been selected for the library configuration and installation since it provides mechanism to submit results to CDASH; the online web based dashboard operated by The HDFGroup.
All configuration related logics are coded in
ctest parameterized with the content of this file. This configuration file supports obtaining a specific
git commit defaulting with the latest, as well as controls cdash submit.
executing performance report#
This is the SLURM batch script entry point, note all the job control parameter, as they may be overwritten. Here is an example to limit processor usage and redefine repository directory:
sbatch --export H5_NCPUS=10,H5_REPO_DIR=/home/username/repo job_perf
Once this controller job is executed it generates an informative report how it's been parameterised and how to change some of these parameters. All in all you will find new jobs en-queued, these jobs will download or retrieve all required packages from caches or directly from the internet then schedules them for compile and installation and finally running the performance test.
#!/bin/bash # _____________________________________________________________________________ # # Copyright (c) <2019> <copyright Steven Varga, Toronto, On> # Contact: firstname.lastname@example.org # 2019 Toronto, On Canada # _____________________________________________________________________________ #SBATCH --job-name=performance-test #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --output=/home/%u/%j-perf.txt [ ! -z "$H5_REPO_DIR" ] || export H5_REPO_DIR=/mnt/ebs01/repos # optional project cache files [ ! -z "$H5_GIT_DIR" ] || export H5_GIT_DIR=/home/repos # must be on shared disk visible to all nodes [ ! -z "$H5_JOBNAME" ] || export H5_JOBNAME=`printf "tmp-%.3i" $SLURM_JOB_ID` # change it to your preference [ ! -z "$H5_INSTALL_DIR" ] || export H5_INSTALL_DIR=~/.local/$H5_JOBNAME # install hdf5 to shared drive [ ! -z "$H5_SCRATCH_DIR" ] || export H5_SCRATCH_DIR=~/scratch # use local scratch directory, much faster -- requires io-nodes [ ! -z "$H5_NHOSTS" ] || export H5_NHOSTS=$(sinfo -h -o %D) # number of hosts [ ! -z "$H5_NCPUS" ] || export H5_NCPUS=$((`sinfo -h -o %c` - 6)) # number of cpus [ ! -z "$H5_NCPUS_CTEST" ] || export H5_NCPUS_CTEST=$((4<$H5_NCPUS?4:$H5_NCPUS)) # maximum of CPUS for CMAKE srun -n $H5_NCPUS_CTEST [ ! -z "$H5_NCPUS_IO_TEST" ] || export H5_NCPUS_IO_TEST=$((40<$H5_NCPUS?40:$H5_NCPUS)) # maximum 10 CPUS [ ! -z "$H5_NCPUS_COMPILE" ] || export H5_NCPUS_COMPILE=$((30<$H5_NCPUS?30:$H5_NCPUS)) # ctest -n H5_CTEST_CPUS-s ## CONTROL CTEST HERE: [ ! -z "$H5_WITH_CTEST" ] || export H5_WITH_CTEST=TRUE # do ctest source tree [ ! -z "$H5_WITH_SUBMIT" ] || export H5_WITH_SUBMIT=TRUE # cdash submit if true [ ! -z "$H5_WITH_COMMIT" ] || export H5_WITH_COMMIT= # checkout latest version if empty # try not changing these variables: export H5_INSTANCE_TYPE=`curl -s http://169.254.169.254/latest/meta-data/instance-type` export H5_JOBID=$SLURM_JOB_ID export H5_SCRIPT_DIR=~/.local/bin export CPPFLAGS="-I$H5_INSTALL_DIR/include" export LDFLAGS="-L$H5_INSTALL_DIR/lib" export LD_LIBRARY_PATH=$H5_INSTALL_DIR/lib export H5_BLOCK_SIZE=1G export H5_TRANSFER_SIZE=1G echo '===============================================================================' echo following variables may be overwritten when launching script: echo " sbatch --export VAR_01=value_01,VAR_02=value_02[,...] job_perf" echo "hdf5 git directory (H5_GIT_DIR): $H5_GIT_DIR" echo "temporary compile placed here (H5_SCRATCH_DIR): $H5_SCRATCH_DIR" echo "temporary directory name (H5_JOBNAME): $H5_JOBNAME" echo "install directory (H5_INSTALL_DIR): $H5_INSTALL_DIR" echo echo "H5_JOBID = $SLURM_JOB_ID # this is assigned by SLURM, don't change it" echo "H5_JOBNAME = $H5_JOBNAME # overwrite to re-run same job" echo "H5_INSTALL_DIR = $H5_INSTALL_DIR # where this version of HDF5 get's installed, must be a shared" echo "H5_REPO_DIR = $H5_REPO_DIR # shared directory where you keep local copy of git" echo "H5_GIT_DIR = $H5_GIT_DIR # EBS back git repo dirs, so not everything gets pulled from origin" echo "H5_SCRATCH_DIR = $H5_SCRATCH_DIR # each user has a local disk on IO nodes, accelerates compile" echo "H5_SCRIPT_DIR = $H5_SCRIPT_DIR # this is where these scripts are located" echo "H5_INSTANCE_TYPE = $H5_INSTANCE_TYPE" echo "H5_NHOSTS = $H5_NHOSTS" echo "H5_NCPUS = $H5_NCPUS" echo "H5_NCPUS_COMPILE = $H5_NCPUS_COMPILE" echo "H5_NCPUS_IO_TEST = $H5_NCPUS_IO_TEST" echo "H5_NCPUS_CTEST = $H5_NCPUS_CTEST" echo H5_COMMIT = $H5_COMMIT echo echo "LDFLAGS=$LDFLAGS" echo "CPPFLAGS=$CPPFLAGS" echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" echo here is an example to repeat this entire compile: echo " sbatch --export H5_JOBNAME=$H5_JOBNAME job_perf" echo '===============================================================================' # only 'master' node has EBS volume attached, where repo-s are stashed away # therefore it is recommended to leave 1-2 CPU-s free on master node or # copy the repositories up into H5_GIT_DIR manually then remove `-w master` job_download_hdf5=$(sbatch -w master --dependency=singleton --parsable $H5_SCRIPT_DIR/job_download_hdf5) job_download_macsio=$(sbatch -w master --dependency=singleton --parsable $H5_SCRIPT_DIR/job_download_macsio) job_download_ior=$(sbatch -w master --dependency=singleton --parsable $H5_SCRIPT_DIR/job_download_ior) job_compile_hdf5=$(sbatch --dependency=afterany:$job_download_hdf5 -n $H5_NCPUS_COMPILE --parsable $H5_SCRIPT_DIR/job_compile_hdf5) job_compile_ior=$(sbatch --parsable --dependency=aftercorr:$job_download_ior:$job_compile_hdf5 $H5_SCRIPT_DIR/job_compile_ior) job_compile_macsio=$(sbatch --parsable --dependency=aftercorr:$job_download_macsio:$job_compile_hdf5 $H5_SCRIPT_DIR/job_compile_macsio) # the actual IO tests must be scheduled on all nodes to take adavange of available bandwidth job_ior=$(sbatch --exclusive --parsable -N $H5_NHOSTS --ntasks-per-node $H5_NCPUS_IO_TEST \ --dependency=aftercorr:$job_compile_ior $H5_SCRIPT_DIR/job_throughput_ior) job_macsio=$(sbatch --exclusive --parsable -N $H5_NHOSTS --ntasks-per-node $H5_NCPUS_IO_TEST \ --dependency=aftercorr:$job_compile_macsio $H5_SCRIPT_DIR/job_throughput_macsio)
While this is a particular example for HDF5 source directory, all other packages follow suit. If the attached EBS volume containing the source distributions then they get copied to a shared directory:
/home/repos and become accessible from all nodes. This job step are only scheduled on
master node, where the EBS drives are attached. It is important to be certain the repository is on shared disk/directory and reachable from all nodes to prevent serializing the otherwise parallel job execution.
In case of missing cache directory: the EBS volumes, the system fetches the source distributions directly from the internet, and caches them in
/home/repos. The primary objective of this step is to furnish the compile steps with reachable source directories uses up a single CPU from SLURM consumables, must be scheduled to master node.
GIT_SRC=https://bitbucket.hdfgroup.org/scm/hdffv/hdf5.git REPO_NAME=hdf5 # CONDITIONAL FETCHING # you can't checkout git repository into orangefs directory -- reason unknown # unless with elevated rights or using `rsync` from an ebs volume if [ ! -d $H5_GIT_DIR/$REPO_NAME ]; then sudo mkdir -p $H5_GIT_DIR sudo chmod a+rwx $H5_GIT_DIR if [ ! -d $H5_REPO_DIR/$REPO_NAME ]; then echo "fetching/updating source files from $GIT_SRC" on `hostname` sudo git clone $GIT_SRC $H5_GIT_DIR/$REPO_NAME sudo chown -R $USER:users $H5_GIT_DIR/$REPO_NAME sudo chmod -R a+rw $H5_GIT_DIR/$REPO_NAME else echo "found $REPO_NAME on EBS volume, copying directory:" echo " rsync -avz $H5_REPO_DIR/$REPO_NAME $H5_GIT_DIR/" rsync -avzq $H5_REPO_DIR/$REPO_NAME $H5_GIT_DIR/ fi else echo "git directory found: $H5_GIT_DIR/$REPO_NAME skipping git cloning " fi
HDF5 compile step#
this job step will take up maximum 30 CPU-s of a single host, and may be paramterised if test + cdash submit should be executed, most importantly what commit/git version you are to work with.
In addition to general paramaters note the script checks if there already exist a compile, and if so it will skip the lengthy process. In order to take benefit of this cached feature you need to submit the job with the same
H5_JOBNAME that generated the souce and compile directories. The
job_perf will use the SLURM JOB_ID to create a jobname each time executed, so pristine environment is guaranteed.
if [ ! -d $CMAKE_TREE ]; then ctest . --verbose -j $H5_NCPUS_COMPILE -S ~/.local/bin/h5cluster.cmake -V \ -DWITH_TEST=$H5_WITH_CTEST -DWITH_COMMIT=$H5_WITH_COMMIT -DWITH_SUBMIT=$H5_WITH_SUBMIT\ -DSCRATCH_DIR=$H5_SCRATCH_DIR -DREPO_DIR=$H5_GIT_DIR -DJOBNAME=$H5_JOBNAME \ -DINSTALL_DIR=$H5_INSTALL_DIR -DNCPUS_TEST=$H5_NCPUS_CTEST -DNCPUS_COMPILE=$H5_NCPUS_COMPILE \ -DINSTANCE_TYPE=$H5_INSTANCE_TYPE -DJOBID=$H5_JOBID cd $CMAKE_TREE make -j $H5_NCPUS_COMPILE install fi
This is the part you would customize for your work.
Below is an example of a MPIIO collective IO baseline test running on 10 x m5d.24xlarge instance cluster on 600 cores transfering ~350GiB on a roughly 100TB shared disk space with 22GB/sec write and 28GB/sec read thoughput. This is as good as it gets, considering the underlying hardware 2 x 900GB NVMe drives and 25Mbit/sec ethernet throughput. When comparing parallel HDF5 you should recall that all IO calles are chanelled through the same MPIIO layer to read|write OrnageFS datastripe. It is reasonable to have some friction and loose some 15% of this available bandwidth to some HDF5 internal, however if you noticed that there is a statistically significant difference between the mean throughout of MPI-IO and HDF5 test then you may choose to investigate further.
IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O Began : Thu Jul 18 17:23:56 2019 Command line : /usr/local/bin/ior -a MPIIO -b 10MB -t 10MB -s 60 -c -C Machine : Linux master TestID : 0 StartTime : Thu Jul 18 17:23:56 2019 Path : /home/steven/.local/share/mpi-hello-world FS : 94.5 TiB Used FS: 0.0% Inodes: 2932031007402.7 Mi Used Inodes: 0.0% Options: api : MPIIO apiVersion : (3.1) test filename : testFile access : single-shared-file type : collective segments : 60 ordering in a file : sequential ordering inter file : constant task offset task offset : 1 tasks : 600 clients per node : 60 repetitions : 1 xfersize : 10 MiB blocksize : 10 MiB aggregate filesize : 351.56 GiB Results: access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 21193 10240 10240 1.82 14.90 4.33 16.99 0 read 27351 10240 10240 0.265332 12.63 5.08 13.16 0 remove - - - - - - 0.009275 0 Max Write: 21192.94 MiB/sec (22222.41 MB/sec) Max Read: 27351.10 MiB/sec (28679.71 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum write 21192.94 21192.94 21192.94 0.00 2119.29 2119.29 2119.29 0.00 16.98679 0 600 60 1 0 1 1 0 0 60 10485760 10485760 360000.0 MPIIO 0 read 27351.10 27351.10 27351.10 0.00 2735.11 2735.11 2735.11 0.00 13.16217 0 600 60 1 0 1 1 0 0 60 10485760 10485760 360000.0 MPIIO 0 Finished : Thu Jul 18 17:24:26 2019
The above test is a sigle run demostrates functionality but not sufficient for reasoning. In order to conclude there is a statistical significance bewteen HDF5 and MPIIO IO you need to run the tests in minumum 30 iterations. IOR has a
-i N switch to specify iteration, and computes the mean and standad deviation. Given ethernet fabric not surpirzingly the standard deviation is near 10%.
It is advisable to transfer data in a GB range to reduce transient effects. In case of Individual IO choose the apropriate block and transfer size, whereas with collective IO set the block and transfer size to multiple of 1MB and manipulate the total data transferred by increasing segments and running number of tasks per node.
MPIIO is a distributed framework with IO services running on top of some underlying filesystem and the MPI software clients connect to them requesting metadata and blockked data, the latter outweighting the first. It is easy to see that the parallel filsystem may be overwhelemed with the number of request after some number of clients being executed.
What if there was a mechanism to aggregate these client requests on each local nodes and have the aggregators connect to the perallel fileystem? -- the number of reuests would certainly fall and the cost of some added complexity. Collective calls do exactly that. OpenMPI has options to control the level of process aggregation, change strategies and so on...
When you wish to measure throughput in this popular case you have to tell
MacsIO to use collective calls. In addition to that be certain to set segment size to multiple of available IO servers, which is equivalent with strip size.
In the following case STRIP_WIDTH is grabbed from
openmpi-mca-params.conf which is unique and optimized to each cluster.
STRIPE_WIDTH=$(cat /usr/local/etc/openmpi-mca-params.conf |grep fs_pvfs2_stripe_width | cut -d'=' -f2) srun --export ALL $H5_INSTALL_DIR/bin/ior -a HDF5 -b 1GB -t 1GB -s $STRIPE_WIDTH -C -c -i 100 srun --export ALL $H5_INSTALL_DIR/bin/ior -a MPIIO -b 1GB -t 1GB -s $STRIPE_WIDTH -C -c -i 100
This approach works well for small process counts, or OpenMP multhithreded applications where IO thread is limited to a few per sheduled node. To model the workflow launch IOR with settings:
srun --export ALL $H5_INSTALL_DIR/bin/ior -a HDF5 -b 10GB -t 1GB -i 100 srun --export ALL $H5_INSTALL_DIR/bin/ior -a MPIIO -b 10GB -t 1GB -i 100