Installing cluster tool#

As per agreement, you are provided with read access to git@vargaconsulting.ca:/cluster. If you are a new user, please submit your name, organization and the GPG public key to info@vargaconsulting.ca. With permissions obtained please clone the cluster driver python package:

pip3 install argcomplete
sudo activate-global-python-argcomplete
git clone git@vargaconsulting.ca:/cluster
cd cluster
pip3 install . # or make install

For details, please refer to argcomplete python package. To have access to S3 volumes locally you also need to install s3backer the software package that provides block device mapping for Amazon S3 services.

Configuring H5CLUSTER#

Managing AWS resources at massive scale from AWS web console is non-trivial, in addition to booting up, stopping/hibernating and terminating AWS spot instances (servers | computers) the cluster management script ties IAM roles to running node's log-in accounts, attaches EBS drives, Elastic IP address and handles the chores required for a super-computing like experience such as:

Default configuration#

Files .aws/config and .aws/credential control the default behavior of H5CLUSTER operations. Although most of these options may be overwritten when running the controller cluster program it is suggested to populate these files with sensible defaults.:

Possible default content of .aws/config:

[default]
region = us-east-2            # define aws region 
output = text                 # control output json | text

[cluster default]             # keyword 'cluster' and 'default' are reserved
ami: ami-xxxxxxxxxxxxxxx      # image identifier provided by vargaconsulting
ipv4:                         # optional elastic IPV4 address (low cost static IP) 
hostname:  master             # if /etc/hosts writable master node IP address and hostname is updated
instance_type: m5d.xlarge     # head / master node type
availability_zone: us-east-2b # see spot instance pricing to determine this
volumes:                      # comma separated list of predifined volumes, applicable only to head node
services:                     # comma separated services, applicable only to head node 
behavior:                     # stop | hibernate | terminate
nodes:                        # comma separated predifined nodes
udp-ports:                    # comma separeted list or ranges of UDP ports
tcp-ports:                    # comma separated list or ranges of TCP ports 

And in .aws/credential you should specify the credentials listed in your IAM:

[default]
aws_access_key_id = AKI****************L
aws_secret_access_key = h+A************************************o

Volumes and Services#

The ability to persist data beyond the lifespan of on-demand distributed computing environment is essential. In addition to permanent storage, instances may be furnished with local disks called ephemeral drives, the upside of ephemeral drives is the increased bandwidth and no charge for data transfer, the downside is the lifespan is tied to the instance the drive belongs to.

Overview of features#

storage throughput persistent federated8 USD per GB/month
ephemeral nvme 600MB/s - 6000MB/s2 no3 no free1
ephemeral ssd high2 no no free1
ebs cold 230MB/s7 yes no 0.025
ebs hdd 500MB/s7 yes no 0.045
ebs ssd 250MB/s yes no 0.100
block yes yes 0.140
pvfs 3GB/s per host6 yes/no maybe 3 cores4 + underlying storage5

Configuration File#

have sections to define volume, node and cluster in increasing order, which means cluster contains nodes, volumes, and nodes can have volumes only. When instatiating a given cluster, first a dependency graph is built by topological ordering of the components then the driver scripts does the right thing. To define dependency graph use mount-dir or block-cache-dir fields, volumes and services. Below is an example configuration file that defines PVFS home directories where the data and meta data is stored in s3://my-s3-bigdata bucket, with local write back cache on ebs volumes mounted as /mt/s3-cache. The cluster consists of a head node, 20 io nodes, and 40 compute nodes.

[volume s3://my-s3-bigdata]# S3 bucket for massive data sets: ~30USD/TiB/month
mount-dir: /mnt/s3         # metadata and data will be dumped here, see [pvfs-???]
block-cache-dir: /mnt/s3-cache  # ties this resource to EBS volume: s3-cache-store
block-cache-size: 10000          # with 1M block the cache is 10GB 

[volume ephemeral://burst-buffer] # 
mount-dir: /mnt/ephemeral  # mounted under this directory
file-system: ext4 -b 4k    # see mkfs.??? for options

[volume s3://spack-us-east-1a]
mount-dir: /mnt/spack      # shell expects spack mounted in this location 
block-cache-dir: /mnt/ephemeral # ties it to ephemeral volume
read-only: true            # s3backer has no distributed locking, only SWMR is available

[volume ebs://s3-cache-store]
mount-dir: /mnt/s3-cache 
uri: vol-0f29xxxxxxxxx3a5c, vol-03c0xxxxxxxxxf55a,
     vol-05bcxxxxxxxxxf450, vol-072bxxxxxxxxx3d4b,
     vol-006fxxxxxxxxxc6b8, vol-0d1axxxxxxxxxd859
chmod: a+rw                # acess control
type: standard             # function of performance and price

[service pvfs-data://homedir]
path: /mnt/s3              # ties it to s3://my-s3-bigdata
[service pvfs-meta://homedir]
path:/mnt/s3               # ties it to s3://my-s3-bigdata
[volume pvfs://homedir]
mount-dir: /home           # provides home directory services

[node io]
size: 20                   # servers will be spread out round robin
instance-type: m5d.metal
bid-price: 2.00
volumes: ephemeral://burst-buffer, 
    pvfs://homedir, ebs://s3-cache-store,
    s3://spack-us-east-1a, s3://my-pvfs-volumes
services: pvfs-meta://homedir, pvfs-data://homedir
behavior: stop

[node compute]
size: 40                   # servers will be spread out round robin
instance-type: c5.metal
bid-price: 2.00
volumes: 
    pvfs://homedir,        # mount shared home directory, all data is here
    s3://spack-us-east-1a  # provides access  to software suite and `/usr/local` 
behavior: terminate        # compute nodes can be disposed  

[cluster my-cluster]
nodes: io, compute         # 
instance-type: c5.2xlarge
volumes: 
    pvfs://homedir, 
    s3://spack-us-east-1a

behavior: stop

Configuration attributes share among all volumes:

[volume  unique-identifier]  # must be unique, no `_` or ` \n \t` allowed 
mount-dir: /some/path        # it also creates parent/child relationship for dependency sort
chmod: octal | text          # unix permissions, see: `man chmod` 
chown: user:group            # ownership, must be available on both localhost and remote

Volumes#

S3 block device#

Most options of s3backer are supported, please study the s3backer website and man s3backer page for more details. Paremeters are delegated directly with the exception of block-cache-dir, which acts as a placeholder for multiple volumes within a bucket.

[volume s3://unique-s3-volume-name]
region: us-east-1                     # region where s3 bucket is stored
mount-dir: /mnt/s3                    # gets mounted here               
block-cache-dir: /mnt/eph             # where local cache files are stored
file-system: ext4                     # filesystem and options for the volume 
# ext4 -E packed_meta_blocks=1 -E lazy_journal_init=1 -E lazy_itable_init=1 -E nodiscard  -b 4096 -O ^has_journal
chmod: a+r                            # octal or standard unix permissions   
compress: 5                           # gzip compression level
block-cache-size: 1000                # actual size is this number x bloxk-size
block-size: 1M                        # trade off between latency and s3 object storage overhead
block-cache-write-delay: 30           # value in ms ,reduces traffic at the cost of inconsistency
block-cache-threads: 48               # 
block-cache-max-dirty: 0
block-cache-timeout: 10000
block-cache-num-protected: 20         # keep inodes in block cache  
block-hash-prefix: true               # controls multi disk setup
min-write-delay: 0
md5-cache-size: 1000
md5-cache-time: 10000
read-ahead: 4
read-ahead-trigger: 2
access-type: private                  # private |  
directIO: false                       # true | false, may be used for testing throughput
fuse-dir: /var/local 
volume-count: 6                       # number of volumes in a single bucket, used for pvfs
volume-size: 20G                      # size of each volume

EBS#

volumes are to back PVFS meta or data server storage, or for direct data stores. In general an EBS device is assigned to a single host with a unique device name between xvdi and xvdz to prevent name space collision with ephemeral devices. The volumes are mounted in mount-dir/xvd? directory with chown ownership and chmod permissions. Services such as pvfs-meta, pvfs-data are aware of the strategy, and will identify the correct volume for their respective servers.

The relationship between device names and EBS uri-s are in order, one to one: vol-0f2906eb113803a5c gets /dev/xvdi, vol-03c0c9f239125f55a gets /dev/xvdj ... and vol-0d1aedd211ea0d859 will come up on /dev/xvdn regardless what host they are assigned to.

When a set of EBS volume uri-s are specified, they will be distributed among the number of nodes in round robin fashion. For example the 6 volumes would be assigned to a node group of 6 memebers one each, when only a single host is in the node group

[volume ebs://unique-name]            # unique name resource is referenced              
mount-dir: /mnt/ebs                   # where device gets mounted
file-system: ext4                     # file system with options, see `man mkfs` 
                                      # list of volume ID-s 
uri: vol-0f2906eb113803a5c, vol-03c0c9f239125f55a,
     vol-05bca035815c6f450, vol-072b4d0f214c93d4b,
     vol-006f83e77920fc6b8, vol-0d1aedd211ea0d859
chmod: a+rw                           # octal or text permissions
chown: root:root                      # set unix user:group for mounted resource
type: standard                        # io1 | io2 | st1 | sc1  

PVFS#

[volume pvfs://unique-pvfs-id]        # `unique-id` must match with `pvfs-meta` and `pvfs-data` ids
mount-dir: /home                      # mount location
protocol: tcp                         # ib | tcp 
[service pvfs-data://unique-pvfs-id]  # data service for orange fs, referred from [node ...] section
path: /mnt/ebs/data                   # where data is stored, also used in dependency graph calculation 
trove-syc: yes                        # turns on/off data sync
trove-method: direct                  # methods: alt-io | directio | nullio |
port: 3344
direct-io-thread-num: 30
direct-io-ops-per-queue: 1
[service pvfs-meta://unique-pvfs-id]
path:/mnt/ebs/meta
# turn on/off sync metadata
trove-sync: yes
tcp-port: 3334
ib-port: 3335

Nodes#

section is to define a group of computers with similar properties. Visualize a case where you need 20GB/sec throughput with S3 backed storage and 400 computing cores with some equipped with GPGPU devices for number crunching. Lets see the ingredients:

[nodes io]
services: pvfs-data://id-001, pvfs-meta://id-002
volumes:  pvfs://id-001
size: 10

[nodes compute]
volumes: pvfs://id-001
size: 40

The above example is to give you an idea. Generally one should fix the required bandwidth for IO services, which often is the function of data storage used. Then find out the computing services: accelarated | general, pick the largest instance with fastest fabric possible, and match them: 100Gb/s with same, this will give you the maximum aggregated bandwidth. Run as many IO services as you can! The nodes equipped with local NVME/SSD hard drives will act as burst buffers for all MPI-IO data transfers, caching them local to your cluster. This type of setup scales to 100's of GB/sec throughout backed with massive computational power.

Cluster#

In each [master some_cluster_name] section one can override any parameters found in [cluster default] section. For instance the following section defines a cluster setup small with all the details needed to launch the cluster.

[service pvfs-data://id-001]
...
[service pvfs-meta://id-002]
...

[nodes io]
services: pvfs-data://id-001 [,...]
volumes:  pvfs://id-001 [, ...]
size = 10
...

[nodes compute]
size = 200
volumes:  pvfs://id-001 [, ...]
...

[cluster small]
availablily_zone: us-east-2a  # this zone must be within the region!!!
ami: ami-0f*************dc    # you find a list to choose from under private images of aws ec2 web console
instance_type: m5d.24xlarge   # pick large and cheap ones with local disks
bid: 1.70                     # maximum price you are willling to pay for the resource
ipv4:                         # optional
nodes: compute, io            # list of predifined nodes
volumes: ebs://some_vol, pvfs://other_vol, vol-04c************53
services: nfs://id-003        # locally running a service

Finding available images#

Because of the constant development of H5CLUSTER there may be several images to choose from. If images are not clearly tagged with explanation it is suggested to choose the most recent cluster image with the highest version number. You can find these images in AWS web console AMIs section on the left side of the screen. Once this menu item is selected, be sure that Private images are picked instead of a default Public images or Owned by me. Once you have this list copy paste the AMI ID into the cluster configuration file default section:

[cluster default]
AMI = ami-*************** # the AMI ID you chose from the list

Screenshot This approach guarantees that all of your clusters are fired up with the same, most recent stable AMI image. If for some reasons, there is a specific image compiled for you or just want to test/experiment with a different image then you have an option instead of using the [cluster default] section, specifying the image such as:

[cluster default]
AMI = ami-*****-default-******
...
[cluster specific_cluster]
AMI = ami-*****-specific-*****

or just passing the AMI to script directly:

cluster start --name my-experiemntal-cluster --ami ami-*****-specific-*****

Creating Your Own Image#

Software packages are centrally managed by vargaconsulting, and provided on a shared S3 drive as SPACK volume. The best and most efficient way to customize your cluster is through spack

H5CLUSTER script usage#

The driver script is self documented, feature rich so you can take control of AWS resources with minimal knowledge. Once properly installed it is equipped with BASH completion; intelligently fetching parameters from both the configuration file and your AWS EC2 account. However once properly configured you only need to reference the named cluster from the .aws/config configuration file: cluster start --name cluster-name

usage: cluster [-h]
            {start,add,configure,stop,terminate,resume,mount,umount,mkfs}
            ...

cluster builder for AWS EC2 with built in bash completion. For authentication please
see/update `~/.ec2config` or `~/.aws/config`. An example configuration  is  provided
with this distribution.
To create arbitrary complexity cluster, start with a master node, then gradually  add
compute and io nodes to a given slurm partition, finally call configure.

optional arguments:
-h, --help            show this help message and exit

commands:
{start,add,configure,stop,terminate,resume,mount,umount,mkfs}
                        additional help
    start               create/start cluster with given properties
    add                 add nodes to existing cluster
    configure           (re)configures a running cluster
    stop                stops/hibernates running instances 
    terminate           terminate running/stopped cluster instances and free ec2 resources
    resume              resuming hibernating/stopped cluster instances
    mount               mount S3 bucket locally
    umount              unmount a device that already has been mounted
    mkfs                formats S3 bucket

Copyright © <2019-2020> Varga Consulting, Toronto, ON      info@vargaconsulting.ca

start#

usage: cluster start [-h] [--region REGION] [--zone ZONE] [--ami AMI]
                    [--volumes NAME [, ...]] [--services NAME [, ...]]
                    [--nodes NODES] [--size N] [--instance-type TYPE]
                    [--bid-price PRICE] [--name NAME]
                    [--tcp-ports PORT [, ...]] [--udp-ports PORT [, ...]]
                    [--ipv4 ADDRESS] [--hostname NAME] [--no-configure]
                    [--behavior PROPERTY] [--dry-run]

optional arguments:
-h, --help              show this help message and exit
--region REGION         aws ec2 region, uses default section of .aws/config when not set
--zone ZONE             aws ec2 availability zone, must be within `region`
--ami AMI               EC2 image id
--volumes NAME [, ...]  optional comma separated volumes listed in `.aws/config`
--services NAME [, ...] optional comma separated services listed in `.aws/config`
--nodes NODES           list of nodes
--size N                when nodes are not specified, created a homogeneous cluster of this size
--instance-type TYPE    cluster instance type, for instance: m5d.metal
--bid-price PRICE       maximum bidding price in USD
--name NAME             unique name/tag of the cluster to track state
--tcp-ports PORT [, ...] list of external IP ports, SSH required
--udp-ports PORT [, ...] optional list of UDP ports
--ipv4 ADDRESS         elastic IP for master node
--hostname NAME        hostname used for ssh connection
--no-configure         skips configuring cluster, use it with heterogeneous nodes or for new AMI
--behavior PROPERTY    how spot requests are placed `terminate | stop | hibernate`
--dry-run              do test run, skips instantiating nodes...

adding nodes#

usage: cluster add [-h] --name NAME [--ami AMI] [--size N]
                    [--instance-type TYPE] [--bid-price PRICE]
                    [--io-servers INT] [--partition-name NAME] [--no-configure]

optional arguments:
    -h, --help            show this help message and exit
    --name NAME           name/tag of the cluster that was provided when
                        starting cluster
    --ami AMI             EC2 image id
    --size N              number of nodes instantiated
    --instance-type TYPE  cluster instance (default: c3.large)
    --bid-price PRICE     maximum bidding price in USD
    --io-servers INT      number of io servers per node
    --partition-name NAME
                        slurm partition name for nodes, same as cluster-name
                        if not specified
    --no-configure        skips configuring cluster, use it with heterogeneous
                        nodes

stop#

    optional arguments:
    -h, --help   show this help message and exit
    --name NAME  name/tag of the cluster that was specified/generated with start

terminate#

Will release all AWS resources associated with the cluster, but not shared resources such as S3 devices, EBS individually managed etc. The list of resources which are terminated:

usage: cluster terminate [-h] --name NAME

optional arguments:
    -h, --help   show this help message and exit
    --name NAME  name/tag of the cluster that was specified/generated with state

mkfs#

usage: cluster mkfs s3 [-h] [--name NAME] [--region NAME]
                       [--file-system VALUE] [--block-cache-dir PATH]
                       [--block-cache-size VALUE]
                       [--block-hash-prefix BLOCK_HASH_PREFIX]
                       [--block-cache-threads BLOCK_CACHE_THREADS]
                       [--block-size VALUE] [--compress VALUE]
                       [--volume-size VALUE] [--volume-count VALUE]
                       [--access-type VALUE] [--timeout VALUE]

optional arguments:
  -h, --help            show this help message and exit
  --name NAME           bucket name, same as section name [volume s3block:xyz]
                        within .aws/config
  --region NAME         aws ec2 region name
  --file-system VALUE   name and optional params, example: ext4 -O
                        ^has_journal
  --block-cache-dir PATH
                        local write back cache filename
  --block-cache-size VALUE
                        maximum write back cache size, example: 8G
  --block-hash-prefix BLOCK_HASH_PREFIX
                        prepend random prefixes to block object names
  --block-cache-threads BLOCK_CACHE_THREADS
  --block-size VALUE    block transfer size
  --compress VALUE      gzip 0-9
  --volume-size VALUE   maximum volume size, example: 16T
  --volume-count VALUE  number of disk, defaults to 1
  --access-type VALUE   maybe: private | public_read | public_read_write |
                        authenticated_read
  --timeout VALUE       wait this many seconds to catch up with block device

mount#

usage: cluster mount [-h] --url NAME --mount-point PATH [--cache-size VALUE]
                    [--rw] [--compress COMPRESS]

optional arguments:
-h, --help           show this help message and exit
--url NAME           example: s3://region.my-bucket/disk-00
--mount-point PATH   absolute path to mount point
--cache-size VALUE   local cache size in number of blocks
--rw                 mount filesystem as read write
--compress COMPRESS  compress newly written data

umount#

usage: cluster umount [-h] --dir NAME

optional arguments:
-h, --help  show this help message and exit
--dir NAME  example: /mnt/spack

removing nodes#

Once the cluster stopped/hibernated you are charged only for the EBS volume size where each node resides. To further drop the cost for dormant clusters set the behaviour property to terminate for compute only node groups, so the cluster hibernating mechanism can terminate and free the unused resources


  1. included in the instance rental fee, instance must have disk attached 

  2. function of the instance type, proportional with number of disks available 

  3. content lifespan is the same of instance, terminating the instance will result in data loss 

  4. running services will require this many dedicated cores 

  5. these services require actual storage: ephemeral, s3, ebs 

  6. sharing bandwidth with MPI interconnect 

  7. has dedicated IO fabric, not sharing bandwidth with MPI 

  8. available data shared among different clusters and nodes within, in HPC case such as pvfs this maybe limited to SWMR single write, multiple read pattern