Using H5CLUSTER#

This section is for users with a login account on a running H5CLUSTER cluster.

If you are a system administrator and interested in how to manage AWS resources pleasefollow this link.

Starting up a cluster#

In order to start a cluster you must have necessary AWS IAM permission to create/terminate instances, VPN-s, EBS volumes. You find the details here. Conceptually there are two different cluster types: shared and private.

Shared Cluster The rationale for shared cluster is to save resources as in most cases the system is less than fully utilized. Sharing is fun! It not only saves money but also allows you to bug your coworker through talk. Speaking of saving resources: be mindful that the driver script has the UNIX spirit in mind. It will not ask for confirmation, instead executes of your request -- whatever it may be. Start with small low cost cluster and scale it up once you see it working.

On the other hand don't be shy starting up the cluster. By default there are limitations how many instances may be started in addition to per second billing policy used at AWS EC2. If you have started a cluster by accident with wrong parameters, terminate the script at any point with SIGINT or ctrl-c then issue a cleanup with h5cluster terminate --name yourcluster

Private Cluster is for running experiments that may ruin someone else's day, had he/she been sharing the same cluster with you. If things didn't quite worked the same way as you planned, why not just start a new cluster and erase every trace of previous mishaps?

benefits of private cluster:

With Preset Cluster Configuration#

Create a section entry for your cluster in .aws/config similar to the [cluster default] section, omitting or overwriting the relevant ones.

[cluster default]
ami: ami-02xxxxxxxxxxx5e5d
...

# service and volume definitions are omitted for brevity
[service ...]
[volume ...]

[nodes io]
instance_type: m5d.24xlarge
services: pvfs-meta://homedir, pvfs-data://homedir
volumes: s3block://s3-01, s3block://spack, pvfs://homedir
size: 3

[nodes compute]
instance_type: c5.18xlarge
volumes: s3block://spack, pvfs://homedir
size: 10

[cluster mycluster]
instance_type: m5d.2xlarge
nodes: compute, io

then cluster start --name mycluster will create a cluster with 3 + 10 + 1 nodes on specified instance types.

Manual Drive#

It would be boring to edit .aws/config each time you need different setting. Why not just override the relevant argument? Here is how you do it: cluster start --name mycluster --nodes io-nodes --instance-type m5d.metal The key is to specify --name clustername first, so the bash completion can help you with choosing. If you are not using bash completion, there is no preferred order for the arguments.

Connecting with SSH#

SSH connections are supported. In order to connect to a running cluster you must have a properly installed SSH client on your workstation, laptop, tablet, phone, etc... and have a working reliable internet connection that allows IP traffic through port 22. During boot-up process a static IP address is attached to the master node. This master node is your home login node, however there is option to obtain the public IP addresses to all nodes, and login to them directly. This approach is generally awkward and not recommended. Gnome Terminal In the most common case, there already is a running cluster with Elastic IP attached possibly with a domain name associated with this IP address. If you are member of the AWS IAM cluster group with a matching login account of the workstation you are using then all you need to do is ssh master.hdfgroup.org If for some reasons this IP address is not known to you, and have AWS console access you can visit the EC2 page and find out the correct Public IP.

known problems xauth issues such as /usr/bin/xauth: error in locking authority file /home/user-name/.Xauthority nuisense, to get rid of it, disable ForwardX11. One simple way is to edit your .ssh/config file

Host *
    StrictHostKeyChecking no
    LogLevel QUIET
    ForwardX11 no

Reason: OrangeFS doesn't support file locking. Having shared file system as home directory comes with much convenience, such as all files, including ssh keys are distributed among all nodes.

Connecting with Web Browser#

SSH port forwarding#

Forward remote port to localhost with ssh -L 8080:127.0.0.1:8080 master.your-organization.org, once you have login prompt run code-server --auth none then connect to the forwarded port http://127.0.0.1:8080 on your local host. Visual Studio Code Press ctrl+`, control and backtick, to get terminal access or just pull up the status bar with a mouse swipe. Follow this link for the linux version of keyboard shortcuts, read the documentation here, check out this vim or emacs extensions, or directly from the terminal use vi or emacs.

Centralized login page#

Connecting to master node without static IP address#

When you are working alone, and authorized to start up a cluster you may choose not to attach an elastic IP address to H5CLUSTER. For the best experience add write permission to /etc/hosts file of the workstation you are to start a cluster from, and make sure the HOSTNAME has been filled in in the cluster configuration file.

[cluster default]
hostaname: master
...

Once the script completed running the /etc/hosts file gets updated with the master node public IP address allowing you to log in: ssh master or if you prefer elevated privileges: ssh root@master. Please note the name master may be arbitrary choice ie: my-cluster.

Copying files#

H5CLUSTER is seamlessly integrated with POSIX class OS-s, therefore secure shell copy will work out of the box. Depending on the environment you may mount volumes as remote SSH share and use graphical user interface.

TODO: expand...

Your home directory#

All home directories are shared with all connected computing and IO nodes. This has the advantage of installing packages under .local/bin will become available for any nodes participating in task execution. The actual data is stripped across running IO servers, think of RAID level 0 data stripes. In addition to shared directory a symbolic link $HOME/scratch will point to local disc for low latency scratch disk private to you. This is where you want to dump temporary data. Similarly /tmp is pointing to the same disk resource and usually reserved for the system processes. If EBS volumes are attached they become available under /mnt/ebs[01-05] mount points only on master node. This storage will survive the lifetime of a cluster and is the preferred method to store data-set or custom configuration files.

What to do once logged in#

This documentation doesn't go into details how to use a shared distributed environment. If you are looking for a training for using Linux workstation, SLURM workload manager or parallel programming paradigms with MPI or CUDA please contact me for professional services. Having said that here are some useful tips that help you get going:

In addition to above, you will find a basic MPI example in your home directory, please do ln -s .local/share/mpi-example then cd mpi-example && make. Once compiled and linked you can schedule the MPI program with srun

Performance#

sinfo --Node --long
srun -N 10 --ntasks-per-node 10 ior -a MPIIO -b 1MB -t 1MB -s 30 -c 

Status#

cluster status --name mycluster

--------------------------------------------------------------------------------
                                 H5CLUSTER                                      
--------------------------------------------------------------------------------

VPC id: vpc-049f883cf83bee9e9

node       instance    group    state   bid            IPv4    eph ebs pvfs s3block
-----------------------------------------------------------    ---------------------
node00    m5d.large       io  running  0.10  35.175.252.153                     
node01    m5d.large       io  running  0.10     3.87.25.180                     
node02    m5d.large       io  running  0.10  34.228.239.217                     
master     c5.large   master  running  0.10    34.230.54.85                     
node03     c5.large  compute  running  0.12     3.81.75.168                     
node04     c5.large  compute  running  0.12     54.81.19.38                     
node05     c5.large  compute  running  0.12     52.91.36.47                     
node06     c5.large  compute  running  0.12   52.90.130.122                     
node07     c5.large  compute  running  0.12   18.232.154.89 

Spack and Environment#

The base OS, Ubuntu 18.04 comes with minimal software installed to keep the image size small. Instead of relying on canonical or debian repository a more advanced package manager is used. On this page you can read on SPACK

Currently spack has a shared read only installation of major packages and versions