This section is for users with a login account on a running H5CLUSTER cluster.
If you are a system administrator and interested in how to manage AWS resources pleasefollow this link.
Starting up a cluster#
In order to start a cluster you must have necessary AWS IAM permission to create/terminate instances, VPN-s, EBS volumes. You find the details here. Conceptually there are two different cluster types: shared and private.
Shared Cluster The rationale for shared cluster is to save resources as in most cases the system is less than fully utilized. Sharing is fun! It not only saves money but also allows you to bug your coworker through
talk. Speaking of saving resources: be mindful that the driver script has the UNIX spirit in mind. It will not ask for confirmation, instead executes of your request -- whatever it may be. Start with small low cost cluster and scale it up once you see it working.
On the other hand don't be shy starting up the cluster. By default there are limitations how many instances may be started in addition to per second billing policy used at AWS EC2. If you have started a cluster by accident with wrong parameters, terminate the script at any point with
ctrl-c then issue a cleanup with
h5cluster terminate --name yourcluster
Private Cluster is for running experiments that may ruin someone else's day, had he/she been sharing the same cluster with you. If things didn't quite worked the same way as you planned, why not just start a new cluster and erase every trace of previous mishaps?
benefits of private cluster:
- all disk IO and CPU resources are reserved for you
- administrator access with
With Preset Cluster Configuration#
Create a section entry for your cluster in
.aws/config similar to the
[cluster default] section, omitting or overwriting the relevant ones.
[cluster default] ami: ami-02xxxxxxxxxxx5e5d ... # service and volume definitions are omitted for brevity [service ...] [volume ...] [nodes io] instance_type: m5d.24xlarge services: pvfs-meta://homedir, pvfs-data://homedir volumes: s3block://s3-01, s3block://spack, pvfs://homedir size: 3 [nodes compute] instance_type: c5.18xlarge volumes: s3block://spack, pvfs://homedir size: 10 [cluster mycluster] instance_type: m5d.2xlarge nodes: compute, io
cluster start --name mycluster will create a cluster with
3 + 10 + 1 nodes on specified instance types.
It would be boring to edit
.aws/config each time you need different setting. Why not just override the relevant argument?
Here is how you do it:
cluster start --name mycluster --nodes io-nodes --instance-type m5d.metal The key is to specify
--name clustername first, so the bash completion can help you with choosing. If you are not using bash completion, there is no preferred order for the arguments.
Connecting with SSH#
SSH connections are supported. In order to connect to a running cluster you must have a properly installed SSH client on your workstation, laptop, tablet, phone, etc... and have a working reliable internet connection that allows IP traffic through port 22. During boot-up process a static IP address is attached to the master node. This master node is your home login node, however there is option to obtain the public IP addresses to all nodes, and login to them directly. This approach is generally awkward and not recommended.
In the most common case, there already is a running cluster with Elastic IP attached possibly with a domain name associated with this IP address. If you are member of the AWS IAM cluster group with a matching login account of the workstation you are using then all you need to do is
ssh master.hdfgroup.org If for some reasons this IP address is not known to you, and have AWS console access you can visit the EC2 page and find out the correct Public IP.
xauth issues such as
/usr/bin/xauth: error in locking authority file /home/user-name/.Xauthority nuisense, to get rid of it, disable
ForwardX11. One simple way is to edit your
Host * StrictHostKeyChecking no LogLevel QUIET ForwardX11 no
Reason: OrangeFS doesn't support file locking. Having shared file system as home directory comes with much convenience, such as all files, including ssh keys are distributed among all nodes.
Connecting with Web Browser#
SSH port forwarding#
Forward remote port to localhost with
ssh -L 8080:127.0.0.1:8080 master.your-organization.org, once you have login prompt run
code-server --auth none then connect to the forwarded port
http://127.0.0.1:8080 on your local host.
ctrl+`, control and backtick, to get terminal access or just pull up the status bar with a mouse swipe. Follow this link for the linux version of keyboard shortcuts, read the documentation here, check out this vim or emacs extensions, or directly from the terminal use
Centralized login page#
Connecting to master node without static IP address#
When you are working alone, and authorized to start up a cluster you may choose not to attach an elastic IP address to H5CLUSTER. For the best experience add write permission to
/etc/hosts file of the workstation you are to start a cluster from, and make sure the
HOSTNAME has been filled in in the cluster configuration file.
[cluster default] hostaname: master ...
Once the script completed running the
/etc/hosts file gets updated with the master node public IP address allowing you to log in:
ssh master or if you prefer elevated privileges:
ssh root@master. Please note the name
master may be arbitrary choice ie:
H5CLUSTER is seamlessly integrated with POSIX class OS-s, therefore secure shell copy will work out of the box. Depending on the environment you may mount volumes as remote SSH share and use graphical user interface.
Your home directory#
All home directories are shared with all connected computing and IO nodes. This has the advantage of installing packages under
.local/bin will become available for any nodes participating in task execution. The actual data is stripped across running IO servers, think of RAID level 0 data stripes. In addition to shared directory a symbolic link
$HOME/scratch will point to local disc for low latency scratch disk private to you. This is where you want to dump temporary data. Similarly
/tmp is pointing to the same disk resource and usually reserved for the system processes.
If EBS volumes are attached they become available under
/mnt/ebs[01-05] mount points only on master node. This storage will survive the lifetime of a cluster and is the preferred method to store data-set or custom configuration files.
What to do once logged in#
This documentation doesn't go into details how to use a shared distributed environment. If you are looking for a training for using Linux workstation, SLURM workload manager or parallel programming paradigms with MPI or CUDA please contact me for professional services. Having said that here are some useful tips that help you get going:
sinfo --Node --longa quick peek what is available to you. Don't forget that some of the cores are reserved for system usage
srun -N 3 --ntasks-per-node 40 some-programallocates 3 nodes then runs 40 instances of
some-programon each node, total
3*40 = 120processes. All output and input is tied to your console, this is an interactive job submission
sbatch some_slurm_batch_scriptwill queue the script, and run it once resources become available. For details see SLURM documentation
scancel -u your-user-nameremoves all of your jobs from queue, super handy when things don't go well
In addition to above, you will find a basic MPI example in your home directory, please do
ln -s .local/share/mpi-example then
cd mpi-example && make. Once compiled and linked you can schedule the MPI program with
sinfo --Node --long srun -N 10 --ntasks-per-node 10 ior -a MPIIO -b 1MB -t 1MB -s 30 -c
cluster status --name mycluster
-------------------------------------------------------------------------------- H5CLUSTER -------------------------------------------------------------------------------- VPC id: vpc-049f883cf83bee9e9 node instance group state bid IPv4 eph ebs pvfs s3block ----------------------------------------------------------- --------------------- node00 m5d.large io running 0.10 188.8.131.52 node01 m5d.large io running 0.10 184.108.40.206 node02 m5d.large io running 0.10 220.127.116.11 master c5.large master running 0.10 18.104.22.168 node03 c5.large compute running 0.12 22.214.171.124 node04 c5.large compute running 0.12 126.96.36.199 node05 c5.large compute running 0.12 188.8.131.52 node06 c5.large compute running 0.12 184.108.40.206 node07 c5.large compute running 0.12 220.127.116.11
Spack and Environment#
The base OS, Ubuntu 18.04 comes with minimal software installed to keep the image size small. Instead of relying on canonical or debian repository a more advanced package manager is used. On this page you can read on SPACK
spack has a shared read only installation of major packages and versions
spack load email@example.com specific version into environment
gcc (Spack GCC) 10.1.0
spack unload firstname.lastname@example.org environment