Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on thousands of clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.
Ganglia is auto-configured by Y-HPC.
NFS
When applied to HPC clusters, NFS allows data placed on the cluster server to be made immediately available to all of the nodes. This is an excellent use of NFS. However, NFS may invoke a fairly heavy overhead when applied to large systems, dependent of course upon configuration and use. You may wish to consider a parallel, virtual file system such as PVFS2 to logically combine all local node drives into a cluster-wide file system. This is a lower overhead, faster means of sharing data across computational nodes.
Set up NFS to share /home from the head node to the compute nodes.
- On the head node:
chkconfig nfs on [ENTER]
chkconfig nfslock on [ENTER]
chkconfig portmap on [ENTER] - Edit /etc/exports and add:
/home {Subnet}/{Subnet Mask}(rw,sync,no_root_squash)Where {Subnet} would be the first part of the IP address of the headnode for example, if the headnode IP address is 192.168.10.1 the subnet would be 192.168.10.0 on a class C network. The {Subnet Mask} depends on your network class for a class C network the subnet mask would be 255.255.255.0 for example.
- And now:
service nfs start [ENTER]
service portmap start [ENTER] - Chroot into the node image for the next 2 steps:
chroot /var/lib/yhpc/images/{Name of Image}/ [ENTER]
- Turn on nfslock and portmap in the node image (before imaging nodes):
chkconfig nfslock on [ENTER]
chkconfig portmap on [ENTER] - Edit the node image’s fstab, located at: /etc/fstab (as you are in a chroot environment), and add:
{IP Address of Headnode}:/home /home nfs defaults 0 0 - Exit the Chroot
exit [ENTER]
NIS Server Configuration
The Network Information Service enables shared configuration across multiple machines, ie: users, groups, passwrd, hostnames, etc.
- Install ypbind, ypserv, yp-tools from your distribution's repository:
yum install ypbind ypserv yp-tools [ENTER]
- Edit /etc/sysconfig/network and add the line:
NISDOMAIN=[your NIS domain name]
... to maintain a decent level of security, keep this from being immediately guessable, ie: do not use "nis", "domain", nor [your-company-name], etc.
- Edit the ‘all’ section in /var/yp/Makefile in order to specify what configuration files to share via NIS. A good minimal set is passwd, group, shadow
and hosts.
all: passwd group shadow hosts
- Make it all permanent:
chkconfig portmap on [ENTER]
chkconfig ypserv on [ENTER]
chkconfig yppasswdd on [ENTER]
service portmap start [ENTER]
service ypserv start [ENTER]
service yppasswdd start [ENTER] - Build yp’s database.
cd /var/yp ; make [ENTER]
NIS Client Configuration
This is most easily accomplished by chrooting into the node image before imaging and completing the following steps and exiting the chroot before imaging the cluster
- Add the following to /etc/yp.conf:
domain [nis-domain] server [nis server IP] ypserver [nis server IP] - Edit /etc/sysconfig/network and add the line (same as on server):
NISDOMAIN=[your nis domain name] - Set /etc/nsswitch.conf to use NIS for user authentication by adding ‘nis’ to the end of these lines:
passwd: nis files shadow: nis files group: nis files hosts: nis files dns - Set the NIS client to start automatically:
chkconfig ypbind on [ENTER]
chkconfig portmap on [ENTER]
OpenMPI
The Open MPI Project is an open source MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.
- Configure ssh keys for all users who will use openmpi.
- Install openmpi on the Headnode, as follows:
yum install openmpi
- Add /opt/openmpi.ppc/lib and /opt/openmpi.ppc64/lib or equivalent lib dirs for your system to the library path for your users.
- Add /opt/openmpi.ppc/bin/ or equivalent bin dir for you system to the PATH for all users who will use openmpi.
- Build a new-line seperated list of hosts in /opt/openmpi.ppc/etc/openmpi-default-hostfile.
- Run mpi jobs using the mpirun command:
mpirun [flags] [application] ENTER
TORQUE Overview and Preparation
TORQUE (Tera-scale Open-source Resource and QUEue manager) is a resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original *PBS project and has incorporated significant advances in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC, the U.S. Department of Energy, Sandia, PNL, University of Buffalo, TeraGrid, and many other leading edge HPC organizations.
This guide assumes you are using NFS to mount /home. For other configurations refer to the Cluster Resources Quick Start Tutorial
You must have a working /etc/hosts file on the headnode before continuing. You must also have a working /etc/hosts file in the compute node image or NIS configured for user account and hosts distribution from the head node to the compute nodes. A sample /etc/hosts file is given here:127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 192.168.200.1 node1 ppc64-000D939C8EFA 192.168.200.2 node2 ppc64-000D939BFAA0 192.168.200.3 node3 ppc64-000D939C0E3A
To test your /etc/hosts file:'hostname -a' [ENTER]
... must return a valid response on both the headnode and the compute nodes before you continue.
TORQUE Server Configuration
Due to the TORQUE license, the RPMs are not included with Y-HPC. They may be downloaded from the public YDL mirrors and installed as follows:
- (RPM download and installation instructions forthcoming)
- Add a list of all nodes by hostname to /var/torque/server_priv/nodes where each node is denoted by an IP address or
resolvable DNS name (such as node1, node2, ... node16).
- Start the TORQUE Server services:
service pbs_sched start [ENTER]
service pbs_server restart [ENTER]
- If you desire to include the head node in computation, you must restart the TORQUE Client service on the head node in
order to reconnect to the TORQUE Server:
service pbs_mom restart [ENTER]
- Configure TORQUE to start by default at boot:
chkconfig pbs_sched on [ENTER]
chkconfig pbs_server on [ENTER]
If using your headnode as a compute node, also run:chkconfig pbs_mom on [ENTER]
TORQUE Client Configuration
The TORQUE Client RPMs are installed in the default node image. TORQUE Client Configuration is most easily accomplished before imaging the nodes through chroot to the node image located at /var/lib/yhpc/images/{Image Name}.
- Add the following line to /var/torque/mom_priv/config:
$usecp {IP address of headnode}:/home /home - Configure the TORQUE Client to automatically start on boot:
chkconfig pbs_mom on [ENTER]
Moab
Moab Cluster Suite is a professional cluster workload management solution that integrates the scheduling, managing, monitoring and reporting of cluster workloads. Moab Cluster Suite simplifies and unifies management across one or multiple hardware, operating system, storage, network, license and resource manager environments. Its task-oriented management and the industry’s most flexible policy engine ensure service levels are delivered and workload is processed faster. This enables organizations to accomplish more work resulting in improved cluster ROI.
Moab Cluster Suite(R) eliminates the complexity of cluster management so users with limited experience can benefit from the power of a cluster. Through Moab, Y-HPC users submit and manage workload via the Web interface. Moab empowers the cluster Admin to schedule, report, and monitor jobs; visually track architecture, OS, status, and CPU utilization; reserve, on-line / off-line connectivity, and cycle power nodes through a single desktop Java application. Both user and admin tools function with OSX, Windows, and Linux desktops with little to no training.
Moab is available through Terra Soft Solutions and may be pre-installed, pre-configured
with the purchase of a complete cluster from Terra Soft.




