High Performance Computing Resources – Policies – Software

HPC Resources

The UH ITS HPC Cluster is a joint investment between UH and the research community based on the Condo Compute Model.  UH made an initial investment of 1.8 million dollars in the current cluster consisting of 178 standard compute nodes and 6 large memory nodes, installed in 2014.  The research community contributes back to this resource by purchasing nodes which are added to the cluster. Upgrades funded by PI’s in the UH research community during 2016 and 2017 added 106 dual socket condo nodes to the cluster, raising the cluster count to 290 nodes and 6168 cores. The researcher-contributed compute nodes are available to the community when not used by the owner.  The idea is to create an efficient and sustainable compute resource for the UH system.

Compute Resources

The UH ITS HPC resource currently consists of a 290 node (6168 core) Cray CS300 compute cluster.

Original equipment:

  • 178 “standard” nodes, each: dual socket Intel Xeon E5-2680 v2 (10-core @ 2.8GHz)  and 128 GB of RAM [20 cores per node]
  • 6 “high-memory” nodes, each: four E5-4640 v2 (10 core @ 2.2GHz) and 1 TB of RAM [40 cores per node]

Additional equipment:

Due to varying needs and dates of PI’s purchasing additional condo nodes, five distinct types of nodes were added to the cluster:

  • 5 Intel Xeon E5-2630 v4 (10 core @ 2.20GHz) 128 GB,
  • 33 Intel Xeon E5-2660 v3 (10 core 2.60GHz) 256 GB RAM,
  • 4 Intel Xeon E5-2660 v3 (10 core 2.60GHz) 1 TB RAM,
  • 2 Intel Xeon E5-2660 v3 (10 core 2.60GHz) 128 GB RAM each, with 2 NVIDIA Tesla K40 GPUs [“community” node]
  • 62 Intel Xeon E5-2670 v3 (12 core @ 2.30GHz) 128 GB of RAM.

The total number of processors for the system is 6168 and the total memory is over 45 TB.  All nodes are connected to each other and to the 600TB parallel file system by a 40 Gb InfiniBand interconnect.

Storage Resources

– Scratch – The storage associated with the compute resource is 600 TB of Lustre file system with 5.2 GB/s write and 7.2 GB/s average of read performance.

Network Storage

– An additional 500 TB NetApp storage system is also associated for more permanent scale-out data storage and can be purchased at $65/yr for 500GB as Value Storage.

HPC Policies

HPC Scratch Filesystem Policy

The Lustre scratch filesystem on the UH-HPC cluster is a shared resource.
It is only meant for temporary storage of data.
The scratch filesystem is not backed-up.
Users are responsible for backing up their own data.
UH ITS is not responsible for any loss of data.

Below are the details of the purge policy for the Lustre scratch filesystem.

Directory tree subject to purge

  • /lus/scratch/${USER} a.k.a. ~/lus

Types of file system objects subject to purge

  • Regular files
  • Symlinks
  • Block files, Character files, Named pipes and Sockets

Attributes of files to be purged

  • Creation time > 35 days and file size > 1 MB or file size == 0 bytes
  • Creation time > 120 days and file size >=1 byte and file size <= 1 MB
  • Frequency of purge: Daily

Login Node Policies and Etiquette

The UH ITS HPC Cluster login nodes have two specific purposes: providing ssh shell access to transfer files to and from the cluster and launch batch and interactive session on the compute nodes.  Specifically, Globus , sftp, scp, rsync transfers are allowed along with launching SLURM jobs (batch and interactive) and modifying text files with a text editor- everything else should be run on a compute node. The login nodes are a shared resource and are the only access to the cluster for hundreds of user.  Therefore, running other tasks on the login nodes is not allowed and the resulting tasks will be canceled and repeat offenders can have their HPC accounts disabled.

HPC Cluster Maintenance

The UH ITS HPC Cluster will need to undergo regular maintenance to address patching, security and system stability.  The first Wednesday of each month that is not a holiday is reserved from 8am-5pm for this maintenance to take place.  Although rare, jobs running on the cluster during this maintenance may have to be stop and possibly restarted – users are responsible for being aware of any impacts on their job from a restart.

HPC Software