- Summary of Improvements and Changes
- General Resources and Feature constraints
- FAQ & On-boarding material
The ITS-CI team has been hard at work deploying infrastructure to replace the aging proprietary Cray management systems. These new systems will allow us the flexibility to sustain the UH HPC into the future by enabling us to deploy new hardware and software as needed. A brief list of the immediate improvement that will be realized when the new systems are deployed follow:
Summary of Improvements and Changes:
- Updated OS, Cray CentOS 6.x to xCAT CentOS 7.x OS.
- Redundancy of user access resources, such as the login nodes
- Separate data transfer(dtn) and login nodes
- Quota limitations home space to each user (removal of the need for ~/apps)
- Tighter login security (duo two factor authentication see https://www.hawaii.edu/its/uhlogin/ )
- Tighter control over what users can and can’t do on common resources (such as the login nodes)
- 100GB connection to the wide area network
- A non-Lustre based scratch storage for user jobs (primary scratch during the planned migration)
- Refresh of the modules list
- Long Term Storage (LTS), a less expensive central storage option for purchase
- Updates and changes to our policies
For reference “xCAT” is open source software management system that will soon replace the aging proprietary Cray management infrastructure. Each of the changes in the process are currently planned to take place on the next 4 regularly scheduled monthly maintenance days, . The changes to be invoked each month will be highlighted in the pre-maintenance emails.
Below the proposed timeline details the migration of the 187 community nodes from the Cray cluster to the xCAT UH HPC. Condo node owners will be contacted individually to schedule migration of their nodes to the xCAT cluster. As private resources, condo node migrations can be scheduled on any day up to and including July 3. After that day the Cray cluster will no longer be available for users since Lustre will be offline for upgrade and migration to xCAT.
Community Resource xCAT Migration Timeline
- May 1, 2019 : 40 standard nodes and 1 large memory node will be migrated to xCAT
- June 5, 2019: 80 standard, 3 large memory nodes and the community GPU node will be migrated to xCAT
- July 8-12, 2019: All remaining nodes will be moved to the xCAT cluster, Cray user logins will be disabled and the Lustre file system will be brought down for upgrades
- Q4 of 2019: Lustre is brought back online
The xCAT cluster will be complete at this point with both NAS and Lustre file systems available.
The partitions have been modified on the new cluster compared to the old UH-HPC. Below is a table of the partitions and important settings:
|Partition||Max time||Jobs per User Total(running)||Nodes per job||Def. mem per job||A core reserved for OS||Shared||Preemption|
|sandbox||4 hours||N/A||2||512 MB||Yes||Yes||No|
|shared||3 days||N/A||1||512 MB||Yes||Yes||No|
|shared-long||7 days||5(2)||1||512 MB||Yes||Yes||No|
|kill-shared||3 days||N/A||1||512 MB||Yes||Yes||Yes|
Beyond the modifications of the partitions, we have also enabled a slurm feature that by default will try to reserve resources (1 CPU core and 2GB memory) on each node for the operating system. While this is the default action for each node, a user is able to override the CPU reservation by specifying
--core-spec=0 as a parameter to sbatch (batch submission) or srun (interactive).
General Resources and Feature constraints :
The new cluster has consolidated public partitions with from the old cluster such as community.q, gpu.q, and lm.q, in order to make larger pools of resources for a given job submission. This should help maximize utilization. The new cluster is more heterogeneous in CPU generation, network options and GPU types. As a result, users will need to use two parameters when defining jobs to provide the scheduler information regarding what resources your job will require:
--gres : Specifies a comma delimited list of generic consumable resources on the new cluster, this option is used to request access to the GPUs on nodes with GPUs.
--constraint : Nodes have features assigned to them by the administrators. Users can specify which of these features are required by their job using the constraint option. Only nodes having
features matching the job constraints will be used to satisfy the request. Multiple
constraints may be specified with “&” (AND), “|” (OR), etc.
On the new cluster, this options is used to define what type of network or processors you wish to limit your job to. Users that are running MPI code will need to set the correct constraint ( ib_qdr, eth_25 or eth ) in order to assure their jobs select nodes with the correct network they wish to run their code across.
|node-[0001-0067,0081-0143]||x86, intel, ivy-bridge, ib_qdr|
|lmem-[0001-0005]||x86, intel, ivy-bridge, ib_qdr|
|gpu-[0001-0002]||x86, intel, haswell, nvidia, tesla, kepler, ib_qdr|
|gpu-[0003-0009]||x86, intel, skylake, nvidia, turing, geforce, eth, eth_25|
The new xCAT managed cluster will have new features and will operate with updated policies. The updated policies will apply only to the xCAT cluster and will be enforced beginning May 1, 2019.
The “User Account Management Policy” describes in detail the who can use the UH HPC and how the life cycle of cluster users accounts will be managed. [ http://go.hawaii.edu/0SG ],
User quotas and usage of common systems:
The “Resource Management and Usage Policy” details the allowed usages of the disk and compute resources on new systems. Specifically, these include login nodes, date transfer nodes (dtn) usage, openvpn usage (experimental), home directories, group directories, scratch filesystem usage and scheduler policies. [ http://go.hawaii.edu/GSY ]
User account security:
The “Access and Security Policy” addresses the use of DUO multi-factor authentication, ssh connection timeouts and password failure lockouts. [ http://go.hawaii.edu/WSG ]
On and off warranty condo nodes:
The “Procurement/purchasing of compute nodes and compute time Policy” describes the life cycle of condo nodes, continued availability of leased nodes and the end of the service unit model for users to purchase time on the cluster.[ http://go.hawaii.edu/GK0 ]
The “Procurement/purchasing of Storage” document describes the paid option for storage on the xCAT cluster called Long Term Storage (LTS). [http://go.hawaii.edu/YKG ]
FAQ & On-boarding material
There are many common questions about the new cluster, so we provide the answers in the new cluster FAQ found at [http://go.hawaii.edu/jdG ]
The on-boarding slides used as an introduction to the xCAT cluster can be found here: http://go.hawaii.edu/wL