Technical Details

This section offers a comprehensive description of the components and operation of the Advanced Computing Center for Research and Education (ACCRE) compute cluster.

A less technical summary suitable for grant proposals and publications is also available.

ACCRE Users By the Numbers

(As of February 2024)

2,750
Vanderbilt
Researchers
9
Vanderbilt Schools and
Colleges Served
350
Campus Departments and Centers

ACCRE Cluster Details

750 Compute Nodes
95 Gateway Nodes
80 GPU Compute Nodes with approximately 320 GPU cards
20,100 Total CPU cores
199 TB of Total RAM
4.7 PB Storage Space Available through PanFS
2.0 PB Storage Space Available through Auristor
1.8 PB Storage Space Available through DORS/GPFS
27 PB Storage Space Available through LStore

Details of the Cluster Design

The cluster's design incorporates substantial input from investigators using it for research and education. Decisions about the number of nodes, memory per node, and disk space for user data storage are driven by demand. The system's schematic diagram and a glossary of terms used in the cluster's description follow:

Bandwidth: The rate of data transfer over the network, typically measured in Megabits/sec (Mbps) or Gigabits/sec (Gbps).
Compute Node: A node dedicated to running user applications. Direct access to these machines by users is restricted, with access granted only through the job scheduler.
Disk Server: A machine utilized for data storage. Normal user access to these machines is limited. More information is available in the ACCRE Storage Systems description.
Gateway or Management Node: Computers designed for interactive login use. Users log in to these machines for tasks such as compiling, editing, debugging programs, and job submission.
Gigabit Ethernet: Standard networking in servers, offering a bandwidth of 1 Gbps and latency of up to 150 microseconds in Linux environments. In contrast, 10 gigabit ethernet is a high-performance networking solution with a bandwidth of 10 Gbps.
Latency: The time required to prepare data for transmission over the network.

Cluster Connectivity

ACCRE hardware resources span three data center rooms and must also connect to outside resources across the campus and across the globe. ACCRE network infrastructure must balance the flexibility and limited restrictions needed to facilitate a wide variety of research against the need for security and reliability. ACCRE network must also balance performance demands for high throughput and low latency against the need for economical cost-effective solutions. To meet these goals, ACCRE operates multiple networks, one that is externally connected and supports both IPv4 and IPv6, a private one that is internal to the ACCRE cluster and associated resources, a highly isolated limited-access management network, plus a few smaller specialized networks including ones for low-latency RoCE RDMA. ACCRE systems operate in a “Science DMZ” outside the normal campus firewall and change management which would be a barrier to our users’ research computing. Recently ACCRE worked with the VUIT Network Design & Engineering team to expand the campus’s perimeter connectivity to high-speed research networks (such as ESnet) to 400 Gigabit and work is underway to extend that level of external connectivity to the ACCRE cluster. Connectivity to ACCRE nodes ranges from 1 Gigabit to dual 100 Gigabit links. The core of the ACCRE internal network uses a multi-path spine-leaf topology for performance and fault-tolerance with connectivity between points of up to 1,200 Gigabit (1.2 Tbps) in bandwidth.
Node Installation, Maintenance Updates, and Health Monitoring

ACCRE works to make Operating System (OS) installation and configuration highly automated for scalability and standardization. Our OS provisioning framework is built on technology that includes PXE boot, Kickstart, and Ansible. We also manage our own local package repositories for performance at scale, and to be able to enforce our own enterprise change management process to software upgrades rather than letting our software version availability be subject to public upstream decision making and timing.

Ansible is also used after initial provisioning to automate updates and configuration changes across the 1000+ ACCRE managed servers. For reliability we use a multi-environment (Development→QA→Production) workflow with versioned releases and enterprise change management. By using these tools and ACCRE workflows it becomes possible to efficiently provision large batches of new servers and keep a fleet of 1000+ servers updated and standardized for reliability and security.
The health of the compute nodes is monitored through the use of the open-source package Nagios which supports distributed monitoring. Distributed monitoring allows data from individual management nodes to be collected on a single machine for viewing, analysis, and problem notification. We have a 24/7 on-call rotation and Nagios is integrated with a notification system to email, SMS text, and voice call the on-call person if high urgency health problems are detected. To complement Nagios which specializes in continuous PASS/WARN/FAIL monitoring of anticipated potential problems, we also use a Elasticsearch+Kibana data collection, analysis, and visualization system. ACCRE uses Elasticsearch+Kibana to graph hundreds of different performance metrics to look for trends and patterns and it is also used for distributed logging collection, search, and analysis. This system helps monitor for and prevent against health problems with symptoms that cannot be simplified into a PASS/FAIL status. It also provides powerful troubleshooting tools when there are problems, particularly for quickly extracting and analyzing obscure and specialized data outside of what is commonly proactively monitored.
Lmod Modules for Research Software

Beyond the OS packages installed locally on ACCRE nodes we also maintain a huge suite of software modules commonly used in research computing that can be dynamically loaded by individual users using the open source Lmod framework and tools. Many of these software modules are available in several different versions and built multiple ways to optimize performance for different classes of hardware found at ACCRE. Software available includes GCC and Intel compilers, with support for multithreading and MPI libraries. A comprehensive list of applications spanning various research areas is available. ACCRE will consider user requests for adding new Lmod modules for software that can benefit many users. ACCRE staff can also provide guidance to users installing software themselves in their personal ACCRE storage space or their group’s storage space.
GPU Cluster

The GPU cluster consists of approximately 80 GPU nodes, each equipped with two, four, or eight NVIDIA GPUs. These nodes are prioritized for research groups that contributed to their funding, but guest access is also available.
ACCRE Storage Systems

ACCRE provides and maintains the following storage systems in order to meet a variety of needs.

PanFS: ACCRE has recently transitioned our primary general purpose cluster file systems (/home, /data, /nobackup, /labs) from GPFS to a new ~5PB Panasas PanFS installation, this system has to support both high performance simultaneous access by several thousand compute jobs spread across nearly a thousand servers as well as interactive use by the many ACCRE users on gateway servers.

DORS: DORS is based on IBM GPFS and serves to provide storage to data which must be accessible to cluster jobs and must also be accessible for heavy usage outside the cluster via NFS.
AuriStor: provides low-cost network file shares with significant storage capacity and strong security. However, at this time it is not well suited for heavy access by batch scheduled non-interactive ACCRE cluster jobs (direct access via non-interactive batch jobs without first staging elsewhere is tricky due to the strong security controls). There are several advantages to AuriStor, including built in encryption and auditing, a single global namespace for file path consistency across all operating systems, VUnetID authentication, ability to delegate group management to designated lab members, and uniform permissions scheme for Windows, MacOS, and Linux (no risk of permission conflicts between POSIX and NFSv4 ACLs).

LStore: provides very low-cost, high-performance, very high-capacity fault tolerant storage. Lstore initially targeted specialized usage but is being developed to handle an increasing amount of general-purpose use cases. LStore is also the backing storage for our new (AWS) S3 compatible campus-local object storage system.
Data Storage and Backup

The True incremental Backup System (TiBS) is used to backup the ACCRE cluster home directories nightly using a Spectra T950 tape library. A big advantage of TiBS is that it minimizes the time (and network resources) required for backups, even full backups. After the initial full backup, TiBS only takes incremental backups from the client. To create full backups, an incremental backup is taken from the client. Then, on the server side, all incrementals since the last full backup are merged into the previous full backup to create a new full backup. This takes the load off the client machine and network. The integrity of the previous full backup is also verified. (TiBS is available for all current operating systems and apart from the cluster, ACCRE also offers backup services for data located remotely. This service is through special arrangement. If you are interested, contact us  or more details.) 
Resource Allocation

A central issue in sharing a resource, such as the cluster, is making sure that each group is able to receive their fairshare if they are regularly submitting jobs to the cluster, that groups do not interfere with the work of other groups, and that research at Vanderbilt University is optimized by not wasting compute cycles. Resource management, scheduling of jobs, and tracking usage is handled by SLURM.
SLURM supplies user functionality to submit jobs and to check system status. It is also responsible for starting and stopping jobs, collecting job output, and returning output to the user. SLURM allows users to specify attributes about the nodes required to run a given job, for example the CPU architecture.
SLURM is a flexible job scheduler designed to guarantee, on average, that each group or user has the use of the particular number of nodes they are entitled to. If there are competing jobs, processing time is allocated by calculating a priority based mainly on the “fairshare” mechanism of SLURM. On the other hand, if no jobs from other groups are in the queue it is possible for an individual user or group to use a significant portion of the cluster. This maximizes cluster usage while maintaining an equitable sharing. You can find more details about submitting jobs through SLURM in our SLURM Documentation page.
For specific details about ACCRE resource allocation, please see more here.
Installed Applications

ACCRE offers GCC and Intel compilers, with support for multithreading and MPI libraries. A comprehensive list of applications spanning various research areas is available, and ACCRE staff assist with the installation of additional tools.