Vanderbilt’s legacy in high-performance computing goes back to 1962, when a large mainframe was built inside the round building of Stevenson Center. The building was surrounded with glass walls to showcase its state-of-the-art technology. When the building was converted to the Vanderbilt Biomolecular Nuclear Magnetic Resonance Center in 2000, the glass walls were kept in place in the hope that it would become a historic building based on its unique composition.
The vision for ACCRE originated with the VUPAC cluster built in 1994 and expanded over the years by Vanderbilt physicists. By the end of its life as a useful production facility, VUPAC was a cluster of roughly 50 workstations loosely coupled together to provide a (very coarse) parallel computing environment.
In the late 1990s, Paul Sheldon in Physics and Astronomy joined forces with Jason Moore in Human Genetics and spearheaded the building of the VAnderbilt MultiProcessor Integrated Research Engine (VAMPIRE). VAMPIRE was a 55-node Linux cluster first put into service in July 2001 with assistance from ITS. Alan Tackett, a computational theoretical physicist, was brought on board to provide technical expertise and vision. VAMPIRE was intended to be a proof-of-concept project. It proved that many groups from diverse disciplines with differing compute problems can equitably share a cluster. This first shared cluster allowed many issues to be resolved, including the type of software to ensure equitable sharing, correct installation of the operating system, system maintenance for a large number of nodes, how to ensure the hardware works properly and how to monitor all the nodes.
VAMPIRE was successful: numerous publications resulted from calculations performed with the cluster and cross-fertilization between the disciplines was even greater than anticipated. VAMPIRE could no longer provide enough computing for the needs of all the researchers involved. While efforts to secure funding for a much larger compute cluster continued, funding from many researchers, but primarily from Ron Schrimpf and Peter Cummings in the School of Engineering, provided funding for the next generation of the compute cluster: a 120-node Linux cluster purchased in 2003, which remained in service until March 2008.
ACCRE is established
With the success of VAMPIRE and the support of a large group of researchers, the proposal for the entity now known as ACCRE was submitted to the Academic Venture Capital Fund of Vanderbilt University. An $8.3 million grant was received to transform the cluster into a University-wide resource capable of meeting the needs of any researcher on campus. The scope of the operation was expanded from a compute cluster to also include data storage and data visualization capabilities. Funding was secured in late 2003 and planning begun to establish the physical infrastructure and personnel for this task.
AVCF funding allowed the staff to be expanded so that the compute cluster services could be offered to all Vanderbilt researchers. New services were added including a helpdesk system called Request Tracker, evening and weekend on-call assistance for problems impacting the entire cluster, daily office hours, and training workshops. Also added was educational use of the cluster for Vanderbilt classes and guest accounts for new Vanderbilt-affiliated researchers as a way to test out usage of the cluster.
AVCF funds were also used to begin expanding the cluster in Spring 2004 but most cluster expansion has been the result of researcher funding through grants and faculty start-up packages. The first major expansion of the compute cluster for ACCRE began in August 2004, approximately five years after the VAMPIRE cluster began when an NIH compute grant was successfully secured to provide extensive computing capacity for NIH-funded researchers in the School of Medicine. Hardware for that grant as well as for a second NIH grant secured in 2008 was purchased under the leadership of Dave Piston.
During this same period, ACCRE began operating as a Core Center under the VUMC Office of Research and developed a funding mechanism that allowed for hardware contributions through annual payments in addition to the hardware additions by grants and start-up packages.
The focus of ACCRE was also expanded beyond computing beginning in 2004. To meet the needs of Vanderbilt researchers not met by off-the-shelf storage solutions, ACCRE staff began the process of developing in-house a new distributed storage system. Called LStore, the new system provides a flexible logistical storage framework for distributed, scalable, and secure access to data for a wide spectrum of users. Vanderbilt research groups began using this storage system as early beta testers in January 2007 and the system began being used in 2010 for a major Physics collaboration.
The cluster filesystem was also substantially increased starting in 2010 to meet the needs of researchers for disk space connected to the cluster and currently over 800 TB are in use by researchers. The size of the filesystem can be adjusted to meet researcher demand.
Since 2011 Vanderbilt has served as a Tier 2 CMS institution. CMS (Compact Muon Solenoid) is one of two primary detectors at the Large Hadron Collider located in Switzerland. Vanderbilt currently hosts 4 petabytes of storage (i.e., 4,000 terabytes) in support of this project alone, all of which is stored in LStore along with over 300 terabytes of video data from the Vanderbilt Television News Archive (TVNA) located in the Jean and Alexander Heard Library.
Overlapping with the development of LStore, a wide area storage network called the Research and Education Data Depot Network (REDDnet) was launched in 2006 to support data-intensive collaboration among researchers who are distributed across Internet 2 and other research networks.
The cluster is currently composed of over 7,000 cores in over 600 compute nodes. Also included are a group of GPU nodes each containing 4 high end Nvidia GPU cards. New hardware continues to be added to the cluster based on contributions from researchers as required to meet their computing needs.
Our software stack evolved quickly throughout the 2010s: we adopted SLURM as our scheduler in 2015, and Lmod as our package manager in 2017. In 2018 we launched the ACCRE Visualization Portal, which allows our users to access their files and view job status within their web browser. Our big data cluster, which runs alongside our traditional cluster, has been in operation since 2017.