Skip to main content

CentOS 7 Upgrade: login.accre.vanderbilt.edu now points to CentOS 7, login6.accre.vanderbilt.edu available temporarily for GPU users

Posted by on Monday, August 13, 2018 in Cluster Status Notice.

Update, 9/28/2018: We have now pointed login.accre.vanderbilt.edu to the CentOS 7 environment.


Update, 9/27/2018: Tomorrow afternoon we will be pointing login.accre.vanderbilt.edu at the CentOS 7 environment. Due to DNS caching this change may not take effect until Saturday or later on your local machine. After the switchover, if you are asked to verify that you trust the authenticity of the remote host when you attempt to ssh to login.accre.vanderbilt.edu, please respond “yes”. A small set of users may also get warnings about spoofing. If you see this warning, please remove login.accre.vanderbilt.edu from your ~/.ssh/known_hosts file and try logging in again. If you would like to access the CentOS 6 environment after the switchover, you will need to use login6.accre.vanderbilt.edu (not login-old.accre.vanderbilt.edu as indicated below). At this point, we also encourage all custom gateway owners to schedule a time to upgrade their gateways to CentOS 7. Please open a helpdesk ticket with us to discuss if you have not done so already. Please read previous email messages below for more details about the CentOS 7 upgrade.


Update, 9/24/2018: A reminder that on Friday at around 3pm, we will be switching over login.accre.vanderbilt.edu to the CentOS 7 environment as described below. You can access this environment now by connecting to login7.accre.vanderbilt.edu.

We ask our GPU users to continue to use CentOS 6, which will be available following the switchover at login-old.accre.vanderbilt.edu. We are working with Mellanox support to overcome a bug in their driver for the network cards in the GPU nodes.

Thanks!


Update, 9/5/2018: Getting Ready for the CentOS 7 Upgrade

This is another reminder to please test your ACCRE workflows in the CentOS 7 environment as soon as possible. We also invite users to go ahead and begin running production jobs (i.e. not just test jobs) in the CentOS 7 environment since a majority of the compute resources are now on that side.
On Friday, Sept 28 at roughly 3pm we will point login.accre.vanderbilt.edu at the CentOS 7 environment, and access to what remains of the CentOS 6 environment will only be available through login-old.accre.vanderbilt.edu. Note that due to DNS caching it may take a day or two for these changes to be picked up on your local machine.
We now have roughly 5,350 cores (including almost 2,000 new Intel Skylake CPU cores!) available in the CentOS 7 environment and 2,764 in the CentOS 6 environment. We are aiming to have >90% of our compute nodes transitioned to the CentOS 7 environment by mid-September. This means that queue times may already be painfully long in the CentOS 6 environment and this will only get worse over the next several weeks!
We have hit a network-related snag on the GPU nodes in the CentOS 7 environment, so please continue to submit and run your GPU nodes in the CentOS 6 environment until further notice.
If you have any questions or concerns, please feel free to reach out to us via our helpdesk.

Update, 8/13/2018: This is a friendly reminder to please test your workflows in the CentOS 7 environment as soon as possible.

See our website for regular updates on the number of cores on the CentOS 6 side versus CentOS 7 side. We will be deploying a large number of new Intel Xeon Skylake-based processors in the next few weeks in the CentOS 7 environment, at which point the number of compute resources on the CentOS 7 side will exceed those available in the old environment.

Below are some points of confusion that have come up in tickets related to the CentOS 7 transition:

  • We currently have two independent instances of SLURM running that are each managing a separate set of compute nodes. If you submit a job from a CentOS 7 gateway (login7.accre.vanderbilt.eduor a custom gateway that has been upgraded to CentOS 7), your job will run on a CentOS 7 compute node. If you submit a job from a CentOS 6 gateway (login.accre.vanderbilt.edu or a custom gateway that has not been upgrade to CentOS 7) then your job will run on a CentOS 6 compute nodes. You should notice that the instance of SLURM managing CentOS 7 resources is significantly more responsive than the one managing CentOS 6 compute nodes.
  • There may be some slight changes in the names of LMod packages between CentOS 6 and 7, so please make use of the “ml spider” subcommand to search for the appropriate names and versions of a particular package you used in the CentOS 6 environment. Please also use this opportunity to research collections in LMod as these will help prevent common problems we see frequently in helpdesk tickets. Note that module collections created in the CentOS 6 environment must be deleted and re-created in the CentOS 7 environment.
  • We currently have separate authentication systems running in parallel – one is managing passwords in the CentOS 6 environment and one is managing passwords in the CentOS 7 environment. We are synchronizing passwords from CentOS 6 to CentOS 7, but not vice versa, and the process for changing your password has changed in CentOS 7. Please see: Password Change Information

ACCRE has begun transitioning its cluster operating system from CentOS 6 to 7, along with many new features and improvements (including improved SLURM responsiveness) that are detailed here.

During this transition, two parallel environments will be available – one running CentOS 6 (ssh login.accre.vanderbilt.edu) and one running CentOS 7 (ssh login7.accre.vanderbilt.edu).
As soon as you are able, we urge you to access the new environment to verify that your workflow functions correctly. The biggest impact will be on software that has been built and installed into a private home or group directory. In most cases, this software will need to be re-built to function in CentOS 7. ACCRE staff are available to assist. Until a majority of compute infrastructure has been transitioned to the CentOS 7 environment, please only run short representative jobs that verify that everything works as expected as bursting capacity is somewhat limited in the new environment.
By the end of this week, roughly 25% of our compute infrastructure1 will have been moved to CentOS 7. We are aiming to have a total of 50% transitioned to CentOS 7 by mid-August, and 90-95% by early to mid-September.
We will reach out to groups that own custom gateways to schedule the upgrades of these machines.
While we have worked hard to ensure the new environment is setup correctly, it’s possible we have overlooked certain features or that we will need to make additional changes during this transition period. Please open a Helpdesk ticket with us if software you rely on is missing.
Depending on changes that are required, it’s also possible we may need to remove jobs or take the cluster offline on short notice (yet another reason to not run production jobs during this trial period).
Note that the setpkgs/pkginfo commands no longer exist in the CentOS 7 environment. Please remove any setpkgs commands from your shell initialization files (e.g. ~/.bashrc). You might also use this as an opportunity to clean up your shell initialization files in general and transition to using LMod collections (if you are not doing so already) as these help avoid common problems we see in tickets.
Please let us know if you have any concerns or questions. We would also appreciate updates when your group has verified that its workflows function as expected.

1Note that we have a very limited number of GPU node resources (a single node for now, but we will move a few more nodes in the next week or two) available for testing in the new environment, currently all under the pascal partition only, so be especially mindful of the size and duration of your jobs requesting GPU resources.

Leave a Response

You must be logged in to post a comment