Skip to main content

ACCRE cluster and visualization portal are back online

Posted by on Monday, January 13, 2020 in Cluster Status Notice.

Update, 1/21/2020 10am: The cluster and visualization portal are back online. Further updates will be posted here.

Update, 1/20/2020 4pm:  Tomorrow we will start providing limited cluster availability. The current bottleneck is determining which files in /scratch have missing blocks. This process requires intensive scanning of all metadata and places a heavy burden on the old hardware during the scan. As a result we will slowly ramp up job scheduling as long as it does not impact the file scanning. Only jobs using /scratch will have an effect on file scanning. After the process completes we will fully release the cluster.

/home and /data are now using the new hardware. The old data will still be available at /gpfs22/home and /gpfs23/data. /home has been fully migrated to the new hardware and 85% of the /data file sets are migrated. If you seem to be missing files in /data that means the restore for your file set has not completed yet. We are expecting the /data restore to complete by the end of the week. If for some reason any changes were made by users to files in /home or /data during the downtime they will need to be manually copied over to the new location by the user.

Additionally in preparation for migrating to new DORS hardware the existing mount was renamed /dors1 to make transitioning simpler. /dors is now a collection of symbolic links pointing to the old DORS hardware currently and can be redirected to the new hardware in the future. Software on DORS should not be impacted by this change. A separate email will be sent to DORS users outlining the next steps for transitioning to the new hardware at a later date.

Please note that in some cases software environments manually installed into your home directory may have followed existing symbolic links so that they will try to access files directly from /gpfs22/home rather than from /home directly. Python virtual environments are a good example of this behavior. In this case we recommend that you promptly rebuild your installation environments so that they reference your home directory in the new hardware. This should result in a significant performance improvement as your software will load from the new GPFS hardware. Eventually we will remove the /gpfs22 hardware completely, and such environments will need to be rebuilt. We will be collecting some tips and instructions for migrating some commonly used environments and will post these in a future communication.

Update, 1/20/2020 1pm for custom gateway users: In anticipation of migrating DORS to new hardware we need to rename the existing DORS mount to make room for the replacement. Doing this will require a brief 1 hour outage at 3pm today.

At the same time we will be updating the links for /home and /data to use the new hardware. The old data will still be available at /gpfs22/home and /gpfs23/data. /home has been fully migrated to the new hardware and 95% of the /data file sets are migrated. Any changes made by users to files in /home or /data during the downtime will need to be manually copied over to the new location by the users.

Update, 1/20/2020 10am: We are running through some final system checks and expect to have a firm estimate today on when we will be online. Thank you for your patience.

Update, 1/16/2020 10pm: The new shared storage system is in place and we have been restoring files from backup for the last 24hrs at approximately 1GB/s. On Monday we will determine if we can go ahead and provide access to /home. Additional information will be provided regarding the remaining time on the restore and how to access files on the old storage.

Update, 1/14/2020 4pm: During the recovery process for the shared storage (GPFS) yesterday we had another drive failure in the same RAID group. With three drives down it is impossible to recover that disk group. The impact is a loss of about 6% of the files.

Due to these circumstances, we have decided go ahead and replace the shared storage system with completely new hardware. Our team began preparing the new system last night and we expect to start migrating user files soon. We will have a tentative estimate for completion once the data transfers are going. The new system’s network connections are 2.5x faster and uses an optimization policy for staging smaller files on the SSDs.

Files located in /home will be copied over from the old system to the new one. There was no data loss to /home. Files from /data will be restored from tape. While we don’t backup /scratch, we plan on making the old /scratch available to users in the chance they are able to copy files over to their new /home, /data, or /scratch.

We regret that this happened this way, but are looking forward to having the new hardware in place and providing a much needed boost to the cluster.

Update, 1/13/2020 5pm: The recovery process for the shared storage (GPFS) is still underway. There are two steps to the process and our plan was to bring the cluster back online after the first step was done. We are still waiting for the
first step to conclude and are reevaluating this strategy in favor of fully recovering the storage before connecting it to the cluster. Our hardware vendor is helping to assess the current state in order to mitigate other potential risks and recommended next steps.

Regarding the scheduler, testing has gone well and you may notice activity on the graphs on the ACCRE website. This just means that SLURM is functional and is running. Looking at the graphs you can see it’s not using the full cluster.

We will be providing another update before lunch tomorrow and apologize for the disruption.

Update, 1/13/2020 10pm: During a check of the shared storage in preparation for bringing the cluster backup up, we detected that there was a RAID set reporting two bad disk drives and another reporting a single bad drive. Those have been replaced and the storage system has started rebuilding the RAID set. We are going to allow that process to finish before bringing the cluster back online. We expect to be able to do that tomorrow morning.

Update, 1/13/2020 4pm: The new SLURM version has been built and installed. The next phase will be testing. So far things look good and we hope to have the cluster restored later this evening.

As part of the effort to make sure the cluster is restored to good health, we are also checking the shared storage system (i.e. GPFS) for any possible residual issues caused by SLURM’s erratic behavior. We will send out another notice once SLURM testing has concluded.

Update, 1/13/2020 2pm: We’ve disabled access to the ACCRE gateways in order to resolve issues with GPFS and SLURM that started over the weekend. Access to the ACCRE Visualization Portal is also down. We apologize for the interruption to your research and hope to bring back access to the cluster soon.

Update, 1/13/2020 9am: This morning the vendor sent us a patched version for SLURM and we are in the process of building, deploying, and then testing it. We will send out an update as soon as testing is completed.

Update, 1/12/2020 9pm: The scheduler software for the cluster, SLURM, continues to have issues and we continue to troubleshoot the problem along with vendor support. Many of the jobs that had already started prior to SLURM going off-line are still running and may complete. However, users may experience difficulty logging into the gateways due to strain on the shared storage system.

Original post:

The cluster’s GPFS file system is extremely overloaded, and at the same time the SLURM scheduler failed overnight and can’t be restarted. We’ve opened a ticket with SchedMD to resolve this issue.

We are currently rebooting nodes that are inaccessible to stabilize GPFS.

We apologize for any inconvenience to your research.