An unexpected outage in the data center where Boqueron is hosted wiped out 60 of Boqueron’s nodes over the weekend. All jobs running on those nodes were killed. We have since turned the nodes back on and they should now be working as usual.
To make sure the entire cluster has recovered from the outage, the 20 nodes that remained up during the outage (nodes 41~60) will be rebooted at a later time, and so they will remain closed off to new jobs until then (they will be listed as either “draining” or “drained” by Slurm). The jobs that are currently running there will be allowed to run uninterrupted.
The outage did not affect Boqueron’s login node or Boqueron’s /home or /work file systems. Other HPCf services such as our web site, however, were affected. They are also back online, and you should not experience any issues when using them.
We are currently in talks with the administrators at the data center to asses the situation and prevent it from occurring again.
As always, if you have any questions, problems, or comment, you may write to us at firstname.lastname@example.org.