WELCOME TO THE HIGH PERFORMANCE COMPUTING FACILITY
DEVELOPING TECHNOLOGY AND INFRASTRUCTURE FOR THE RESEARCH AND EDUCATION COMMUNITY
An unexpected outage in the data center where Boqueron is hosted wiped out 60 of Boqueron’s nodes over the weekend. All jobs running on those nodes were killed. We have since turned the nodes back on and they should now be working as usual.
To make sure the entire cluster has recovered from the outage, the 20 nodes that remained up during the outage (nodes 41~60) will be rebooted at a later time, and so they will remain closed off to new jobs until then (they will be listed as either “draining” or “drained” by Slurm). The jobs that are currently running there will be allowed to run uninterrupted.
The outage did not affect Boqueron’s login node or Boqueron’s /home or /work file systems. Other HPCf services such as our web site, however, were affected. They are also back online, and you should not experience any issues when using them.
We are currently in talks with the administrators at the data center to asses the situation and prevent it from occurring again.
As always, if you have any questions, problems, or comment, you may write to us at firstname.lastname@example.org.
Early today, Boqueron’s /work had a hiccup and it went offline for a few hours. This caused some weird behavior, including not being able to sign in through certain software such as SCP programs. We’ve fixed the issue, but because jobs run on /work, the running jobs that remained after /work went offline eventually got killed. Everything should be fine now and you should be able to resubmit your jobs with no problems. No data loss should have occurred as a result of this error.
Please remember that per HPCf Usage Policies, data in /work is not backed up. Always make sure to move important data off of /work and into a more permanent storage location, such as a computer in your laboratory or a personal workstation.
As always, if you have any other issues or questions, please let us know at email@example.com.
The High Performance Computing facility of the University of Puerto Rico is presently developing a technology and service infrastructure for the research and education community of the University. This infrastructure is built from the following components:
- Advanced Research Network
- Core High Performance Computational Resources
- Services in Support of Users of Computational Resources
- Standards and Architecture
- Evaluation of Emerging Technologies
Some common tasks:
Use of HPCf resources requires that you acknowledge our sponsoring institutions: the University of Puerto Rico, the Puerto Rico INBRE grant P20 GM103475 from the National Institute for General Medical Sciences (NIGMS), a component of the National Institutes of Health (NIH); and awards 1010094 and 1002410 from the Experimental Program to Stimulate Competitive Research (EPSCoR) program of the National Science Foundation