WELCOME TO THE HIGH PERFORMANCE COMPUTING FACILITY
DEVELOPING TECHNOLOGY AND INFRASTRUCTURE FOR THE RESEARCH AND EDUCATION COMMUNITY
The staff from the data center where Boqueron is hosted at will be carrying out some electrical work next week that will impact Boqueron. To cooperate with the efforts and to protect as many jobs as possible from getting spontaneously killed, we have opened a maintenance window starting at 7:30am Monday January 30th and ending at 12:00pm Tuesday January 31st. Any newly submitted jobs that cannot complete before the maintenance window begins will be held in the queue until the window ends.
During this window, the Boqueron login node will continue to operate, and the file systems /home and /work should continue to be available, so you should be able to access your files during this time. The worker nodes will be powered down, however.
Jobs that are currently running will be allowed to continue to run, but if any remain running at the time of the maintenance window, they will unfortunately be killed.
We realize this announcement comes a bit short notice, but please understand that HPCf staff was notified of this electrical work yesterday afternoon.
We apologize for any inconvenience you may experience from this maintenance window, and we thank you for your cooperation. As always, if you have any questions or comments, please send them to firstname.lastname@example.org.
An unexpected outage in the data center where Boqueron is hosted wiped out 60 of Boqueron’s nodes over the weekend. All jobs running on those nodes were killed. We have since turned the nodes back on and they should now be working as usual.
To make sure the entire cluster has recovered from the outage, the 20 nodes that remained up during the outage (nodes 41~60) will be rebooted at a later time, and so they will remain closed off to new jobs until then (they will be listed as either “draining” or “drained” by Slurm). The jobs that are currently running there will be allowed to run uninterrupted.
The outage did not affect Boqueron’s login node or Boqueron’s /home or /work file systems. Other HPCf services such as our web site, however, were affected. They are also back online, and you should not experience any issues when using them.
We are currently in talks with the administrators at the data center to asses the situation and prevent it from occurring again.
As always, if you have any questions, problems, or comment, you may write to us at email@example.com.
The High Performance Computing facility of the University of Puerto Rico is presently developing a technology and service infrastructure for the research and education community of the University. This infrastructure is built from the following components:
- Advanced Research Network
- Core High Performance Computational Resources
- Services in Support of Users of Computational Resources
- Standards and Architecture
- Evaluation of Emerging Technologies
Some common tasks:
Use of HPCf resources requires that you acknowledge our sponsoring institutions: the University of Puerto Rico, the Puerto Rico INBRE grant P20 GM103475 from the National Institute for General Medical Sciences (NIGMS), a component of the National Institutes of Health (NIH); and awards 1010094 and 1002410 from the Experimental Program to Stimulate Competitive Research (EPSCoR) program of the National Science Foundation