WELCOME TO THE HIGH PERFORMANCE COMPUTING FACILITY
DEVELOPING TECHNOLOGY AND INFRASTRUCTURE FOR THE RESEARCH AND EDUCATION COMMUNITY
Early today, Boqueron’s /work had a hiccup and it went offline for a few hours. This caused some weird behavior, including not being able to sign in through certain software such as SCP programs. We’ve fixed the issue, but because jobs run on /work, the running jobs that remained after /work went offline eventually got killed. Everything should be fine now and you should be able to resubmit your jobs with no problems. No data loss should have occurred as a result of this error.
Please remember that per HPCf Usage Policies, data in /work is not backed up. Always make sure to move important data off of /work and into a more permanent storage location, such as a computer in your laboratory or a personal workstation.
As always, if you have any other issues or questions, please let us know at firstname.lastname@example.org.
As you may have noticed, Boqueron underwent unscheduled, urgent maintenance earlier today. This maintenance unfortunately had the effect of killing all running and pending jobs in the process.
An issue in the way Slurm (the queue manager) was interacting with the cluster manager software that we use on Boqueron (Bright CM) was causing Slurm’s configuration to get rewritten spontaneously when certain actions were taken in the cluster. Every time this happened, jobs got killed. Today, we contacted Bright support and they were kind enough to help us out through a live screen-sharing session. The changes they had to make required the Slurm configuration to get rewritten much like those other times, and so earlier today jobs got killed as well.
Is the issue resolved?
Yes. The issue was bugging us for a few weeks now, but it should now be completely resolved.
Can I submit jobs again? Won’t they get killed again?
Yes, you may submit jobs again; and no, they should not get killed again. No system is perfect, but the solution we arrived at today with Bright support should result in continuous, stable queue operation under normal, day-to-day circumstances.
But I’m afraid they’ll get killed again!
You shouldn’t be afraid. We recognize that the recent shakiness of the queues would cause user confidence to drop, but again, the core issue should now be resolved and we expect stable times for our cluster. *knocks on wood*
We do apologize for the inconvenience this has created. For any further questions or comments, feel free to contact us at email@example.com.
The High Performance Computing facility of the University of Puerto Rico is presently developing a technology and service infrastructure for the research and education community of the University. This infrastructure is built from the following components:
- Advanced Research Network
- Core High Performance Computational Resources
- Services in Support of Users of Computational Resources
- Standards and Architecture
- Evaluation of Emerging Technologies
Some common tasks:
Use of HPCf resources requires that you acknowledge our sponsoring institutions: the University of Puerto Rico, the Puerto Rico INBRE grant P20 GM103475 from the National Institute for General Medical Sciences (NIGMS), a component of the National Institutes of Health (NIH); and awards 1010094 and 1002410 from the Experimental Program to Stimulate Competitive Research (EPSCoR) program of the National Science Foundation