WELCOME TO THE HIGH PERFORMANCE COMPUTING FACILITY
DEVELOPING TECHNOLOGY AND INFRASTRUCTURE FOR THE RESEARCH AND EDUCATION COMMUNITY
As you may have noticed, Boqueron underwent unscheduled, urgent maintenance earlier today. This maintenance unfortunately had the effect of killing all running and pending jobs in the process.
An issue in the way Slurm (the queue manager) was interacting with the cluster manager software that we use on Boqueron (Bright CM) was causing Slurm’s configuration to get rewritten spontaneously when certain actions were taken in the cluster. Every time this happened, jobs got killed. Today, we contacted Bright support and they were kind enough to help us out through a live screen-sharing session. The changes they had to make required the Slurm configuration to get rewritten much like those other times, and so earlier today jobs got killed as well.
Is the issue resolved?
Yes. The issue was bugging us for a few weeks now, but it should now be completely resolved.
Can I submit jobs again? Won’t they get killed again?
Yes, you may submit jobs again; and no, they should not get killed again. No system is perfect, but the solution we arrived at today with Bright support should result in continuous, stable queue operation under normal, day-to-day circumstances.
But I’m afraid they’ll get killed again!
You shouldn’t be afraid. We recognize that the recent shakiness of the queues would cause user confidence to drop, but again, the core issue should now be resolved and we expect stable times for our cluster. *knocks on wood*
We do apologize for the inconvenience this has created. For any further questions or comments, feel free to contact us at firstname.lastname@example.org.
We have completed the maintenance work on Boqueron that had been scheduled for today. Core sharing should now be disabled. Users may now resubmit jobs to Boqueron. Everything should be working fine, but we encourage users to submit at least one test job first to ensure everything is working correctly. Please report any issues or questions to email@example.com.
The High Performance Computing facility of the University of Puerto Rico is presently developing a technology and service infrastructure for the research and education community of the University. This infrastructure is built from the following components:
- Advanced Research Network
- Core High Performance Computational Resources
- Services in Support of Users of Computational Resources
- Standards and Architecture
- Evaluation of Emerging Technologies
Some common tasks:
Use of HPCf resources requires that you acknowledge our sponsoring institutions: the University of Puerto Rico, the Puerto Rico INBRE grant P20 GM103475 from the National Institute for General Medical Sciences (NIGMS), a component of the National Institutes of Health (NIH); and awards 1010094 and 1002410 from the Experimental Program to Stimulate Competitive Research (EPSCoR) program of the National Science Foundation