Boqueron Unscheduled Maintenance During Morning of June 1st, 2016
As you may have noticed, Boqueron underwent unscheduled, urgent maintenance earlier today. This maintenance unfortunately had the effect of killing all running and pending jobs in the process.
What happened?
An issue in the way Slurm (the queue manager) was interacting with the cluster manager software that we use on Boqueron (Bright CM) was causing Slurm’s configuration to get rewritten spontaneously when certain actions were taken in the cluster. Every time this happened, jobs got killed. Today, we contacted Bright support and they were kind enough to help us out through a live screen-sharing session. The changes they had to make required the Slurm configuration to get rewritten much like those other times, and so earlier today jobs got killed as well.
Is the issue resolved?
Yes. The issue was bugging us for a few weeks now, but it should now be completely resolved.
Can I submit jobs again? Won’t they get killed again?
Yes, you may submit jobs again; and no, they should not get killed again. No system is perfect, but the solution we arrived at today with Bright support should result in continuous, stable queue operation under normal, day-to-day circumstances.
But I’m afraid they’ll get killed again!
You shouldn’t be afraid. We recognize that the recent shakiness of the queues would cause user confidence to drop, but again, the core issue should now be resolved and we expect stable times for our cluster. *knocks on wood*
We do apologize for the inconvenience this has created. For any further questions or comments, feel free to contact us at help(at)hpcf.upr.edu.