News

Boqueron Scheduled Partial Downtime Jan 30 - Jan 31, 2017

The staff from the data center where Boqueron is hosted at will be carrying out some electrical work next week that will impact Boqueron.  To cooperate with the efforts and to protect as many jobs as possible from getting spontaneously killed, we have opened a maintenance window starting at 7:30am Monday January 30th and ending at 12:00pm Tuesday January 31st.   Any newly submitted jobs that cannot complete before the maintenance window begins will be held in the queue until the window ends. [...]

Boqueron Power Outage During the Halloween Weekend

An unexpected outage in the data center where Boqueron is hosted wiped out 60 of Boqueron’s nodes over the weekend. All jobs running on those nodes were killed.  We have since turned the nodes back on and they should now be working as usual. To make sure the entire cluster has recovered from the outage, the 20 nodes that remained up during the outage (nodes 41~60) will be rebooted at a later time, and so they will remain closed off to new jobs until then (they will be listed as either “draini[...]

/work Filesystem Outtage Has Been Fixed

Early today, Boqueron's /work had a hiccup and it went offline for a few hours.  This caused some weird behavior, including not being able to sign in through certain software such as SCP programs.  We've fixed the issue, but because jobs run on /work, the running jobs that remained after /work went offline eventually got killed.  Everything should be fine now and you should be able to resubmit your jobs with no problems.  No data loss should have occurred as a result of this error. Please rem[...]

Boqueron Unscheduled Maintenance During Morning of June 1st, 2016

As you may have noticed, Boqueron underwent unscheduled, urgent maintenance earlier today.  This maintenance unfortunately had the effect of killing all running and pending jobs in the process. What happened? An issue in the way Slurm (the queue manager) was interacting with the cluster manager software that we use on Boqueron (Bright CM) was causing Slurm's configuration to get rewritten spontaneously when certain actions were taken in the cluster.  Every time this happened, jobs got killed. [...]

Notice: Boqueron Maintenance Completed - May 18, 2016

We have completed the maintenance work on Boqueron that had been scheduled for today.  Core sharing should now be disabled.  Users may now resubmit jobs to Boqueron.  Everything should be working fine, but we encourage users to submit at least one test job first to ensure everything is working correctly.  Please report any issues or questions to help@hpcf.upr.edu.[...]