12 Feb

Boqueron Back Online at Reduced Capacity

This is a quick note to let our users know that the maintenance work scheduled for today Wednesday, February 12 went as planned and Boqueron is back online at reduced capacity.

As indicated in this week’s earlier post, Boqueron will continue to operate at reduced capacity until the electrical work at the data center is completed.

As always, if you have any questions or comments, please send them to help@hpcf.upr.edu.

10 Feb

Boqueron Planned Maintenance This Week

The data center where the Boqueron cluster is hosted will be undergoing some electrical maintenance from Saturday, February 15 to Sunday, February 16. The electrical work will require a full power down of Boqueron this Wednesday, February 12. After Boqueron is brought back online (which should happen the same day), it will operate at reduced capacity until the electrical work at the data center is completed. “Reduced capacity” means that a significant portion of Boqueron’s nodes will remain offline in order to lighten the load on the the data center’s circuits.

Because of this, we have placed some limits on the jobs Boqueron may currently accept. Specifically, any jobs that cannot complete their run by 8:00 am Wednesday will not be accepted until Boqueron is powered down and brought back online. Furthermore, once Boqueron is brought back online, only half of its nodes will be available to run jobs, so the remaining half will not accept any jobs until the electrical maintenance is completed at the data center.

All of Boqueron’s nodes will be powered down on Wednesday morning. Any jobs currently running will continue as usual, but, if they are still running by the time the nodes are powered down, they will unfortunately be killed.

We realize this information comes on a somewhat short notice, but we at the HPCf were only informed recently that the date for the electrical work would be this weekend, so we had no way of alerting our users much earlier than this. We apologize for any inconvenience this situation may cause for your work.

Half of Boqueron’s node are set to be powered back on on Wednesday, February 12. The rest are set to be powered up again on Monday, February 17. The nodes that will remain powered off throughout the weekend are the following: node011 through node020, node031 through node040, node051 through node060, and node071 through node080.

As always, if you have any questions or comments, please send them to help@hpcf.upr.edu.

20 Nov

Boqueron Decreased Capacity During the Weekend

The data center where the Boqueron cluster is hosted will be undergoing some electrical maintenance from Saturday November 23 to Sunday November 24. The electrical work requires that we power down a significant portion of Boqueron’s nodes in order to lighten the load on the the data center’s circuits.

Because of this, we have decided to stop accepting new jobs on half of Boqueron’s nodes effective immediately. These “drained” nodes will be powered down on Friday morning. Any jobs currently running on these nodes will continue as usual, but, if they are still running by the time the nodes are powered down, they will unfortunately be killed.

We realize this information comes on a somewhat short notice, but we at the HPCf were only informed today that the date for the electrical work would be this weekend, so we had no way of alerting our users earlier. We apologize for any inconvenience this situation may cause for your work.

All the affected nodes will be powered on again on Monday November 25. The nodes affected by this maintenance are the following: node011 through node020, node031 through node040, node051 through node060, and node071 through node080.

As always, if you have any questions or comments, please send them to help@hpcf.upr.edu.

11 Oct

Status Report After Hurricane Maria

We hope that this message finds you all safely.

As you all probably know, hurricane Maria has brought an unprecedented amount of devastation to all of Puerto Rico. The HPCf has luckily not suffered many damages and our staff and equipment are safe. Yesterday, 20 days after Maria’s passage through the island, we finally managed to restore most of our virtual machines, which host a significant amount of HPCf services, including our website.

At the time of writing, our cluster Boqueron seems to be working as usual, and users should be able to log in and submit jobs. During the hurricane, there was a power outage that knocked out 20 of Boqueron’s nodes and wiped out any jobs that were running there. Out of those 20 nodes, 19 have been restored. If you encounter any issues with Boqueron, please let us know.

While our machines are working close to normal, the HPCf offices themselves are still not fully operational. We have no electricity there yet, and so HPCf staff may be slow to respond to user tickets and inquiries. We thank you for your understanding during this difficult time.

We hope that you are all safe and that we can all recover quickly from hurricane Maria’s devastation.

26 Jan

Boqueron Scheduled Partial Downtime Jan 30 – Jan 31, 2017

The staff from the data center where Boqueron is hosted at will be carrying out some electrical work next week that will impact Boqueron.  To cooperate with the efforts and to protect as many jobs as possible from getting spontaneously killed, we have opened a maintenance window starting at 7:30am Monday January 30th and ending at 12:00pm Tuesday January 31st.   Any newly submitted jobs that cannot complete before the maintenance window begins will be held in the queue until the window ends.

During this window, the Boqueron login node will continue to operate, and the file systems /home and /work should continue to be available, so you should be able to access your files during this time.  The worker nodes will be powered down, however.

Jobs that are currently running will be allowed to continue to run, but if any remain running at the time of the maintenance window, they will unfortunately be killed.

We realize this announcement comes a bit short notice, but please understand that HPCf staff was notified of this electrical work yesterday afternoon.

We apologize for any inconvenience you may experience from this maintenance window, and we thank you for your cooperation.  As always, if you have any questions or comments, please send them to help@hpcf.upr.edu.

31 Oct

Boqueron Power Outage During the Halloween Weekend

An unexpected outage in the data center where Boqueron is hosted wiped out 60 of Boqueron’s nodes over the weekend. All jobs running on those nodes were killed.  We have since turned the nodes back on and they should now be working as usual.

To make sure the entire cluster has recovered from the outage, the 20 nodes that remained up during the outage (nodes 41~60) will be rebooted at a later time, and so they will remain closed off to new jobs until then (they will be listed as either “draining” or “drained” by Slurm). The jobs that are currently running there will be allowed to run uninterrupted.

The outage did not affect Boqueron’s login node or Boqueron’s /home or /work file systems.  Other HPCf services such as our web site, however, were affected.  They are also back online, and you should not experience any issues when using them.

We are currently in talks with the administrators at the data center to asses the situation and prevent it from occurring again.

As always, if you have any questions, problems, or comment, you may write to us at help@hpcf.upr.edu.

15 Aug

/work Filesystem Outtage Has Been Fixed

Early today, Boqueron’s /work had a hiccup and it went offline for a few hours.  This caused some weird behavior, including not being able to sign in through certain software such as SCP programs.  We’ve fixed the issue, but because jobs run on /work, the running jobs that remained after /work went offline eventually got killed.  Everything should be fine now and you should be able to resubmit your jobs with no problems.  No data loss should have occurred as a result of this error.

Please remember that per HPCf Usage Policies, data in /work is not backed up.  Always make sure to move important data off of /work and into a more permanent storage location, such as a computer in your laboratory or a personal workstation.

As always, if you have any other issues or questions, please let us know at help@hpcf.upr.edu.

01 Jun

Boqueron Unscheduled Maintenance During Morning of June 1st, 2016

As you may have noticed, Boqueron underwent unscheduled, urgent maintenance earlier today.  This maintenance unfortunately had the effect of killing all running and pending jobs in the process.

What happened?

An issue in the way Slurm (the queue manager) was interacting with the cluster manager software that we use on Boqueron (Bright CM) was causing Slurm’s configuration to get rewritten spontaneously when certain actions were taken in the cluster.  Every time this happened, jobs got killed.  Today, we contacted Bright support and they were kind enough to help us out through a live screen-sharing session.  The changes they had to make required the Slurm configuration to get rewritten much like those other times, and so earlier today jobs got killed as well.

Is the issue resolved?

Yes.  The issue was bugging us for a few weeks now, but it should now be completely resolved.

Can I submit jobs again?  Won’t they get killed again?

Yes, you may submit jobs again; and no, they should not get killed again.  No system is perfect, but the solution we arrived at today with Bright support should result in continuous, stable queue operation under normal, day-to-day circumstances.

But I’m afraid they’ll get killed again!

You shouldn’t be afraid.  We recognize that the recent shakiness of the queues would cause user confidence to drop, but again, the core issue should now be resolved and we expect stable times for our cluster. *knocks on wood*

We do apologize for the inconvenience this has created.  For any further questions or comments, feel free to contact us at help@hpcf.upr.edu.

10 May

Notice of Boqueron Scheduled Maintenance – Wed May 18, 2016

Boqueron will be undergoing scheduled maintenance on Wednesday May 18th, 2016.  Effective immediately, all jobs which cannot complete their run by 12:00 am May 18th will not be allocated until maintenance is over.  We have reserved a maintenance window starting at 12:00 am and ending at 1:00 pm the same day, though we anticipate the maintenance operations will take a much shorter time than that.  We will email another notice indicating when maintenance is done so that you may submit jobs again.

Reason for Maintenance Window

As some of you may have noticed, Boqueron is currently set up in a way that allows a single compute core to run multiple jobs at once.  That is to say, a user’s job currently does not actually reserve compute cores, the cores are shared among various jobs.  This is not the intended behavior of Boqueron.  Not only does this hurt jobs’ performance, but it places a heavy load on the compute nodes.

To fix this, Slurm (the resource manager) must be switched off and reconfigured.  Switching off Slurm would kill any jobs that are running at the moment, so we need to create a maintenance window to ensure that no jobs are killed in the process.

What effect will this change have on future jobs?

After the change, we anticipate that all user jobs will run much faster than they do right now.  There’s a small trade-off, though: since cores will now be reserved, we’ll start seeing jobs actually waiting in line to be allocated.  So far, because of the core sharing, most jobs are usually run almost as soon as they are submitted (but they run under a considerably degraded performance).  We anticipate this will change, and that jobs will actually have to wait before being allocated (but they will run much faster once they are allocated).

As always, if you have any questions, feel free to contact us at help@hpcf.upr.edu.