22 Jun

Boqueron Back Online – June 22, 2020

Boqueron’s replacement switch has finally been installed and Boqueron is back online. Users should be able to log back in and use the cluster normally.

Before resuming your work, we invite you to take a few moments to back up all your data currently on Boqueron. Please remember that, per our Storage Policy:

Boqueron is not meant to provide reliable long-term storage. We do our best to keep users’ data safe, but by making use of the HPCf cluster (and HPCf systems in general), users agree that they are the ones ultimately responsible for keeping their own data safe, including keeping their own backups outside of HPCf systems.

We thank you for your patience during this outage and we apologize for the impact it no doubt had on your work. As always, please send any questions or comments to help@hpcf.upr.edu, and we will gladly get back to you.

03 Jun

Boqueron Unplanned Downtime Status Report – June 3, 2020

The purchase of the replacement switch for Boqueron is underway. Due to the pandemic, it is difficult to estimate how long it’ll actually take for the switch to ship and arrive to us, but we wanted to let our users know how things are proceeding.

We will most likely have to wait for the switch to arrive in order to reestablish user access to their data, since the alternatives that we considered do not seem viable.

We will continue updating on Boqueron’s status as new developments occur. As always, feel free to write any questions or comments to help@hpcf.upr.edu, and we will get back to you as soon possible.

26 May

Boqueron Unplanned Downtime Status Report – May 26, 2020

During the past week we identified the main issue that has caused the downtime and we began taking steps to resolve it. The problem lies in Boqueron’s main internal 10 G switch, which, after extensive testing, we have concluded will unfortunately need to be replaced. We have begun the process of obtaining quotes for the new switch to proceed with a purchase.

We are also discussing what our alternatives are during the time it will take for the replacement switch to arrive. We will update as soon as we have a decision regarding this matter.

We would like to at least restore user access to their files as soon as possible, but this too is impacted by the network outage: the /home directory is located in one machine, but the LDAP server (similar to a Windows Active Directory) used to authenticate users (so they can log in) is in a separate computer (this is by design; it is not an error), and without a means of getting the two machines to talk to each other, users will not be able to log in to their files.

We apologize for the inconvenience this outage has no doubt caused to your work. As always, please send any questions or comments to help@hpcf.upr.edu and we will get back to you as soon as possible.

19 May

Boqueron Unplanned Downtime – May 19, 2020

An internal networking issue is disrupting Boqueron’s operations. User data is safe, but it is currently inaccessible to users until the issue is resolved. We are working both remotely and on-site to fix the issue as quickly as possible within the restrictions in place to stop the spread of the coronavirus. We will update with more details as they become available.

As always, if you have any questions or comments, feel free to write to us at help@hpcf.upr.edu.

12 Feb

Boqueron Back Online at Reduced Capacity

This is a quick note to let our users know that the maintenance work scheduled for today Wednesday, February 12 went as planned and Boqueron is back online at reduced capacity.

As indicated in this week’s earlier post, Boqueron will continue to operate at reduced capacity until the electrical work at the data center is completed.

As always, if you have any questions or comments, please send them to help@hpcf.upr.edu.

10 Feb

Boqueron Planned Maintenance This Week

The data center where the Boqueron cluster is hosted will be undergoing some electrical maintenance from Saturday, February 15 to Sunday, February 16. The electrical work will require a full power down of Boqueron this Wednesday, February 12. After Boqueron is brought back online (which should happen the same day), it will operate at reduced capacity until the electrical work at the data center is completed. “Reduced capacity” means that a significant portion of Boqueron’s nodes will remain offline in order to lighten the load on the the data center’s circuits.

Because of this, we have placed some limits on the jobs Boqueron may currently accept. Specifically, any jobs that cannot complete their run by 8:00 am Wednesday will not be accepted until Boqueron is powered down and brought back online. Furthermore, once Boqueron is brought back online, only half of its nodes will be available to run jobs, so the remaining half will not accept any jobs until the electrical maintenance is completed at the data center.

All of Boqueron’s nodes will be powered down on Wednesday morning. Any jobs currently running will continue as usual, but, if they are still running by the time the nodes are powered down, they will unfortunately be killed.

We realize this information comes on a somewhat short notice, but we at the HPCf were only informed recently that the date for the electrical work would be this weekend, so we had no way of alerting our users much earlier than this. We apologize for any inconvenience this situation may cause for your work.

Half of Boqueron’s node are set to be powered back on on Wednesday, February 12. The rest are set to be powered up again on Monday, February 17. The nodes that will remain powered off throughout the weekend are the following: node011 through node020, node031 through node040, node051 through node060, and node071 through node080.

As always, if you have any questions or comments, please send them to help@hpcf.upr.edu.

20 Nov

Boqueron Decreased Capacity During the Weekend

The data center where the Boqueron cluster is hosted will be undergoing some electrical maintenance from Saturday November 23 to Sunday November 24. The electrical work requires that we power down a significant portion of Boqueron’s nodes in order to lighten the load on the the data center’s circuits.

Because of this, we have decided to stop accepting new jobs on half of Boqueron’s nodes effective immediately. These “drained” nodes will be powered down on Friday morning. Any jobs currently running on these nodes will continue as usual, but, if they are still running by the time the nodes are powered down, they will unfortunately be killed.

We realize this information comes on a somewhat short notice, but we at the HPCf were only informed today that the date for the electrical work would be this weekend, so we had no way of alerting our users earlier. We apologize for any inconvenience this situation may cause for your work.

All the affected nodes will be powered on again on Monday November 25. The nodes affected by this maintenance are the following: node011 through node020, node031 through node040, node051 through node060, and node071 through node080.

As always, if you have any questions or comments, please send them to help@hpcf.upr.edu.

01 May

HPCf Helpdesk Unscheduled Downtime and Alternative Contact

The HPCf Helpdesk is experiencing unexpected downtime. We are currently hard at work at fixing the issue, but until everything has been resolved, we will not be able to check the email address help@hpcf.upr.edu for incoming requests. Until maintenance work has finished, please direct any and all questions, requests, comments, or issues to the new, temporary address help.temp@hpcf.upr.edu. We are sorry for the inconvenience, and we thank you for  your patience and continued support of the HPCf.

26 Jan

Boqueron Scheduled Partial Downtime Jan 30 – Jan 31, 2017

The staff from the data center where Boqueron is hosted at will be carrying out some electrical work next week that will impact Boqueron.  To cooperate with the efforts and to protect as many jobs as possible from getting spontaneously killed, we have opened a maintenance window starting at 7:30am Monday January 30th and ending at 12:00pm Tuesday January 31st.   Any newly submitted jobs that cannot complete before the maintenance window begins will be held in the queue until the window ends.

During this window, the Boqueron login node will continue to operate, and the file systems /home and /work should continue to be available, so you should be able to access your files during this time.  The worker nodes will be powered down, however.

Jobs that are currently running will be allowed to continue to run, but if any remain running at the time of the maintenance window, they will unfortunately be killed.

We realize this announcement comes a bit short notice, but please understand that HPCf staff was notified of this electrical work yesterday afternoon.

We apologize for any inconvenience you may experience from this maintenance window, and we thank you for your cooperation.  As always, if you have any questions or comments, please send them to help@hpcf.upr.edu.

31 Oct

Boqueron Power Outage During the Halloween Weekend

An unexpected outage in the data center where Boqueron is hosted wiped out 60 of Boqueron’s nodes over the weekend. All jobs running on those nodes were killed.  We have since turned the nodes back on and they should now be working as usual.

To make sure the entire cluster has recovered from the outage, the 20 nodes that remained up during the outage (nodes 41~60) will be rebooted at a later time, and so they will remain closed off to new jobs until then (they will be listed as either “draining” or “drained” by Slurm). The jobs that are currently running there will be allowed to run uninterrupted.

The outage did not affect Boqueron’s login node or Boqueron’s /home or /work file systems.  Other HPCf services such as our web site, however, were affected.  They are also back online, and you should not experience any issues when using them.

We are currently in talks with the administrators at the data center to asses the situation and prevent it from occurring again.

As always, if you have any questions, problems, or comment, you may write to us at help@hpcf.upr.edu.