12 Feb

Boqueron Back Online at Reduced Capacity

This is a quick note to let our users know that the maintenance work scheduled for today Wednesday, February 12 went as planned and Boqueron is back online at reduced capacity.

As indicated in this week’s earlier post, Boqueron will continue to operate at reduced capacity until the electrical work at the data center is completed.

As always, if you have any questions or comments, please send them to help@hpcf.upr.edu.

10 Feb

Boqueron Planned Maintenance This Week

The data center where the Boqueron cluster is hosted will be undergoing some electrical maintenance from Saturday, February 15 to Sunday, February 16. The electrical work will require a full power down of Boqueron this Wednesday, February 12. After Boqueron is brought back online (which should happen the same day), it will operate at reduced capacity until the electrical work at the data center is completed. “Reduced capacity” means that a significant portion of Boqueron’s nodes will remain offline in order to lighten the load on the the data center’s circuits.

Because of this, we have placed some limits on the jobs Boqueron may currently accept. Specifically, any jobs that cannot complete their run by 8:00 am Wednesday will not be accepted until Boqueron is powered down and brought back online. Furthermore, once Boqueron is brought back online, only half of its nodes will be available to run jobs, so the remaining half will not accept any jobs until the electrical maintenance is completed at the data center.

All of Boqueron’s nodes will be powered down on Wednesday morning. Any jobs currently running will continue as usual, but, if they are still running by the time the nodes are powered down, they will unfortunately be killed.

We realize this information comes on a somewhat short notice, but we at the HPCf were only informed recently that the date for the electrical work would be this weekend, so we had no way of alerting our users much earlier than this. We apologize for any inconvenience this situation may cause for your work.

Half of Boqueron’s node are set to be powered back on on Wednesday, February 12. The rest are set to be powered up again on Monday, February 17. The nodes that will remain powered off throughout the weekend are the following: node011 through node020, node031 through node040, node051 through node060, and node071 through node080.

As always, if you have any questions or comments, please send them to help@hpcf.upr.edu.

20 Nov

Boqueron Decreased Capacity During the Weekend

The data center where the Boqueron cluster is hosted will be undergoing some electrical maintenance from Saturday November 23 to Sunday November 24. The electrical work requires that we power down a significant portion of Boqueron’s nodes in order to lighten the load on the the data center’s circuits.

Because of this, we have decided to stop accepting new jobs on half of Boqueron’s nodes effective immediately. These “drained” nodes will be powered down on Friday morning. Any jobs currently running on these nodes will continue as usual, but, if they are still running by the time the nodes are powered down, they will unfortunately be killed.

We realize this information comes on a somewhat short notice, but we at the HPCf were only informed today that the date for the electrical work would be this weekend, so we had no way of alerting our users earlier. We apologize for any inconvenience this situation may cause for your work.

All the affected nodes will be powered on again on Monday November 25. The nodes affected by this maintenance are the following: node011 through node020, node031 through node040, node051 through node060, and node071 through node080.

As always, if you have any questions or comments, please send them to help@hpcf.upr.edu.

18 Jul

Boqueron /work Unexpected Downtime Resolved

Boqueron’s /work file system recently experienced unexpected downtime due to a hardware issue. Our colleagues at Central Administration provided us with the replacement hardware we needed, and the issue seems to have finally been resolved. As far as we can tell, there has not been any data loss; however, we still encourage users to connect to Boqueron and verify the integrity of their files. We’d also like to take this opportunity to remind our users that, per our Usage Policies, the /work directory in Boqueron is specifically built for performance and not for reliability, and that, as such, data in users’ /work directory is not and will not be backed up. Users are responsible for maintaining their own backups of any data they keep in their /work.

We apologize for any delay in communication regarding this issue. The HPCf has been understaffed these past two weeks, precisely when the hardware problem popped up. All HPCf staff should be back at the office starting Monday July 23.

We apologize for the inconvenience this issue has no doubt caused, and we thank you for your patience and continued support for the HPCf. As always, you may direct any questions or comments to help@hpcf.upr.edu.

01 May

HPCf Helpdesk Unscheduled Downtime and Alternative Contact

The HPCf Helpdesk is experiencing unexpected downtime. We are currently hard at work at fixing the issue, but until everything has been resolved, we will not be able to check the email address help@hpcf.upr.edu for incoming requests. Until maintenance work has finished, please direct any and all questions, requests, comments, or issues to the new, temporary address help.temp@hpcf.upr.edu. We are sorry for the inconvenience, and we thank you for  your patience and continued support of the HPCf.

19 Jan

Boqueron Power Outage on Friday January 19, 2018

On the morning of Friday January 19 there was a power outage at the data center where HPCf computers are hosted. This outage is outside of HPCf’s control, and it unfortunately knocked out 60 of Boqueron’s nodes, which resulted in many jobs getting spontaneously killed. The power outage has been fixed, and most of Boqueron’s features should now be working as normal. Some of the nodes may take a bit longer to reconnect to the cluster, however, so we kindly ask for your patience in this matter. We apologize for the inconvenience this situation has caused our users. If you have any questions or experience any errors, please write to us at help@hpcf.upr.edu, and we will gladly help you out.

11 Oct

Status Report After Hurricane Maria

We hope that this message finds you all safely.

As you all probably know, hurricane Maria has brought an unprecedented amount of devastation to all of Puerto Rico. The HPCf has luckily not suffered many damages and our staff and equipment are safe. Yesterday, 20 days after Maria’s passage through the island, we finally managed to restore most of our virtual machines, which host a significant amount of HPCf services, including our website.

At the time of writing, our cluster Boqueron seems to be working as usual, and users should be able to log in and submit jobs. During the hurricane, there was a power outage that knocked out 20 of Boqueron’s nodes and wiped out any jobs that were running there. Out of those 20 nodes, 19 have been restored. If you encounter any issues with Boqueron, please let us know.

While our machines are working close to normal, the HPCf offices themselves are still not fully operational. We have no electricity there yet, and so HPCf staff may be slow to respond to user tickets and inquiries. We thank you for your understanding during this difficult time.

We hope that you are all safe and that we can all recover quickly from hurricane Maria’s devastation.

18 Sep

HPCf Services During Hurricane Maria

As you all probably know, hurricane Maria is, at the time of writing, set to impact Puerto Rico starting this Wednesday as a category 4 hurricane. We are writing this post to explain what to expect from HPCf services during this time. In short, HPCf services will operate just like they did during hurricane Irma.

The HPCf machines are currently hosted at a private data center outside of any UPR campus or property. Our systems should be protected during the hurricane, and HPCf services (including jobs running on Boqueron) should continue to run as usual. If you manage to get electricity and an Internet connection (or if you are currently outside of Puerto Rico), you could, in theory, continue to work with HPCf resources even during the hurricane and its immediate aftermath.

That said, HPCf staff will not be available to provide regular support or maintenance to HPCf resources during the hurricane. That means that any help tickets will, unfortunately, remain unanswered until hurricane Maria has safely moved away from Puerto Rico. How quickly HPCf staff will be able to respond to support tickets during the aftermath of Maria will largely depend on the damage that Maria causes nationwide.

We are optimistic that our computers will not suffer downtime during the hurricane, but please keep in mind that it is still always a possibility that an outage could occur at the data center, and that such an outage could have unpredictable impact on user data. If you currently have absolutely crucial data on HPCf systems that you have not yet backed up–data that you absolutely cannot afford to lose–please make time during your hurricane preparations to back up your data. As stated in our Storage Policy, we do our best to protect user data, but users are ultimately responsible for keeping their data safe.

We appreciate your understanding, and we wish that you all stay safe during this major weather event.

05 Sep

HPCf Services During Hurricane Irma

As you all know, hurricane Irma is, at the time of writing, set to impact Puerto Rico starting tomorrow as a category 5 hurricane. We are writing this post to explain what to expect from HPCf services during this time.

The HPCf machines are currently hosted at a private data center outside of any UPR campus or property. Our systems should be protected during the hurricane, and HPCf services (including jobs running on Boqueron) should continue to run as usual. If you manage to get electricity and an Internet connection (or if you are currently outside of Puerto Rico), you could, in theory, continue to work with HPCf resources even during the hurricane and its immediate aftermath.

That said, HPCf staff will not be available to provide regular support or maintenance to HPCf resources during the hurricane. That means that any help tickets will, unfortunately, remain unanswered until hurricane Irma has safely moved away from Puerto Rico. How quickly HPCf staff will be able to respond to support tickets during the aftermath of Irma will largely depend on the damage that Irma causes nationwide.

We are optimistic that our computers will not suffer downtime during the hurricane, but please keep in mind that it is still always a possibility that an outage could occur at the data center, and that such an outage could have unpredictable impact on user data. If you currently have absolutely crucial data on HPCf systems that you have not yet backed up–data that you absolutely cannot afford to lose–please make time during your hurricane preparations to back up your data. As stated in our Storage Policy, we do our best to protect user data, but users are ultimately responsible for keeping their data safe.

We appreciate your understanding, and we wish that you all stay safe during this unprecedented weather event.