01 May

HPCf Helpdesk Unscheduled Downtime and Alternative Contact

The HPCf Helpdesk is experiencing unexpected downtime. We are currently hard at work at fixing the issue, but until everything has been resolved, we will not be able to check the email address help@hpcf.upr.edu for incoming requests. Until maintenance work has finished, please direct any and all questions, requests, comments, or issues to the new, temporary address help.temp(at)hpcf.upr.edu. We are sorry for the inconvenience, and we thank you for  your patience and continued support of the HPCf.

19 Jan

Boqueron Power Outage on Friday January 19, 2018

On the morning of Friday January 19 there was a power outage at the data center where HPCf computers are hosted. This outage is outside of HPCf’s control, and it unfortunately knocked out 60 of Boqueron’s nodes, which resulted in many jobs getting spontaneously killed. The power outage has been fixed, and most of Boqueron’s features should now be working as normal. Some of the nodes may take a bit longer to reconnect to the cluster, however, so we kindly ask for your patience in this matter. We apologize for the inconvenience this situation has caused our users. If you have any questions or experience any errors, please write to us at help(at)hpcf.upr.edu, and we will gladly help you out.

11 Oct

Status Report After Hurricane Maria

We hope that this message finds you all safely.

As you all probably know, hurricane Maria has brought an unprecedented amount of devastation to all of Puerto Rico. The HPCf has luckily not suffered many damages and our staff and equipment are safe. Yesterday, 20 days after Maria’s passage through the island, we finally managed to restore most of our virtual machines, which host a significant amount of HPCf services, including our website.

At the time of writing, our cluster Boqueron seems to be working as usual, and users should be able to log in and submit jobs. During the hurricane, there was a power outage that knocked out 20 of Boqueron’s nodes and wiped out any jobs that were running there. Out of those 20 nodes, 19 have been restored. If you encounter any issues with Boqueron, please let us know.

While our machines are working close to normal, the HPCf offices themselves are still not fully operational. We have no electricity there yet, and so HPCf staff may be slow to respond to user tickets and inquiries. We thank you for your understanding during this difficult time.

We hope that you are all safe and that we can all recover quickly from hurricane Maria’s devastation.

18 Sep

HPCf Services During Hurricane Maria

As you all probably know, hurricane Maria is, at the time of writing, set to impact Puerto Rico starting this Wednesday as a category 4 hurricane. We are writing this post to explain what to expect from HPCf services during this time. In short, HPCf services will operate just like they did during hurricane Irma.

The HPCf machines are currently hosted at a private data center outside of any UPR campus or property. Our systems should be protected during the hurricane, and HPCf services (including jobs running on Boqueron) should continue to run as usual. If you manage to get electricity and an Internet connection (or if you are currently outside of Puerto Rico), you could, in theory, continue to work with HPCf resources even during the hurricane and its immediate aftermath.

That said, HPCf staff will not be available to provide regular support or maintenance to HPCf resources during the hurricane. That means that any help tickets will, unfortunately, remain unanswered until hurricane Maria has safely moved away from Puerto Rico. How quickly HPCf staff will be able to respond to support tickets during the aftermath of Maria will largely depend on the damage that Maria causes nationwide.

We are optimistic that our computers will not suffer downtime during the hurricane, but please keep in mind that it is still always a possibility that an outage could occur at the data center, and that such an outage could have unpredictable impact on user data. If you currently have absolutely crucial data on HPCf systems that you have not yet backed up–data that you absolutely cannot afford to lose–please make time during your hurricane preparations to back up your data. As stated in our Storage Policy, we do our best to protect user data, but users are ultimately responsible for keeping their data safe.

We appreciate your understanding, and we wish that you all stay safe during this major weather event.

05 Sep

HPCf Services During Hurricane Irma

As you all know, hurricane Irma is, at the time of writing, set to impact Puerto Rico starting tomorrow as a category 5 hurricane. We are writing this post to explain what to expect from HPCf services during this time.

The HPCf machines are currently hosted at a private data center outside of any UPR campus or property. Our systems should be protected during the hurricane, and HPCf services (including jobs running on Boqueron) should continue to run as usual. If you manage to get electricity and an Internet connection (or if you are currently outside of Puerto Rico), you could, in theory, continue to work with HPCf resources even during the hurricane and its immediate aftermath.

That said, HPCf staff will not be available to provide regular support or maintenance to HPCf resources during the hurricane. That means that any help tickets will, unfortunately, remain unanswered until hurricane Irma has safely moved away from Puerto Rico. How quickly HPCf staff will be able to respond to support tickets during the aftermath of Irma will largely depend on the damage that Irma causes nationwide.

We are optimistic that our computers will not suffer downtime during the hurricane, but please keep in mind that it is still always a possibility that an outage could occur at the data center, and that such an outage could have unpredictable impact on user data. If you currently have absolutely crucial data on HPCf systems that you have not yet backed up–data that you absolutely cannot afford to lose–please make time during your hurricane preparations to back up your data. As stated in our Storage Policy, we do our best to protect user data, but users are ultimately responsible for keeping their data safe.

We appreciate your understanding, and we wish that you all stay safe during this unprecedented weather event.

 

19 Jun

Old Cluster Nanobio’s Retirement

We at the HPCf have decided that our old cluster Nanobio (Boqueron’s predecessor) will be retired after the end of July of this year. On August 1st, 2017, Nanobio will be powered down and any data left in any of its file systems (including /home and /work) will no longer be available from that time onward. As we have done since Boqueron went online on March 2016, we encourage all users who still have any data on Nanobio to move or copy it to some other system as soon as possible. Transferring data to Boqueron is one option, but please keep in mind that, per our usage policies, users should transfer to Boqueron only the data that they will actually need for their work there; Boqueron is not designed or intended for long-term storage. Users will be responsible for making sure they do not incur any data loss as a result of Nanobio’s retirement.

Any users whose HPCf accounts were created after Boqueron went online may disregard this notice since they do not have a Nanobio account. As always, if you have any questions or comments, feel free to send them to help(at)hpcf.upr.edu.

26 Jan

Boqueron Scheduled Partial Downtime Jan 30 – Jan 31, 2017

The staff from the data center where Boqueron is hosted at will be carrying out some electrical work next week that will impact Boqueron.  To cooperate with the efforts and to protect as many jobs as possible from getting spontaneously killed, we have opened a maintenance window starting at 7:30am Monday January 30th and ending at 12:00pm Tuesday January 31st.   Any newly submitted jobs that cannot complete before the maintenance window begins will be held in the queue until the window ends.

During this window, the Boqueron login node will continue to operate, and the file systems /home and /work should continue to be available, so you should be able to access your files during this time.  The worker nodes will be powered down, however.

Jobs that are currently running will be allowed to continue to run, but if any remain running at the time of the maintenance window, they will unfortunately be killed.

We realize this announcement comes a bit short notice, but please understand that HPCf staff was notified of this electrical work yesterday afternoon.

We apologize for any inconvenience you may experience from this maintenance window, and we thank you for your cooperation.  As always, if you have any questions or comments, please send them to help(at)hpcf.upr.edu.

31 Oct

Boqueron Power Outage During the Halloween Weekend

An unexpected outage in the data center where Boqueron is hosted wiped out 60 of Boqueron’s nodes over the weekend. All jobs running on those nodes were killed.  We have since turned the nodes back on and they should now be working as usual.

To make sure the entire cluster has recovered from the outage, the 20 nodes that remained up during the outage (nodes 41~60) will be rebooted at a later time, and so they will remain closed off to new jobs until then (they will be listed as either “draining” or “drained” by Slurm). The jobs that are currently running there will be allowed to run uninterrupted.

The outage did not affect Boqueron’s login node or Boqueron’s /home or /work file systems.  Other HPCf services such as our web site, however, were affected.  They are also back online, and you should not experience any issues when using them.

We are currently in talks with the administrators at the data center to asses the situation and prevent it from occurring again.

As always, if you have any questions, problems, or comment, you may write to us at help(at)hpcf.upr.edu.

15 Aug

/work Filesystem Outtage Has Been Fixed

Early today, Boqueron’s /work had a hiccup and it went offline for a few hours.  This caused some weird behavior, including not being able to sign in through certain software such as SCP programs.  We’ve fixed the issue, but because jobs run on /work, the running jobs that remained after /work went offline eventually got killed.  Everything should be fine now and you should be able to resubmit your jobs with no problems.  No data loss should have occurred as a result of this error.

Please remember that per HPCf Usage Policies, data in /work is not backed up.  Always make sure to move important data off of /work and into a more permanent storage location, such as a computer in your laboratory or a personal workstation.

As always, if you have any other issues or questions, please let us know at help(at)hpcf.upr.edu.

01 Jun

Boqueron Unscheduled Maintenance During Morning of June 1st, 2016

As you may have noticed, Boqueron underwent unscheduled, urgent maintenance earlier today.  This maintenance unfortunately had the effect of killing all running and pending jobs in the process.

What happened?

An issue in the way Slurm (the queue manager) was interacting with the cluster manager software that we use on Boqueron (Bright CM) was causing Slurm’s configuration to get rewritten spontaneously when certain actions were taken in the cluster.  Every time this happened, jobs got killed.  Today, we contacted Bright support and they were kind enough to help us out through a live screen-sharing session.  The changes they had to make required the Slurm configuration to get rewritten much like those other times, and so earlier today jobs got killed as well.

Is the issue resolved?

Yes.  The issue was bugging us for a few weeks now, but it should now be completely resolved.

Can I submit jobs again?  Won’t they get killed again?

Yes, you may submit jobs again; and no, they should not get killed again.  No system is perfect, but the solution we arrived at today with Bright support should result in continuous, stable queue operation under normal, day-to-day circumstances.

But I’m afraid they’ll get killed again!

You shouldn’t be afraid.  We recognize that the recent shakiness of the queues would cause user confidence to drop, but again, the core issue should now be resolved and we expect stable times for our cluster. *knocks on wood*

We do apologize for the inconvenience this has created.  For any further questions or comments, feel free to contact us at help(at)hpcf.upr.edu.