Jose Bonilla

01 Jun

Boqueron Unscheduled Maintenance During Morning of June 1st, 2016

As you may have noticed, Boqueron underwent unscheduled, urgent maintenance earlier today. This maintenance unfortunately had the effect of killing all running and pending jobs in the process.

What happened?

An issue in the way Slurm (the queue manager) was interacting with the cluster manager software that we use on Boqueron (Bright CM) was causing Slurm’s configuration to get rewritten spontaneously when certain actions were taken in the cluster. Every time this happened, jobs got killed. Today, we contacted Bright support and they were kind enough to help us out through a live screen-sharing session. The changes they had to make required the Slurm configuration to get rewritten much like those other times, and so earlier today jobs got killed as well.

Is the issue resolved?

Yes. The issue was bugging us for a few weeks now, but it should now be completely resolved.

Can I submit jobs again? Won’t they get killed again?

Yes, you may submit jobs again; and no, they should not get killed again. No system is perfect, but the solution we arrived at today with Bright support should result in continuous, stable queue operation under normal, day-to-day circumstances.

But I’m afraid they’ll get killed again!

You shouldn’t be afraid. We recognize that the recent shakiness of the queues would cause user confidence to drop, but again, the core issue should now be resolved and we expect stable times for our cluster. *knocks on wood*

We do apologize for the inconvenience this has created. For any further questions or comments, feel free to contact us at help(at)hpcf.upr.edu.

18 May

Notice: Boqueron Maintenance Completed – May 18, 2016

We have completed the maintenance work on Boqueron that had been scheduled for today. Core sharing should now be disabled. Users may now resubmit jobs to Boqueron. Everything should be working fine, but we encourage users to submit at least one test job first to ensure everything is working correctly. Please report any issues or questions to help(at)hpcf.upr.edu.

10 May

Notice of Boqueron Scheduled Maintenance – Wed May 18, 2016

Boqueron will be undergoing scheduled maintenance on Wednesday May 18th, 2016. Effective immediately, all jobs which cannot complete their run by 12:00 am May 18th will not be allocated until maintenance is over. We have reserved a maintenance window starting at 12:00 am and ending at 1:00 pm the same day, though we anticipate the maintenance operations will take a much shorter time than that. We will email another notice indicating when maintenance is done so that you may submit jobs again.

Reason for Maintenance Window

As some of you may have noticed, Boqueron is currently set up in a way that allows a single compute core to run multiple jobs at once. That is to say, a user’s job currently does not actually reserve compute cores, the cores are shared among various jobs. This is not the intended behavior of Boqueron. Not only does this hurt jobs’ performance, but it places a heavy load on the compute nodes.

To fix this, Slurm (the resource manager) must be switched off and reconfigured. Switching off Slurm would kill any jobs that are running at the moment, so we need to create a maintenance window to ensure that no jobs are killed in the process.

What effect will this change have on future jobs?

After the change, we anticipate that all user jobs will run much faster than they do right now. There’s a small trade-off, though: since cores will now be reserved, we’ll start seeing jobs actually waiting in line to be allocated. So far, because of the core sharing, most jobs are usually run almost as soon as they are submitted (but they run under a considerably degraded performance). We anticipate this will change, and that jobs will actually have to wait before being allocated (but they will run much faster once they are allocated).

As always, if you have any questions, feel free to contact us at help(at)hpcf.upr.edu.

16 Mar

Boqueron – All Users Have Been Added, plus VASP and Gaussian 09

The build of our new cluster Boqueron has progressed well in the past few weeks, and we are pleased to announce that all currently-registered users have finally been added to Boqueron. If you believe that you were registered to receive an account but have not received one yet, please write to help(at)hpcf.upr.edu so that we can help you out.

Additionally, Gaussian 09 and VASP 5 are now available on our cluster for those users who are authorized by the programs’ respective licensors to use them. If you need any help getting them set up, please let us know. VASP in particular is still being tested on Boqueron since some users have run into some errors that apparently are known issues with VASP.

So if you will use VASP (which you can only do if we have received prior written authorization from VASP that you may do so), please keep in mind we’re still doing a test drive, and please report any errors to us so that we can get VASP running well as soon as possible.

08 Mar

New Cluster Build – Beta Launch is Live

We are pleased to announce that Boqueron’s beta has launched. We are adding users progressively instead of all at once in order to ease the transition from Nanobio. Users will know that they have been added when they receive an email message with details on logging in to Boqueron. If you haven’t received your email message yet, you will soon. Users are encouraged to explore the cluster and compare how it works vs. how Nanobio works.

Software is still being added to the cluster, but even if your software is not yet available, you can start working on becoming familiar with Boqueron and transferring the files you will need for your work.

Also, please remember to request the software that you will need if it has not yet been added to Boqueron. You can do so by sending us a message at help(at)hpcf.upr.edu with the subject line

[Boqueron Software Request]

We thank you for your patience throughout this long process and hope that you will enjoy using Boqueron for your work.

26 Feb

New Cluster Build – Finishing Touches and Beta Launch

Boqueron

We are pleased to announce that our new cluster build has been making great strides lately and that we expect our new cluster to be up and running by sometime next week (finally!). We can now reveal that the name of our new cluster will be Boqueron (pronounced boh-keh-ron) and that it will initially feature over 2200 computation cores, and 200 TB of /work space served over a 10Gbps and QDR Infiniband backbone.

Registered users should soon be receiving email messages with details on logging in to Boqueron. When the cluster launches, however, you might notice that not all the software you used on Nanobio is readily available on Boqueron. This is by design. Instead of importing all the software on Nanobio that users may or may not actually use, we want you to request the software that you will actually use so that we can have a sort of clean slate with Boqueron. In that sense, Boqueron’s launch will be like a sort of beta launch.

What this means for you

We need our users to request the software that they will actually use on Boqueron. You can either wait until the launch of the new cluster to do so, or you can do so right now. Just drop us a line at help(at)hpcf.upr.edu with the subject line

[Boqueron Software Request]

and we will queue your requested software for installation. We can’t guarantee that all the requested software will be available at launch time, but the sooner you request it, the sooner it’ll be installed.

A note on license-based software

All software has a license of some sort. However, some licenses are more restrictive than others. These “more-restrictive-license” software is what I refer to as “license-based software”. License-based software might take a bit longer to be available on Boqueron precisely because of their more restrictive and sometimes for-profit nature. Some vendors will readily transfer our Nanobio licenses to be used on Boqueron, but others have much stricter processes and policies that will make their transfer over to Boqueron slower. We ask for your patience with the availability of these software.

You might be wondering: “Well, why did you wait until the cluster was ready to launch to deal with the licenses?” The answer is that we have to wait until out system is actually up and running before we can ask vendors to authorize it for running their software. We can’t ask vendors to authorize a machine that doesn’t exist yet.

What software is currently available on Boqueron?

Just like in Nanobio, once you log in to Boqueron, you can run

$ module avail

to see what software is currently installed on Boqueron. I will offer a preview, however, of software that is already installed so that you don’t have to request it.

gcc/5.1.0
openmpi/1.8.8
samtools 1.2
bcftools 1.2
bowtie2 2.2.6
boost (headers only) 1.59.0
python2 2.7.11
python3 3.5.1
pip2 7.1.2
jdk 1.8.0_65
R 3.2.3
trinityrnaseq 2.1.1
dsk 2.0.7
kanalyze 0.9.7
jellyfish 2.2.4
tophat 2.1.0
gatk/queue 3.5
trinotate 2.0.2
ncbi-blast+ 2.2.31
hmmer 3.1b2
signalp 4.1
tmhmm 2.0c
rnammer 1.2
Blast+ dbs:
 - SwissProt
 - Uniref90
HMMER dbs:
 - Pfam-A

Do you have documentation on Boqueron available?

Boqueron’s documentation is still a work in progress, but you are free to browse it if you wish to start becoming familiar with the new cluster. It’s available here.

As always, we’d like to remind you that any dates we give for the new cluster are always subject to change. That’s just the nature of working with computer resources. Unforeseen circumstances always come up. We thank you for your patience throughout this long process. If you have any further questions or comments, feel free to contact us at help(at)hpcf.upr.edu.

12 Feb

New Cluster Build – More Nodes Will Be Moved From Nanobio to New Cluster

The build of our new cluster has been progressing well and we are now ready to migrate more nodes from Nanobio to the new cluster. This means that Nanobio will lose even more computational power, which also means that jobs will remain waiting in queue for an even longer period of time.

Effective immediately, all access to the ib-compute-0-n nodes has been closed. Jobs already running on these nodes will be allowed to run in their entirety (or until they are killed by the system if they reach the 1 week max running time).

Additionally, any jobs that were submitted in 2015 that are currently running (these are jobs that were submitted before we put in place the maximum running time restriction) will be allowed to run until no later than Monday February 15, 2016 at 12:00pm. Any jobs that are still running at that time will be killed.

We apologize for the inconvenience and as always, we thank you for your patience during this time. We anticipate the new cluster will be up sooner rather than later.

P.S. Remember that if your Nanobio account was created before November 2014, you need to re-register your account by filling out the form on our site. More details in our FAQ.

27 Jan

New Cluster Build – Notice of Delay – 1/27/16

The build of our new cluster has been unexpectedly delayed due to unforeseen circumstances. The new estimated date for the new cluster to go online is tentatively by sometime in February, provided the current situation is resolved quickly. We apologize for the inconvenience this causes and thank you for your continued patience and support. Following is an explanation of the current delay.

What has caused the delay

Our consultant discovered an unforeseen incompatibility between two critical software components of the cluster. This has caused us to have to install a new version of some of these components (including the kernel).

08 Dec

New Cluster Build – Notice of Delay – 12/8/2015

The build of our new cluster has been unexpectedly delayed due to unforeseen circumstances. The new estimated date for the new cluster to go online is tentatively by December 18th, provided the current situation is resolved quickly. We apologize for the inconvenience this causes and thank you for your continued patience and support. In the meantime, we reinstalled twenty nodes back into Nanobio to help alleviate the traffic jam in the queues. Following is an explanation of the current delay.

What has caused the delay

One of the new machines we purchased for the new cluster came with a defective crucial part. It took us a while to pinpoint that that was the case, due in part to the variety of tests that have to be run to determine that the problem is indeed a hardware issue. We then waited a few weeks for the new part to come in, but it seems a mistake was made in shipment and the wrong part was sent, so we are now making arrangements to have another replacement sent to us. We will keep you updated on the process.

12 Nov

New Cluster Installation FAQ

With all the news about a new cluster being built at the HPCf, you no doubt have many questions regarding Nanobio and the new cluster. In order to help you with these, I decided to put up a FAQ on our site. We hope you find it useful. As always, any further questions can be directed at help@hpcf.upr.edu.

So, what's going on?
Why is Nanobio so slow?
Will Nanobio get faster again?
So, you're killing Nanobio?
Will my Nanobio data be transferred to the new cluster?
Who will have access to the new cluster? Do I have to register for a new account?
Will I keep the same username and password?
When will the new cluster be up?
Will the new cluster work the same way as Nanobio?
What's taking you so long?
How powerful is the new cluster?
Is there anything I can do to help?
What if I have more questions?

Nanobio is an old cluster. After many months of planning what was the best way to proceed with an upgrade to Nanobio, we at the HPCf finally decided that it would be best to build a new separate cluster rather than updating the current cluster. At the time of writing we are currently involved in the installation of this new cluster.

A lot of the compute nodes of Nanobio were capable of being integrated into the new cluster. Instead of leaving these compute nodes behind on Nanobio, we removed them from Nanobio and installed them into the new cluster. This has an obvious impact on Nanobio’s performance, which was already over-capacity to begin with.

No. The hardware that has been migrated to the new cluster will stay there for good. In fact, some more compute nodes from Nanobio will be migrated to the new cluster.

Not quite. The new cluster will be Nanobio’s successor, but we currently have no plans to shut down Nanobio for good (subject to change). We’ve yet to decide the fate of Nanobio, but one possibility is to keep it running as a separate, less powerful cluster for jobs that may not necessarily require that many resources.

No. The new cluster is completely separate from Nanobio. Any data migration will have to be carried out by the user. The user is ultimately responsible for ensuring that their HPCf data is properly backed up.

As Nanobio’s successor, the new cluster will be available for all current and future HPCf users. If you already have a Nanobio account that was created after our current web site went up (after November 2014) then you will automatically be added as a user to the new cluster. If you registered your current Nanobio account before the new site went up, then you will have to register again through our site here. If you are unsure whether you have to reregister or not, you can ask us to verify your account at help@hpcf.upr.edu.

Your username will be the same as your current Nanobio username. You will likely be required to enter a new password, though. We will let you know more details later.

The current tentative timeline is for the new cluster to be up sometime in February.

The way the new cluster is built is very similar to Nanobio, but some changes will be evident to users. The most noticeable one will be that the new cluster will use a new software for managing the job queues. Nanobio uses SGE and the new cluster will use Slurm. There will be a bit a of a learning curve for users, but we will have documentation and tutorials up on our site for your reference, and as always we’ll be glad to answer your questions at help@hpcf.upr.edu.

Another big change will be the addition of quotas. A cluster’s resources are limited, and in order ensure that these resources are shared fairly (or as close to fair as possible) we will be enforcing quotas throughout the new cluster. There will be quotas for both /home consumption and /work consumption, as well as for the running time of jobs. The quotas on /home and /work will be applied not to individual users, but to entire research groups.

Otherwise, the cluster interface should feel mostly the same to users: its OS is based on CentOS Linux, it will have modules for loading specific software, a high-performance /work filesystem for running jobs on and a slower /home filesystem for temporarily storing files.

Computer clusters are complex systems that incorporate many different software and hardware components and try to make them play nicely together. These components include: a Linux OS, Infiniband networking, Ethernet networking, a Lustre parallel file system, a cluster manager software, the actual machines that run all these components, and many many others. There’s also the added work of physically assembling the cluster in our data center. Sometimes getting all the components to play nicely takes a while, and so we thank you for your patience as we continue to work hard to get the new cluster ready for our users.

We’ll have final numbers soon. But generally it is much bigger, faster, and more efficient than Nanobio in virtually every meaningful way.

Yes. You can tell your research colleagues about the new cluster and ask them to verify if they need to create a new account or not. Also, you could start identifying any data that you know for sure you will need to transfer to the new cluster to continue your work. Keep in mind that the new cluster is not a storage server. The data that you transfer should be data that you need for actually running jobs on the cluster. High performance computer clusters in general are not designed to be reliable, long-term storage solutions for users, and as such users will be ultimately responsible for making sure their data is properly backed up.

You can also share this FAQ with other Nanobio users so they’ll be up to date with what’s going on.

As always, we’ll be glad to respond to them at help(at)hpcf.upr.edu. Any other questions you ask may be incorporated into this FAQ.