WELCOME TO THE HIGH PERFORMANCE COMPUTING FACILITY
DEVELOPING TECHNOLOGY AND INFRASTRUCTURE FOR THE RESEARCH AND EDUCATION COMMUNITY
The High Performance Computing facility of the University of Puerto Rico is presently developing a technology and service infrastructure for the research and education community of the University. This infrastructure is built from the following components:
- Advanced Research Network
- Core High Performance Computational Resources
- Services in Support of Users of Computational Resources
- Standards and Architecture
- Evaluation of Emerging Technologies
Some common tasks:
Use of HPCf resources requires that you acknowledge our sponsoring institutions: the University of Puerto Rico, the Puerto Rico INBRE grant P20 GM103475 from the National Institute for General Medical Sciences (NIGMS), a component of the National Institutes of Health (NIH); and awards 1010094 and 1002410 from the Experimental Program to Stimulate Competitive Research (EPSCoR) program of the National Science Foundation
We have completed the maintenance work on Boqueron that had been scheduled for today. Core sharing should now be disabled. Users may now resubmit jobs to Boqueron. Everything should be working fine, but we encourage users to submit at least one test job first to ensure everything is working correctly. Please report any issues or questions to email@example.com.
Boqueron will be undergoing scheduled maintenance on Wednesday May 18th, 2016. Effective immediately, all jobs which cannot complete their run by 12:00 am May 18th will not be allocated until maintenance is over. We have reserved a maintenance window starting at 12:00 am and ending at 1:00 pm the same day, though we anticipate the maintenance operations will take a much shorter time than that. We will email another notice indicating when maintenance is done so that you may submit jobs again.
Reason for Maintenance Window
As some of you may have noticed, Boqueron is currently set up in a way that allows a single compute core to run multiple jobs at once. That is to say, a user’s job currently does not actually reserve compute cores, the cores are shared among various jobs. This is not the intended behavior of Boqueron. Not only does this hurt jobs’ performance, but it places a heavy load on the compute nodes.
To fix this, Slurm (the resource manager) must be switched off and reconfigured. Switching off Slurm would kill any jobs that are running at the moment, so we need to create a maintenance window to ensure that no jobs are killed in the process.
What effect will this change have on future jobs?
After the change, we anticipate that all user jobs will run much faster than they do right now. There’s a small trade-off, though: since cores will now be reserved, we’ll start seeing jobs actually waiting in line to be allocated. So far, because of the core sharing, most jobs are usually run almost as soon as they are submitted (but they run under a considerably degraded performance). We anticipate this will change, and that jobs will actually have to wait before being allocated (but they will run much faster once they are allocated).
As always, if you have any questions, feel free to contact us at firstname.lastname@example.org.