Notice of Boqueron Scheduled Maintenance – Wed May 18, 2016
Boqueron will be undergoing scheduled maintenance on Wednesday May 18th, 2016. Effective immediately, all jobs which cannot complete their run by 12:00 am May 18th will not be allocated until maintenance is over. We have reserved a maintenance window starting at 12:00 am and ending at 1:00 pm the same day, though we anticipate the maintenance operations will take a much shorter time than that. We will email another notice indicating when maintenance is done so that you may submit jobs again.
Reason for Maintenance Window
As some of you may have noticed, Boqueron is currently set up in a way that allows a single compute core to run multiple jobs at once. That is to say, a user’s job currently does not actually reserve compute cores, the cores are shared among various jobs. This is not the intended behavior of Boqueron. Not only does this hurt jobs’ performance, but it places a heavy load on the compute nodes.
To fix this, Slurm (the resource manager) must be switched off and reconfigured. Switching off Slurm would kill any jobs that are running at the moment, so we need to create a maintenance window to ensure that no jobs are killed in the process.
What effect will this change have on future jobs?
After the change, we anticipate that all user jobs will run much faster than they do right now. There’s a small trade-off, though: since cores will now be reserved, we’ll start seeing jobs actually waiting in line to be allocated. So far, because of the core sharing, most jobs are usually run almost as soon as they are submitted (but they run under a considerably degraded performance). We anticipate this will change, and that jobs will actually have to wait before being allocated (but they will run much faster once they are allocated).
As always, if you have any questions, feel free to contact us at help(at)hpcf.upr.edu.