Boqueron

 

Boqueron is the main scientific computation cluster at the HPCf.  It provides over 2240 compute cores, and 200 terabytes of high-performance storage served over a 10G and QDR Infiniband backbone.  Jobs are managed by the Slurm Workload Manager, and they may run in distributed or shared-memory parallelism or using hybrid parallelism.

In order to work on a high-performance environment such as Boqueron, there is some information that you should know.  Below you have some important topics regarding working on Boqueron.  Feel free to refer to them whenever you find yourself stuck on something.  If you have a question that is not answered in the topics below, let us know on help@hpcf.upr.edu, and we’ll be glad to help you out.

  • Logging in to Boqueron
  • Submitting Jobs
  • /home and /work
  • Finding Software
  • Compiler-Specific Software

Logging in to Boqueron

Your Boqueron username and password are for logging in to Boqueron through SSH.  How you connect to Boqueron will depend on your operating system.

Note: Do not try to connect to Boqueron by pointing your Web browser to http://boqueron.hpcf.upr.edu.

Linux or Mac OS X

On Linux or Mac OS X, the process to log in to Boqueron is quite straightforward.

  1. Open a Terminal window.
  2. Type in the following command (without the leading $):
    $ ssh <username>@boqueron.hpcf.upr.edu

    substituting your actual username for <username>.  Then hit Enter.

  3. If this is your first time connecting, you will now be asked if you want to trust the connection to Boqueron; answer “yes” and hit Enter again.
  4. You will then be asked for your password.  This completes your login.

Windows

Unlike Linux or Mac OS X, Windows does not provide an SSH-capable terminal by default.  Therefore, you’ll need to download software that provides such functionality.

The most popular tool (and our recommendation) is PuTTY, which you can download for free from here.

After you download and install PuTTY (or whichever software you chose), you can connect to Boqueron with the following information.

  • Hostname: boqueron.hpcf.upr.edu
  • Port: 22
  • Username: Your Boqueron username
  • Password: Your Boqueron password

Click on the Connect button.  If this is your first time connecting, you will now be asked if you want to trust the connection to Boqueron; answer “yes” and hit Enter again.

You should now be connected to Nanobio.

Android

Though you’ll find yourself connecting to Boqueron mostly through your workstation or laptop, you might find it useful sometimes to connect to Boqueron through your smartphone or tablet.  Apps such as JuiceSSH allow you to establish SSH connections from your Android device.  Just remember to use boqueron.hpcf.upr.edu as the hostname/address of the server you want to connect to, and make sure you are connecting using port 22.

Submitting Jobs

Boqueron provides a high-performance file system for you to work in, appropriately called /work.  HPCf guidelines require that you always keep any files related to work you are dong in the /work file system.  When you log in to Boqueron, you are located in you home directory, which is meant to be used to hold final results of jobs you’ve run.  Therefore, you will need to change into the /work before submitting any jobs to be run on Nanobio.  To do so, simply run the command

$ cd $WORK

and you’ll be in your work directory.

Before submitting a job, always remember to have all the job-related data in your /work directory.

Creating a Submit Script

Once you have all your data in place, you are ready to create a “submit script” which is simply a file that specifies the instructions that your job will run.

The following is an example submit script slurm-script.sh:

#!/bin/bash
#SBATCH --mem-per-cpu=<megabytes>
#SBATCH --time=<hh:mm:ss>
#SBATCH --job-name=<job name>
#SBATCH --mail-user=<your email address>
#SBATCH --mail-type=ALL
#SBATCH --workdir=<the directory where job will run>
#SBATCH --error=<error filename>
#SBATCH --output=<output filename>

#This script simply prints out current clock time in 12-hour format

/bin/date +%r

The lines that start with #SBATCH are options passed to the Slurm resource manager (the software that schedules and manages jobs on Boqueron).  There are many other options that may be specified, but these defaults work for most jobs.  You don’t need to concern yourself with the line beginning with #! , just make sure it’s included at the very top of your script.  Any other lines starting with the # character are “comments”, which are simply arbitrary text that is ignored by the system.

The options given to Slurm in the #SBATCH command take specific values after the equals sign (=).  Replace the parts with angle brackets (<, >) in the example above by the specific values that your job will use.  For example, if your job will run for at most 1 hour, you would write the following #SBATCH option:

#SBATCH --time=01:00:00

Note: There is a 1 week maximum running time limit for all jobs on Boqueron.  You must specify a value less than or equal to 1 week in order for your job to run.  Specifying a time limit greater than 1 week will result in your job staying stuck in its queue forever (until it is manually cancelled), waiting for a node that can accommodate it.

Below the lines starting with #SBATCH (and any optional comments), is where you would place the actual commands your job will execute.  In this example, we simply execute the command date for printing the current date.

Special Note on Parallel Jobs

For parallel jobs, you will need to also specify the ntasks option and the cpus-per-task option in your submit script.

#SBATCH --ntasks=<number of tasks>

This will tell Slurm to allocate a number of computation cores equal to the number of parallel tasks you specify.  For example, to run an MPI job with 5 cores, you would specify the following.

#SBATCH --ntasks=5

You will also need to specify that you wish to run your MPI job on the mpi partition (Slurm calls queues “partitions”).

#SBATCH --partition=mpi

Submitting Your Job

After you have created a submit script, simply use the sbatch command to submit your job to Slurm.

$ sbatch slurm-script.sh

Your job will start running as soon as Slurm finds a suitable place for it in the cluster.

Monitoring Your Job

To monitor the status of your jobs on Boqueron, simply type the squeue command, passing your username as an argument.

$ squeue -u <your username>

Cancelling Your Job

Sometimes you’ll need to cancel a job.  To do so, you would use the scancel command, passing the job’s ID as an argument.

$ scancel <job id>

Running Interactive Jobs

Sometimes the software that you would like to run needs to be run interactively.  For these kinds of jobs, you need to use the srun command to request Slurm to allocate you an interactive session on one of the worker nodes.

$ srun --pty bash

Note that the --pty bash argument is required.

Boqueron’s /home and /work

Boqueron features two very important and very different filesystems that users need to understand well in order to correctly use cluster resources: /home and /work.  Below you will find all the fundamental information you need to know about these filesystems.

/home

/home is a relatively small filesystem where users appear in when they connect to Boqueron.  It is meant to provide a holding space for your output and other valuable data until they can be moved elsewhere outside of Boqueron.

Structure of /home

/home is subdivided in research group directories, and each group directory contains yet another subdirectory for each of that group’s members.

For example, user jcbonilla from group hpcf has the following as his $HOME directory:

/home/hpcf/jcbonilla

User olena also from group hpcf has the following $HOME:

/home/hpcf/olena

But user mfurukaw from group hcc_unl has the following:

/home/hcc_unl/mfurukaw

The Intended Purpuse of /home

/home is meant to be a place for users to temporarily keep data that they either need for their computations or that are output of their computations.

As mentioned in the HPCf Usage Policies, /home (and any other filesystem on Boqueron) is not meant to be used as a place for long-term storage.  Files that are contained in /home are files that are expected to be moved by their owners to a place outside of Boqueron in the near future.  We do our best to keep users’ data safe, but each user is ultimately responsible for keeping his or her own data safe.  Please make sure to keep backups of your data outside of Boqueron.

Quotas on /home

/home enforces a limit on how much of its space each research group may consume.  These limits are known as quotas.  The quota on /home for each research group is currently set to 100 GB per group.

To confirm the quota consumption on /home for your group, you can use the following command:

$ chkhomequot

How jobs interact with /home

/home is not meant to be used as a filesystem where jobs can be run from.  Because of this, /home is mounted as a read-only filesystem on the cluster’s worker nodes.

What this means in concrete terms is that if you launch a job from /home it will fail because the job will need to create output and error files, but /home cannot be written to from the worker nodes.

In order for your jobs to run, you must launch them from /work.

/work

/work is a relatively large filesystem that is meant to be used as so-called “scratch space”, that is to say, space on Boqueron that is meant for making computations and obtaining results that will then be stored elsewhere.  /work is the filesystem from which you submit your jobs and their input data.

Structure of /work

/work is subdivided in research group directories, and each group directory contains yet another subdirectory for each of that group’s members.

For example, user jcbonilla from group hpcf has the following as his $WORK directory:

/work/hpcf/jcbonilla

User olena also from group hpcf has the following $WORK:

/work/hpcf/olena

But user mfurukaw from group hcc_unl has the following:

/work/hcc_unl/mfurukaw

The Intended Purpuse of /work

/work is meant to be a place for users to run their jobs from.  /work provides a far superior performance than any other filesystem on Boqueron, and as such will aid the speed of your jobs by not slowing them down with I/O operations.

The trade-off to this high performance provided by /work is that it is not built to be a reliable, long-term storage filesystem.

Furthermore, as mentioned in the HPCf Usage Policies, since the data on /work is intended to be just the data you need to run your jobs, data in users’ /work directory is not and will not be backed up.  Users should not keep anything that they are not willing to lose inside their /work directory.  Please make sure to always keep backups of your data.

Example workflow

If all this sounds a little overwhelming, we recommend trying out the following workflow for Boqueron:

  1. 1. Write or move your source code files and/or submit scripts into your $HOME.
    
    2. Make a copy of those sources into your $WORK.  This step is important.  Should anything happen to data on /work, you would only be risking a copy of your sources.
    
    3. Copy any input data into your $WORK.  If the input data is small, you can keep the original in /home and just copy it over to /work.  If the input data is quite large, you will likely have to keep the original outside of Boqueron and copy it into your $WORK for running your jobs.
    
    4. Once your job has successfully run, move your results from /work to another location for safekeeping.  Additionally, you should delete any input data that you will no longer need for other jobs.

Quotas on /work

/work enforces a limit on how much of its space each research group may consume.  These limits are known as quotas.  The quota on /work for each research group is currently set to 6 TB per group.

To confirm the quota consumption on /work for your group, you can use the following command:

$ chklustrequot

Finding Software

Since a cluster is such a complex system, a mechanism is needed to organize and keep track of the software that is installed in it.  Oftentimes a cluster will even need to have multiple versions of the same piece of software available.  If not handled correctly, this situation could turn into a real hassle for both users and system administrators.  To avoid this, Boqueron uses module files to keep software organized and easy to manage.

Tip: If you just came here looking for info on MPI modules, click here to jump.

What are module files?

Briefly, module files are simply files that define where a specific piece of software’s installation is located in Boqueron.  The collection of a software’s installation and its module file together make up a module.  When a user needs a particular software of a particular version for their work, he or she simply needs to load the module of the software and version they need, and the software will become available.

Basic module commands

Following are some of the basic commands you will need to find and use software on Boqueron.

View all available modules

To view a list of all modules available on Boqueron simply issue the following command.

$ module avail

Sometimes, the list of modules available will change depending on other modules you currently have loaded.  Fore more details on this, see the section Loading MPI Modules below.

View available modules of a specific software

Sometimes you might want to view the available modules for just a single piece of software (for example, the same software may have various versions installed on Boqueron).  In that case, you can issue the module avail command along with the name of the software to see its modules.  For example, to see all modules available for gcc, you would type the following.

$ module avail gcc

Load a module

Once you know which module you wish to use, you can load it through the module load command.  For example, to load the python2/2.7.11 module, you would type the following.

$ module load python2/2.7.11

You can leave off the module version and just enter the software name.

$ module load python2

However, we recommend only doing this for software that have a single version so as to avoid confusion as to which version of the software gets loaded.

View currently loaded modules

To view the modules you currently have loaded, you would use the module list command.

$ module list

Unload a module

Unloading is just as simple as loading.  Here’s an example for unloading the python2/2.7.11 module we loaded earlier.

$ module unload python2/2.7.11

Or simply,

$ module unload python2

Get help for a specific module

To get information about a module, you can use the module help command.

$ module help egglib/2.1.11

This is especially useful when loading modules for libraries, since the user will need to know the location of specific lib and include directories to use them for compilations.

What happens after I load a module?

Loading a module makes that module’s software available to you.  This is better illustrated by example.

First, let’s see what happens when we connect to Boqueron and try running Python 3 before loading its module.

$ ssh <username>@boqueron.hpcf.upr.edu
$ python3
-bash: python3: command not found

Now let’s try running Python 3 after we load its module.

$ module load python3
$ python3
Python 3.5.1 (default, Dec 10 2015, 15:18:52) 
[GCC 5.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Here we see that loading Python 3’s module made the command python3 available to our environment.

We can also see modules at work in changing and managing different versions of the same software.  For example, let’s check what version of gcc we have when we log in to Boqueron.

$ ssh <username>@boqueron.hpcf.upr.edu
$ gcc -v
...
gcc version 5.1.0 (GCC) 

Now let’s see our result after we load the gcc/4.9.3 module.

$ module load gcc/4.9.3 
$ gcc -v
...
gcc version 4.9.3 (GCC)

The most important point here is that after loading the module, we still just used the same gcc command as before loading it.

This is one of the conveniences of using modules: to use different versions of the same software you don’t have to change the actual commands you use, you just have to change the context in which the same commands are made.  This makes your code and job scripts easier to manage.

Loading MPI Modules

In Boqueron, there are different versions of MPI available for different compilers.  Therefore to load MPI, you must load its related compiler first.  MPI’s module will not be visible until you load a compiler that has an MPI version installed.  This situation is demonstrated in the following example using gcc ver. 4.8.5 and mvapich2 ver. 2.2b.

$ module avail mvapich2
$ module load gcc/4.8.5 
$ module avail mvapich2

-------------- /cm/shared/modulefiles_gcc/4.8.5 --------------
mvapich2/2.2b

The first time module avail mvapich2 was run, nothing was returned, indicating that there are no modules available for mvapich2.  However, when the same command was run after loading the gcc/4.8.5 module, an mvapich2/2.2b module appeared as available.

In short, you can only load an MPI module if you have loaded its associated compiler first.

$ module load gcc/4.8.5 
$ module load mvapich2

Here’s another loading example but with gcc 4.7.4 and openmpi 1.8.8.

$ module load gcc/4.7.4 
$ module load openmpi/1.8.8

To find out if a specific compiler has an MPI version installed with it, simply load the compiler’s module and then run module avail to see if MPI modules have become available.  The modules unique to that compiler will be presented in a separate section.

Compiler-Specific Software

Certain software on Boqueron is tied to specific compilers.  To be able to load that software’s module you have to load its compiler’s module first (more info can be found in the previous tab Finding Software).  To help you navigate these special modules, we’ve put together the following list of the software that is tied to each compiler on Boqueron.

GCC 4.7.4

mvapich2/2.2b  
openmpi/1.10.2 
openmpi/1.8.8
openmpi/2.0.2

GCC 4.8.5

mvapich2/2.2b  
openmpi/1.10.2 
openmpi/1.8.8
openmpi/2.0.2

GCC 4.9.3

boost/1.60.0 
jdftx/1.2.1 
maker/2.31.8 
memesuite_parallel/4.11.1 
netcdf/c/4.4.1 
openmpi/1.10.2 
openmpi/2.0.2 
qe/5.4.0 
raxml/8.2.6
gpaw/0.11.0 
ldhelmet/1.7 
mash/0a9a3f3 
mvapich2/2.2b 
netcdf/fortran/4.4.4 
openmpi/1.8.8 
qe/5.3.0 
qe-gipaw/5.4 
vinalc/1.1.2

GCC 6.3.0

fac/1.1.1 
fac/1.1.4 
openmpi/2.0.2 
qe/6.1

PGI 14.9

mpich/3.2
openmpi/1.10.2 
openmpi/1.8.8

Intel 2013

lammps/30Jul16 
mvapich2/2.2b 
netcdf/c/4.4.1 
netcdf/fortran/4.4.4 
openmpi/1.10.2 
openmpi/1.8.8 
vasp/5.4.1 
vasp-vtst/5.4.1