Boqueron is the main scientific computation cluster at the HPCf. It provides over 2240 compute cores, and 200 terabytes of high-performance storage served over a QDR Infiniband and 10G Ethernet backbone. Jobs are managed by the Slurm Workload Manager, and they may run in distributed or shared-memory parallelism or using hybrid parallelism.
In order to work on a high-performance environment such as Boqueron, there is some information that you should know. Below you have some important topics regarding working on Boqueron. Feel free to refer to them whenever you find yourself stuck on something. If you have a question that is not answered in the topics below, let us know on firstname.lastname@example.org, and we’ll be glad to help you out.
- Logging in to Boqueron
- Submitting Jobs
- /home and /work
- Finding Software
- Compiler-Specific Software
Logging in to Boqueron
Your Boqueron username and password are for logging in to Boqueron through SSH. How you connect to Boqueron will depend on your operating system.
Note: Do not try to connect to Boqueron by pointing your Web browser to http://boqueron.hpcf.upr.edu.
Linux or Mac OS X
On Linux or Mac OS X, the process to log in to Boqueron is quite straightforward.
- Open a Terminal window.
- Type in the following command (without the leading $):
$ ssh <username>@boqueron.hpcf.upr.edu
substituting your actual username for <username>. Then hit Enter.
- If this is your first time connecting, you will now be asked if you want to trust the connection to Boqueron; answer “yes” and hit Enter again.
- You will then be asked for your password. This completes your login.
Unlike Linux or Mac OS X, Windows does not provide an SSH-capable terminal by default. Therefore, you’ll need to download software that provides such functionality.
The most popular tool (and our recommendation) is PuTTY, which you can download for free from here.
After you download and install PuTTY (or whichever software you chose), you can connect to Boqueron with the following information.
- Hostname: boqueron.hpcf.upr.edu
- Port: 22
- Username: Your Boqueron username
- Password: Your Boqueron password
Click on the Connect button. If this is your first time connecting, you will now be asked if you want to trust the connection to Boqueron; answer “yes” and hit Enter again.
You should now be connected to Nanobio.
Though you’ll find yourself connecting to Boqueron mostly through your workstation or laptop, you might find it useful sometimes to connect to Boqueron through your smartphone or tablet. Apps such as JuiceSSH allow you to establish SSH connections from your Android device. Just remember to use boqueron.hpcf.upr.edu as the hostname/address of the server you want to connect to, and make sure you are connecting using port 22.
Boqueron provides a high-performance file system for you to work in, appropriately called /work. HPCf guidelines require that you always keep any files related to work you are dong in the /work file system. When you log in to Boqueron, you are located in you home directory, which is meant to be used to hold final results of jobs you’ve run. Therefore, you will need to change into the /work before submitting any jobs to be run on Nanobio. To do so, simply run the command
$ cd $WORK
and you’ll be in your work directory.
Before submitting a job, always remember to have all the job-related data in your /work directory.
Creating a Submit Script
Once you have all your data in place, you are ready to create a “submit script” which is simply a file that specifies the instructions that your job will run.
The following is an example submit script
#!/bin/bash #SBATCH --time=<hh:mm:ss> #SBATCH --job-name=<job name> #SBATCH --mail-user=<your email address> #SBATCH --mail-type=ALL #SBATCH --workdir=<the directory where job will run> #SBATCH --error=<error filename> #SBATCH --output=<output filename> #This script simply prints out current clock time in 12-hour format /bin/date +%r
The lines that start with
#SBATCH are options passed to the Slurm resource manager (the software that schedules and manages jobs on Boqueron). There are many other options that may be specified, but these defaults work for most jobs. You don’t need to concern yourself with the line beginning with
#! , just make sure it’s included at the very top of your script. Any other lines starting with the
# character are “comments”, which are simply arbitrary text that is ignored by the system.
The options given to Slurm in the
#SBATCH command take specific values after the equals sign (=). Replace the parts with angle brackets (<, >) in the example above by the specific values that your job will use. For example, if your job will run for at most 1 hour, you would write the following
Note: There is a 1 week maximum running time limit for all jobs on Boqueron. You must specify a value less than or equal to 1 week in order for your job to run. Specifying a time limit greater than 1 week will result in your job staying stuck in its queue forever (until it is manually cancelled), waiting for a node that can accommodate it.
Below the lines starting with
#SBATCH (and any optional comments), is where you would place the actual commands your job will execute. In this example, we simply execute the command
date for printing the current date.
Special Note on Parallel Jobs
For parallel jobs, you will need to also specify the
--ntasks parameter and, optionally, the
--cpus-per-task parameter in your submit script.
#SBATCH --ntasks=<number of tasks> #SBATCH --cpus-per-task=<number of cpus per task>
--ntasks parameter tells Slurm how many parallel processes (“tasks”) to launch for your job, while the
--cpus-per-task parameter tells Slurm how many cores to allocate for each of those processes. If the
--cpus-per-task parameter is not specified in your job script, Slurm will automatically set it to 1.
For example, to run an MPI job with 5 processes (and 1 core per process) you would specify the following.
You will also need to specify that you wish to run your MPI job on the
mpi partition (Slurm calls queues “partitions”).
--mem-per-cpu parameter is used to specify the amount of memory (RAM) in MB to allocate per requested compute core. This option also impacts non-parallel jobs; since non-parallel jobs run on a single core, the
--mem-per-cpu parameter defines the amount of memory for the entire job. If this parameter is not specified, Boqueorn will automatically set it to 500.
Once a job starts running, its allocated memory will be available exclusively to that job, so please be mindful when setting this value so as to not reserve unnecessarily large amounts of memory that could instead be used by other users to run their jobs.
If the total amount of requested memory (
--mem-per-cpu multiplied by
--cpus-per-task multiplied by
--ntasks) is larger than what is physically possible for the cluster to accomodate, Slurm will reject your job with an error indicating that no such allocation is possible.
Submitting Your Job
After you have created a submit script, simply use the
sbatch command to submit your job to Slurm.
$ sbatch slurm-script.sh
Your job will start running as soon as Slurm finds a suitable place for it in the cluster.
Monitoring Your Job
To monitor the status of your jobs on Boqueron, simply type the
squeue command, passing your username as an argument.
$ squeue -u <your username>
Cancelling Your Job
Sometimes you’ll need to cancel a job. To do so, you would use the
scancel command, passing the job’s ID as an argument.
$ scancel <job id>
Running Interactive Jobs
Sometimes the software that you would like to run needs to be run interactively. For these kinds of jobs, you need to use the
srun command to request Slurm to allocate you an interactive session on one of the worker nodes.
$ srun --pty bash
Note that the
--pty bash argument is required.
Boqueron’s /home and /work
Boqueron features two very important and very different filesystems that users need to understand well in order to correctly use cluster resources: /home and /work. Below you will find all the fundamental information you need to know about these filesystems.
/home is a relatively small filesystem where users appear in when they connect to Boqueron. It is meant to provide a holding space for your output and other valuable data until they can be moved elsewhere outside of Boqueron.
Structure of /home
/home is subdivided in research group directories, and each group directory contains yet another subdirectory for each of that group’s members.
For example, user jcbonilla from group hpcf has the following as his $HOME directory:
User olena also from group hpcf has the following $HOME:
But user mfurukaw from group hcc_unl has the following:
The Intended Purpuse of /home
/home is meant to be a place for users to temporarily keep data that they either need for their computations or that are output of their computations.
As mentioned in the HPCf Usage Policies, /home (and any other filesystem on Boqueron) is not meant to be used as a place for long-term storage. Files that are contained in /home are files that are expected to be moved by their owners to a place outside of Boqueron in the near future. We do our best to keep users’ data safe, but each user is ultimately responsible for keeping his or her own data safe. Please make sure to keep backups of your data outside of Boqueron.
Quotas on /home
/home enforces a limit on how much of its space each research group may consume. These limits are known as quotas. The quota on /home for each research group is currently set to 100 GB per group.
To confirm the quota consumption on /home for your group, you can use the following command:
How jobs interact with /home
/home is not meant to be used as a filesystem where jobs can be run from. Because of this, /home is mounted as a read-only filesystem on the cluster’s worker nodes.
What this means in concrete terms is that if you launch a job from /home it will fail because the job will need to create output and error files, but /home cannot be written to from the worker nodes.
In order for your jobs to run, you must launch them from /work.
/work is a relatively large filesystem that is meant to be used as so-called “scratch space”, that is to say, space on Boqueron that is meant for making computations and obtaining results that will then be stored elsewhere. /work is the filesystem from which you submit your jobs and their input data.
Structure of /work
/work is subdivided in research group directories, and each group directory contains yet another subdirectory for each of that group’s members.
For example, user jcbonilla from group hpcf has the following as his $WORK directory:
User olena also from group hpcf has the following $WORK:
But user mfurukaw from group hcc_unl has the following:
The Intended Purpuse of /work
/work is meant to be a place for users to run their jobs from. /work provides a far superior performance than any other filesystem on Boqueron, and as such will aid the speed of your jobs by not slowing them down with I/O operations.
The trade-off to this high performance provided by /work is that it is not built to be a reliable, long-term storage filesystem.
Furthermore, as mentioned in the HPCf Usage Policies, since the data on /work is intended to be just the data you need to run your jobs, data in users’ /work directory is not and will not be backed up. Users should not keep anything that they are not willing to lose inside their /work directory. Please make sure to always keep backups of your data.
If all this sounds a little overwhelming, we recommend trying out the following workflow for Boqueron:
1. Write or move your source code files and/or submit scripts into your $HOME. 2. Make a copy of those sources into your $WORK. This step is important. Should anything happen to data on /work, you would only be risking a copy of your sources. 3. Copy any input data into your $WORK. If the input data is small, you can keep the original in /home and just copy it over to /work. If the input data is quite large, you will likely have to keep the original outside of Boqueron and copy it into your $WORK for running your jobs. 4. Once your job has successfully run, move your results from /work to another location for safekeeping. Additionally, you should delete any input data that you will no longer need for other jobs.
Quotas on /work
/work enforces a limit on how much of its space each research group may consume. These limits are known as quotas. The quota on /work for each research group is currently set to 6 TB per group.
To confirm the quota consumption on /work for your group, you can use the following command:
Since a cluster is such a complex system, a mechanism is needed to organize and keep track of the software that is installed in it. Oftentimes a cluster will even need to have multiple versions of the same piece of software available. If not handled correctly, this situation could turn into a real hassle for both users and system administrators. To avoid this, Boqueron uses module files to keep software organized and easy to manage.
Tip: If you just came here looking for info on MPI modules, click here to jump.
What are module files?
Briefly, module files are simply files that define where a specific piece of software’s installation is located in Boqueron. The collection of a software’s installation and its module file together make up a module. When a user needs a particular software of a particular version for their work, he or she simply needs to load the module of the software and version they need, and the software will become available.
Basic module commands
Following are some of the basic commands you will need to find and use software on Boqueron.
View all available modules
To view a list of all modules available on Boqueron simply issue the following command.
$ module avail
Sometimes, the list of modules available will change depending on other modules you currently have loaded. Fore more details on this, see the section Loading MPI Modules below.
View available modules of a specific software
Sometimes you might want to view the available modules for just a single piece of software (for example, the same software may have various versions installed on Boqueron). In that case, you can issue the
module avail command along with the name of the software to see its modules. For example, to see all modules available for
gcc, you would type the following.
$ module avail gcc
Load a module
Once you know which module you wish to use, you can load it through the
module load command. For example, to load the
python2/2.7.11 module, you would type the following.
$ module load python2/2.7.11
You can leave off the module version and just enter the software name.
$ module load python2
However, we recommend only doing this for software that have a single version so as to avoid confusion as to which version of the software gets loaded.
View currently loaded modules
To view the modules you currently have loaded, you would use the
module list command.
$ module list
Unload a module
Unloading is just as simple as loading. Here’s an example for unloading the
python2/2.7.11 module we loaded earlier.
$ module unload python2/2.7.11
$ module unload python2
Get help for a specific module
To get information about a module, you can use the
module help command.
$ module help egglib/2.1.11
This is especially useful when loading modules for libraries, since the user will need to know the location of specific
include directories to use them for compilations.
What happens after I load a module?
Loading a module makes that module’s software available to you. This is better illustrated by example.
First, let’s see what happens when we connect to Boqueron and try running Python 3 before loading its module.
$ ssh <username>@boqueron.hpcf.upr.edu $ python3 -bash: python3: command not found
Now let’s try running Python 3 after we load its module.
$ module load python3 $ python3 Python 3.5.1 (default, Dec 10 2015, 15:18:52) [GCC 5.1.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>>
Here we see that loading Python 3’s module made the command
python3 available to our environment.
We can also see modules at work in changing and managing different versions of the same software. For example, let’s check what version of gcc we have when we log in to Boqueron.
$ ssh <username>@boqueron.hpcf.upr.edu $ gcc -v ... gcc version 5.1.0 (GCC)
Now let’s see our result after we load the
$ module load gcc/4.9.3 $ gcc -v ... gcc version 4.9.3 (GCC)
The most important point here is that after loading the module, we still just used the same
gcc command as before loading it.
This is one of the conveniences of using modules: to use different versions of the same software you don’t have to change the actual commands you use, you just have to change the context in which the same commands are made. This makes your code and job scripts easier to manage.
Loading MPI Modules
In Boqueron, there are different versions of MPI available for different compilers. Therefore to load MPI, you must load its related compiler first. MPI’s module will not be visible until you load a compiler that has an MPI version installed. This situation is demonstrated in the following example using
gcc ver. 4.8.5 and
mvapich2 ver. 2.2b.
$ module avail mvapich2 $ module load gcc/4.8.5 $ module avail mvapich2 -------------- /cm/shared/modulefiles_gcc/4.8.5 -------------- mvapich2/2.2b
The first time
module avail mvapich2 was run, nothing was returned, indicating that there are no modules available for
mvapich2. However, when the same command was run after loading the
gcc/4.8.5 module, an
mvapich2/2.2b module appeared as available.
In short, you can only load an MPI module if you have loaded its associated compiler first.
$ module load gcc/4.8.5 $ module load mvapich2
Here’s another loading example but with
gcc 4.7.4 and
$ module load gcc/4.7.4 $ module load openmpi/1.8.8
To find out if a specific compiler has an MPI version installed with it, simply load the compiler’s module and then run
module avail to see if MPI modules have become available. The modules unique to that compiler will be presented in a separate section.
Certain software on Boqueron is tied to specific compilers. To be able to load that software’s module you have to load its compiler’s module first (more info can be found in the previous tab Finding Software). To help you navigate these special modules, we’ve put together the following list of the software that is tied to each compiler on Boqueron.
mvapich2/2.2b openmpi/1.10.2 openmpi/1.8.8 openmpi/2.0.2
mvapich2/2.2b openmpi/1.10.2 openmpi/1.8.8 openmpi/2.0.2
boost/1.60.0 jdftx/1.2.1 maker/2.31.8 memesuite_parallel/4.11.1 netcdf/c/4.4.1 openmpi/1.10.2 openmpi/2.0.2 qe/5.4.0 raxml/8.2.6 gpaw/0.11.0 ldhelmet/1.7 mash/0a9a3f3 mvapich2/2.2b netcdf/fortran/4.4.4 openmpi/1.8.8 qe/5.3.0 qe-gipaw/5.4 vinalc/1.1.2
fac/1.1.1 fac/1.1.4 openmpi/2.0.2 qe/6.1
mpich/3.2 openmpi/1.10.2 openmpi/1.8.8
lammps/30Jul16 mvapich2/2.2b netcdf/c/4.4.1 netcdf/fortran/4.4.4 openmpi/1.10.2 openmpi/1.8.8 vasp/5.4.1 vasp-vtst/5.4.1