Boqueron is the main scientific computation cluster at the HPCf. It provides over 2240 compute cores, and 200 terabytes of high-performance storage served over a QDR Infiniband and 10G Ethernet backbone. Jobs are managed by the Slurm Workload Manager, and they may run in distributed or shared-memory parallelism or using hybrid parallelism.
In order to work on a high-performance environment such as Boqueron, there is some information that you should know. Below you have some important topics regarding working on Boqueron. Feel free to refer to them whenever you find yourself stuck on something. If you have a question that is not answered in the topics below, let us know on help(at)hpcf.upr.edu, and we’ll be glad to help you out.
Logging in to BoqueronYour Boqueron username and password are for logging in to Boqueron through SSH. How you connect to Boqueron will depend on your operating system.
Note: Do not try to connect to Boqueron by pointing your Web browser to http://boqueron.hpcf.upr.edu.
Linux or Mac OS XOn Linux or Mac OS X, the process to log in to Boqueron is quite straightforward.
- Open a Terminal window.
- Type in the following command (without the leading $):
$ ssh [username]@boqueron.hpcf.upr.edusubstituting your actual username for [username]. Then hit Enter.
- If this is your first time connecting, you will now be asked if you want to trust the connection to Boqueron; answer "yes" and hit Enter again.
- You will then be asked for your password. This completes your login.
WindowsUnlike Linux or Mac OS X, Windows does not provide an SSH-capable terminal by default. Therefore, you'll need to download software that provides such functionality. The most popular tool (and our recommendation) is PuTTY, which you can download for free from here. After you download and install PuTTY (or whichever software you chose), you can connect to Boqueron with the following information.
- Hostname: boqueron.hpcf.upr.edu
- Port: 22
- Username: Your Boqueron username
- Password: Your Boqueron password
AndroidThough you'll find yourself connecting to Boqueron mostly through your workstation or laptop, you might find it useful sometimes to connect to Boqueron through your smartphone or tablet. Apps such as JuiceSSH allow you to establish SSH connections from your Android device. Just remember to use boqueron.hpcf.upr.edu as the hostname/address of the server you want to connect to, and make sure you are connecting using port 22.
Submitting JobsBoqueron provides a high-performance file system for you to work in, appropriately called /work. HPCf guidelines require that you always keep any files related to work you are dong in the /work file system. When you log in to Boqueron, you are located in you home directory, which is meant to be used to hold final results of jobs you've run. Therefore, you will need to change into the /work before submitting any jobs to be run on Nanobio. To do so, simply run the command
$ cd $WORKand you'll be in your work directory. Before submitting a job, always remember to have all the job-related data in your /work directory.
Creating a Submit ScriptOnce you have all your data in place, you are ready to create a "submit script" which is simply a file that specifies the instructions that your job will run. The following is an example submit script
#!/bin/bash #SBATCH --time=[hh:mm:ss] #SBATCH --job-name=[job name] #SBATCH --mail-user=[your email address] #SBATCH --mail-type=ALL #SBATCH --workdir=[the directory where job will run] #SBATCH --error=[error filename] #SBATCH --output=[output filename] #This script simply prints out current clock time in 12-hour format /bin/date +%rThe lines that start with
#SBATCHare options passed to the Slurm resource manager (the software that schedules and manages jobs on Boqueron). There are many other options that may be specified, but these defaults work for most jobs. You don't need to concern yourself with the line beginning with
#!, just make sure it's included at the very top of your script. Any other lines starting with the
#character are "comments", which are simply arbitrary text that is ignored by the system. The options given to Slurm in the
#SBATCHcommand take specific values after the equals sign (=). Replace the parts with angle brackets (<, >) in the example above by the specific values that your job will use. For example, if your job will run for at most 1 hour, you would write the following
Note: There is a 1 week maximum running time limit for all jobs on Boqueron. You must specify a value less than or equal to 1 week in order for your job to run. Specifying a time limit greater than 1 week will result in your job staying stuck in its queue forever (until it is manually cancelled), waiting for a node that can accommodate it.Below the lines starting with
#SBATCH(and any optional comments), is where you would place the actual commands your job will execute. In this example, we simply execute the command
datefor printing the current date.
Special Note on Parallel JobsFor parallel jobs, you will need to also specify the
--ntasksparameter and, optionally, the
--cpus-per-taskparameter in your submit script.
#SBATCH --ntasks=[number of tasks] #SBATCH --cpus-per-task=[number of cpus per task]The
--ntasksparameter tells Slurm how many parallel processes ("tasks") to launch for your job, while the
--cpus-per-taskparameter tells Slurm how many cores to allocate for each of those processes. If the
--cpus-per-taskparameter is not specified in your job script, Slurm will automatically set it to 1. For example, to run an MPI job with 5 processes (and 1 core per process) you would specify the following.
#SBATCH --ntasks=5You will also need to specify that you wish to run your MPI job on the
mpipartition (Slurm calls queues "partitions").
--mem-per-cpuparameter is used to specify the amount of memory (RAM) in MB to allocate per requested compute core. This option also impacts non-parallel jobs; since non-parallel jobs run on a single core, the
--mem-per-cpuparameter defines the amount of memory for the entire job. If this parameter is not specified, Boqueorn will automatically set it to 500. Once a job starts running, its allocated memory will be available exclusively to that job, so please be mindful when setting this value so as to not reserve unnecessarily large amounts of memory that could instead be used by other users to run their jobs. If the total amount of requested memory (
--ntasks) is larger than what is physically possible for the cluster to accomodate, Slurm will reject your job with an error indicating that no such allocation is possible.
Submitting Your JobAfter you have created a submit script, simply use the
sbatchcommand to submit your job to Slurm.
$ sbatch slurm-script.shYour job will start running as soon as Slurm finds a suitable place for it in the cluster.
Monitoring Your JobTo monitor the status of your jobs on Boqueron, simply type the
squeuecommand, passing your username as an argument.
$ squeue -u [your username]
Cancelling Your JobSometimes you'll need to cancel a job. To do so, you would use the
scancelcommand, passing the job's ID as an argument.
$ scancel [job id]
Running Interactive JobsSometimes the software that you would like to run needs to be run interactively. For these kinds of jobs, you need to use the
sruncommand to request Slurm to allocate you an interactive session on one of the worker nodes.
$ srun --pty bashNote that the
--pty bashargument is required.
Boqueron's /home and /workBoqueron features two very important and very different filesystems that users need to understand well in order to correctly use cluster resources: /home and /work. Below you will find all the fundamental information you need to know about these filesystems.
/home/home is a relatively small filesystem where users appear in when they connect to Boqueron. It is meant to provide a holding space for your output and other valuable data until they can be moved elsewhere outside of Boqueron.
Structure of /home/home is subdivided in research group directories, and each group directory contains yet another subdirectory for each of that group's members. For example, user jcbonilla from group hpcf has the following as his $HOME directory:
/home/hpcf/jcbonillaUser olena also from group hpcf has the following $HOME:
/home/hpcf/olenaBut user mfurukaw from group hcc_unl has the following:
The Intended Purpuse of /home/home is meant to be a place for users to temporarily keep data that they either need for their computations or that are output of their computations. As mentioned in the HPCf Usage Policies, /home (and any other filesystem on Boqueron) is not meant to be used as a place for long-term storage. Files that are contained in /home are files that are expected to be moved by their owners to a place outside of Boqueron in the near future. We do our best to keep users' data safe, but each user is ultimately responsible for keeping his or her own data safe. Please make sure to keep backups of your data outside of Boqueron.
Quotas on /home/home enforces a limit on how much of its space each research group may consume. These limits are known as quotas. The quota on /home for each research group is currently set to 100 GB per group. To confirm the quota consumption on /home for your group, you can use the following command:
How jobs interact with /home/home is not meant to be used as a filesystem where jobs can be run from. Because of this, /home is mounted as a read-only filesystem on the cluster's worker nodes. What this means in concrete terms is that if you launch a job from /home it will fail because the job will need to create output and error files, but /home cannot be written to from the worker nodes. In order for your jobs to run, you must launch them from /work.
/work/work is a relatively large filesystem that is meant to be used as so-called "scratch space", that is to say, space on Boqueron that is meant for making computations and obtaining results that will then be stored elsewhere. /work is the filesystem from which you submit your jobs and their input data.
Structure of /work/work is subdivided in research group directories, and each group directory contains yet another subdirectory for each of that group's members. For example, user jcbonilla from group hpcf has the following as his $WORK directory:
/work/hpcf/jcbonillaUser olena also from group hpcf has the following $WORK:
/work/hpcf/olenaBut user mfurukaw from group hcc_unl has the following:
The Intended Purpuse of /work/work is meant to be a place for users to run their jobs from. /work provides a far superior performance than any other filesystem on Boqueron, and as such will aid the speed of your jobs by not slowing them down with I/O operations. The trade-off to this high performance provided by /work is that it is not built to be a reliable, long-term storage filesystem. Furthermore, as mentioned in the HPCf Usage Policies, since the data on /work is intended to be just the data you need to run your jobs, data in users’ /work directory is not and will not be backed up. Users should not keep anything that they are not willing to lose inside their /work directory. Please make sure to always keep backups of your data.
Example workflowIf all this sounds a little overwhelming, we recommend trying out the following workflow for Boqueron:
1. Write or move your source code files and/or submit scripts into your $HOME. 2. Make a copy of those sources into your $WORK. This step is important. Should anything happen to data on /work, you would only be risking a copy of your sources. 3. Copy any input data into your $WORK. If the input data is small, you can keep the original in /home and just copy it over to /work. If the input data is quite large, you will likely have to keep the original outside of Boqueron and copy it into your $WORK for running your jobs. 4. Once your job has successfully run, move your results from /work to another location for safekeeping. Additionally, you should delete any input data that you will no longer need for other jobs.
Quotas on /work/work enforces a limit on how much of its space each research group may consume. These limits are known as quotas. The quota on /work for each research group is currently set to 6 TB per group. To confirm the quota consumption on /work for your group, you can use the following command:
Finding SoftwareSince a cluster is such a complex system, a mechanism is needed to organize and keep track of the software that is installed in it. Oftentimes a cluster will even need to have multiple versions of the same piece of software available. If not handled correctly, this situation could turn into a real hassle for both users and system administrators. To avoid this, Boqueron uses module files to keep software organized and easy to manage.
Tip: If you just came here looking for info on MPI modules, click here to jump.
What are module files?Briefly, module files are simply files that define where a specific piece of software's installation is located in Boqueron. The collection of a software's installation and its module file together make up a module. When a user needs a particular software of a particular version for their work, he or she simply needs to load the module of the software and version they need, and the software will become available.
Basic module commandsFollowing are some of the basic commands you will need to find and use software on Boqueron.
View all available modulesTo view a list of all modules available on Boqueron simply issue the following command.
$ module availSometimes, the list of modules available will change depending on other modules you currently have loaded. Fore more details on this, see the section Loading MPI Modules below.
View available modules of a specific softwareSometimes you might want to view the available modules for just a single piece of software (for example, the same software may have various versions installed on Boqueron). In that case, you can issue the
module availcommand along with the name of the software to see its modules. For example, to see all modules available for
gcc, you would type the following.
$ module avail gcc
Load a moduleOnce you know which module you wish to use, you can load it through the
module loadcommand. For example, to load the
python2/2.7.11module, you would type the following.
$ module load python2/2.7.11You can leave off the module version and just enter the software name.
$ module load python2However, we recommend only doing this for software that have a single version so as to avoid confusion as to which version of the software gets loaded.
View currently loaded modulesTo view the modules you currently have loaded, you would use the
$ module list
Unload a moduleUnloading is just as simple as loading. Here's an example for unloading the
python2/2.7.11module we loaded earlier.
$ module unload python2/2.7.11Or simply,
$ module unload python2
Get help for a specific moduleTo get information about a module, you can use the
$ module help egglib/2.1.11This is especially useful when loading modules for libraries, since the user will need to know the location of specific
includedirectories to use them for compilations.
What happens after I load a module?Loading a module makes that module's software available to you. This is better illustrated by example. First, let's see what happens when we connect to Boqueron and try running Python 3 before loading its module.
$ sshNow let's try running Python 3 after we load its module.
@boqueron.hpcf.upr.edu $ python3 -bash: python3: command not found
$ module load python3 $ python3 Python 3.5.1 (default, Dec 10 2015, 15:18:52) [GCC 5.1.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>>Here we see that loading Python 3's module made the command
python3available to our environment. We can also see modules at work in changing and managing different versions of the same software. For example, let's check what version of gcc we have when we log in to Boqueron.
$ sshNow let's see our result after we load the
@boqueron.hpcf.upr.edu $ gcc -v ... gcc version 5.1.0 (GCC)
$ module load gcc/4.9.3 $ gcc -v ... gcc version 4.9.3 (GCC)The most important point here is that after loading the module, we still just used the same
gcccommand as before loading it. This is one of the conveniences of using modules: to use different versions of the same software you don't have to change the actual commands you use, you just have to change the context in which the same commands are made. This makes your code and job scripts easier to manage.
Loading MPI ModulesIn Boqueron, there are different versions of MPI available for different compilers. Therefore to load MPI, you must load its related compiler first. MPI's module will not be visible until you load a compiler that has an MPI version installed. This situation is demonstrated in the following example using
gccver. 4.8.5 and
$ module avail mvapich2 $ module load gcc/4.8.5 $ module avail mvapich2 -------------- /cm/shared/modulefiles_gcc/4.8.5 -------------- mvapich2/2.2bThe first time
module avail mvapich2was run, nothing was returned, indicating that there are no modules available for
mvapich2. However, when the same command was run after loading the
mvapich2/2.2bmodule appeared as available. In short, you can only load an MPI module if you have loaded its associated compiler first.
$ module load gcc/4.8.5 $ module load mvapich2Here's another loading example but with
$ module load gcc/4.7.4 $ module load openmpi/1.8.8To find out if a specific compiler has an MPI version installed with it, simply load the compiler's module and then run
module availto see if MPI modules have become available. The modules unique to that compiler will be presented in a separate section.