Boqueron supports running Spark as a regular Slurm job. In this article we discuss the steps that users need to follow to ensure Spark runs correctly on Boqueron.
The approach we take to working with Spark at the HPCf is heavily based on this documentation piece by Princeton.
Launching Spark from a Slurm script
Spark’s goal is, at its most basic level, to easily set up a cluster to run Spark applications on. This is accomplished by launching a “master process” on one computer and a number of “worker processes” on other computers; after this master-worker cluster is set up, users are able to run Spark applications on it.
At the HPCf, we have a script called spark-start
that, when run within a Slurm job, will take care of setting up the Spark cluster within the resources allocated by Slurm.
Let’s look at an example Slurm script to launch a Spark cluster at the HPCf:
#!/bin/bash #SBATCH --time=01:00:00 #SBATCH --job-name=<job name> #SBATCH --nodes=5 #SBATCH --mem=10G #SBATCH --cpus-per-task=3 #SBATCH --mail-user=<user email address> #SBATCH --mail-type=ALL module load spark spark-start spark-submit <your spark job>
This script asks for 1 hour of running time, 5 nodes, 10 GB of memory per node, and 3 cores (CPUs) per node. All Spark nodes (the master and the workers) will be launched with these resource specifications. Note that setting --mem
, --cpus-per-task
, and --nodes
is required for a Spark job to run; spark-start
will exit with an error otherwise. Also, the value of --nodes
must be greater or equal to 2.
Keep in mind that the master node is included in the --nodes
count, so a count of 5 will launch 1 master and 4 workers.
When spark-start
runs, it will take care of choosing a master among the nodes Slurm has allocated for the job, and it will appoint the rest as workers and connect them to the master. You do not need to specify anything for spark-start
to work. It will work on its own and it will write to the job’s output file to let you know which node it has chosen as master. You can confirm the identity of the master node there or you can simply wait for the master to write its log file (the log file will appear in the same directory where the job was submitted from). The master’s log will contain the master’s name in the log’s filename.
If spark-start
runs successfully, the Spark cluster will be set up and spark-submit
will then run your Spark job.
Note: Manually specifying variables such as
SPARK_MASTER
orSPARK_HOME
will interfere withspark-start
and will create unpredictable behavior. Do not specify any Spark variables manually.
Monitoring Spark
After spark-start
runs successfully, the Spark master and workers will begin to write their log files in the same directory from which the Saprk job was launched. You will also see Slurm’s own output file being generated. You can obtain a lot of useful information from all these log files, including the names of the nodes in the Spark cluster.
When you confirm the name of the master node, you can use Firefox to connect to it and monitor the Spark cluster. To do so, first make sure that, when you connect to Boqueron through SSH, you set the -X
option like this:
ssh -X <username>@boqueron.hpcf.upr.edu
The -X
allows Boqueron to forward graphical data to your local computer, thus allowing you to launch Firefox from within Boqueron and view it on you computer.
After you’ve launched the Spark job, run the following command:
firefox --no-remote http://<master node name>:8030
substituting the actual name of the master node for <master node name>
. This will launch Firefox and connect to the master’s Web UI.
Note: Windows users connecting through PuTTY will have to set the Enable X11 Forwarding option under Connections > SSH > X11 to allow forwarding of graphical data from Boqueron.