Boqueron supports running Spark as a regular Slurm job. In this article we discuss the steps that users need to follow to ensure Spark runs correctly on Boqueron.
The approach we take to working with Spark at the HPCf is heavily based on this documentation piece by Princeton.
Launching Spark from a Slurm script
Spark’s goal is, at its most basic level, to easily set up a cluster to run Spark applications on. This is accomplished by launching a “master process” on one computer and a number of “worker processes” on other computers; after this master-worker cluster is set up, users are able to run Spark applications on it.
At the HPCf, we have a script called
spark-start that, when run within a Slurm job, will take care of setting up the Spark cluster within the resources allocated by Slurm.
Let’s look at an example Slurm script to launch a Spark cluster at the HPCf:
#!/bin/bash #SBATCH --time=01:00:00 #SBATCH --job-name=<job name> #SBATCH --nodes=5 #SBATCH --mem=10G #SBATCH --cpus-per-task=3 #SBATCH --mail-user=<user email address> #SBATCH --mail-type=ALL module load spark spark-start spark-submit <your spark job>
This script asks for 1 hour of running time, 5 nodes, 10 GB of memory per node, and 3 cores (CPUs) per node. All Spark nodes (the master and the workers) will be launched with these resource specifications. Note that setting
--nodes is required for a Spark job to run;
spark-start will exit with an error otherwise. Also, the value of
--nodes must be greater or equal to 2.
Keep in mind that the master node is included in the
--nodes count, so a count of 5 will launch 1 master and 4 workers.
spark-start runs, it will take care of choosing a master among the nodes Slurm has allocated for the job, and it will appoint the rest as workers and connect them to the master. You do not need to specify anything for
spark-start to work. It will work on its own and it will write to the job’s output file to let you know which node it has chosen as master. You can confirm the identity of the master node there or you can simply wait for the master to write its log file (the log file will appear in the same directory where the job was submitted from). The master’s log will contain the master’s name in the log’s filename.
spark-start runs successfully, the Spark cluster will be set up and
spark-submit will then run your Spark job.
Note: Manually specifying variables such as
SPARK_HOMEwill interfere with
spark-startand will create unpredictable behavior. Do not specify any Spark variables manually.
spark-start runs successfully, the Spark master and workers will begin to write their log files in the same directory from which the Saprk job was launched. You will also see Slurm’s own output file being generated. You can obtain a lot of useful information from all these log files, including the names of the nodes in the Spark cluster.
When you confirm the name of the master node, you can use Firefox to connect to it and monitor the Spark cluster. To do so, first make sure that, when you connect to Boqueron through SSH, you set the
-X option like this:
ssh -X <username>@boqueron.hpcf.upr.edu
-X allows Boqueron to forward graphical data to your local computer, thus allowing you to launch Firefox from within Boqueron and view it on you computer.
After you’ve launched the Spark job, run the following command:
firefox --no-remote http://<master node name>:8030
substituting the actual name of the master node for
<master node name>. This will launch Firefox and connect to the master’s Web UI.
Note: Windows users connecting through PuTTY will have to set the Enable X11 Forwarding option under Connections > SSH > X11 to allow forwarding of graphical data from Boqueron.