Skip to content

Cluster Configuration

Kiri Choi edited this page Jan 18, 2019 · 1 revision

Setting up Spark in Cluster Mode

  1. Download Apache Spark from https://spark.apache.org/downloads.html
  2. Extract the file downloaded.
  3. Inside Apache Spark folder, within the conf directory there are templates available for the configurations to run Spark
  4. Below are the main things that needs to get changed

First JAVA_HOME is must. Scala is optional but to run spark it needs JVM and JVM_HOME is the best place for it to find required things. export JAVA_HOME=/home/shaik/src/jdk1.8.0_131

Spark Worker Cores allows you to determine the number of cores you want to allocate for Spark programs. Say your machine is of N cores (lscpu command) it is better to have at max N-2 Cores allocated for Spark Worker Cores export SPARK_WORKER_CORES=10

Spark Local IP is just to identify itself in the cluster. Be aware of /etc/hosts file which is very useful for machines to communicate to each other. export SPARK_LOCAL_IP=127.0.0.1

The port on which Spark Master UI needs to run. export SPARK_MASTER_PORT=8000

HOST of the Spark master export SPARK_MASTER_HOST=127.0.0.1

SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master export SPARK_MASTER_WEBUI_PORT=8888

This tells livy regarding the available spark master so as to submit the jobs to the spark cluster export LIVY_SPARK_MASTER=spark://127.0.0.1:8000

This is for Zeppelin to use livy (%livy.pyspark). Livy server is running on the node:port mentioned in the value export ZEPPELIN_LIVY_URL=http://127.0.0.1:5000

Master & Slave

Since, we are not using yarn based cluster, we need to tell to master who are all slaves. (A master node can be slave as well, Master node can also spawn threads on its own machine to perform computations)

  1. Create a file named just slaves with no extensions
  2. Add list of nodes (IPs) you want it to be part of spark cluster

Talk Less Do More

In order to reduce the verbosity of logs, we can modify log4j.properties file (again under conf). The line that needs to be changed is log4j.rootCategory=INFO, console to log4j.rootCategory=ERROR, console

Using Zeppelin to run Spark Jobs Smooth

  1. First we need to download Zeppelin from https://zeppelin.apache.org/download.html
  2. There are the two options available there -- one with all interpreters pre installed and the other with only required interpreters
  3. Change directory to conf after extracting it to your local system
  4. In Zeppelin-env.sh file, below are the configurations that are needed to get added

Informing Zeppelin regarding Spark Master export MASTER=spark://127.0.0.1:8000

Helping Zeppelin know the other spark submit options we use to run our spark jobs, since we use Tellurium/antimony, below is the sample of how it can be done export SPARK_SUBMIT_OPTIONS="--conf spark.executorEnv.PYTHONPATH=/home/shaik/gsoc/tellurium --conf spark.executorEnv.LD_LIBRARY_PATH=/usr/lib/python2.7/site-packages/antimony"

Providing python path for pyspark to run and be able to import required python libraries export PYTHONPATH=/home/shaik/gsoc/tellurium:$PATH

We can also restrict Zeppelin by not overusing our Spark resources and can be configured in a way that it can distribute its cores and memory among the multiple spark jobs running (Zeppelin Included) export ZEPPELIN_JAVA_OPTS="-Dspark.executor.memory=20g -Dspark.cores.max=12"