-
Notifications
You must be signed in to change notification settings - Fork 36
Cluster Configuration
- Download Apache Spark from https://spark.apache.org/downloads.html
- Extract the file downloaded.
- Inside Apache Spark folder, within the conf directory there are templates available for the configurations to run Spark
- Below are the main things that needs to get changed
First JAVA_HOME is must. Scala is optional but to run spark it needs JVM and JVM_HOME is the best place for it to find required things.
export JAVA_HOME=/home/shaik/src/jdk1.8.0_131
Spark Worker Cores allows you to determine the number of cores you want to allocate for Spark programs. Say your machine is of N cores (lscpu command) it is better to have at max N-2 Cores allocated for Spark Worker Cores
export SPARK_WORKER_CORES=10
Spark Local IP is just to identify itself in the cluster. Be aware of /etc/hosts file which is very useful for machines to communicate to each other.
export SPARK_LOCAL_IP=127.0.0.1
The port on which Spark Master UI needs to run.
export SPARK_MASTER_PORT=8000
HOST of the Spark master
export SPARK_MASTER_HOST=127.0.0.1
SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
export SPARK_MASTER_WEBUI_PORT=8888
This tells livy regarding the available spark master so as to submit the jobs to the spark cluster
export LIVY_SPARK_MASTER=spark://127.0.0.1:8000
This is for Zeppelin to use livy (%livy.pyspark). Livy server is running on the node:port mentioned in the value
export ZEPPELIN_LIVY_URL=http://127.0.0.1:5000
Since, we are not using yarn based cluster, we need to tell to master who are all slaves. (A master node can be slave as well, Master node can also spawn threads on its own machine to perform computations)
- Create a file named just
slaves
with no extensions - Add list of nodes (IPs) you want it to be part of spark cluster
In order to reduce the verbosity of logs, we can modify log4j.properties
file (again under conf). The line that needs to be changed is log4j.rootCategory=INFO, console
to log4j.rootCategory=ERROR, console
- First we need to download Zeppelin from https://zeppelin.apache.org/download.html
- There are the two options available there -- one with all interpreters pre installed and the other with only required interpreters
- Change directory to conf after extracting it to your local system
- In Zeppelin-env.sh file, below are the configurations that are needed to get added
Informing Zeppelin regarding Spark Master export MASTER=spark://127.0.0.1:8000
Helping Zeppelin know the other spark submit options we use to run our spark jobs, since we use Tellurium/antimony, below is the sample of how it can be done export SPARK_SUBMIT_OPTIONS="--conf spark.executorEnv.PYTHONPATH=/home/shaik/gsoc/tellurium --conf spark.executorEnv.LD_LIBRARY_PATH=/usr/lib/python2.7/site-packages/antimony"
Providing python path for pyspark to run and be able to import required python libraries export PYTHONPATH=/home/shaik/gsoc/tellurium:$PATH
We can also restrict Zeppelin by not overusing our Spark resources and can be configured in a way that it can distribute its cores and memory among the multiple spark jobs running (Zeppelin Included) export ZEPPELIN_JAVA_OPTS="-Dspark.executor.memory=20g -Dspark.cores.max=12"