Skip to content

Latest commit

 

History

History
169 lines (124 loc) · 7.54 KB

yarn.md

File metadata and controls

169 lines (124 loc) · 7.54 KB

Configuring Job Server for YARN

(Looking for contributors for this page)

(I would like to thank Jon Buffington for sharing the config tips below.... @velvia)

Note: This is for yarn with docker. If you are looking to deploy on a yarn cluster via EMR, then this link would be more useful EMR

Configuring the Spark-Jobserver Docker package to run in Yarn-Client Mode

To run the Spark-Jobserver in yarn-client mode you have to do a little bit extra of configuration. You can either follow the instructions here for a little bit of explanations or check out the example repository and adjust it to your needson your own. First of all make sure that your have a correct docker installation on the host that shall run the spark-jobserver.

I suggest you to create a new directory for your custom config files.

Files we need:

  • docker.conf
  • dockerfile
  • cluster-config directory with hdfs-site.xml and yarn-site.xml (You should have these files already)

Example docker.conf (important settings are marked with # important):

spark {
  master = "yarn-client" # important
  master = ${?SPARK_MASTER}

  # Default # of CPUs for jobs to use for Spark standalone cluster
  job-number-cpus = 4

  jobserver {
    port = 8090
    jobdao = spark.jobserver.io.JobSqlDAO

    sqldao {
      # Directory where default H2 driver stores its data. Only needed for H2.
      rootdir = /database

      # Full JDBC URL / init string.  Sorry, needs to match above.
      # Substitutions may be used to launch job-server, but leave it out here in the default or tests won't pass
      jdbc.url = "jdbc:h2:file:/database/h2-db"
    }
  }

  # predefined Spark contexts
  # contexts {
  #   my-low-latency-context {
  #     num-cpu-cores = 1           # Number of cores to allocate.  Required.
  #     memory-per-node = 512m         # Executor memory per node, -Xmx style eg 512m, 1G, etc.
  #   }
  #   # define additional contexts here
  # }

  # universal context configuration.  These settings can be overridden, see README.md
  context-settings {
    # choose a port that is free on your system and also the 16 (depends on max retries for submitting the job) next portnumbers should be free 
    spark.driver.port = 32456 # important
    # defines the place where your spark-assembly jar is located in your hdfs
    spark.yarn.jar = "hdfs://hadoopHDFSCluster/spark_jars/spark-assembly-1.6.0-hadoop2.6.0.jar" # important
    
    # defines the YARN queue the job is submitted to
    #spark.yarn.queue = root.myYarnQueue

    num-cpu-cores = 2           # Number of cores to allocate.  Required.
    memory-per-node = 512m         # Executor memory per node, -Xmx style eg 512m, #1G, etc.

    # in case spark distribution should be accessed from HDFS (as opposed to being installed on every mesos slave)
    # spark.executor.uri = "hdfs://namenode:8020/apps/spark/spark.tgz"

    # uris of jars to be loaded into the classpath for this context. Uris is a string list, or a string separated by commas ','
    # dependent-jar-uris = ["file:///some/path/present/in/each/mesos/slave/somepackage.jar"]

    # If you wish to pass any settings directly to the sparkConf as-is, add them here in passthrough,
    # such as hadoop connection settings that don't use the "spark." prefix
    passthrough {
      #es.nodes = "192.1.1.1"
    }
  }

  # This needs to match SPARK_HOME for cluster SparkContexts to be created successfully
  home = "/usr/local/spark"
}

Now that we have a docker.conf that should work we can create our dockerfile to create a docker container:

dockerfile:

FROM sparkjobserver/spark-jobserver:0.7.0.mesos-0.25.0.spark-1.6.2
EXPOSE 32456-32472                                    # Expose driver port range (spark.driver.port + 16)
ADD /path/to/your/docker.conf /app/docker.conf        # Add the docker.conf to the container
ADD /path/to/your/cluster-config /app/cluster-config  # Add the yarn-site.xml and hfds-site.xml to the container
ENV YARN_CONF_DIR=/app/cluster-config                 # Set env variables for spark
ENV HADOOP_CONF_DIR=/app/cluster-config               # Set env variables for spark

Your dockercontainer is now ready to be build:

docker build -t=your-container-name /directory/with/your/dockerfile

Output should look like this:

Sending build context to Docker daemon  21.5 kB
Step 0 : FROM sparkjobserver/spark-jobserver:0.7.0.mesos-0.25.0.spark-1.6.2
 ---> 7a188f2d0dff
Step 1 : EXPOSE 32456-32472
 ---> f1c91bbaa2d8
Step 2 : ADD ./docker.conf /app/docker.conf
 ---> c7c983e279e3
Step 3 : ADD ./cluster-config /app/cluster-config
 ---> 684951fb6bec
Step 4 : ENV YARN_CONF_DIR /app/cluster-config
 ---> 2f4cbf17443a
Step 5 : ENV HADOOP_CONF_DIR /app/cluster-config
 ---> ca0460d53b28
Successfully built ca0460d53b28

The last step to your jarn-client is to run the docker container that you have just created:

docker run -d -p 8090:8090 -p 32456-32472:32456-32472 --net=host your-container-name

The -p parameters are necessary to publish the ports for the rest interface and to communicate with the cluster. The --net parameter is necessary to make the docker container accessible from the spark cluster. I have found no way to do this in bridging mode (pullrequests appreciated).

Your Spark-Jobserver should now be up and running

Modifying the start-server.sh script

The start-server.sh script does not contain a --master option. In some clusters this means that the job will be launched in cluster mode. To prevent this modify the start script and add:

--master yarn-client

Important Context Settings for yarn

I recently responded to a private question about configuring job-server AWS EMR running Spark and wanted to share with the group.

We are successfully using job-server running on AWS EMR with Spark 1.3.0 in one case and 1.2.1 in another. We found that configuring the job-server app context correctly is critical to for Spark/YARN to maximize resources. For example, one of our clusters is composed of 4 slave r3.xlarge instances. The following snippet allowed us to create the expected number of executors with the most RAM:

...
contexts {
  shared {
    num-cpu-cores = 1 # shared tasks work best in parallel.
    memory-per-node = 4608M # trial-and-error discovered memory per node
    spark.executor.instances = 17 # 4 r3.xlarge instances with 4 cores each = 16 + 1 master
    spark.scheduler.mode = "FAIR"
    spark.scheduler.allocation.file = "/home/hadoop/spark/job-server/job_poolconfig.xml"
  }
}
...

� It was trial and error to find the best memory-per-node setting. If you over allocate memory per node, YARN will not allocate the expected executors.

ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil

(Thanks to user @apivovarov - I believe solution is for non-Docker)

I defined the following env vars in ~/.bashrc

export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/usr/lib/spark

I solved it by adding export EXTRA_JAR=$SPARK_HOME/lib/spark-assembly-1.6.0-hadoop2.6.0.jar.

Ok, I solved my issues with jobserver. I use bin/server_package.sh ec2 script to build tar.gz distribution. I extracted dist on the server and run jobserver using server_start.sh script. Now Yarn works.