Skip to content

Latest commit

 

History

History
206 lines (173 loc) · 10.1 KB

RunningDistributedCifar10TFJobsWithYarnService.md

File metadata and controls

206 lines (173 loc) · 10.1 KB

Cifar10 Tensorflow Estimator Example With YARN Native Runtime

Prepare data for training

CIFAR-10 is a common benchmark in machine learning for image recognition. Below example is based on CIFAR-10 dataset.

  1. Checkout https://github.com/tensorflow/models/:
git clone https://github.com/tensorflow/models/
  1. Go to models/tutorials/image/cifar10_estimator

  2. Generate data by using following command: (required Tensorflow installed)

python generate_cifar10_tfrecords.py --data-dir=cifar-10-data
  1. Upload data to HDFS
hadoop fs -put cifar-10-data/ /dataset/cifar-10-data

Warning:

Please note that YARN service doesn't allow multiple services with the same name, so please run following command

yarn application -destroy <service-name>

to delete services if you want to reuse the same service name.

Prepare Docker images

Refer to Write Dockerfile to build a Docker image or use prebuilt one:

  • hadoopsubmarine/tensorflow1.13.1-hadoop3.1.2-cpu:1.0.0
  • hadoopsubmarine/tensorflow1.13.1-hadoop3.1.2-gpu:1.0.0

Run Tensorflow jobs

Set submarine.runtime.class to YarnServiceRuntimeFactory in submarine-site.xml.

<property>
    <name>submarine.runtime.class</name>
    <value>org.apache.submarine.server.submitter.yarnservice.YarnServiceRuntimeFactory</value>
    <description>RuntimeFactory for Submarine jobs</description>
  </property>

The file, named submarine-site.xml, is in the path of ${SUBMARINE_HOME}/conf.

Run standalone training

SUBMARINE_VERSION=0.4.0-SNAPSHOT
CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath --glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
${SUBMARINE_HOME}/conf: \
java org.apache.submarine.client.cli.Cli job run \
   --name tf-job-001 --verbose --docker_image <image> \
   --input_path hdfs://default/dataset/cifar-10-data \
   --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
   --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current
   --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
   --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync" \
   --tensorboard --tensorboard_docker_image tf-1.13.1-cpu:0.0.1

Explanations:

  • When access of HDFS is required, the two environments are required to indicate: DOCKER_JAVA_HOME and DOCKER_HADOOP_HDFS_HOME to access libhdfs libraries inside Docker image. We will try to eliminate specifying this in the future.
  • Docker image for worker and tensorboard can be specified separately. For this case, Tensorboard doesn't need GPU, so we will use cpu Docker image for Tensorboard. (Same for parameter-server in the distributed example below).

Run distributed training

SUBMARINE_VERSION=0.4.0-SNAPSHOT
CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath --glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
${SUBMARINE_HOME}/conf: \
java org.apache.submarine.client.cli.Cli job run \
   --name tf-job-001 --verbose --docker_image tf-1.13.1-gpu:0.0.1 \
   --input_path hdfs://default/dataset/cifar-10-data \
   --env(s) (same as standalone)
   --num_workers 2 \
   --worker_resources memory=8G,vcores=2,gpu=1 \
   --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync"  \
   --ps_docker_image tf-1.13.1-cpu:0.0.1 \
   --num_ps 1 --ps_resources memory=4G,vcores=2,gpu=0  \
   --ps_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0" \
   --tensorboard --tensorboard_docker_image tf-1.13.1-cpu:0.0.1

Explanations:

  • >1 num_workers indicates it is a distributed training.
  • Parameters / resources / Docker image of parameter server can be specified separately. For many cases, parameter server doesn't require GPU.

For the meaning of the individual parameters, see the QuickStart page!

Outputs of distributed training

Sample output of master:

...
allow_soft_placement: true
, '_tf_random_seed': None, '_task_type': u'master', '_environment': u'cloud', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe77cb15050>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
...
2018-05-06 22:29:14.656022: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> localhost:8000}
2018-05-06 22:29:14.656097: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ps-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:29:14.656112: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:29:14.659359: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
...
INFO:tensorflow:Restoring parameters from hdfs://default/tmp/cifar-10-jobdir/model.ckpt-0
INFO:tensorflow:Evaluation [1/625]
INFO:tensorflow:Evaluation [2/625]
INFO:tensorflow:Evaluation [3/625]
INFO:tensorflow:Evaluation [4/625]
INFO:tensorflow:Evaluation [5/625]
INFO:tensorflow:Evaluation [6/625]
...
INFO:tensorflow:Validation (step 1): loss = 1220.6445, global_step = 1, accuracy = 0.1
INFO:tensorflow:loss = 6.3980675, step = 0
INFO:tensorflow:loss = 6.3980675, learning_rate = 0.1
INFO:tensorflow:global_step/sec: 2.34092
INFO:tensorflow:Average examples/sec: 1931.22 (1931.22), step = 100
INFO:tensorflow:Average examples/sec: 354.236 (38.6479), step = 110
INFO:tensorflow:Average examples/sec: 211.096 (38.7693), step = 120
INFO:tensorflow:Average examples/sec: 156.533 (38.1633), step = 130
INFO:tensorflow:Average examples/sec: 128.6 (38.7372), step = 140
INFO:tensorflow:Average examples/sec: 111.533 (39.0239), step = 150

Sample output of worker:

, '_tf_random_seed': None, '_task_type': u'worker', '_environment': u'cloud', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc2a490b050>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
...
2018-05-06 22:28:45.807936: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> master-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:45.808040: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ps-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:45.808064: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8000}
2018-05-06 22:28:45.809919: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
...
INFO:tensorflow:loss = 5.319096, step = 0
INFO:tensorflow:loss = 5.319096, learning_rate = 0.1
INFO:tensorflow:Average examples/sec: 49.2338 (49.2338), step = 10
INFO:tensorflow:Average examples/sec: 52.117 (55.3589), step = 20
INFO:tensorflow:Average examples/sec: 53.2754 (55.7541), step = 30
INFO:tensorflow:Average examples/sec: 53.8388 (55.6028), step = 40
INFO:tensorflow:Average examples/sec: 54.1082 (55.2134), step = 50
INFO:tensorflow:Average examples/sec: 54.3141 (55.3676), step = 60

Sample output of ps:

...
, '_tf_random_seed': None, '_task_type': u'ps', '_environment': u'cloud', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4be54dff90>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
...
2018-05-06 22:28:42.562316: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> master-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:42.562408: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:8000}
2018-05-06 22:28:42.562433: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:42.564242: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000

Notes:

When using YARN native service runtime, you can view multiple job training history like from the Tensorboard link:

alt text

Run tensorboard to monitor your jobs

# Cleanup previous tensorboard service if needed

SUBMARINE_VERSION=0.4.0-SNAPSHOT
CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath --glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
${SUBMARINE_HOME}/conf: \
java org.apache.submarine.client.cli.Cli job run \
  --name tensorboard-service \
  --verbose \
  --docker_image <your-docker-image> \
  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
  --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
  --worker_resources memory=2G,vcores=2 \
  --worker_launch_cmd "pwd" \
  --tensorboard