Cifar10 Tensorflow Estimator Example With YARN Native Runtime

Prepare data for training

CIFAR-10 is a common benchmark in machine learning for image recognition. Below example is based on CIFAR-10 dataset.

Checkout https://github.com/tensorflow/models/:

git clone https://github.com/tensorflow/models/

Go to models/tutorials/image/cifar10_estimator
Generate data by using following command: (required Tensorflow installed)

python generate_cifar10_tfrecords.py --data-dir=cifar-10-data

Upload data to HDFS

hadoop fs -put cifar-10-data/ /dataset/cifar-10-data

Warning:

Please note that YARN service doesn't allow multiple services with the same name, so please run following command

yarn application -destroy <service-name>

to delete services if you want to reuse the same service name.

Prepare Docker images

Refer to Write Dockerfile to build a Docker image or use prebuilt one:

hadoopsubmarine/tensorflow1.13.1-hadoop3.1.2-cpu:1.0.0
hadoopsubmarine/tensorflow1.13.1-hadoop3.1.2-gpu:1.0.0

Run Tensorflow jobs

Set submarine.runtime.class to YarnServiceRuntimeFactory in submarine-site.xml.

<property>
    <name>submarine.runtime.class</name>
    <value>org.apache.submarine.server.submitter.yarnservice.YarnServiceRuntimeFactory</value>
    <description>RuntimeFactory for Submarine jobs</description>
  </property>

The file, named submarine-site.xml, is in the path of ${SUBMARINE_HOME}/conf.

Run standalone training

SUBMARINE_VERSION=0.4.0-SNAPSHOT
CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath --glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
${SUBMARINE_HOME}/conf: \
java org.apache.submarine.client.cli.Cli job run \
   --name tf-job-001 --verbose --docker_image <image> \
   --input_path hdfs://default/dataset/cifar-10-data \
   --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
   --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current
   --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
   --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync" \
   --tensorboard --tensorboard_docker_image tf-1.13.1-cpu:0.0.1

Explanations:

When access of HDFS is required, the two environments are required to indicate: DOCKER_JAVA_HOME and DOCKER_HADOOP_HDFS_HOME to access libhdfs libraries inside Docker image. We will try to eliminate specifying this in the future.
Docker image for worker and tensorboard can be specified separately. For this case, Tensorboard doesn't need GPU, so we will use cpu Docker image for Tensorboard. (Same for parameter-server in the distributed example below).

Run distributed training

SUBMARINE_VERSION=0.4.0-SNAPSHOT
CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath --glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
${SUBMARINE_HOME}/conf: \
java org.apache.submarine.client.cli.Cli job run \
   --name tf-job-001 --verbose --docker_image tf-1.13.1-gpu:0.0.1 \
   --input_path hdfs://default/dataset/cifar-10-data \
   --env(s) (same as standalone)
   --num_workers 2 \
   --worker_resources memory=8G,vcores=2,gpu=1 \
   --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync"  \
   --ps_docker_image tf-1.13.1-cpu:0.0.1 \
   --num_ps 1 --ps_resources memory=4G,vcores=2,gpu=0  \
   --ps_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0" \
   --tensorboard --tensorboard_docker_image tf-1.13.1-cpu:0.0.1

Explanations:

>1 num_workers indicates it is a distributed training.
Parameters / resources / Docker image of parameter server can be specified separately. For many cases, parameter server doesn't require GPU.

For the meaning of the individual parameters, see the QuickStart page!

Outputs of distributed training

Sample output of master:

...
allow_soft_placement: true
, '_tf_random_seed': None, '_task_type': u'master', '_environment': u'cloud', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe77cb15050>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
...
2018-05-06 22:29:14.656022: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> localhost:8000}
2018-05-06 22:29:14.656097: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ps-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:29:14.656112: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:29:14.659359: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
...
INFO:tensorflow:Restoring parameters from hdfs://default/tmp/cifar-10-jobdir/model.ckpt-0
INFO:tensorflow:Evaluation [1/625]
INFO:tensorflow:Evaluation [2/625]
INFO:tensorflow:Evaluation [3/625]
INFO:tensorflow:Evaluation [4/625]
INFO:tensorflow:Evaluation [5/625]
INFO:tensorflow:Evaluation [6/625]
...
INFO:tensorflow:Validation (step 1): loss = 1220.6445, global_step = 1, accuracy = 0.1
INFO:tensorflow:loss = 6.3980675, step = 0
INFO:tensorflow:loss = 6.3980675, learning_rate = 0.1
INFO:tensorflow:global_step/sec: 2.34092
INFO:tensorflow:Average examples/sec: 1931.22 (1931.22), step = 100
INFO:tensorflow:Average examples/sec: 354.236 (38.6479), step = 110
INFO:tensorflow:Average examples/sec: 211.096 (38.7693), step = 120
INFO:tensorflow:Average examples/sec: 156.533 (38.1633), step = 130
INFO:tensorflow:Average examples/sec: 128.6 (38.7372), step = 140
INFO:tensorflow:Average examples/sec: 111.533 (39.0239), step = 150

Sample output of worker:

, '_tf_random_seed': None, '_task_type': u'worker', '_environment': u'cloud', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc2a490b050>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
...
2018-05-06 22:28:45.807936: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> master-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:45.808040: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ps-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:45.808064: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8000}
2018-05-06 22:28:45.809919: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
...
INFO:tensorflow:loss = 5.319096, step = 0
INFO:tensorflow:loss = 5.319096, learning_rate = 0.1
INFO:tensorflow:Average examples/sec: 49.2338 (49.2338), step = 10
INFO:tensorflow:Average examples/sec: 52.117 (55.3589), step = 20
INFO:tensorflow:Average examples/sec: 53.2754 (55.7541), step = 30
INFO:tensorflow:Average examples/sec: 53.8388 (55.6028), step = 40
INFO:tensorflow:Average examples/sec: 54.1082 (55.2134), step = 50
INFO:tensorflow:Average examples/sec: 54.3141 (55.3676), step = 60

Sample output of ps:

...
, '_tf_random_seed': None, '_task_type': u'ps', '_environment': u'cloud', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4be54dff90>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
...
2018-05-06 22:28:42.562316: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> master-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:42.562408: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:8000}
2018-05-06 22:28:42.562433: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:42.564242: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000

Notes:

When using YARN native service runtime, you can view multiple job training history like from the Tensorboard link:

Run tensorboard to monitor your jobs

# Cleanup previous tensorboard service if needed

SUBMARINE_VERSION=0.4.0-SNAPSHOT
CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath --glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
${SUBMARINE_HOME}/conf: \
java org.apache.submarine.client.cli.Cli job run \
  --name tensorboard-service \
  --verbose \
  --docker_image <your-docker-image> \
  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
  --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
  --worker_resources memory=2G,vcores=2 \
  --worker_launch_cmd "pwd" \
  --tensorboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RunningDistributedCifar10TFJobsWithYarnService.md

RunningDistributedCifar10TFJobsWithYarnService.md

Cifar10 Tensorflow Estimator Example With YARN Native Runtime

Prepare data for training

Prepare Docker images

Run Tensorflow jobs

Run standalone training

Run distributed training

Notes:

Run tensorboard to monitor your jobs

Files

RunningDistributedCifar10TFJobsWithYarnService.md

Latest commit

History

RunningDistributedCifar10TFJobsWithYarnService.md

File metadata and controls

Cifar10 Tensorflow Estimator Example With YARN Native Runtime

Prepare data for training

Prepare Docker images

Run Tensorflow jobs

Run standalone training

Run distributed training

Notes:

Run tensorboard to monitor your jobs