TensorFlow Cloud

What is this repo?

The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.

Installation

Requirements

Python >= 3.5
A Google Cloud project
An authenticated GCP account
Google AI platform APIs enabled for your GCP account. We use the AI platform for deploying docker images on GCP.
Either a functioning version of docker if you want to use a local docker process for your build, or create a cloud storage bucket to use with Google Cloud build for docker image build and publishing.
Authenticate to your Docker Container Registry
(optional) nbconvert if you are using a notebook file as entry_point as shown in usage guide #4.

For detailed end to end setup instructions, please see Setup instructions.

Install latest release

pip install -U tensorflow-cloud

Install from source

git clone https://github.com/tensorflow/cloud.git
cd cloud
pip install src/python/.

High level overview

TensorFlow Cloud package provides the run API for training your models on GCP. To start, let's walk through a simple workflow using this API.

Let's begin with a Keras model training code such as the following, saved as mnist_example.py.

import tensorflow as tf

(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()

x_train = x_train.reshape((60000, 28 * 28))
x_train = x_train.astype('float32') / 255

model = tf.keras.Sequential([
  tf.keras.layers.Dense(512, activation='relu', input_shape=(28 * 28,)),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10, batch_size=128)

After you have tested this model on your local environment for a few epochs, probably with a small dataset, you can train the model on Google Cloud by writing the following simple script scale_mnist.py.

import tensorflow_cloud as tfc
tfc.run(entry_point='mnist_example.py')

Running scale_mnist.py will automatically apply TensorFlow one device strategy and train your model at scale on Google Cloud Platform. Please see the usage guide section for detailed instructions and additional API parameters.

You will see an output similar to the following on your console. This information can be used to track the training job status.

user@desktop$ python scale_mnist.py
Job submitted successfully.
Your job ID is:  tf_cloud_train_519ec89c_a876_49a9_b578_4fe300f8865e
Please access your job logs at the following URL:
https://console.cloud.google.com/mlengine/jobs/tf_cloud_train_519ec89c_a876_49a9_b578_4fe300f8865e?project=prod-123

Setup instructions

End to end instructions to help set up your environment for Tensorflow Cloud.

Create a new local directory

mkdir tensorflow_cloud
cd tensorflow_cloud

Make sure you have python >= 3.5

python -V

Set up virtual environment

virtualenv tfcloud --python=python3
source tfcloud/bin/activate

Set up your Google Cloud project

Verify that gcloud sdk is installed.

which gcloud

Set default gcloud project

export PROJECT_ID=<your-project-id>
gcloud config set project $PROJECT_ID

Authenticate your GCP account

Create a service account.

export SA_NAME=<your-sa-name>
gcloud iam service-accounts create $SA_NAME
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com \
    --role 'roles/editor'

Create a key for your service account.

gcloud iam service-accounts keys create ~/key.json --iam-account $SA_NAME@$PROJECT_ID.iam.gserviceaccount.com

Create the GOOGLE_APPLICATION_CREDENTIALS environment variable.

export GOOGLE_APPLICATION_CREDENTIALS=~/key.json

Create a Cloud Storage bucket. Using Google Cloud build is the recommended method for building and publishing docker images, although we optionally allow for local docker daemon process depending on your specific needs.

BUCKET_NAME="your-bucket-name"
REGION="us-central1"
gcloud auth login
gsutil mb -l $REGION gs://$BUCKET_NAME

(optional for local docker setup)

sudo dockerd

Install nbconvert if you plan to use a notebook file entry_point as shown in usage guide #4.

pip install nbconvert

Install latest release of tensorflow-cloud

pip install tensorflow-cloud

Usage guide

As described in the high level overview, the run API allows you to train your models at scale on GCP. The run API can be used in four different ways. This is defined by where you are running the API (Terminal vs IPython notebook), and your entry_point parameter. entry_point is an optional Python script or notebook file path to the file that contains your TensorFlow Keras training code. This is the most important parameter in the API.

run(entry_point=None,
    requirements_txt=None,
    distribution_strategy='auto',
    docker_base_image=None,
    chief_config='auto',
    worker_config='auto',
    worker_count=0,
    entry_point_args=None,
    stream_logs=False,
    docker_image_bucket_name=None,
    **kwargs)

1. Using a python file as entry_point.

If you have your tf.keras model in a python file (mnist_example.py), then you can write the following simple script (scale_mnist.py) to scale your model on GCP.

import tensorflow_cloud as tfc
tfc.run(entry_point='mnist_example.py')

Please note that all the files in the same directory tree as entry_point will be packaged in the docker image created, along with the entry_point file.

2. Using a notebook file as entry_point.

If you have your tf.keras model in a notebook file (mnist_example.ipynb), then you can write the following simple script (scale_mnist.py) to scale your model on GCP.

import tensorflow_cloud as tfc
tfc.run(entry_point='mnist_example.ipynb')

Please note that all the files in the same directory tree as entry_point will be packaged in the docker image created, along with the entry_point file.

3. Using run within a python script that contains the tf.keras model.

You can use the run API from within your python file that contains the tf.keras model (mnist_scale.py). In this use case, entry_point should be None. The run API can be called anywhere and the entire file will be executed remotely. The API can be called at the end to run the script locally for debugging purposes (possibly with fewer epochs and other flags).

import tensorflow_datasets as tfds
import tensorflow as tf
import tensorflow_cloud as tfc

tfc.run(
    entry_point=None,
    distribution_strategy='auto',
    requirements_txt='requirements.txt',
    chief_config=tfc.MachineConfig(
            cpu_cores=8,
            memory=30,
            accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_T4,
            accelerator_count=2),
    worker_count=0)

datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
mnist_train, mnist_test = datasets['train'], datasets['test']

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

BUFFER_SIZE = 10000
BATCH_SIZE = 64


def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label


train_dataset = mnist_train.map(scale).cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(
        28, 28, 1)),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])
model.fit(train_dataset, epochs=12)

4. Using run within a notebook script that contains the tf.keras model.

In this use case, entry_point should be None and docker_image_bucket_name must be specified, to ensure the build can be stored and published.

Cluster and distribution strategy configuration

By default, run API takes care of wrapping your model code in a TensorFlow distribution strategy based on the cluster configuration you have provided.

No distribution

CPU chief config and no additional workers

tfc.run(entry_point='mnist_example.py',
        chief_config=tfc.COMMON_MACHINE_CONFIGS['CPU'])

OneDeviceStrategy

1 GPU on chief (defaults to AcceleratorType.NVIDIA_TESLA_T4) and no additional workers.

tfc.run(entry_point='mnist_example.py')

MirroredStrategy

Chief config with multiple GPUS (AcceleratorType.NVIDIA_TESLA_V100).

tfc.run(entry_point='mnist_example.py',
        chief_config=tfc.COMMON_MACHINE_CONFIGS['V100_4X'])

MultiWorkerMirroredStrategy

Chief config with 1 GPU and 2 workers each with 8 GPUs (AcceleratorType.NVIDIA_TESLA_V100).

tfc.run(entry_point='mnist_example.py',
        chief_config=tfc.COMMON_MACHINE_CONFIGS['V100_1X'],
        worker_count=2,
        worker_config=tfc.COMMON_MACHINE_CONFIGS['V100_8X'])

TPUStrategy

Chief config with 1 CPU and 1 worker with TPU.

tfc.run(entry_point="mnist_example.py",
        chief_config=tfc.COMMON_MACHINE_CONFIGS["CPU"],
        worker_count=1,
        worker_config=tfc.COMMON_MACHINE_CONFIGS["TPU"])

Please note that TPUStrategy with TensorFlow Cloud works only with TF version 2.1 as this is the latest version supported by AI Platform cloud TPU

Custom distribution strategy

If you would like to take care of specifying distribution strategy in your model code and do not want run API to create a strategy, then set distribution_stategy as None. This will be required for example when you are using strategy.experimental_distribute_dataset.

tfc.run(entry_point='mnist_example.py',
        distribution_strategy=None,
        worker_count=2)

What happens when you call run?

The API call will encompass the following:

Making code entities such as a Keras script/notebook, cloud and distribution ready.
Converting this distribution entity into a docker container with the required dependencies.
Deploy this container at scale and train using TensorFlow distribution strategies.
Stream logs and monitor them on hosted TensorBoard, manage checkpoint storage.

By default, we will use local docker daemon for building and publishing docker images to Google container registry. Images are published to gcr.io/your-gcp-project-id. If you specify docker_image_bucket_name, then we will use Google Cloud build to build and publish docker images.

We use Google AI platform for deploying docker images on GCP.

Please note that, when entry_point argument is specified, all the files in the same directory tree as entry_point will be packaged in the docker image created, along with the entry_point file.

Please see run API documentation for detailed information on the parameters and how you can modify the above processes to suit your needs.

End to end examples

cd src/python/tensorflow_cloud/core
python tests/examples/call_run_on_script_with_keras_fit.py

Running unit tests

pytest src/python/tensorflow_cloud/core/tests/unit/

Local vs remote training

Things to keep in mind when running your jobs remotely:

[Coming soon]

Debugging workflow

Here are some tips for fixing unexpected issues.

Operation disallowed within distribution strategy scope

Error like: Creating a generator within a strategy scope is disallowed, because there is ambiguity on how to replicate a generator (e.g. should it be copied so that each replica gets the same random numbers, or 'split' so that each replica gets different random numbers).

Solution: Passing distribution_strategy='auto' to run API wraps all of your script in a TF distribution strategy based on the cluster configuration provided. You will see the above error or something similar to it, if for some reason an operation is not allowed inside distribution strategy scope. To fix the error, please pass None to the distribution_strategy param and create a strategy instance as part of your training code as shown in this example.

Version not supported for TPU training

Error like: There was an error submitting the job.Field: tpu_tf_version Error: The specified runtime version '2.3' is not supported for TPU training. Please specify a different runtime version.

Solution: Please use TF version 2.1. See TPU Strategy in Cluster and distribution strategy configuration section.

TF nightly build.

Warning like: Docker base image '2.4.0.dev20200720' does not exist. Using the latest TF nightly build.

Solution: If you do not provide docker_base_image param, then by default we use pre-built TF docker images as base image. If you do not have TF installed on the environment where run is called, then TF docker image for the latest stable release will be used. Otherwise, the version of the docker image will match the locally installed TF version. However, pre-built TF docker images aren't available for TF nightlies except for the latest. So, if your local TF is an older nightly version, we upgrade to the latest nightly automatically and raise this warning.

Coming up

Distributed Keras tuner support.

Contributing

We welcome community contributions, see CONTRIBUTING.md and, for style help, Writing TensorFlow documentation guide.

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 251 Commits
.github/workflows		.github/workflows
images		images
src		src
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TensorFlow Cloud

What is this repo?

Installation

Requirements

Install latest release

Install from source

High level overview

Setup instructions

Usage guide

Cluster and distribution strategy configuration

What happens when you call run?

End to end examples

Running unit tests

Local vs remote training

Debugging workflow

Operation disallowed within distribution strategy scope

Version not supported for TPU training

TF nightly build.

Coming up

Contributing

License

About

Releases

Packages

Languages

License

erlang-network/cloud

Folders and files

Latest commit

History

Repository files navigation

TensorFlow Cloud

What is this repo?

Installation

Requirements

Install latest release

Install from source

High level overview

Setup instructions

Usage guide

Cluster and distribution strategy configuration

What happens when you call run?

End to end examples

Running unit tests

Local vs remote training

Debugging workflow

Operation disallowed within distribution strategy scope

Version not supported for TPU training

TF nightly build.

Coming up

Contributing

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages