This repository contains several files for building Apache Spark focused container images, targeted for usage on OpenShift Origin.
By default, it will build the following images into your local Docker registry:
openshift-spark
, Apache Spark, Python 2.7openshift-spark-py36
, Apache Spark, Python 3.6
For Spark versions, please see the image.yaml
file.
Create all images and save them in the local Docker registry.
make
Tag and push the images to the designated reference.
make push SPARK_IMAGE=[REGISTRY_HOST[:REGISTRY_PORT]/]NAME[:TAG]
There are several ways to customize the construction and build process. This
project uses the GNU Make tool for
the build workflow, see the Makefile
for more information. For container
specification and construction, the
Container image creation tool concreate
is
used as the primary point of investigation, see the image.yaml
file for
more information.
This repository also supports building 'incomplete' versions of the images which contain tooling for OpenShift but lack an actual Spark distribution. An s2i workflow can be used with these partial images to install a Spark distribution of a user's choosing. This gives users an alternative to checking out the repository and modifying build files if they want to run a custom Spark distribution. By default, the partial images built will be
openshift-spark-inc
, Apache Spark, Python 2.7openshift-spark-inc-py36
, Apache Spark, Python 3.6
To build the partial images, use make with Makefile.inc
make -f Makefile.inc
Tag and push the images to the designated reference.
make -f Makefile.inc push SPARK_IMAGE=[REGISTRY_HOST[:REGISTRY_PORT]/]NAME[:TAG]
To produce a final image, a source-to-image build must be performed which takes
a Spark distribution as input. This can be done in OpenShift or locally using
the s2i tool if it's installed.
The final images created can be used just like the openshfit-spark
and
openshift-spark-py36
images described above.
To complete the Python 2.7 image using the s2i tool
$ mkdir build_input
$ wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz -O build_input/spark-2.3.0-bin-hadoop2.7.tgz
$ wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz.md5 -O build_input/spark-2.3.0-bin-hadoop2.7.tgz.md5
$ s2i build build_input radanalyticsio/openshift-spark-inc openshift-spark
To complete the Python 2.7 image using OpenShift, for example:
$ oc new-build --name=openshift-spark --docker-image=radanalyticsio/openshift-spark-inc --binary
$ oc start-build openshift-spark --from-file=https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
(note that the value of `--from-file` could also be the `build_input` directory from the s2i example above)
This will write the completed image to an imagestream called openshift-spark
in the current project
Note that all of the images described here will respond to a 'usage' command for reference. For example
$ docker run --rm openshift-spark:latest usage