Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multistaged build for spark_driver reduces image size #3

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aaalexlit
Copy link

I stumbled upon this repo when doing the "Big Data, Hadoop, and Spark Basics" course on edX
The thing is that RUN rm ... doesn't help to make the images smaller cause it creates yet another layer leaving the previous one in the image
I propose the following changes to the Dockerfile to make the multistage build and consequently to make the resulting image considerably smaller
Note that you'll have to change the build.sh correspondingly to use the target argument --target spark
Also note that I had to change to spark-3.1.3 since spark-3.1.2 is not available any more from apache.org

I checked that the proposed changes work

  1. I built the image using docker build --target spark -t wr0ngc0degen/spark_driver .
  2. I pushed the image built with this Dockerfile to https://hub.docker.com/r/wr0ngc0degen/spark_driver (that's my identity on dockerhub)
  3. I replaced the line image: romeokienzler/spark_driver:3.1.2 --> image: wr0ngc0degen/spark_driver
  4. I ran the "Apache Spark on Kubernetes Lab" with the updates and everything worked as expected

I stumbled upon this repo when doing the "Big Data, Hadoop, and Spark Basics" course on edX 
The thing is that `RUN rm ...` doesn't help to make the images smaller cause it creates yet another layer leaving the previous one in the image
I propose the following changes to the Dockerfile to make the multistage build and consequently to make the resulting image considerably smaller
Note that you'll have to change the build.sh correspondingly to use the target argument `--target spark`
Also note that I had to change to spark-3.1.3 since spark-3.1.2 is not available any more from apache.org

I checked that the proposed changes work 
1.  I built the image using `docker build --target spark -t wr0ngc0degen/spark_driver .`
2. I pushed the image built with this Dockerfile to https://hub.docker.com/r/wr0ngc0degen/spark_driver (that's my identity on dockerhub)
3. I replaced the line `image: romeokienzler/spark_driver:3.1.2` --> `image: wr0ngc0degen/spark_driver`
4. I ran the "Apache Spark on Kubernetes Lab" with the updates and everything worked as expected
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant