Container Scheduling - A meta issue #3922

jmchilton · 2017-04-11T19:30:36Z

This meta-issue tracks major tasks needed to allow Galaxy to efficiently and robustly schedule "containers". Currently Galaxy for the most part assumes it can run a copy of Galaxy on the remote job server - this is fine for running Docker below a SLURM job - but isn't a possibility if Galaxy is talking directly to a container-scheduling service. Galaxy has experimental support for Kubernates and Condor container scheduling so such scheduling is in theory possible if the required disks are all exposed correctly - but there are several things that should be done to make things more robust.

Provide tool and repository centric UI elements to determine if containers are available for tools - and if so information about the containers. This should largely be modeled after @mvdbeek's work on Conda UI elements I believe.
Holistic approach to container caching. Holistic Approach to Container Caching #3673 (UI elements here require the above UI to be in place first)
Refactor job scheduling stuff to treat embed_metadata_in_job uniformly. embed_metadata_in_job uniformity #1894
More options for post-job metadata evaluation. More Options for Efficient Post-Job Metadata Evaluation #3921
Ensure all best-practice tools support BioContainers Mulled Support - Part II planemo#646
- Setup infrastructure to support publishing biocontainers with multiple requirements.
- Start running tools-iuc PRs with planemo test --dockerized also.
Provide documentation, test cases, and VMs (or docker-compose setups) for testing Kubernetes within Galaxy.
Provide documentation, test cases, and VMs (or docker-compose setups) for testing Condor container scheduling within Galaxy.

The text was updated successfully, but these errors were encountered:

jmchilton · 2017-04-11T19:34:13Z

Ping @afgane - in response to my status update a couple days ago you requested my plans involving container scheduling. I think this is a good summary of the most immediate framework shortcomings - it is a couple months worth of work in the correct direction I think at least. There are some more nebulous things - like how to deal effectively with volumes that I want to consider but I'd like to have a setup for testing such a system first (I'm thinking something like - do Pulsar staging to create a volume, pass it off to the container service, and read from it afterward - but I don't know what the APIs look like for instance).

afgane · 2017-04-12T17:59:05Z

To back up a bit, all of the topics you listed are internal to Galaxy with the aim of enabling (efficient) use of containers in place of locally installed tools. On a larger scale, I see three parts in this space; I'll document those here so anyone can chime in:

Make Galaxy utilize containerized tools (original topic on this issue)
Provide a container execution environment for Galaxy
Manage Galaxy itself as a containerized (micro)service

For item 2, this can be a VM (as you indicate), a dedicated container service (such as a local cluster with container manager or a cloud container service), or an on-demand provisioned cluster. The plan is to have CloudMan become this where it can be used to supply the necessary infrastructure and properly configure it so Galaxy can consume it. CloudMan would handle resource provisioning and configuration so Galaxy would just get a handle to a ready-to-use container cluster manager. As you indicate, storage provisioning and data staging are still a bit of an open question here, particularly for the case of federated infrastructure.

For item 3, the idea is that Galaxy can be provisioned as a containerized service using a recipe (e.g., Helm charts or a Docker Compose file) that will deploy all of its required components as dedicated services. @pcm32 has done a fair bit of work here with the Phenomenal project. For the cloud deployment scenarios, this can be used on top of the provisioned infrastructure to also deploy Galaxy.

@nuwang and I are working on item 2 at the moment (with plans to progress to item 3 upon completion) and wonder is a functional container cluster all that's required from Galaxy's perspective or if there are other requirements?

jmchilton · 2017-04-12T18:37:49Z

To back up a bit, all of the topics you listed are internal to Galaxy with the aim of enabling (efficient) use of containers in place of locally installed tools.

Roughly speaking - yes I would say this is the case. Stepping back is great and I'm glad you outlined those three things. I'd just add the caveat that 1 and 3 can be fairly decoupled. For instance,

docker-galaxy-stable for instance can be run as a single container or decomposed into several - but either way I believe it should be able to talk to an external SLURM cluster and run traditional (non-containerized) jobs.
One can easily imagine setting up a fair traditional Galaxy host with a permanent stable IP address on bare metal and then having it submit jobs to some attached container scheduler.
If if you had access to a container cluster, it is likely more robust currently to just throw up a shared file system across them, run slurm in every instance, and just schedule jobs as regular cluster jobs.

That said - it would be nice to be able to say for each of Amazon EC2 Container Service, Kubernetes, Mesos, and Google Container Engine we provide some sort of complete and robust solution that delivers both 1 and 3 in a container-centric manner (i.e. not setting up SLURM and just pretending it is a cluster).

For item 3, the idea is that Galaxy can be provisioned as a containerized service using a recipe

Great - especially great if this work dovetails with docker-galaxy-stable and we can find more synergy between CloudMan and that approach.

wonder is a functional container cluster all that's required from Galaxy's perspective or if there are other requirements

Conda dependencies for complete toolset one desires to use and

Either:

Big shared file system (I think) and schedules jobs as containers

or

Run Pulsar in a container for as many containers as you wish to allocate, set it up to target a message queue, setup conda auto-install to install dependencies as needed inside the container, and let Pulsar do the staging.

pcm32 · 2017-04-14T02:13:36Z

Hi all,

Just to comment on points enumerated by @afgane, we have our implementation of point 3 "Manage Galaxy itself as a containerized (micro)service". We have been using this for months now in a number of cloud deployments at PhenoMeNal through the following Helm charts:

https://github.com/pcm32/galaxy-helm-charts
(this is a parallel repo to galaxyproject/galaxy-kubernetes, as I needed some admin rights that didn't have in the beginning, but intend to leave everything there in the short term as @afgane seems happy with the overall helm approach, please correct me if I'm wrong).

Regarding documentation on using Galaxy within Kubernetes, as asked by @jmchilton, I have written documentation in our project wiki:
https://github.com/phnmnl/phenomenal-h2020/wiki/QuickStart-Installation-for-Local-PhenoMeNal-Workflow

And some legacy explanations of what happens behind the Helm curtains:
https://github.com/phnmnl/phenomenal-h2020/wiki/galaxy-with-k8s

In particular, for usage of Kubernetes as an external scheduler (not running galaxy within it):
https://github.com/phnmnl/phenomenal-h2020/wiki/Galaxy-outside-Kubernetes

We have been using Galaxy within Kubernetes plus job offloading to containers through the same k8s cluster for months now in PhenoMeNal, both locally through minikube for development, and on cloud deployments (amazon, google, openstack), so I wonder what would take the Kubernetes Runner to be graduated out of the "experimental support" mentioned.

Regarding docker-galaxy-stable, I don't think that a general purpose image should be used for container orchestration, as it includes so much that is not needed. We are using an image built from scratch that is around 1/4 of the weight. I have added though in the Helm charts the ability to set the container image that you wish to use instead of the one we are using, but that image would need to include a couple of directories where we have some ansible/shell routines for runtime setup of databases, users and workflows, plus some minor environment setup. I intend to leave everything needed on galaxyproject/galaxy-kubernetes git repo soon.

Hope this is useful!

jmchilton · 2017-04-25T13:09:40Z

@pcm32 Thanks for the great outline of the awesome work you are doing, this is tremendously exciting stuff!

I wanted to offer my counter-perspective on one point of disagreement since I had a chat with @afgane about it that was cut short at a recent team meeting. But I offer my opinion with the huge caveat that I don't claim to offer a best practice or any particular depth of knowledge in this realm - I'm just one guy with a fairly uninformed opinion.

Regarding docker-galaxy-stable, I don't think that a general purpose image should be used for container orchestration, as it includes so much that is not needed.

I understand the impulse and I cannot fault you - there is a very common tension between reuse and complexity in play with this choice at very least. I tend to side with reuse in these situations - that is my bent - I don't want to miss the opportunity to benefit from the innovations, testing, and flexibility that the amazing community puts into that project - and I wouldn't want to miss the opportunity to contribute Kubernetes goodies back to those in that community.

That said - docker-galaxy-stable is build artifact (a build artifact with a fantastic community and great documentation) but the ultimate source reuse could easily be a level below - in the ansible projects that build into docker-galaxy-stable. They are a lot to go through - but ultimately you should be able to configure them to build tiny single-purpose containers if that is a primary architectural goal. Any innovations we've made to make these recipes compose well or optimize Galaxy in non-Kubernetes specific ways could be shared with other projects.

bgruening · 2017-04-25T13:36:25Z

I agree with @jmchilton here and raised exactly this point a few month ago already.

Regarding the following statement:

Regarding docker-galaxy-stable, I don't think that a general purpose image should be used for container orchestration, as it includes so much that is not needed.

It's important to know that we spend a significant mount of work into addressing different use-cases with the same code-base - which John was referring as Ansible projects.

One compact Galaxy Docker Image able to install ToolShed dependencies (the traditional one)
A slim version of 1) that is only working with Conda packages and minimal in size
a composed Image, spitted into multiple parts. If available using upstream maintained Containers. This also comes with it's own SLURM cluster or HTCondor cluster if needed

If time permits, our plan is to use kompose to run the compose setup on k8s. With this we hope to have with one set of Ansible roles a stack that builds from HPC, VMs to Clouds in all different forms.
I know there is a lot of optimization to be done, spending 3h I was able to shrink the container by 200MB in size ... but I hope we are on the right track.

pcm32 · 2017-04-25T13:52:10Z

I agree on a lower level setup based on ansible, where we should also move the packages being installed (which tends to be the major bulk). Then through different ansible calls you could have different sets of packages that are installed for different flavours of containers. Currently for our orchestration oriented container, our main package installation looks like this:

RUN apt-get -qq update && apt-get install --no-install-recommends -y apt-transport-https software-properties-common wget && \
    apt-get update -qq && \
    apt-get install --no-install-recommends -y mercurial python-psycopg2 sudo python-virtualenv \
    libyaml-dev libffi-dev libssl-dev \
    curl git python-pip python-gnuplot python-psutil && \
    pip install --upgrade pip && \
    apt-get purge -y software-properties-common && \
    apt-get autoremove -y && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

which avoids slurm, supervisor, postgres, toolshed, docker, proftpd, etc. so many things that we don't need in an orchestration scenario because they are optionally provisioned by other containers externally or by the orchestrator itself, which act as a cluster administrator and process administrator, among others. That is where I would hope such minimal container for orchestration would go, which is different to a minimal container for other purposes, in my opinion.

bgruening · 2017-04-25T14:06:14Z

@pcm32 which is more or less what we did here: https://github.com/bgruening/docker-galaxy-stable/blob/dev/compose/galaxy-web/Dockerfile

pcm32 · 2017-04-25T14:16:00Z

Yes, I have seen it, but starts from galaxy-base, which includes all of the things I mentioned ;-), if memory doesn't fail.

jmchilton · 2017-06-19T19:42:32Z

I haven't looped back into this thread for some time but @bgruening and I have been working on some of these deployment issues. On my local minikube setup I have various versions of docker-galaxy-stable's compose setup working in a traditional SLURM cluster mode as well as using the Kubernetes job runner - with both hard-code tool containers and automatically discovered BioContainers. So ansible-galaxy-extras has been updated with Kubernetes support for instance and support for dispatching to different runners based on container availability. We are still working hard at getting this stuff to consistently test as part of docker-galaxy-stable's testing framework (bgruening/docker-galaxy-stable#347).

@afgane I see that you are planning on replacing CloudMan's backend with a Kubernetes based infrastructure - do you have more details on how you are going to deploy that? Any chance we can convince you to build it on docker-galaxy-stable's work - if not are there other ways we can coordinate?

@bgruening and I put together a BoF to talk about this and some other issues -
https://gcc2017.sched.com/event/BCVj/containerized-galaxy-deployments-and-advanced-testing-bof. On a broad range of issues I'm sure we will all disagree about many things - but it would be nice to all sit down and talk through planned approaches.

afgane · 2017-06-19T21:53:46Z

Great news John!
The goal with CloudMan has all along been to plug in underneath all the other existing components, so yes, all the docker-galaxy-stable work (or other similar solutions) should sit on top of the infrastructure and runtime environment that CloudMan provides. Here's a planning presentation for how this is envisioned: https://docs.google.com/presentation/d/1h9PVEGdVIHEat_JWTjYZWuU1R23IeGlMW8PZd8yJVuE/edit#slide=id.p.

The way this works in practice is that we use the new CloudLaunch to deploy a bare-bones VM and have CloudLaunch point an Ansible playbook at it. This playbook deploys Rancher, sets up K8S env, and starts CloudMan within. CloudMan then takes over to provide management controls for Galaxy and the infrastructure. Clearly, the playbook can be made do different things too. The goal is that it should be possible to use CloudMan also outside the cloud context as a way to manage Galaxy on provisioned infrastructure. Hence, containerizing everything is so desirable. Much (i.e., most) of this is still under early development but things are happening daily.

mvdbeek · 2021-04-16T19:55:41Z

So much has changed in the meantime, but the initial goals here seem like they're ✅

mvdbeek closed this as completed Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container Scheduling - A meta issue #3922

Container Scheduling - A meta issue #3922

jmchilton commented Apr 11, 2017

jmchilton commented Apr 11, 2017

afgane commented Apr 12, 2017

jmchilton commented Apr 12, 2017 •

edited by nsoranzo

Loading

pcm32 commented Apr 14, 2017

jmchilton commented Apr 25, 2017

bgruening commented Apr 25, 2017

pcm32 commented Apr 25, 2017 •

edited

Loading

bgruening commented Apr 25, 2017

pcm32 commented Apr 25, 2017

jmchilton commented Jun 19, 2017

afgane commented Jun 19, 2017

mvdbeek commented Apr 16, 2021

Container Scheduling - A meta issue #3922

Container Scheduling - A meta issue #3922

Comments

jmchilton commented Apr 11, 2017

jmchilton commented Apr 11, 2017

afgane commented Apr 12, 2017

jmchilton commented Apr 12, 2017 • edited by nsoranzo Loading

pcm32 commented Apr 14, 2017

jmchilton commented Apr 25, 2017

bgruening commented Apr 25, 2017

pcm32 commented Apr 25, 2017 • edited Loading

bgruening commented Apr 25, 2017

pcm32 commented Apr 25, 2017

jmchilton commented Jun 19, 2017

afgane commented Jun 19, 2017

mvdbeek commented Apr 16, 2021

jmchilton commented Apr 12, 2017 •

edited by nsoranzo

Loading

pcm32 commented Apr 25, 2017 •

edited

Loading