-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Container Scheduling - A meta issue #3922
Comments
Ping @afgane - in response to my status update a couple days ago you requested my plans involving container scheduling. I think this is a good summary of the most immediate framework shortcomings - it is a couple months worth of work in the correct direction I think at least. There are some more nebulous things - like how to deal effectively with volumes that I want to consider but I'd like to have a setup for testing such a system first (I'm thinking something like - do Pulsar staging to create a volume, pass it off to the container service, and read from it afterward - but I don't know what the APIs look like for instance). |
To back up a bit, all of the topics you listed are internal to Galaxy with the aim of enabling (efficient) use of containers in place of locally installed tools. On a larger scale, I see three parts in this space; I'll document those here so anyone can chime in:
For item 2, this can be a VM (as you indicate), a dedicated container service (such as a local cluster with container manager or a cloud container service), or an on-demand provisioned cluster. The plan is to have CloudMan become this where it can be used to supply the necessary infrastructure and properly configure it so Galaxy can consume it. CloudMan would handle resource provisioning and configuration so Galaxy would just get a handle to a ready-to-use container cluster manager. As you indicate, storage provisioning and data staging are still a bit of an open question here, particularly for the case of federated infrastructure. For item 3, the idea is that Galaxy can be provisioned as a containerized service using a recipe (e.g., Helm charts or a Docker Compose file) that will deploy all of its required components as dedicated services. @pcm32 has done a fair bit of work here with the Phenomenal project. For the cloud deployment scenarios, this can be used on top of the provisioned infrastructure to also deploy Galaxy. @nuwang and I are working on item 2 at the moment (with plans to progress to item 3 upon completion) and wonder is a functional container cluster all that's required from Galaxy's perspective or if there are other requirements? |
Roughly speaking - yes I would say this is the case. Stepping back is great and I'm glad you outlined those three things. I'd just add the caveat that 1 and 3 can be fairly decoupled. For instance,
That said - it would be nice to be able to say for each of Amazon EC2 Container Service, Kubernetes, Mesos, and Google Container Engine we provide some sort of complete and robust solution that delivers both 1 and 3 in a container-centric manner (i.e. not setting up SLURM and just pretending it is a cluster).
Great - especially great if this work dovetails with docker-galaxy-stable and we can find more synergy between CloudMan and that approach.
Either:
or
|
Hi all, Just to comment on points enumerated by @afgane, we have our implementation of point 3 "Manage Galaxy itself as a containerized (micro)service". We have been using this for months now in a number of cloud deployments at PhenoMeNal through the following Helm charts: https://github.com/pcm32/galaxy-helm-charts Regarding documentation on using Galaxy within Kubernetes, as asked by @jmchilton, I have written documentation in our project wiki: And some legacy explanations of what happens behind the Helm curtains: In particular, for usage of Kubernetes as an external scheduler (not running galaxy within it): We have been using Galaxy within Kubernetes plus job offloading to containers through the same k8s cluster for months now in PhenoMeNal, both locally through minikube for development, and on cloud deployments (amazon, google, openstack), so I wonder what would take the Kubernetes Runner to be graduated out of the "experimental support" mentioned. Regarding docker-galaxy-stable, I don't think that a general purpose image should be used for container orchestration, as it includes so much that is not needed. We are using an image built from scratch that is around 1/4 of the weight. I have added though in the Helm charts the ability to set the container image that you wish to use instead of the one we are using, but that image would need to include a couple of directories where we have some ansible/shell routines for runtime setup of databases, users and workflows, plus some minor environment setup. I intend to leave everything needed on galaxyproject/galaxy-kubernetes git repo soon. Hope this is useful! |
@pcm32 Thanks for the great outline of the awesome work you are doing, this is tremendously exciting stuff! I wanted to offer my counter-perspective on one point of disagreement since I had a chat with @afgane about it that was cut short at a recent team meeting. But I offer my opinion with the huge caveat that I don't claim to offer a best practice or any particular depth of knowledge in this realm - I'm just one guy with a fairly uninformed opinion.
I understand the impulse and I cannot fault you - there is a very common tension between reuse and complexity in play with this choice at very least. I tend to side with reuse in these situations - that is my bent - I don't want to miss the opportunity to benefit from the innovations, testing, and flexibility that the amazing community puts into that project - and I wouldn't want to miss the opportunity to contribute Kubernetes goodies back to those in that community. That said - docker-galaxy-stable is build artifact (a build artifact with a fantastic community and great documentation) but the ultimate source reuse could easily be a level below - in the ansible projects that build into docker-galaxy-stable. They are a lot to go through - but ultimately you should be able to configure them to build tiny single-purpose containers if that is a primary architectural goal. Any innovations we've made to make these recipes compose well or optimize Galaxy in non-Kubernetes specific ways could be shared with other projects. |
I agree with @jmchilton here and raised exactly this point a few month ago already. Regarding the following statement:
It's important to know that we spend a significant mount of work into addressing different use-cases with the same code-base - which John was referring as Ansible projects.
If time permits, our plan is to use kompose to run the compose setup on k8s. With this we hope to have with one set of Ansible roles a stack that builds from HPC, VMs to Clouds in all different forms. |
I agree on a lower level setup based on ansible, where we should also move the packages being installed (which tends to be the major bulk). Then through different ansible calls you could have different sets of packages that are installed for different flavours of containers. Currently for our orchestration oriented container, our main package installation looks like this:
which avoids slurm, supervisor, postgres, toolshed, docker, proftpd, etc. so many things that we don't need in an orchestration scenario because they are optionally provisioned by other containers externally or by the orchestrator itself, which act as a cluster administrator and process administrator, among others. That is where I would hope such minimal container for orchestration would go, which is different to a minimal container for other purposes, in my opinion. |
@pcm32 which is more or less what we did here: https://github.com/bgruening/docker-galaxy-stable/blob/dev/compose/galaxy-web/Dockerfile |
Yes, I have seen it, but starts from galaxy-base, which includes all of the things I mentioned ;-), if memory doesn't fail. |
I haven't looped back into this thread for some time but @bgruening and I have been working on some of these deployment issues. On my local minikube setup I have various versions of docker-galaxy-stable's compose setup working in a traditional SLURM cluster mode as well as using the Kubernetes job runner - with both hard-code tool containers and automatically discovered BioContainers. So ansible-galaxy-extras has been updated with Kubernetes support for instance and support for dispatching to different runners based on container availability. We are still working hard at getting this stuff to consistently test as part of docker-galaxy-stable's testing framework (bgruening/docker-galaxy-stable#347). @afgane I see that you are planning on replacing CloudMan's backend with a Kubernetes based infrastructure - do you have more details on how you are going to deploy that? Any chance we can convince you to build it on docker-galaxy-stable's work - if not are there other ways we can coordinate? @bgruening and I put together a BoF to talk about this and some other issues - |
Great news John! The way this works in practice is that we use the new CloudLaunch to deploy a bare-bones VM and have CloudLaunch point an Ansible playbook at it. This playbook deploys Rancher, sets up K8S env, and starts CloudMan within. CloudMan then takes over to provide management controls for Galaxy and the infrastructure. Clearly, the playbook can be made do different things too. The goal is that it should be possible to use CloudMan also outside the cloud context as a way to manage Galaxy on provisioned infrastructure. Hence, containerizing everything is so desirable. Much (i.e., most) of this is still under early development but things are happening daily. |
So much has changed in the meantime, but the initial goals here seem like they're ✅ |
This meta-issue tracks major tasks needed to allow Galaxy to efficiently and robustly schedule "containers". Currently Galaxy for the most part assumes it can run a copy of Galaxy on the remote job server - this is fine for running Docker below a SLURM job - but isn't a possibility if Galaxy is talking directly to a container-scheduling service. Galaxy has experimental support for Kubernates and Condor container scheduling so such scheduling is in theory possible if the required disks are all exposed correctly - but there are several things that should be done to make things more robust.
embed_metadata_in_job
uniformly. embed_metadata_in_job uniformity #1894planemo test --dockerized
also.The text was updated successfully, but these errors were encountered: