-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converge to CloudLaunch? #6
Comments
Definitely - can you read my mind? :) I have a few ideas how we can use these informations, e.g. for finding training resources (human and metal) for GTN or generating the If this idea is taking off we can create some mini-ontology to automatically filter this json file for Cloudman, GTN, ELIXIR, de.NBI and so on. I can imagine ELIXIR is interested to see how many Galaxy instances are deployed in Europe. I hope I have not produced something redundant here, I was not aware of a geojson collection of already known Galaxy servers. |
I was hoping we can go the other way and replace the public galaxy servers page with a page like this one that's more descriptive and interactive. In addition to the ideas you listed, I'd really like to see a cross-Galaxy tool search. I keep wanting to chat to @martenson about this idea since he's been working with the search functionality (@martenson thoughts?). I guess we can ingest data from here into the CloudLaunch at some point so effort wouldn't need to be duplicated. As the features are added to CloudLaunch, the idea is that any user can add their instance by just filling out a form there. I guess we'd then have tags to include public instance, themed instances, instances accessible only to the given user, etc. |
I think GRT is the solution to all of the problems :)
What more could we want? (Serious question) By default galaxy ships with part of a GRT configuration. Cloudlaunch could automatically register an instance for users and automatically apply tags to that instance like "public" or "private", and "cloudlaunch" and maybe even infrastructure run on would be nice to have tagged. |
@afgane more than a year ago I added this to the API: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/webapps/galaxy/api/tools.py#L67 which allows to cheaply query remote Galaxy for a tool presence. It might be around that time when we can assume a decent number of public instances is updated to have this functionality and we can build some aggregated search UI. My idea was that you search on top of the Tool Shed list of tools with fulltext and then with the tool_id(s) you query all known public Galaxies to see where you can run the tool(set of tools). GTR has a bit different approach as it plans to fetch all tools and then perform the search locally (right, @erasche ?). |
sounds very good, but what about Galaxies behind firewalls? On Wed, Aug 10, 2016 at 7:57 PM, Eric Rasche [email protected]
|
You have my vote on this, but I don't want to be so disruptive again ;)
Many thoughts :) but I think this is a different discussion I think. The entire federation idea is awesome and something we should heading to, tool search would be the logical first step I think. I currently try to push the ELIXIR registry to implement a Galaxy tool search so if you like I would try to organise a meeting with @joncison and see where we can join forces.
Sounds great! |
I'm sceptical about this, haven't we learnt that any kind of automatically "call home" features, no matter how transparent it is - get's a bad reputation? For tool-search I don't think a pushing mechanism will work, we create a single-point of failure for such an important feature. I think crawling is the way to go here, Google is quite successful with this I have heard ;) - but again I had the same discussion with @joncison from ELIXIR-registry fame. Don't get me wrong I like the original GRT idea and I think for this it is ideally suited. Some big, very public instances can collect data (and communicate this to their users) and we can improve Galaxy with it. I don't see anyone using this in the future, for this we have to many private and highly secure instances (without any internet connection, behind firewalls, etc..). Registering (manually) pure Geo data and say "Here I'm, you can not see my Galaxy instance but I'm a Galaxy expert, get in touch!" is the lowest barrier what we can offer. Just my 2cents, I'm completely fine with abandon this idea in favour of GRT or something else. It was a test and I appreciate the discussion - that was the aim before going public :) |
We could use the URL from the geojson file to get all public instances. We need to communicate this properly though, |
Now that we also collect metadata, I'll add a flag for not submitting job/tool data. They shouldn't be forced to submit any data they don't want to, I don't want this either. It could be that it is because I work on it and am biased in favour of GRT, but it feels much less like an automated call home, and much more like an opt-in registry. Perhaps the language needs to be updated to reflect this better.
Is this really such a concern? That other people learn that I have version X of tool Y? Are tool IDs and names really sensitive data? (Are there any examples of tools that someone really would do anything to not have exposed?)
You create an SPOF either way. Either with a pulling crawler and its site to display results, or with the point that data is pushed to. I do not see the difference until we move to a scale with multiple servers and failover/handoff logic. At that scale, neither of them suffer from the SPOF problem, but we are a long ways from that.
A push model additionally gets us around the firewall problem. Internal-to-university galaxies can still advertise their tools (to their own users) despite the rest of the world not being able to access them.
I hear what you are saying, that there are high security instances with different priorities. In this case, GRT is no better than the manual galaxy-maps/registry + tool search engine approach here. So, we ignore these cases because they are not relevant to the discussion. For every other galaxy, that is public, that is willing to be indexed, that is willing to have a pin on a map, GRT is a fairly convenient web-form place to do this. (I'll add a pin selector today rather than just lat/lon fields, given how much people struggle with geojson's choices there.)
Registering just the location of a trainer (and not their associated galaxyies) seems to be different than what was originally proposed. Is this in scope? Speaking of scope, should it be expanded to:
My apologies on this front Björn, for attacking your idea outright, but yes, discussion is what we were going for, hope this did not come off too aggressively/in a mean spirit, not my intention. |
Not sure about the difference ;)
Yes. I know about people that strictly don't want to share anything not even what they are working on, because this can give others a clue on which new technologies they are working. I'm not allowed to fill github issues for tools I will work on and such stuff :(
Searching on demand (crawling) does not need such a failover it just searches, no storing needed.
Now you are talking about local GRTs, isn't it? Then it's the same for a local search crawlers.
Let's not mix up the search discussion with the "Galaxy registry discussion" here.
For me they are. As stated in the readme, I really want to register and map instances that are not able to submit data to the public (no matter in which way :))
I don't think Galaxy should overtake the registry part, for this we have the ELIXIR registry, that does way more than the registration of Galaxy instances or tools and we should support them.
My initial aim was to collect instances and finally creating one giant map of Galaxies :) |
The technical difference is non-existent. But clearly you had a negative perception of "automated call-home."
I'm sorry to hear that, that is unfortunate. Science should not (have to?) be so secretive. I think this is solved in both cases, the crawling scenario would only see the public api/tools list, the GRT scenario would use a blacklist.
Oh, wow, you mean dynamically searching? Whenever someone puts in a query for "bowtie", some services talks to every galaxy in the universe at once to find out who has bowtie? I'm sure the ELIXIR people have plans for this that are hopefully more sophisticated than this.
No, I wasn't. If you have internet access + a university firewall + no external access to inside, then GRT would continue to function since you can push metadata out to the central GRT with your galaxy's information. And if you don't wish to send job logs, GRT supports just registering the name / location.
Sorry for this, it comes up with GRT since that encompasses both of these functionalities in one project. Registering human resources not attached to a Galaxy would be a strong point in favour of a separate project to track human resources, but I am worried that it is the minority case. Most of us are admins + trainers and our training is integrally related to our administration and the presence of a public Galaxy instance on which we train people.
I will be curious to see how popular such a thing would be. My hypothesis is: If their galaxy is not internet connected, how often are they trying to direct people to them as a training resource? I believe that will be a minority of cases. People trying to do training would more likely have some form of public galaxy that their trainees can use, or would not have a galaxy they admin but instead just want to register them self as a human resource. GRT supports this: registering, mapping (pub + priv), guiding users (website), central bragging point for their galaxy to funding agencies (badges with "we run #1 most jobs out of all galaxies" (a really important point, imo, GRT has user/job statistics, so it can share your ranking, and you can share that with funding agencies)), and (possibly) connecting admins. Especially if we added the admin's info to the GRT page.
ELIXIR was just tool registry, or ...? What new features are they adding? I do think galaxy should handle galaxy registry. It seems very strange to be sending all of our galactic instance metadata to a completely separate organisation, separate funding, to track Galaxies across the universe. |
We are mixing things up here. Please let's move the search discussion to a different thread. This is about registering Galaxy instances, private and public ones and reusable for what ever comes to our mind because it is structure in a standard format (geojson).
Ah got it!
Back to the SPOF :)
An other argument is that a trainer not strictly can configure Galaxy to send this data or that you need to convince your admin to send data ...
I know a few instances that are restricted and not public, yet they advertise it and offering service for others. Required VPN access etc. But training aside, it's also to state that there is someone with Galaxy experience, maybe an Admin to whom I can talk to.
Eric I'm not arguing against GRT here :) It is a great project and very useful for some use cases.
Workflows, Galaxy instances, they have a large ToDo list afaik.
They register services and tools, Galaxy is one service under many and this fits nicely. Not discussing the search idea here - what counts for me and some kind of map project is to get as many people on board as possible, as easily as possible. I don't see this happen with GRT quickly. Especially not if I assume that not everyone will activate this for given reasons. Do you intend to activate it by default? |
Ok, sure, that's fine. Not talking about federation either, that's completely separate.
We have these everywhere. We are not talking about building and deploying infinitely scalable services to AWS + GCP with multi-region failover, why bring this up? Is this such a big concern?
But a trainer can register their galaxy. This does not require admin access to a galaxy, or server access at all. You can register your galaxy with whatever subset of data you want (name, description, location) and not send job logs. This is fine by GRT. The data would have to be updated manually, but that is no different from galaxy-maps, just through a web interface instead of hand edited files and PRs.
Very interesting! People do such strange things! I would argue that this case is covered in GRT, through the "register through the website and do nothing else" case.
I know this Björn, I just strongly believe that GRT completely covers this precise use case here, and possibly Cloudlaunch's as well.
Sure. That's fine, we don't ask for all of their metadata. You opt-in to providing as much as you want during 1) registration, and 2) regular crontab sending, if and only if you are wishing to submit job run logs.
This is the current state. Again, no activation necessary, it's a website you can sign up at and register your galaxy.
I'm already using it as test-data ;) |
Update: https://oc.hx42.org/grt/galaxy/ Internally I've exposed this as an API endpoint in GRT which can show the geojson data for all or one galaxy (depending on which map you wish to place, no need for you to fetch data about the entire world). You could embed these maps or use the GeoJSON any other way that you want, much like you were suggesting. |
Since when is "We have/use this everywhere" an excuse to introduce even more of this ;)
I see where you are heading, if you add more and more things to GRT thats fine and awesome. It's not what I had GRT in mind - as a collection of Job metadata. But if you try to get more and more of these features in and support the community this is great!!! I'm still concerned about the overlap with the ELIXIR project and would like that both projects talk to each other and not replicating work. There is already https://github.com/C3BI-pasteur-fr/ReGaTE very similar to what GRT is now becoming (or was since ever ... :)) |
Fair point.
Ah, we're treating github as not a SPOF. This is reasonable.
This was not my original plan either, but it grew very naturally:
So, keep status quo? Sounds good to me. I have a command to import the geojson from this repo, so will continue to do that going forward until we see if GRT reaches momentum/dies.
yeah, that is something else to work out. Thanks for the link. |
@afgane I seem to have scared you off, any follow up comments on what cloudlaunch might wish to do? :) |
Yes we do, as we all have backups, what ever comes next will have a github importer (see google-code). But I guess this comment was just peeveing ;) |
well, I like the GRT, it is a great idea, but it should be something where people (ie "Galaxy Servers") can sign up themselve. So I don't like the idea of your script to import the geojson from this repository and create a "Listing of public Galaxy instances". There are at least two servers on your list which are not public. |
@hrhotz you're right, that is misinformation from that bit of text, since it is a listing of public + private* instances. * private here meaning not open to public registration, but the fact that they exist is public, otherwise they would not have been mentioned in this geojson file. Would that be preferable? Or would you rather that I do not import those into GRT at all? |
Just briefly folks (its ELIXIR deliverable time :-/) I'm including @hmenager who is leading the parts of ELIXIR / bio.tools work concerning Galaxy integration (broadly), including https://github.com/C3BI-pasteur-fr/ReGaTE. We'd be happy to talk more of course, in due course, on how we can play nice with GRT, CloudLaunch etc. Cheers! |
That's ok. Can you please make the corresponding changes on your GRT page -
Well, the geojson file is on github you can do with this file whatever you |
The general idea behind CloudLaunch is to facilitate access to Galaxy instances (really, any application service), whether they are on a cloud, a laptop, running in a container somewhere, as a public instance or pretty much anything in between. As far as more specific features goes, the version being developed continues to allow launching instances on the clouds but it also allows (or will allow) linking to existing instances (on or off the cloud), searching for tools across those instances, sharing of instances, viewing and controlling cloud resources. The idea is that public instances get listed in a similar fashion to how @tnabtaf maintains https://wiki.galaxyproject.org/PublicGalaxyServers but that individuals can also register their own instances at free will by logging in and filling out a form. All instance listings won't need to be be public either but can be private or shared so when someone logs in, they see a list of instances they have access to, i.e. - that are public, have been shared with them or added by them (e.g., launched cloud instances). BTW, the name CloudLaunch comes from the original version of the app where the goal was to exclusively launch Galaxy on the Cloud instances on AWS and, later, OpenStack clouds. With that, the name CloudLaunch may imply too much of a cloud-centric view, and we can certainly change the name if it would add to the clarity of the app's purpose. At the same time, with the cloud becoming more omnipresent and Galaxy instances trending toward support for bursting and federation, everything will be coming from some version of the cloud before long. Linking all of this back to my understanding of the GRT, it feels like CloudLaunch and GRT could really be merged into one project: (1) the app would allow listing of instances (public, private or shared); (2) querying across all of those (either for specific tools or by asking questions: "If I'm mapping a 32 Gb FastQ dataset against a 1Mbp genome, what are the likely minimum/optimal compute requirements"); and (3) launching new instances if a suitable one does not already exist. Comments about that thought? |
If this is so, then there might be a branding issue. Edit: ok, you mention this later.
I did not know cloudlaunch was picking up the complete feature set of GRT. That is all in the roadmap? This would mean we have four cross-galaxy tool search efforts? @martenson's, GRT's, ELIXIR's, and cloudlaunch's?
Yes, this was one of the (recently added) goals of GRT as well, minus the "all instances that they have access to" portion.
Yes, definitely. Cloud is one of those meaningless business-y terms anyway that has not so much information content.
I'm killing the django frontend to GRT as it is. I was separating it out into a react-js project, but if you want to re-implement the frontend in CloudLaunch and consume our API, I won't say no to that. But I am amazed that this is all in scope for cloudlaunch. This seems like a lot of functionality that has not been discussed before as a priority/goal for the project? Looking at the roadmap, not a single one of these features that GRT provides, that you discuss being in the interest of cloudlaunch, is on there. galaxyproject/galaxy#1928 |
Seems to be trending that way... I guess things are ripe for something like this. More specifically, I have not looked at the details yet but my understanding of @martenson's effort was that it is a set of API endpoints that would enable external Galaxy tool searches - it's just needs a UI. ELIXIR just came up on my radar and prompted this issue. Since GCC and before this discussion, my understanding of the GRT was that it's primarily a job/data collection engine. As far as the roadmap goes, this got bunched under the "All new CloudLaunch" bullet item (you may remember that the cloud branch of the project got no discussion at the team meeting); I just realized it's not linked there but some of the specifics were outlined in this issue: galaxyproject/cloudlaunch#49. Given that issue prompted no discussion, the idea evolved a bit since then without everything being documented (but those ideas were presented at GCC). Whatever the service is called, it really seems natural to aggregate instances launched in the cloud with the ones that exist permanently so that users can create their own lists of instances they use and access them from one place. In the long run, I feel that we and the community would be better served if the efforts unify and converge. In the short term though, this would slow things down to figure out the proper architecture and for everyone to get familiarized with it all. As far as the timeline goes, GRT seems to be chugging along; CloudLaunch has not seen any visible development since GCC and will largely be on the back burner for the upcoming month. What would you like to see? |
It was until this repo opened up and I realised "hey, we've already got infrastructure in every galaxy ≥16.07 for doing this."
Thanks for linking this, had forgotten about that issue. CloudLaunch needed more attention during that meeting, you're right. I've always paid less attention to Cloudlaunch because I don't directly use clouds for any of my work.
agreed. This sounds like a nice feature.
somewhat agreed.
I'm not sure, so I'm going to outline my thoughts as they come to me.
|
I kind of feel there is really more overlap than there are differences: everything revolves around a directory of Galaxy instances, static or temporary ones. CloudLaunch adds the ability to provision additional ones while GRT aggregates job data. Badges, flavors, search etc. are more of a UI feature, which is enabled by the directory concept and supported by data from the GRT. |
While I agree that the data is really related and belongs in a single store...I quite strongly do not think the audiences for those data are the same.
(Speculation: those who can create, will, rather than be stuck in a queue with normal users looking to discover/access open galaxies that might have less hardware than the create-users can afford?) This is why I would hesitate on voting merge, because GRT has a very distinct audience. It would be strange to say "hey, tool devs and cluster admins of your private local galaxy, go to CloudLaunch, the galaxy service to launch cloud images, that's where our job data is." That feels like two completely unrelated things shoved into one project, at least given the history of CloudLaunch. I somewhat feel that given their separate audiences, that at least the frontends could stay separate with no great loss, and we'd all have the same backend and be able to take advantage of that data. From the GRT side, until I picked up some of the GTN goals by adding maps, I had zero interest in end users. I only cared about admins and tool devs. |
Hey folks, a couple of notes explaining the scope of the ELIXIR registry (dev.bio.tools), in case it informs the discussion. We're focused on "discovery" of tools and services (that means find, understand, compare, select) by providing basic (but supporting quite comprehensive) description of tools. Also "interoperability" as a secondary concern (mostly boiling down to annotation of supported data formats, and providing some information about service endpoints and command-line spec - but not straying deep into the later - we want to transform / support CWL). We won't be collating data on tool usage, job performance, popularity, or be a repo for code or job data, but we'd like in due course to expose the results of scientific benchmarking of tools and technical service monitoring (up time etc.) - this is a separate concern within ELIXIR, which bio.tools will expose. bio.tools won't provide facility for running tools either (although a service broker has been mooted within ELIXIR) The major task for bio.tools in 1st instance is producing - and maintaining (through a distributed curation effort) a high quality and comprehensive set of tool descriptions. By tool I mean all types of application software broadly. Once we have that, we can then link unique tools to the various online services where they can be used, that includes of course Galaxy instances. So information about such servers and the tools they contain is very interesting, hence regate and similar such efforts. Yes, it would be awesome to search for "bowtie" and get basic information about it, including all the places you can run it. Lot of work to get there, though! I think it's OK for different portals serving different needs / audiences, we just want (obviously) to avoid redundant efforts where we can. At the very least, share data and try our best to coordinate. Best of luck with all the efforts here! |
Took me a bit of time to catch up with the discussion :-) These all seem like very good ideas, with the main issue being the apparent overlap. To what extent can these be broken apart into different, standalone web-services (micro-services if you will), which can then be aggregated as desired? For example, I think that @erasche makes a good point about not putting too much of this stuff into CloudLaunch, it may expand scope significantly to the point where it becomes hard to manage. Looking at the stated project scope of each, some of this functionality doesn't seem to fit in all that well with either CloudLaunch or GRT. Perhaps the thing to do is to expand on @bgruening's idea and simply create a separate service altogether for this, dedicated to aggregating and filtering available Galaxy servers? For CloudLaunch specifically, it may make more sense to simply query a remote URL, and fetch a list of available Galaxy Servers. Whether this list comes from a geojson file hosted on Github, or whether it comes from GRT, or from some other service altogether, won't matter so much as long as the data is in a documented JSON response. The only complication is that CloudLaunch should also allow people to register a newly launched instance as a publicly available server if they so desire. That makes the idea of a programmable web service with CRUD operations more attractive, as opposed to a simpler geojson file which needs to be manually updated (although I'm certainly not immune to the charms of a geojson file - it's simple and reliable). Users can be redirected to this third project if they wish to find servers with specific characteristics, tools etc. Alternatively, cloudlaunch could consume the service directly where it makes sense. As it stands now, CloudLaunch already contains this web service, with a django-rest-framework browsable API, so the code can directly be spun off as a separate project should this route seem more attractive. However, it's not clear to me to what extent a separate service would benefit GRT, or whether it makes sense to fold everything into GRT etc. etc. I do think that it's too risky to fold GRT and CloudLaunch into one project - the scope seems too expansive. |
I appear to be on the thin end here rooting for the merge. Some of my thinking comes from the fact I see CloudLaunch being less and less as just a service for launching Galaxy cloud instances and more of a hub for accessing and discovering services/tools (despite its name, but I've discussed that already above; again, I'm perfectly fine with changing the name). From the cloud perspective, an increasingly large number of new hardware installations are starting to be managed by cloud middleware (e.g., campus clusters). Then, there are academic clouds (e.g., Jetstream, NeCTAR, EGI Federated Cloud), all of which are typically 'free' to use. So I think a growing number of people will want to deploy Galaxy for the Cloud and either launch longer running, shared instances or point their users to a launcher for self-provisioning instances. This would cause the number of running instances to expand and users will probably want to be able to discover those, group them and share them. With that, I feel the launch process will become this seemingly minor thing that happens as a sideline and automatically behind the scenes while the discovery and service groups are the focus for users. For example, a user sees a public instance that has the right toolset but the instance has quotas or other access restrictions. The user can then use the flavor launcher to automatically create a clone of that instance on their own infrastructure (and make it available to the rest of their group or use it for training). Although still a bit far out, to me, these speak that a launch and discovery belong together from the user's perspective. The performance piece comes in to help or, ideally, automatically decide what infrastructure to use for the deployment (both, for dedicated cloud instances or for bursting workers on a long-running instances). (Not knowing what the user may use a dedicated instance for, this may be putting the carriage in front of the horse but I can think we could work with that.) Technically, I feel that if more of us put a joint effort towards a single app, its development will likely move along faster than if we develop three separate apps. Particularly, three apps that are largely based on the same framework. With that, I'd be interested in seeing a single backend with the multiple UI interfaces (ideally, those would use the same technology so components can be interchanged but that may be a stretch). |
Sorry for the delayed reply @afgane
It feels like a lot of work for marginal returns. And if the backend of GRT is part of cloudlaunch, you have this 90% orthogonal service as part of the codebase. The vast majority of my PRs to update GRT wouldn't apply to any cloudlaunch devs, and vice versa. That feels weird to me, but that isn't really a quantifiable/useful statement, so let us ignore it.
I can definitely sympathise with this desire, I really do like your vision of this!
Sure, from a user perspective this is ideal. And I understand that having the full list of Galaxies within the cloudlaunch codebase would be easier for you to track which a user has access to, more so than the tenuous connections that happen over REST. I can definitely see that.
(Talked to Björn, his comments suggested that just two would make sense, trainers do not want a separate app.)
Ok, I had a look at doing this, and will PR my models because this discussion has gone on long enough, this will get GRT's backend deployed much sooner than waiting on someone to deploy it for me on servers I don't have cli access to, and everything will be happy-enough, I've seen the light ;) |
I just came across this repo and was wondering if it would be desirable to channel the effort going into this to the new version of CloudLaunch that has basically the same feature: https://beta.launch.usegalaxy.org/public_appliances
The text was updated successfully, but these errors were encountered: