Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple bash action #179

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Simple bash action #179

wants to merge 10 commits into from

Conversation

ewels
Copy link
Member

@ewels ewels commented Dec 5, 2024

Attempt at a vastly simpler composite action using a few lines of bash.

Comes with some limitations:

  • Dropping support for all
    • It's now called dist and I don't think I've ever seen anyone use this 🤔
  • Now uses a regular Nextflow installation, instead of using GitHub assets
    • I don't remember why we used GitHub assets here before..
  • No longer supports partial version numbers
    • Again not sure that anyone used this, I didn't realise that the action could do it until today tbh

On the plus side:

  • No typescript, so no node, no build step, just a super simple bash script
  • Much much easier to test and maintain

@ewels ewels requested review from MillironX and mashehu December 5, 2024 13:15
@ewels
Copy link
Member Author

ewels commented Dec 5, 2024

@MillironX @mashehu curious to see what you both think of this.

Bit of a radical approach but I wonder if a massive simplification like this will make maintenance a lot easier. Especially thinking about the bus factor here as it's such a central piece of our infrastructure now.

@MillironX
Copy link
Member

This is going to be a long post here. I don't really want to try and argue either way, just to lay out some of my justifications of thinking that went into the creation of this action and shaped it the way it is (esp. since I'm realizing just how poorly I've documented the thing). I've got finals next week, so as long as setup-nextflow is working, I don't want to touch it until exams are over, but I still want to throw these thoughts out there for folks to chew on. I'm also going to be a terrible scientist here and not cite my sources b/c this is more of a record of my thoughts, even if they were wrong.

Why Typescript

One of the main reasons was that I was using partial version matching a lot at the time (I don't hardly anymore, tho). It still boggles my mind that Nextflow doesn't support YY.MM version aliases, but that can be a discussion for another day.

Another reason was how GitHub Actions (GHA) is architected. From my understanding, GHA doesn't guarantee that it will bring a full FHS file system with it between action steps. When using docker actions or using a run: step, GHA will perform the action within a virtual machine, then store any "relevant" docker mounts and mount them in the same place in the next step. How does GHA determine which mount points are "relevant?" Nobody knows. The only documentation I could find was that $HOME was guaranteed to be persisted between steps, but everything else was iffy. Even the location of repository checkout is not guaranteed to be consistent. Changes made to $PATH would be persisted, but note that with other tools also modifying $PATH, the lookup order could get really weird. One of the ways to combat those changes was to leverage GHA's API and tool-cache. This was how most of the other setup-* actions function (I used setup-julia as my main inspiration). This is easiest to do from Javascript/Typescript. As I read it, Typescript actions run in an ephemeral, minimal virtual machine with only access to GitHub-approved APIs. This should (hypothetically) run faster, and because every change made is explicitly declared, any changes made in the environment are guaranteed to be persisted to all future steps.

Caching should be much easier in Typescript. Once again, this is because the Typescript APIs allow for explicit declaration of what is cached, and what identifier is assigned to it. tool-cache should have managed this for us, and I thought it did, but I later learned that tool-cache only applies to some runners, and even then, caching the nextflow script file would provide minimal speedup compared to caching the Java files that Nextflow downloads at first run (#78).

Addressing changes

  • Dropping support for all
    I've worked with one HPC that used this, and it was so outdated... I was going to propose dropping this in v3 anyway.
  • Now uses a regular Nextflow installation, instead of using GitHub assets
    Because a "regular Nextflow installation" is just pulling GitHub assets anyway. Managing tarballs/script files is (hypothetically) more reproducible within a GHA environment than running a bash script.
  • No longer supports partial version numbers
    Already addressed

Why not GitHub Actions

I made a modest proposal that we should get away from using GHA for doing CI, instead hooking into something like Earthly or Dagger (thanks @edmundmiller) for the unit testing or really anything that requires Nextflow. I'm going to make the case here that GHA is not reproducible, and is therefore antithetical to the premise of Nextflow.

GHA has no concept of dependencies. This dead horse has already been beat better than I can by fasterthanlime in his video Why GitHub Actions Feels Bad. This is what bit us on the Java breakage. The fact that we cannot specify that Nextflow has a dependency on Java is scary. The fact that it worked for so long is scarier. The fact that ubuntu-latest will change soon and could have broken or fixed things means we are building on a shaky foundation.

GHA changes its behavior drastically based on the environment it runs in. Last night, I ran local tests on setup-nextflow with act. Without the Java install step, the tests failed, once I added it back in, the tests passed. Then I pushed a branch and created a pull request. The tests all passed. So I merged, tagged and released...and running the action in a repo that wasn't this one failed. Why is that allowed to happen? This has been a repeated theme with this action, I feel like last night's escapade is a prime example.

GHA is not testable locally. Yes, there's act, but the behavior of act is what GHA should be (per GHA's documentation), and not what GHA is. I noticed that you removed the .actrc file, and in previous drafts were questioning the example.yml workflows. This is on me for not documenting this well enough, so I'll explain myself here. .actrc makes sure that we can actually test the action locally without downloading a 50+GB docker image (and provides consistency with Apple Silicon systems), and example.yml is the way to test the action in a workflow (the unit tests only test Typescript functions). Even with those files, clearly there is a huge disconnect between running locally with act and running on a real GHA runner.

Bottom line

Clearly ripping out the CI system is a big decision, and I'm not asking it to be made lightly. The point of this section is to highlight the issues I've had with GHA as a whole and to jumpstart a discussion. So far it seems like @edmundmiller is the only one with any interest in a different setup, tho.

My maintainership issues

I made setup-nextflow as somewhat of an experiment, and I'm glad it's been a help to so many people. At the same time, setup-nextflow (and GHA in general) has proven to be very brittle and to break at inopportune times. I'd love to help make nf-core's CI more future-proof, but I do have to recognize that I'm no longer on a grant that lets me work on nf-core stuff, and my free time is limited. If making this into Bash makes it more accessible to nf-core, then that could be worth losing the speed/caching/purity/etc.


Like I said, I've got other priorities for the next couple weeks, so I just wanted to throw my general thoughts out there for now.

@edmundmiller
Copy link
Collaborator

  1. Great work @ewels! I think this is much more in line with something maintainable
  2. Awesome write-up @MillironX. It's kind of the same issue I've been having with modules. Everyone relies on the CI to work globally, but it's held together with Ducktape and Yaml.

I think this action lost the plot. I think if we have in the README

    steps:
      - uses: actions/checkout@v3
+     - uses: actions/setup-java@v4
+     - uses: nf-core/setup-nextflow@v3
-     - uses: nf-core/setup-nextflow@v1
      - run: nextflow run ${GITHUB_WORKSPACE}

And users will be happy.

A lot of this came out of #5 and #19, which was working around Nextflow release oddities.

@ewels
Copy link
Member Author

ewels commented Dec 5, 2024

@edmundmiller yes, but:

  • I'm not sure that we need users to add setup-java. It's in the composite action in this PR, so no changes needed. I'm not sure what benefit would come from requiring users to do this?
  • This PR still uses https://nf-co.re/nextflow_version so we're still working around the GitHub API limits and Nextflow release oddities in the same way as before really

@MillironX thanks for the great writeup, much appreciated. I'm keenly aware of how much work you've invested in this already and want to make sure that it's clear that it's recognised how key this action has become in much of the nf-core (and wider Nextflow) ecosystem. The fact that it's just worked for a long time now is significant, given how heavily it's used.

Couple of quick thoughts:

The only documentation I could find was that $HOME was guaranteed to be persisted between steps, but everything else was iffy.

What do we consider to be important here, /usr/local/bin/nextflow and ~/.nextflow right? I guess if we're worried we could move the nextflow executable to ~/bin instead? (and add to the $PATH if needed?)

In #78 there's a fair bit of discussion about knowing when to invalidate the cache, but I don't think that we need to care about that. As long as NXF_VER is set to the right thing then the Nextflow executable should handle all of this itself when it runs and fetch / use only the correct assets. Caching those assets is only really useful to reduce bandwidth and speed up run times in my opinion. ie. we don't need to care all that much if they're correct, I don't think that it would affect validity of test results.

Because a "regular Nextflow installation" is just pulling GitHub assets anyway. Managing tarballs/script files is (hypothetically) more reproducible within a GHA environment than running a bash script.

As above, Nextflow should be handling assets for regular (non -all/-dist) installs anyway, so I don't think that fetching from the GitHub release really has any effect. It's just the bash executable and it's self-updating based on NXF_VER anyway.

Why not GitHub Actions

Sure maybe one day, but not any day soon. GHA may not be the best solution but it is by far the easiest. Everything in nf-core is massively baked into the GitHub ecosystem and we intentionally gravitate towards using it whenever we can.

In the early days we used Travis for CI (before GHA was a thing) and it was crappy. A lot of work put on core team members who had access to go in and do stuff and a lot of extra maintenance work. Adoption of GHA was a revolution for us because it democratised access. Also it was zero effort setup for anyone forking / building off nf-core for their own stuff.

I agree that writing and testing GHA is a pain in the ass, but given the volume that we run it at I think that once set up it tends to be very robust. You just have to test a lot in a manual way. Yesterday I uncovered the same bug you saw overnight by testing a separate private repo on GitHub with the branch - I've never bothered using act because I usually find it more trouble than it's worth for exactly this reason. It's not true in its replication.

Generally my preference is very strongly in favour of reducing the number of services and external dependencies that we rely on, especially for anything touching distributed resources (such as pipelines / repositories). I've been burnt a lot of times and it's easy to think that the grass is greener on the other side :)

Maintainership

Although this event was a bit of a 🔥 I think it's good that it's prompted us to review this. It's clear that it's a fairly key part of our infra now (given how many people it affected) and also that its current state was not easy for others to maintain (it took @mashehu and I together about 1.5 days to piece together all the different parts).

My motivation for this simplification is to make the action so simple that anyone can look at it and fix it with minimal prior experience. I hope that this will go some way to making it less brittle.


Finally, the Nextflow team has had several new engineers join recently and we were already in the middle of reviewing the build and release system. The Capsule system was already removed in the summer and we're hoping to improve process for a bunch of how this works.

Specifically I have written an internal postmortem on this and suggested a few concrete things:

  1. We should change NXF_EDGE to install stable if that's newer than the most recent edge release
  2. We should aim to have nightly / regular dev builds to test a canary CI job against in tools to give us advance warning of this type of event, before it hits an edge release
  3. We should investigate supporting installation of YY.MM version numbers that use the latest patch release.

I can't promise when or if any of this will change, but it is at least being discussed 😄 If it all comes through then this action can hopefully get even simpler (or even be deprecated entirely).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants