Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Ubuntu packages for FastTree and other programs? #127

Open
victorlin opened this issue Dec 23, 2022 · 6 comments
Open

Use Ubuntu packages for FastTree and other programs? #127

victorlin opened this issue Dec 23, 2022 · 6 comments
Labels
proposal Proposals that warrant further discussion

Comments

@victorlin
Copy link
Member

Context

FastTree is currently built from source for the nextstrain/base Docker image.

While working on #123, I discovered that it is also available as an Ubuntu package fasttree which can be installed directly via apt-get install fasttree. This made me think whether we should be installing from that directly instead of building from source.

Up-sides to installing from Ubuntu's APT package manager

  1. It reduces build times (upon cache miss) and prevents "re-inventing the wheel" by having to figure out software-specific build instructions.
  2. The package manager can define/install dependencies that would otherwise have to be handled separately within the Dockerfile.
  3. If the desired binaries are not yet available via the package manager, updating the package has the potential to benefit a wider community (any Ubuntu-using bioinformatician in addition to Nextstrain users).

Notes on the above, with examples:

  1. The Ubuntu package mafft is available for both amd64 and arm64, whereas we have a TODO to figure out how to build it from source.
  2. This helps prevent issues such as FastTreeDblMP doesn't work in docker #123.
  3. Two examples:
    1. Augur prefers FastTreeDblMP which is built for the Docker image. The Ubuntu fasttree package provides less-optimal versions. The FastTreeDblMP build instructions can be copied over to the Ubuntu package builder to benefit non-Nextstrain users.
    2. The Ubuntu package iqtree is only available as amd64. The Dockerfile also only downloads a pre-built binary for amd64. There is a TODO to build from source which would provide arm64-native IQ-TREE binaries in the nextstrain/base image. This could instead be done in the Ubuntu package builder to benefit non-Nextstrain users.

Considerations

  1. Are the Ubuntu package maintainers trustworthy?
  2. Can the version be pinned?
  3. Is it up to date with the latest version?
  4. Does it have the right binary?
  5. If the answer to (3) and/or (4) is "no", is it easy/quick to propose changes to the package and make a new version available?

I'm not familiar with Ubuntu packages, but it seems like all those questions can be answered by clicking around the package websites.

@victorlin victorlin added question Question about the project proposal Proposals that warrant further discussion labels Dec 23, 2022
@huddlej
Copy link
Contributor

huddlej commented Dec 23, 2022

If we are considering installation from prebuilt binaries, we might also consider installing these tools with Conda. We already rely on Conda binaries in our workflow-specific environment files and our nextstrain-base environment. We could have micromamba installed in our first pass of the Docker build and use that to install the third-party binaries we want. I'd trust these binaries that we use in multiple places over the binaries from Ubuntu that we rarely use.

@victorlin
Copy link
Member Author

@huddlej fair point about considering Conda. One thing to note is that it's best not to compromise availability of platform-specific binaries.

We have a (historical?) preference for using Bioconda. From what I can tell, their package builders do not support aarch64/arm64, only noarch for pure Python packages. This means there is no way to get an arm64-native binary from the fasttree recipe (as of now).

The alternative for FastTree would be to make it available on a different channel that supports building and hosting both amd64 and arm64 binaries - either something established like conda-forge or our own channel (this was mentioned recently, I forget where). However, this comes with increased complexity as we'd have to not only figure out how to build the program but also maintain the Conda "feedstock"/"recipe"/whatever.

@victorlin
Copy link
Member Author

victorlin commented Dec 23, 2022

This issue can be thought of as a discussion of the "Adding a new software program" section in the README which I added recently:

docker-base/README.md

Lines 90 to 113 in 3ab045e

### Adding a new software program
To add a software program to `nextstrain/base`, follow steps in this order:
1. Check if it is available via the Ubuntu package manager. You can use
`apt-cache search` or [Ubuntu Packages Search](https://packages.ubuntu.com/)
if you do not have an Ubuntu machine. If available, add it to the `apt-get
install` command following `FROM … AS final`
([example](https://github.com/nextstrain/docker-base/commit/8f5e059ce897a85194f35517e56b31424e89472e)).
2. Check if it is available via PyPI. You can search on [PyPI's
website](https://pypi.org/search/). If available, add an install command to
the section labeled with `Install programs via pip`.
3. Check if a pre-built binary for the `linux/amd64` platform (name contains
`linux` and `amd64`/`x86_64`) is available on the software's website (e.g.
GitHub release assets). If available, add a download command to the section
labeled with `Download pre-built programs`.
- If a pre-built binary supporting `linux/arm64` (name contains `linux` and
`arm64`/`aarch64`) is also available, that should be used conditionally on
`ARG`s `TARGETPLATFORM` or `TARGETOS`+`TARGETARCH` in the Dockerfile. See
existing usage of those arguments for examples.
4. The last resort is to build from source. Look for instructions on the
software's website. Add a build command to the section labeled with `Build
programs from source`. Note that this can require platform-specific
instructions.

There's a few separate things I'd like to discuss:

  1. Should we limit usage of the Ubuntu package manager to only "official" packages (i.e. linked from the program's website)? Otherwise, if one were to follow these instructions as-is for FastTree, they would install the program using apt-get install fasttree.
  2. If building from source, should we consider migrating those build commands to (any) package manager for the up-sides noted in the issue description?
  3. As @huddlej proposes, should we also include Conda channel(s) somewhere in the search for existing binaries?

@ivan-aksamentov
Copy link
Member

The problem with installing from package managers (dpkg or conda) is that it does not always allow to swap versions easily. For dpkg, once released in a given distro version, these packages pretty much never update.

Considering how buggy and low quality the software in the field can be, scientists sometimes have to search for the only only one version that works for a given input (and for example not crashes or not producing garbage results). This is especially true for iqtree - you can find adventures Cornelius went through with it on Slack.

To speedup docker build, instead of building inside container, we could add scripts to make prebuilt installable tarballs or debs once, host them on S3 or GH releases, and then just untar them in the Dockerfile.

@victorlin
Copy link
Member Author

For dpkg, once released in a given distro version, these packages pretty much never update.

scientists sometimes have to search for the only only one version that works for a given input

These relate to considerations (3) "Is it up to date with the latest version?" and (2) "Can the version be pinned?". It seems like fasttree is up-to-date and can be pinned by apt-get install fasttree=2.1.11-2. So packages should be evaluated on a case-by-case basis, and I wouldn't rule it out based on these two points alone.

To speedup docker build, instead of building inside container, we could add scripts to make prebuilt installable tarballs or debs once, host them on S3 or GH releases, and then just untar them in the Dockerfile.

Speed of docker build is the least important benefit I see from using package managers – the current Docker caching (at least for pinned programs) is already effective at reducing the amount of times a program is built.

A more important benefit would be using package managers for dependency management, since it is currently disjoint in the Dockerfile (see #126).

@tsibley
Copy link
Member

tsibley commented Jan 4, 2023

Lots of intersectional considerations here.

For background context, the reason we were compiling FastTree (and others) in the first place, IIRC, was that we started off using an Alpine base image which either did not provide packages for these programs (or potentially packaged too out-of-date versions). We did not reconsider this when switching to a Debian-based base image (e.g. python:3.10-slim-bullseye).

Also, to be precise, we're talking about using Debian packages here not Ubuntu's repackaging of them, as our base distro is Debian "bullseye" (via python:3.10-slim-bullseye). Bullseye is the current Debian stable release. The relevant package is https://packages.debian.org/bullseye/fasttree.

I've added my thoughts on @victorlin's questions below.

  1. Are the Ubuntu package maintainers trustworthy?

Yes, absolutely. Or at least as much as any other maintainers we implicitly trust, and we already trust Debian maintainers as a whole (a very broad and varied group, not necessarily the specific maintainers of this package) quite a bit.

  1. Can the version be pinned?
  2. Is it up to date with the latest version?

For FastTree, yes to both, as you've noted.

  1. Does it have the right binary?

Yes! but not with the conventional name. See below.

  1. If the answer to (3) and/or (4) is "no", is it easy/quick to propose changes to the package and make a new version available?

Maybe. We can certainly propose packaging patches, and those will go thru Debian's process at the pace set by the maintainer team for this package. That pace is an unknown, but we could look at past changes to gauge; ~all communication is open/public. But depending on the scope of changes, those may or may not be able to be included in the stable release we're using.

  1. Augur prefers FastTreeDblMP which is built for the Docker image. The Ubuntu fasttree package provides less-optimal versions. The FastTreeDblMP build instructions can be copied over to the Ubuntu package builder to benefit non-Nextstrain users.

The Debian package actually enables double-precision for both binaries it produces, but it doesn't include the conventional Dbl moniker in the names like we do and the official binary does.

Relatedly, there's a case to be made that we shouldn't ever use a FastTree version compiled without double precision, as the results are likely to be wrong for our use cases (c.f. Not so fast, FastTree and the Debian package's only ever bug report). To enforce this in Augur, I think we'd have to do something equivalent to

if ! FastTree |& grep -qi 'double precision'; then
    echo FastTree must be compiled with USE_DOUBLE >&2
    exit 1
fi

in Augur (or some external wrapper Augur calls instead).

Note that Conda also omits the Dbl moniker too, like Ubuntu, but does compile with USE_DOUBLE. We may decide that enforcing in Augur isn't worth it, and that we'll ensure our runtimes are safe.

  1. The Ubuntu package iqtree is only available as amd64. The Dockerfile also only downloads a pre-built binary for amd64. There is a TODO to build from source which would provide arm64-native IQ-TREE binaries in the nextstrain/base image. This could instead be done in the Ubuntu package builder to benefit non-Nextstrain users.

The Ubuntu package is not directly relevant here (per the note about the distro we use above), but Debian also does not compile its iqtree package for arm64. We could potentially contribute to the Debian packaging.

However, note that the iqtree package for Debian bullseye is 1.x not 2.x as we currently use, so we'd have to downgrade, which I think is probably a nonstarter?

  • Should we limit usage of the Ubuntu package manager to only "official" packages (i.e. linked from the program's website)? Otherwise, if one were to follow these instructions as-is for FastTree, they would install the program using apt-get install fasttree.

We should prefer packaged versions, esp. from the base distro, as long as they're suitable. "Are they suitable?" likely mostly means, "Are they current/new enough versions?"

  • If building from source, should we consider migrating those build commands to (any) package manager for the up-sides noted in the issue description?

If necessary and applicable, but not always. Also, this isn't always feasible, c.f. discussion above about packaging policies/cadences.

  • As @huddlej proposes, should we also include Conda channel(s) somewhere in the search for existing binaries?

Conda packages bring along other issues. For example, they expect to bring along everything but libc, so things like openssl and other common shared libs will get duplicated (increasing image size, increasing complexity of library interactions at runtime, and more). I'm reluctant to mix Conda packages with non-Conda packages for these reasons.

That said, we might take a step back and consider building the container image entirely from a static Conda environment. We've (or at least I've) considered this before, but decided it wasn't worth it then. Maybe that's changed, particularly in light of our new Conda runtime defined by a locked package? There are downsides though, like a tighter coupling between runtimes and what they can support (e.g. architectures). Tighter is good in some ways but worse in others. Also, other considerations aside, we may not want to put all our eggs in Conda's basket.


Relatedly, but not yet mentioned, is RAxML. We could consider moving to the Debian package which is available for both amd64 and arm64 and includes the raxmlHPC-PTHREADS-AVX variant we currently use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Proposals that warrant further discussion
Projects
No open projects
Development

No branches or pull requests

4 participants