Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(uv): parse the dist-manifest.json to not hardcode sha256 in rules_python #2578

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

aignas
Copy link
Collaborator

@aignas aignas commented Jan 25, 2025

This is using the dist-manifest.json on the GH releases page so that
we can get the expected sha256 value of each available file and
download all of the usable archives. This means that rules_python no
longer needs to be updated for uv version bumps.

The remaining bits for closing the ticket:

  • Finalize the lock interface.
  • Add the locking target to the pip.parse hub repo if pyproject.toml
    is passed in.
  • Add a rule/target for venv creation.

Work towards #1975, stacked on #2580.

@aignas aignas requested a review from rickeylev as a code owner January 25, 2025 05:28
@aignas aignas requested a review from groodt January 25, 2025 05:31
@aignas
Copy link
Collaborator Author

aignas commented Jan 25, 2025

Not fully sure why it failed.

@aignas aignas marked this pull request as draft January 25, 2025 06:19
@aignas
Copy link
Collaborator Author

aignas commented Jan 25, 2025

I'll do the moving of files around in a separate PR so that the functionality con stay in this PR.

aignas added a commit to aignas/rules_python that referenced this pull request Jan 25, 2025
This PR starts establishing a structure that will eventually become a
part of our API. This is a prerequisite for bazelbuild#2578 which removes the
versions.bzl file in favour of a more dynamic configuration of the
extension. We also remove the `defs.bzl` to establish a one symbol per
file convention.

Things that I wish we could change is `//python/uv:extensions.bzl` and
the fact that we have `extensions` in the load path. I think it cannot
be removed, because that may break the BCR test. On the other hand,
maybe we could remove it and do an alpha release to verify this
assumption.

Work towards bazelbuild#1975
@aignas aignas force-pushed the feat/improve-uv-mgmt branch from f94d74c to 087fb57 Compare January 25, 2025 07:52
@aignas aignas changed the title feat: improve the uv mgmt and the structure feat(uv): parse the dist-manifest.json to not hardcode sha256 in rules_python Jan 25, 2025
github-merge-queue bot pushed a commit that referenced this pull request Jan 26, 2025
This PR starts establishing a structure that will eventually become a
part of our API. This is a prerequisite for #2578 which removes the
versions.bzl file in favour of a more dynamic configuration of the
extension. We also remove the `defs.bzl` to establish a one symbol per
file convention.

Things that I wish we could change is `//python/uv:extensions.bzl` and
the fact that we have `extensions` in the load path. I think it cannot
be removed, because that may break the BCR test. On the other hand,
maybe we could remove it and do an alpha release to verify this
assumption.

Work towards #1975
Copy link
Collaborator

@rickeylev rickeylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only had an extra minute free, so didn't get far in review.

I see some new tag classes added. Can you give a quick overview of the thinking?

sha256 values for each binary.
"""
dist_manifest = module_ctx.path(_DIST_MANIFEST_JSON)
module_ctx.download(base_url + "/" + _DIST_MANIFEST_JSON, output = dist_manifest)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want a sha for the dist manifest download?

Using untrusted shas to verify a download is equivalent to not using shas

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the sha256, I am downloading the manifest that only has links and I don't see how the extra sha256 value here could be beneficial.

This has been inspired by https://github.com/bazel-contrib/rules_go/blob/7f6a9bf5870f2b5ffbba1615658676dcabf9edc7/go/private/sdk.bzl#L84 where rules_go is downloading things from the manifest if the user does not specify anything.

Maybe we should instead an optional sha256s dict where we have platform labels as keys and the sha values as values? And we could print a warning with buildozer command to add the said sha256s values to their MODULE.bazel file if they want things to be deterministic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I see your point now. Hm.

On the one hand, not using shas means, if someone MITM the manifest, then they can control what actual uv stuff is downloaded and then used.

On the other hand, using a sha means we are tied to a particular manifest and don't benefit from the automatic-ness of using the manifest. Which is pretty appealing behavior.

Hm, not sure how, or if, we can split the difference. These seem at odds.

Also:

  • Does the sha arg of download() allow bazel to better cache it?
  • Does the sha arg feed into MODULE.lock? I would expect so, since this is module-phase behavior.
  • Without the sha arg set, does Bazel print a warning? (http_archive does, not sure if download does)

@aignas
Copy link
Collaborator Author

aignas commented Jan 27, 2025

The main idea would be to leverage the dist-manifest.json files to download things. rules_python can set the defaults and eventually register the toolchains (if nothing is specified, we just register a noop toolchain).

The users can use the following tag_classes to interact:

  • uv.config - set the base_url for where to download things from.
  • uv.toolchain - set the version and the name of the toolchains repository. Though the name setting is probably not needed if rules_python is planning to register the toolchains.
  • uv.platform - add the platform definitions for the toolchain binaries.

After having written this, maybe we should just have:

  • uv.platform - as above
  • uv.config - set the version and the base_url of the toolchain the name being hardcoded.

My thinking was that any user may want to do the following to customize the uv usage in the locking rule we provide:

So all of the above could be done by:

uv.config(version="0.5.24") # override the version
uv.config(base_url="my_internal") # override the base URL
uv.platform(...) # add an extra platform to support locking on

And if they wish, they can just ignore this and register their own uv toolchain.

This is using the `dist-manifest.json` on the GH releases page so that
we can get the expected `sha256` value of each available file and
download all of the usable archives. This means that `rules_python` no
longer needs to be updated for `uv` version bumps.

The remaining bits for closing the ticket:
- [ ] Finalize the `lock` interface.
- [ ] Add it to the `pip.parse` hub repo if `pyproject.toml` is passed
  in.
- [ ] Add a rule/target for `venv` creation.

Work towards bazelbuild#1975.
@aignas aignas force-pushed the feat/improve-uv-mgmt branch from 9144ce8 to b8fa595 Compare January 27, 2025 10:13
@aignas aignas marked this pull request as ready for review January 27, 2025 10:14
@aignas
Copy link
Collaborator Author

aignas commented Jan 27, 2025

Marking as ready to be reviewed even though this is not yet ready to be merged, but I would like feedbakc on the APIs.

@rickeylev
Copy link
Collaborator

+1 on uv.config.

I get uv.toolchain. It's telling what version of uv we want to use.

I don't fully grok uv.platform. Is the idea that a uv toolchain is registered for every platform() call? Ah -- it's carrying the constraint labels that get put onto the toolchain. Plus also telling the repo name to use. Is that it? Because putting those on uv.toolchain() is tricky; it would require passing e.g. {"linux_86": [//os:linux, //cpu:x86]}. Similar to requirements_by_platform. Is this what you meant in the maintainers meeting, when you mentioned "we define the platforms and their constraints" ?

I think I like it. The constraints for a platform are going to be the same, regardless of e.g. uv version.

Is part of the idea here that rules_python calls uv.platform() for the common ones? And if a user has a special platform, e.g there's a musl build of uv. they would do: uv.platform("musl-link", constraints=[os:linux, glibc:musl]) ?

},
)

platform = tag_class(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that isn't clear from the doc here is how the word "for" is operating.

Does this define the platforms uv will run on, or the platform that uv will generate a lockfile for? I'm assuming the former.

@aignas
Copy link
Collaborator Author

aignas commented Jan 29, 2025

+1 on uv.config.

I get uv.toolchain. It's telling what version of uv we want to use.

Maybe we should just have uv.config instead of the separate tag classes that have barely any api?

I don't fully grok uv.platform. Is the idea that a uv toolchain is registered for every platform() call? Ah -- it's carrying the constraint labels that get put onto the toolchain. Plus also telling the repo name to use. Is that it? Because putting those on uv.toolchain() is tricky; it would require passing e.g. {"linux_86": [//os:linux, //cpu:x86]}. Similar to requirements_by_platform. Is this what you meant in the maintainers meeting, when you mentioned "we define the platforms and their constraints" ?

Correct. For uv we have to have some labels that we can put into tcw on the toolchain, so I am doing a very similar thing to what the python extension is doing under the hood.

I think I like it. The constraints for a platform are going to be the same, regardless of e.g. uv version.

Yes that is the idea.

Is part of the idea here that rules_python calls uv.platform() for the common ones? And if a user has a special platform, e.g there's a musl build of uv. they would do: uv.platform("musl-link", constraints=[os:linux, glibc:musl]) ?

Also correct.

@rickeylev
Copy link
Collaborator

rickeylev commented Jan 30, 2025

See below -- thoughts?

Maybe we should just have uv.config instead of the separate tag classes that have barely any api?

Hm. Let see what some psuedo code looks like and what it might imply.

uv.config(base_url="A") # L1
uv.config(version="0.1.0") # L2
uv.config(version="0.2.0", base_url="B") # L3
uv.config(platform="darwin_x86", constraints=[os:darwin, cpu:x86]) # L4
uv.config(platform="linux_arm", constraints=[os:linux, cpu_arm]) # L5
uv.config(version="0.3.0", platform="linux_musl_arm", constraints=[os:linux, cpu:arm, glibc:musl]) # L6
uv.config(version="0.4.0", constraints=[//user:flag]) # L7

I just merged all 3 tag classes and threw it in a blender like a code fuzzer, for exploratory purposes. I'm also throwing in complications like we see from the pip/python extensions just for fun.

Aesthetically, that doesn't look too bad. It seems fairly clear to understand? The docs for the config "function" will be a bit unwieldy, but that seems a minor loss.

Three observations


First: Recall that tag_class calls preserve their order. So, one way to interpret that pseudo code is:

By default, download from url A (L1)
By default, use version 0.1.0 (L2)
By default, every registration will attempt to register
  darwin_x86 (L4)
  linux_arm (L5)
  with their respective constraints.

Also register version 0.2.0 and download it from B. (L3)
  implicit: for platforms darwin_x86, linux_arm
Also register version 0.3.0 (L6)
  explicit: but don't use the default platforms. Instead just "linux_musl_arm"
  implicit: download it from A
  i.e. toolchain(name="0.3.0_linux_musl_arm", version=0.3.0,
                 constraints=[linux, arm, musl])
Also register version 0.4.0 (L7)
  explicit: add constraint //user:flag
  implicit: for platforms darwin_x86, linux_arm, download from A
  i.e. toolchain(name="0.4.0_{darwin,linux}", version=0.4.0,
                 target_settings[//user:flag], ...)

It's kind of neat I was able to be so specific.

An interesting implication of respecting the order of calls is I can easily insert toolchains earlier in the ordering. This allows me to define a toolchain with more specific criteria (linux, arm, musl) so that it matches before one with more broad criteria (linux, arm).

Hm, though, things that affect "defaults" seem worthy of looking more distinct (different "function call"), e.g. L1,2,4,5. The order of those doesn't matter. With the sort-of exception of L1 ("first defined version is default").

(edit: inspired aside: maybe add e.g. python.rules_python_internal_set_defaults(...). It'd remove the "if module == rules_python" block that gets interjected into the normal code flow and allow it to be handled specially in a more first-class manner.)


Second: as I wrote the pseudo code, I was reminded of builder patterns, in particular Java protos. These are pretty good at giving fine-grained control in pretty clear and understandable ways:

config = ConfigProto.newBuilder()

tc = config.addToolchain().setVersion("0.1.0").setUrl("B")
  .setConstraints([...])
tc.addPlatform().setName("darwin_x86").setConstraints(...)
tc.addPlatform()...etc...

My point being, were it available, we'd probably use a builder style API. The accumulative nature of tag_class calls seems akin to that.


Third: we have multiple dimensions of interaction: version x platform constraints x url x arbitrary constraints. What we're trying to do give users enough control over those variables and their expansion that they get what they need without too much of what they don't need. This is a pretty tall ask of any API.


Fourth: I'm reminded of the recent PR that reads the is_default setting from a file. Another approach would be to offload everything into a separate config file altogether. This seems like overkill, but it did pop into my head.

@aignas
Copy link
Collaborator Author

aignas commented Feb 1, 2025

I actually like the expressivity of the API you came up with for the platform definitions. Maybe it would be an interesting thing to have two things:

uv.defaults() # used by `rules_python` to set the default platforms, version, etc
uv.config() # used by the users to customize things similar to what you suggested above

I think that would be also interesting in the pip.parse case for:

pip.parse(hub_name = "foo", requirements_lock = "default_requirements_lock.txt")
pip.parse(hub_name = "foo", requirements_lock = "requirements.linux_x86_64.txt", constraint_values = ["@platforms//os:linux", "@platforms//cpu:x86_64"])
pip.parse(hub_name = "foo", requirements_lock = "requirements.gpu.linux_x86_64.txt", constraint_values = ["@platforms//os:linux", "@platforms//cpu:x86_64"], flag_values = ["@//:my_gpu_flag": "enabled"])

This is related to #2548 and could potentially be a good way to allow users to modify things.

However, the pip.parse and the uv.config implications/complications are a little bit worrying:

  • How do you validate things correctly? The user might be mistakenly missing some of the arguments and we are interpreting them as "the guy must want to append to the previous declaration".
  • Bazel quite often is thought to be too complex for "regular human beings", who are not working at FAANG and I am curious if we are increasing the entry barrier here?

Maybe we could solve the concerns above by adding another attribute called add, which users would have to use for this different behaviour or we can have a separate tag_class that is doing either building or defining everything at once. Thus, users cannot get into a situation where they are making mistakes.

I am thinking that the following would be unambiguous:

uv.config() # the builder pattern
...
uv.toolchain() # non-builder pattern

# likewise for pip
pip.parse() # non-builder pattern
pip.config() # builder pattern

I'll think about this more and when I have time to finish implementation might go with the builder/non-builder patterns present in the same bzlmod extension.

@rickeylev
Copy link
Collaborator

have an add attr/tag_class

Yeah, +1 to something like this. That config() really means "merge these values into the current state" doesn't well jive with the noun "config" word, and the "merge" semantics for the verb word seem out of place.

add_config(), merge_config(), mixin_config() ?

I like uv.defaults() for setting the defaults.

@aignas
Copy link
Collaborator Author

aignas commented Feb 2, 2025

Other ideas for naming: append_config, build_config, append_last_config, merge_last_config, add_or_merge_config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants