-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improved packaging #26
Comments
absolutely! for additional context, here's some discussion of project goals:
Responding to your specific suggestions:
tl;dr: I would absolutely love to have your support and invite your contributions. The current packaging (especially the installation) is a consequence of the problem I am trying to solve and I am definitely open to solving it in other ways. |
ok sounds good, I'll get started on the pypi part |
hmm there's just one last thing I'm wondering: is it better to build our own thing (like you did), or plug ourselves into an existing platform like huggingface models, https://hub.jina.ai/ or https://github.com/UKPLab/sentence-transformers ? |
this was the approach I was originally planning to take as well, but @apolinario made a great point in an earlier discussion that this approach doesn't scale well. I created napm as an alternative pattern for managing dependencies that couldn't be pip-installed: we could probably extend the functionality/recipe for packages that have a setup.py but aren't hosted on pypi. it's a sort of hacky solution, but it doesn't commit us to making sure 10s or 100s of forks are kept up to date. Have you had a chance to take a look at the CLOOB example I linked?
That's definitely something we could consider further. I suspect this approach probably has similar scaling issues to the forking thing above, but that might be something we could get around with clever CI. I've generally experienced a lot of headache associated with HF's CLIP implementation and at present, the only models that mmc is still experiencing issues working with are the ones it pulls from HF (CLIP-fa and SBERT-mCLIP). I'm a little concerned about tacitly directing researchers to hosting their CLIP models on HF if it means they'd be more likely to gravitate towards the troublesome HF CLIP implementation. If we go the HF route, we might want to start with adding a more HF-friendly base CLIP implementation or improving the one that's up there. I haven't used Jina-AI in a while, but a teammate was recently singing their praises over jina's clip-as-a-service inference API. I'd already been thinking about adding a class to mmc to support inferencing off APIs like jina's and HF's (#24): if we could set up hosted inferencing APIs for the models we hope to support that could be an extremely powerful paradigm. I think Jina's CaaS thing is essentially a product and not a generic inferencing API they're planning to let people plug their own models into for free. However, as I'm sure you're already aware, this is a feature the HF hub already supports. so I guess it all comes to some cost-benefit analysis. We're currently using a duct-tape-and-bubblegum approach that let's us add support for new models fairly quickly without requiring that we maintain forks or push changes upstream, at the cost of ignoring a lot of dependency management best practices and just crossing our fingers that nothing breaks. We could prioritize improved stability, but it comes at the cost of a potentially significantly higher LOE to onboard new models and commits us to ongoing maintenance that's abstracted back to the reference repo by the hackier approach. ...maybe we can have our cake and eat it too? we can use the pattern we've already established here for getting some simple support quickly, and migrate more popular models into hosted (jina/hf) and/or better supported (upstream PRs, forks) solutions? Maybe we could get AK or someone who plays with HF-hub/spaces to chime in on this? |
Yeah all good points. |
@rom1504 just checking in. have you had a chance to take a stab at this? would love it if we could distribute this on pypi rather than asking users to build the wheel locally |
Hey, sorry didn't have time yet If you want to start things yourself, https://github.com/rom1504/python-template/blob/main/.github/workflows/python-publish.yml you could start by copying this to the same folder Example of recent PR to do this in another repo that i just did TheoCoombes/ClipCap#5 |
You'd need a setup.py too, can take example from the template repo too I'll take a look soon |
I think it's a bit more involved than that because of the github dependencies. This is the reason I didn't put it on PyPI myself already: the wheel failed a validation of some kind. I don't think PyPI will host wheels that block the dependency resolution flow. Not sure what I did with my notes. One approach I was thinking we could take (which would also be better for future scalability than the current approach of requiring users to install the codebases for every clip variant mmc supports) would be to have an internal download/install mechanism that ships with mmc, similar to how nltk's model and corpus downloads work. that specific approach is a bit clunky, but that's my current thinking at the moment anyway. Another tack we could take would be to push the installations into the loaders, so like if a pip-installable git dependency can't be imported, the loader emits |
Why can't we simply fix all the dependencies to be installable from pypi ? |
that was the approach I was originally taking and apolinario convinced me otherwise. See the discussion here (tldr: scaling concerns): #14 Also, making these dependencies installable from PyPI is itself potentially non-trivial. Consider for example Kat's CLOOB, which itself has three github dependencies. Specifically, webdataset, jmp, and CLIP. CLIP does have a PyPI hosted fork, but it hasn't been updated since October and consequently its clip.available_models() is missing a few late-release checkpoints. webdataset is only needed for training, so that's a droppable dependency, but jmp is still needed for jax inference. So there's at least potential for a PyPI hosted pytorch-only CLOOB perceptor, but you can see how these edge cases and caveats can stack up. And that was using an example authored by an extremely talented developer, whereas a lot of the dependencies relevant to this discussion will probably be a lot messier. |
Yes i do think this is the only scalable approach. Properly publishing all these packages will not only help this aggregator package but also all direct usage of these packages |
If you mean non scalable as in "this will install too many things", that could be a better point but then my opinion is the same: we can fix the packaging of all the dependencies If some rare dependencies absolutely need too many things, then we make them optional and tell people to pip install them themselves for usage (and/or run your function to install them, which is the same) |
I meant "non-scalable" in the sense "this will likely create a lot of technical burden on the mmc maintainers, which is mostly myself." If you want to take responsibility for converting mmc dependencies into installable packages and keeping installable forks up-to-date with upstream changes, I have no problem with redirecting mmc to point to your forks. I'm not personally inclined to take that responsibility myself for the majority of mmc's dependencies. My current strategic plan is to prioritize making these pre-trained models broadly available quickly: if certain models shipped in this fashion prove to be particularly useful, then it might make sense to create and maintain PyPI hosted and installable forks of those specific repositories, the intention then being to facilitate downstream users getting access to those tools without dealing with the tangled mess the rest of the models bring with them. I'd like to have this hosted on PyPI, but asking users to have one slightly weird package setup doesn't feel like a big ask to me if it is replacing a bunch of other (potentially weirder) package setups. It's not perfect, but it's a huge improvement over how our target users are currently being forced to do things. So to reiterate: my concern here is about how your suggestion will impact my personal workload. If you would like to take responsibility for that work or can find someone else who is open to it, I have no problem redirecting pyproject dependencies to point to PyPI instead of github. But creating and maintaining those forks is not a responsibility I'm inclined to volunteer for at the moment for a variety of reasons which I think I've already explained to the best of my ability. |
also: I didn't even know clip-anytorch existed. |
I see! |
Also maybe to clarify one thing: the reason why I want this to be packaged properly with true dependencies on pypi and all, is that it's the only way I know to make this clean enough so I can depend on it from other places with confidence that it will keep working. |
ah I've got this one too https://pypi.org/project/taming-transformers-rom1504/#history :D |
Hi,
As part of our package to easily evaluate clip models, LAION-AI/CLIP_benchmark#1
and my inference lib https://github.com/rom1504/clip-retrieval
I'm interested to have a package like this
however here is what's missing here:
I may be interested to contribute all that, but first I'd like to check with you if you're ok with that kind of changes
thanks
The text was updated successfully, but these errors were encountered: