Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Pip could resume download package at halfway the connection is poor #4796

Open
winstonma opened this issue Oct 20, 2017 · 28 comments · May be fixed by #12991
Open

[Improvement] Pip could resume download package at halfway the connection is poor #4796

winstonma opened this issue Oct 20, 2017 · 28 comments · May be fixed by #12991
Labels
C: download About fetching data from PyPI and other sources state: awaiting PR Feature discussed, PR is needed type: enhancement Improvements to functionality

Comments

@winstonma
Copy link

winstonma commented Oct 20, 2017

  • Pip version: 9.0.1
  • Python version: 3.6.2
  • Operating system: macOS 10.13

Description

When I have poor internet connection (the network is cut unexpectedly), updating pip package is painful. When I retry the pip install, it would stop at the midpoint and give me the same md5 error.

All I have to do is

  1. Download the package from pypi (using browser or wget, both have retry/resume capability)
  2. Pip install
  3. remove the package

If pip download have resume feature then the problem could be solved.

What I've run

pip install -U jupyterlab in poor network condition

Collecting jupyterlab
  Downloading jupyterlab-0.28.4-py2.py3-none-any.whl (8.7MB)
    4% |█▋                              | 430kB 1.1MB/s eta 0:00:08
THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    jupyterlab from https://pypi.python.org/packages/b1/6d/d1d033186a07e08af9dc09db41401af7d6e18f98b73bd3bef75a1139dd1b/jupyterlab-0.28.4-py2.py3-none-any.whl#md5=9a93b1dc85f5924151f0ae9670024bd0:
        Expected md5 9a93b1dc85f5924151f0ae9670024bd0
             Got        4b6835257af9609a227a72b18ea011e3
@pradyunsg pradyunsg added type: enhancement Improvements to functionality C: download About fetching data from PyPI and other sources labels Oct 20, 2017
@CTimmerman
Copy link

CTimmerman commented Jul 9, 2018

I don't know how pip's hashing works, but here's some working, easy, simple, modular resume code in a single file/function: https://gist.github.com/CTimmerman/ccf884f8c8dcc284588f1811ed99be6c

@seandepagnier
Copy link

I have a poor connection and I often resume pip manually using wget.

This is easy for wheel using wget -c, then you can install the wheel with pip, but when it's a tarball I have to use the setup script and don't get the same result even though in the end it works.

@chrahunt chrahunt added type: feature request Request for a new feature and removed type: feature request Request for a new feature labels Dec 17, 2019
@chrahunt
Copy link
Member

This should be easier to implement now since all the logic regarding downloads is isolated in pip._internal.network.download.

@johny65
Copy link

johny65 commented May 14, 2020

Any updates on this? I was installing a huge package (specifically Tensorflow, 500+MB), and for some reason pip was killed in 99% of download... Re run the command and it started downloading from 0...

@pradyunsg
Copy link
Member

@johny65 No updates.

Folks are welcome to contribute this functionality to pip. As noted by @chrahunt, there's a clear part of the codebase for these changes to be made in. :)

@McSinyx
Copy link
Contributor

McSinyx commented May 15, 2020

I have a few questions about the design for this enhancement. First, why (or how) does this happen?

When I have poor internet connection (the network is cut unexpectedly) [...] When I retry the pip install, it would stop at the midpoint and give me the same md5 error.

My guess would be that back then the wheels are stored directly to the cache dir instead of being downloaded to a temporary location like it is handled now. Thus the hashing error should be solved.

However, because the wheel being downloaded is in a directory that will be cleaned up afterward, do we want to expose that mechanism to be configurable (e.g. pip install --wheel-dir=<user-assgned path> <packages>), or do we want to offer the last result for people with poor connections to pip download -d <user-assgned path> <packages> then pip install? Personally I prefer the latter approach, where we'd need to make pip download download directly to the specified dir, and I'm not sure if doing that would break any existing use case.

@ShashankAW
Copy link

Any updates on this? I was installing a huge package (specifically Tensorflow, 500+MB), and for some reason pip was killed in 99% of download... Re run the command and it started downloading from 0...

same with pytorch which was 1 GB in size. days quota just got exhausted and no fruitful result.

@uranusjr
Copy link
Member

fwiw, you can always curl manually (applying the resuming logic you need and checking the integrity manually) and pip install the downloded file instead.

@pradyunsg pradyunsg added the state: awaiting PR Feature discussed, PR is needed label Dec 1, 2020
@pradyunsg
Copy link
Member

Folks are welcome to contribute this functionality to pip.

@yichi-yang
Copy link

Folks are welcome to contribute this functionality to pip.

I'd like to give this a try and created a proof of concept PR here: #11180.

I'm not quite sure what the command line options will look like for this feature. I imagine we will need new options to turn on/off this feature and limit the number of retries (this is different from the --retries switch). So maybe use --resume-incomplete-download to opt-in and --resume-attempts to set the limit?

@uranusjr
Copy link
Member

If this gets implemented, I would want it to be enabled by default, and fallback automatically to the previous implementation if resuming is not successful (e.g. if the server does not support resuming). This matches the behaviour of normal downloading clients e.g. web browsers.

@yichi-yang
Copy link

If this gets implemented, I would want it to be enabled by default, and fallback automatically to the previous implementation if resuming is not successful (e.g. if the server does not support resuming). This matches the behaviour of normal downloading clients e.g. web browsers.

How about the number of attempts? Should we keep making new requests as long as the responses have successful status code (e.g. 200) and non-empty bodies (some progress is made in each request)?

@uranusjr
Copy link
Member

Instead of trying to guess how many attempts is reasonable, perhaps pip should store the incomplete download somewhere (e.g. in cache?) and resume it on the next pip install. This also better matches browser behaviour—the download is not re-attempted automatically, but the user can click a button to resume.

@CTimmerman
Copy link

CTimmerman commented Jun 12, 2022

If-Unmodified-Since should ensure it's the same file, safe to resume. https://gist.github.com/CTimmerman/ccf884f8c8dcc284588f1811ed99be6c

@yichi-yang
Copy link

Instead of trying to guess how many attempts is reasonable, perhaps pip should store the incomplete download somewhere (e.g. in cache?) and resume it on the next pip install. This also better matches browser behaviour—the download is not re-attempted automatically, but the user can click a button to resume.

Currently pip uses CacheControl to handle HTTP caching, but it doesn't cache responses with incomplete bodies (or Range requests with status code 206) so it doesn't help with our case (incomplete download). It seems to me that to implement a cache independent of existing HTTP and wheel caching for the sole purpose of resuming failed download will be a lot of work.

Also I'm not sure if the browser behavior is desirable in this case. With large wheels (e.g. pytorch > 2 GB) and my crappy Internet it consistently fails 4~5 times before completing. If users are installing many large packages (e.g. from a requirements.txt) having to manually resume multiple times can be annoying. That's why I think opt-in might work better. In most cases resuming is not required, but in the case it does we can present a warning informing the users that 1) the download is incomplete, and 2) they can use some command line option to automatically resume download next time.

@pradyunsg
Copy link
Member

One caveat with trying to mimck the browser is that, unlike the browser's UI which lets the user cancel / pause / resume any specific download, pip doesn't have such a rich user interface via the CLI.

We'd need to, at least, provide one knob for this resuming behaviour -- either to opt-in or opt-out. I think when you're not in "resume my downloads" mode, pip should also clean up any existing incomplete downloads.

That said, picking between opt-in vs opt-out is not really blocker to needing to implement either behaviours. It's a matter of changing a flag's default value in the PR (let's use a flag with values like --incomplete-downloads=resume/discard for handling this) which is easy-enough. :)

@yichi-yang
Copy link

yichi-yang commented Jul 17, 2022

I think my PR #11180 is ready for a first round of review. Suggestions for more meaningful flag names, log messages, and exception messages are welcome.

@Rom1deTroyes
Copy link

Having the same problem downloading pytorch + open-cv on a streamlit project for the third time today (connection lost after 6 hours...), I wonder if making pip able to use an external downloader could be a thing ? yt-dl provides :

    --external-downloader COMMAND        Use the specified external downloader.
                                         Currently supports aria2c,avconv,axel,c
                                         url,ffmpeg,httpie,wget
    --external-downloader-args ARGS      Give these arguments to the external
                                         downloader

A kind of pip install --external-downloader wget --external-downloader-args '-r' requirements.txt ?

@CTimmerman
Copy link

Having the same problem downloading pytorch + open-cv on a streamlit project for the third time today (connection lost after 6 hours...), I wonder if making pip able to use an external downloader could be a thing ? yt-dl provides :

    --external-downloader COMMAND        Use the specified external downloader.
                                         Currently supports aria2c,avconv,axel,c
                                         url,ffmpeg,httpie,wget
    --external-downloader-args ARGS      Give these arguments to the external
                                         downloader

A kind of pip install --external-downloader wget --external-downloader-args '-r' requirements.txt ?

Which of those also works on Windows? Resuming HTTP downloads is simple, as evident by the PR at #11180 which is fine by me, but i feel it's such a basic feature it should be supported upstream.

@pradyunsg
Copy link
Member

We're not going to be using an external programme for network interaction within pip. This should be implemented as logic within pip itself.

@Nneji123
Copy link

What's the progress on this feature? It's annoying trying to install packages like tensorflow and pytorch and then getting errors when the downloads are almost complete

@yichi-yang
Copy link

What's the progress on this feature? It's annoying trying to install packages like tensorflow and pytorch and then getting errors when the downloads are almost complete

I have a proof-of-concept PR here: #11180. It's been a while since I last worked on it, and there has been some discussion about the user interface that I haven't incorporated into the PR.

Personally I feel like the major problems are:

  1. Need to decided to if this is better fixed upstream (though I think parts of the resume logic will have to be handled by pip either case).
  2. What user interface we should use?

I think it will be nice if we can have some input from the maintainers, e.g., priorities, expectations, etc.

@uranusjr
Copy link
Member

By upstream do you mean requests? As for which UX to use, I don’t think anyone really expressed strong opinions, but only pointed out things the end product needs to be handle. So the best approach to drive this forward would be to implement what you feel is best and see what people think of it.

@yichi-yang
Copy link

By upstream do you mean requests? As for which UX to use, I don’t think anyone really expressed strong opinions, but only pointed out things the end product needs to be handle. So the best approach to drive this forward would be to implement what you feel is best and see what people think of it.

Sounds good. I'll update that PR when I got time (been busy lately).
By upstream I'm referring to the issue that requests doesn't enforce content length check: psf/requests#4956.

@nbkgit
Copy link

nbkgit commented Apr 20, 2024

2024 still no resume for large packages, the connection is closed by the server and i have to start numpy and psyspark over and over again, a resume would save a lot of resources as pip retrieves the same stream also all over again.
I am sorry that i am not versed enough t write it myself, but it is necessary

@mrlectus
Copy link

mrlectus commented May 3, 2024

2024 still no resume for large packages, the connection is closed by the server and i have to start numpy and psyspark over and over again, a resume would save a lot of resources as pip retrieves the same stream also all over again. I am sorry that i am not versed enough t write it myself, but it is necessary

Yes very necessary

@thk686
Copy link

thk686 commented Jul 15, 2024

Currently in the western Amazon on a starlink connection trying to download birdnetlib and this is killing me. It would be so much better to use the rsync protocol with checksums.

@gmargaritis
Copy link

Hello everyone 👋

I opened a PR for this one (#12991). Happy to hear your thoughts and finally get it merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: download About fetching data from PyPI and other sources state: awaiting PR Feature discussed, PR is needed type: enhancement Improvements to functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

17 participants