You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Wget truncates this to 236 characters, as seen in the log:
The name is too long, 288 chars total.
Trying to shorten...
New name is www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-fo.
(I believe this is done because the filename can be at max 255 characters, and wget has a 19-character "chomp buffer" that it reserves for appending things like .tmp, /index.html, etc.)
Note that this new filename doesn't have an extension. Wget then tries to download to this location:
--2020-10-30 18:57:07-- https://www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-for-the-full-amount-what-should-i-do-en-835/
...
Saving to ‘www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-fo.tmp’
...
Removing www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-fo.tmp since it should be rejected.
But it then removes the downloaded file because it doesn't end in .html! (I believe that wget adds but then properly ignores the .tmp suffix it adds when downloading files.)
The diffs and output files in this repo become harder to use if some files don't end in .html. They become harder to edit locally and generally just less useable for downstream applications.
Expected behavior
It would be better if we could somehow ensure that all files always get saved consistently as .html, or, at least makes it easier to track when this happens. As far as I can tell we can't do this with wget itself.
One idea would be to write a script that parses our wget.log file to generate a list of URLs and their truncated filenames. We could then have some other script that "corrects" those filenames.
The text was updated successfully, but these errors were encountered:
Problem 1 above (non-HTML files getting rejected) was resolved by #15.
Problem 2 above (downstream code relying on .html) will be resolved by #19.
Problem 3 above (complexity of having truncated filenames) is still a potential issue. Let's wait to see what kinds of issues this creates before we decide if we need to fix this or what the best solution might be.
Wget truncates long URLs when storing them as HTML files on disk.
For example, this URL is 286 characters:
https://www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-for-the-full-amount-what-should-i-do-en-835/
When converted to a file named index.html under the www.consumerfinance.gov folder, it would be 288 characters:
www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-for-the-full-amount-what-should-i-do-en-835/index.html
Wget truncates this to 236 characters, as seen in the log:
(I believe this is done because the filename can be at max 255 characters, and wget has a 19-character "chomp buffer" that it reserves for appending things like
.tmp
,/index.html
, etc.)Note that this new filename doesn't have an extension. Wget then tries to download to this location:
But it then removes the downloaded file because it doesn't end in
.html
! (I believe that wget adds but then properly ignores the.tmp
suffix it adds when downloading files.)Current behavior
This truncating behavior causes a few problems:
--accept html
, files that get truncated to end with some other extension (or get no extension at all) will be deleted, and won't be tracked in this repository. (See also Crawl seems to be missing some pages #9 (comment); this probably needs to be fixed by instead specifically using--reject
on certain non-HTML files like.pdf,.jpg
, etc.) (FIXED by Reject non-HTML instead of accepting only HTML #15).html
won't work correctly. For example, generate_summary.sh currently deliberately diffs only HTML. Additionally, a file could (at least in theory) get truncated to end in an extension specified in our .gitignore!.html
. They become harder to edit locally and generally just less useable for downstream applications.Expected behavior
It would be better if we could somehow ensure that all files always get saved consistently as
.html
, or, at least makes it easier to track when this happens. As far as I can tell we can't do this with wget itself.One idea would be to write a script that parses our wget.log file to generate a list of URLs and their truncated filenames. We could then have some other script that "corrects" those filenames.
The text was updated successfully, but these errors were encountered: