Long URLs get truncated by wget #13

chosak · 2020-11-02T20:24:55Z

Wget truncates long URLs when storing them as HTML files on disk.

For example, this URL is 286 characters:

https://www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-for-the-full-amount-what-should-i-do-en-835/

When converted to a file named index.html under the www.consumerfinance.gov folder, it would be 288 characters:

www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-for-the-full-amount-what-should-i-do-en-835/index.html

Wget truncates this to 236 characters, as seen in the log:

The name is too long, 288 chars total.
Trying to shorten...
New name is www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-fo.

(I believe this is done because the filename can be at max 255 characters, and wget has a 19-character "chomp buffer" that it reserves for appending things like .tmp, /index.html, etc.)

Note that this new filename doesn't have an extension. Wget then tries to download to this location:

--2020-10-30 18:57:07--  https://www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-for-the-full-amount-what-should-i-do-en-835/
...
Saving to ‘www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-fo.tmp’
...
Removing www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-fo.tmp since it should be rejected.

But it then removes the downloaded file because it doesn't end in .html! (I believe that wget adds but then properly ignores the .tmp suffix it adds when downloading files.)

Current behavior

This truncating behavior causes a few problems:

Because we only --accept html, files that get truncated to end with some other extension (or get no extension at all) will be deleted, and won't be tracked in this repository. (See also Crawl seems to be missing some pages #9 (comment); this probably needs to be fixed by instead specifically using --reject on certain non-HTML files like .pdf,.jpg, etc.) (FIXED by Reject non-HTML instead of accepting only HTML #15)
Any downstream code that depends on these files adding in .html won't work correctly. For example, generate_summary.sh currently deliberately diffs only HTML. Additionally, a file could (at least in theory) get truncated to end in an extension specified in our .gitignore!
The diffs and output files in this repo become harder to use if some files don't end in .html. They become harder to edit locally and generally just less useable for downstream applications.

Expected behavior

It would be better if we could somehow ensure that all files always get saved consistently as .html, or, at least makes it easier to track when this happens. As far as I can tell we can't do this with wget itself.

One idea would be to write a script that parses our wget.log file to generate a list of URLs and their truncated filenames. We could then have some other script that "corrects" those filenames.

The text was updated successfully, but these errors were encountered:

chosak · 2020-11-05T20:15:36Z

Problem 1 above (non-HTML files getting rejected) was resolved by #15.

Problem 2 above (downstream code relying on .html) will be resolved by #19.

Problem 3 above (complexity of having truncated filenames) is still a potential issue. Let's wait to see what kinds of issues this creates before we decide if we need to fix this or what the best solution might be.

This was referenced Nov 2, 2020

Reject non-HTML instead of accepting only HTML #15

Merged

Improve summary by diffing all intended files #19

Merged

chosak mentioned this issue Dec 2, 2020

Make it easier to map search results to URLs #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long URLs get truncated by wget #13

Long URLs get truncated by wget #13

chosak commented Nov 2, 2020 •

edited

Loading

chosak commented Nov 5, 2020

Long URLs get truncated by wget #13

Long URLs get truncated by wget #13

Comments

chosak commented Nov 2, 2020 • edited Loading

Current behavior

Expected behavior

chosak commented Nov 5, 2020

chosak commented Nov 2, 2020 •

edited

Loading