-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl seems to be missing some pages #9
Comments
The above example is wrong! Confusingly, those sub-pages of /owning-a-home/compare/ are actually under /owning-a-home/process/compare/. And those are getting properly crawled. Need to do more review of a more thorough crawl. |
An example that fails: on Submit a complaint, the "Start a new complaint" button points to https://www.consumerfinance.gov/complaint/getting-started. Because we use Django's This URL shows up in wget's rejection log as being rejected because of reason |
Fixed by #15; see for example that /complaint/getting-started/index.html is properly saved. |
The crawl results seem to miss some pages, even though they should fall within the current specified depth.
Current behavior
Crawl results are incomplete. For example, last night's run used
--depth=4
. But the results under /owning-a-home/compare/ only include two sub-pages even though the page sidebar has 5 sub-pages:From the page root, these links should only be at depth 3:
Expected behavior
All pages with the specific depth are crawled.
The text was updated successfully, but these errors were encountered: