Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawling multiple web pages #54

Open
joeyzhou98 opened this issue Aug 28, 2019 · 2 comments
Open

Crawling multiple web pages #54

joeyzhou98 opened this issue Aug 28, 2019 · 2 comments

Comments

@joeyzhou98
Copy link

It might already be answered, but from my knowledge I haven't found a way to crawl and download files from multiple sub pages from one main page. For example, here

image
We can see there are multiple datasets I want to download, however, there are no direct href download links on the page. I would need to click on a dataset I am interested in and then there is a download href link to download files:

image

Is there a way to define pipeline() so that it is able to crawl, starting from one main catalog page, to multiple sub pages in order to download files?

@kyleam
Copy link
Collaborator

kyleam commented Aug 29, 2019

Is there a way to define pipeline() so that it is able to crawl, starting from one main catalog page, to multiple sub pages in order to download files?

I don't know datalad-crawler's internals well. Poking around in the repo, I'd guess the way to do this would be with a recurse node. pipelines/abstractsonline.py seems to provide the clearest example. But looking at modules like pipelines/{openfmri,crcns}.py, I'd guess the preferred design is to make the pipeline work at the individual dataset level and then define superdataset_pipeline.

@yarikoptic will be able to give a more informed response.

Two comments not directly related to your question:

  • I'm assuming the goal is to create a pipeline that works with a zenodo dataset and then provide a datalad superdataset that contains a collection of zenodo datasets of interest, where the datasets of interest are much smaller than the 41,396 results you're showing in your screenshot.
  • I wondered whether zenodo has an API for downloading. It seems like they do, but it's in beta.

@yarikoptic
Copy link
Member

yeap, probably you would like to first establish a pipeline to create subdatasets (one per each zenodo dataset page) as @kyleam has pointed out. And then have each dataset crawled independently.

if you want/need to crawl into other pages you can provide matchers to crawl_url which would could be used for crawling multiple pages. See e.g. https://github.com/datalad/datalad-crawler/blob/master/datalad_crawler/pipelines/crcns.py#L141 super dataset pipeline where we need to crawl multiple pages to identify all datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants