Crawling multiple web pages #54

joeyzhou98 · 2019-08-28T19:53:27Z

It might already be answered, but from my knowledge I haven't found a way to crawl and download files from multiple sub pages from one main page. For example, here

We can see there are multiple datasets I want to download, however, there are no direct href download links on the page. I would need to click on a dataset I am interested in and then there is a download href link to download files:

Is there a way to define pipeline() so that it is able to crawl, starting from one main catalog page, to multiple sub pages in order to download files?

kyleam · 2019-08-29T16:09:02Z

Is there a way to define pipeline() so that it is able to crawl, starting from one main catalog page, to multiple sub pages in order to download files?

I don't know datalad-crawler's internals well. Poking around in the repo, I'd guess the way to do this would be with a recurse node. pipelines/abstractsonline.py seems to provide the clearest example. But looking at modules like pipelines/{openfmri,crcns}.py, I'd guess the preferred design is to make the pipeline work at the individual dataset level and then define superdataset_pipeline.

@yarikoptic will be able to give a more informed response.

Two comments not directly related to your question:

I'm assuming the goal is to create a pipeline that works with a zenodo dataset and then provide a datalad superdataset that contains a collection of zenodo datasets of interest, where the datasets of interest are much smaller than the 41,396 results you're showing in your screenshot.
I wondered whether zenodo has an API for downloading. It seems like they do, but it's in beta.

yarikoptic · 2019-08-31T09:22:19Z

yeap, probably you would like to first establish a pipeline to create subdatasets (one per each zenodo dataset page) as @kyleam has pointed out. And then have each dataset crawled independently.

if you want/need to crawl into other pages you can provide matchers to crawl_url which would could be used for crawling multiple pages. See e.g. https://github.com/datalad/datalad-crawler/blob/master/datalad_crawler/pipelines/crcns.py#L141 super dataset pipeline where we need to crawl multiple pages to identify all datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawling multiple web pages #54

Crawling multiple web pages #54

joeyzhou98 commented Aug 28, 2019

kyleam commented Aug 29, 2019

yarikoptic commented Aug 31, 2019

Crawling multiple web pages #54

Crawling multiple web pages #54

Comments

joeyzhou98 commented Aug 28, 2019

kyleam commented Aug 29, 2019

yarikoptic commented Aug 31, 2019