This project fetches the DOM of multiple URLs using Puppeteer, saves the content as HTML files, and processes several URLs concurrently. You can configure how many URLs are processed at the same time by passing a concurrency parameter.
-
Clone the repository (or download the project files):
git clone https://github.com/dudekm/dom-scraper.git
-
Navigate to the project directory:
cd dom-scraper
-
Install the required Node.js packages:
npm install
This will install
puppeteer
and other dependencies defined inpackage.json
.
The script accepts three parameters from the command line:
- URL list file: A
.txt
file with one URL per line. - Output directory: The directory where the HTML files will be saved.
- Concurrency: The number of URLs to process concurrently.
https://example.com
https://another-example.com
https://example.org
https://another-site.com
Once you have installed the necessary dependencies and created your urls.txt
file, you can run the project using Node.js.
To run the scraper, use the following command:
node index.js <urls.txt> <outputDir> <concurrency>
node index.js urls.txt ./dom_output 5