-
-
Notifications
You must be signed in to change notification settings - Fork 220
/
Copy pathpackage.json
119 lines (119 loc) · 6.06 KB
/
package.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
{
"name": "scrape-it",
"description": "A Node.js scraper for humans.",
"keywords": [
"scrape",
"it",
"a",
"scraping",
"module",
"for",
"humans"
],
"license": "MIT",
"version": "6.1.3",
"main": "lib/index.js",
"types": "lib/index.d.ts",
"scripts": {
"test": "node test"
},
"author": "Ionică Bizău <[email protected]> (https://ionicabizau.net)",
"contributors": [
"ComFreek <[email protected]> (https://github.com/ComFreek)",
"Jim Buck <[email protected]> (https://github.com/JimmyBoh)",
"Non <[email protected] (https://github.com/fadingNA)"
],
"repository": {
"type": "git",
"url": "git+ssh://[email protected]/IonicaBizau/scrape-it.git"
},
"bugs": {
"url": "https://github.com/IonicaBizau/scrape-it/issues"
},
"homepage": "https://github.com/IonicaBizau/scrape-it#readme",
"blah": {
"h_img": "https://i.imgur.com/j3Z0rbN.png",
"cli": "scrape-it-cli",
"description": [
"----",
"",
"<p align=\"center\">",
"Sponsored with :heart: by:",
"<br/><br/>",
"<a href=\"https://serpapi.com\">Serpapi.com</a> is a platform that allows you to scrape Google and other search engines from our fast, easy, and complete API.<br>",
"<a href=\"https://serpapi.com\"><img src=\"https://i.imgur.com/0Pk6Ysp.png\" width=\"250\" /></a>",
"<br/><br/>",
"",
"[Capsolver.com](https://www.capsolver.com/?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it) is an AI-powered service that specializes in solving various types of captchas automatically. It supports captchas such as [reCAPTCHA V2](https://docs.capsolver.com/guide/captcha/ReCaptchaV2.html?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it), [reCAPTCHA V3](https://docs.capsolver.com/guide/captcha/ReCaptchaV3.html?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it), [hCaptcha](https://docs.capsolver.com/guide/captcha/HCaptcha.html?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it), [FunCaptcha](https://docs.capsolver.com/guide/captcha/FunCaptcha.html?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it), [DataDome](https://docs.capsolver.com/guide/captcha/DataDome.html?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it), [AWS Captcha](https://docs.capsolver.com/guide/captcha/awsWaf.html?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it), [Geetest](https://docs.capsolver.com/guide/captcha/Geetest.html?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it), and Cloudflare [Captcha](https://docs.capsolver.com/guide/antibots/cloudflare_turnstile.html?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it) / [Challenge 5s](https://docs.capsolver.com/guide/antibots/cloudflare_challenge.html?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it), [Imperva / Incapsula](https://docs.capsolver.com/guide/antibots/imperva.html?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it), among others. For developers, Capsolver offers API integration options detailed in their [documentation](https://docs.capsolver.com/?utm_source=github&utm_medium=banner_github&utm_campaign=scrape-it), facilitating the integration of captcha solving into applications. They also provide browser extensions for [Chrome](https://chromewebstore.google.com/detail/captcha-solver-auto-captc/pgojnojmmhpofjgdmaebadhbocahppod) and [Firefox](https://addons.mozilla.org/es/firefox/addon/capsolver-captcha-solver/), making it easy to use their service directly within a browser. Different pricing packages are available to accommodate varying needs, ensuring flexibility for users.",
"<a href=\"https://capsolver.com/?utm_source=github&utm_medium=github_banner&utm_campaign=scrape-it\"><img src=\"https://i.imgur.com/lCngxre.jpeg\"/></a>",
"</p>",
"",
"----"
],
"installation": [
{
"h2": "FAQ"
},
{
"p": "Here are some frequent questions and their answers."
},
{
"h3": "1. How to parse scrape pages?"
},
{
"p": "`scrape-it` has only a simple request module for making requests. That means you cannot directly parse ajax pages with it, but in general you will have those scenarios:"
},
{
"ol": [
"**The ajax response is in JSON format.** In this case, you can make the request directly, without needing a scraping library.",
"**The ajax response gives you HTML back.** Instead of calling the main website (e.g. example.com), pass to `scrape-it` the ajax url (e.g. `example.com/api/that-endpoint`) and you will you will be able to parse the response",
"**The ajax request is so complicated that you don't want to reverse-engineer it.** In this case, use a headless browser (e.g. Google Chrome, Electron, PhantomJS) to load the content and then use the `.scrapeHTML` method from scrape it once you get the HTML loaded on the page."
]
},
{
"h3": "2. Crawling"
},
{
"p": "There is no fancy way to crawl pages with `scrape-it`. For simple scenarios, you can parse the list of urls from the initial page and then, using Promises, parse each page. Also, you can use a different crawler to download the website and then use the `.scrapeHTML` method to scrape the local files."
},
{
"h3": "3. Local files"
},
{
"p": "Use the `.scrapeHTML` to parse the HTML read from the local files using `fs.readFile`."
}
]
},
"dependencies": {
"assured": "^1.0.15",
"cheerio": "^1.0.0",
"cheerio-req": "^2.0.0",
"scrape-it-core": "^1.0.0",
"typpy": "^2.3.13"
},
"devDependencies": {
"@types/cheerio": "^0.22.35",
"@types/node": "^22.7.4",
"lien": "^3.4.2",
"tester": "^1.4.5",
"ts-node": "^10.9.2",
"typescript": "^5.6.2"
},
"files": [
"bin/",
"app/",
"lib/",
"dist/",
"src/",
"scripts/",
"resources/",
"menu/",
"cli.js",
"index.js",
"index.d.ts",
"package-lock.json",
"bloggify.js",
"bloggify.json",
"bloggify/"
]
}