Skip to content

Commit

Permalink
Merge pull request #25 from chriSmile0/process_usage
Browse files Browse the repository at this point in the history
Process usage
  • Loading branch information
chriSmile0 authored May 20, 2024
2 parents 84e871c + abc43aa commit f873d7d
Show file tree
Hide file tree
Showing 19 changed files with 1,294 additions and 344 deletions.
22 changes: 22 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
MIT License

Copyright (c) 2024 chriSmile0

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

83 changes: 48 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,25 @@
# Scrapper

## STATUS [PENDING]

## UPDATE 05/20/2024
**At this time `Intermarche`,`SystemeU` and `Leclerc` use `Datadome` protection**
- `Intermarche` -> Impossible for me to bypass the new version of Datadome -> Target waiting
- `SystemeU` -> Bypass the old version of Datadome in this website
- `Leclerc` -> Bypass OK

## PRESHOT 2024 TARGET EVOLUTION
- `SystemeU` -> Update the version of the DataDome Solution
- `Auchan` and `Carrefour` add DataDome Solution
- `Monoprix` no protection
- `Leclerc` need to rebuild the pathing of the website to use correctly the DataDome solution

## PRESHOT 2024 TOOL EVOLUTION
- `php-webdriver` -> Maybe Deprecated soon for WebScraping
- `puppeteer` -> need more update for hide the headless mode (waiting)
- `playwright` -> microsoft tool (Ubuntu 20.* or newer)
- `selenium` -> next test for scrapping target (Famous tool)

## Disclaimer
- **_This tool is not for collect personal information_**
- Please respect the [RGPDs rules](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32016R0679)
Expand All @@ -11,7 +31,7 @@ It's possible to look the content of a website pages with the browser with this
- `CRTL+MAJ+I` for web inspector -> Console, possibility to change the display conten

Or with special library and framework like :
- Selenium (PHP)
- Selenium (Python,...)
- Goutte (Symfony)
- Scrapy (Python)

Expand All @@ -25,8 +45,6 @@ Or API :
- With [puppeteer](https://github.com/puppeteer/puppeteer)
- With [puppeteer-extra](https://github.com/berstend/puppeteer-extra)



## Why
- For my project **PriceComparator**
- Developpement of your own tools is important to understand and learn many things.
Expand All @@ -36,36 +54,31 @@ Or API :
<summary>Paths</summary>
<pre>
dev
├── JSON_updates.php
└── copy_all_leclerc.html
├── copy_all_leclerc.html
└── JSON_updates.php
project
├── project.php
└── infos_programs.php
├── infos_programs.php
└── project.php
src
├── test_rq_submod.js
├── test_extra_puppeteer.js
├── scrapper_systemeu.php
├── scrapper.php
├── scrapper_monoprix.php
├── scrapper_leclerc.php
├── scrapper_intermarche.php
├── scrapper_carrefour.php
├── scrapper_auchan.php
├── scrape_su.js
├── control_google_.js
├── DatadomeBreaker/
├── libJSON/
├── scrape.js
├── products_su.txt
├── libJSON
│ └── leclercs.json
── DatadomeBreaker
├── screen_deps
├── README.md
├── package.json
├── outs
├── canvas_lib
└── break.js
├── scrape_su.js
├── scrapper_auchan.php
── scrapper_carrefour.php
── scrapper_intermarche.php
├── scrapper_leclerc.php
├── scrapper_monoprix.php
├── scrapper.php
├── scrapper_systemeu.php
├── test_extra_puppeteer.js
└── test_rq_submod.js
your_project
├── process_p.php
├── proofs/
├── README.md
└── example.php
└── usage.php
composer.json
package.json
README.md
Expand All @@ -82,7 +95,7 @@ README.md
### LIKE A PROJECT :
- `composer require php-webdriver/php-webdriver`
- `project.php` for known how the different tools works
- `src/scrapper*.php` the differents files for scraping mission
- `scrapper*.php` the differents files for scraping mission
- `vendor` add lib for php-webdriver
- `node_modules(hide with .gitignore)` for node.js module
- `*.json/*.txt` for different test to build program to efficient scraping
Expand All @@ -91,7 +104,7 @@ README.md

## Version

### V1.2
### V1.4.1
- Basic version of scrapper :
- [x] http, https
- [x] html content generate by JS -> `puppeteer`
Expand Down Expand Up @@ -119,15 +132,15 @@ README.md
- [x] usage of `puppeteer` or `php-webdriver` is possible
- [x] products for all stores in the target country
- [ ] NoBot Solutions
- [Intermaché](https://www.intermarche.com) :
- [Intermaché](https://www.intermarche.com) [**BLOCKED**] :
- [x] parse specific JS -> json
- [x] usage of `php-webdriver`
- [ ] NoBot Solutions
- [Systeme_U](https://www.magasins-u.com) :
- [x] NoBot Solutions -> **DataDome** Solution -> `NEW_VERSION`
- [SystemeU](https://www.magasins-u.com) [**UPDATE SOON FOR NEW PUPPETEER VERSION**]:
- [x] parse specific JS -> json (products only on the display page)
- [ ] usage of `puppeteer` or `php-webdriver` **IMPOSSIBLE**
- [x] NoBot Solutions -> **DataDome** Solution
- [x] NoBot Solutions -> **DataDome** Solution -> `OLD VERSION`
- [x] Necessary to use `puppeteer-extra-plugin-stealth`


## Features
## Features
7 changes: 3 additions & 4 deletions composer.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"description": "scrapping of http(s) file",
"license": "proprietary",
"description": "Scrapping French 'Drive' Supermarket wWbsite",
"license": "MIT",
"name": "chrismile0/scrapper",
"autoload": {
"psr-4": {
Expand All @@ -21,5 +21,4 @@
"require": {
"php-webdriver/webdriver": "^1.13"
}

}
}
1 change: 0 additions & 1 deletion dev/JSON_updates.php
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
<?php

use function PHPSTORM_META\type;

$test = "0011_2_23_Aix_25km";
$test2 = "0012_3_24_Paris_800km";
Expand Down
13 changes: 10 additions & 3 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
{
"dependencies": {
"puppeteer": "^21.7.0",
"chromedriver": "^123.0.4",
"puppeteer": "^22.6.5",
"puppeteer-extra": "^3.3.6",
"puppeteer-extra-plugin-adblocker": "^2.13.6",
"puppeteer-extra-plugin-stealth": "^2.11.2"
"puppeteer-extra-plugin-stealth": "^2.11.2",
"selenium-stealth": "^1.0.1",
"selenium-webdriver": "^4.20.0",
"geckodriver":"4.4.0"
},
"devDependencies": {
"@types/node": "^20.12.7"
}
}
}
26 changes: 8 additions & 18 deletions project/infos_programs.php
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ function print_info_scrapper_leclerc() {
(Not usage of the second parameters for the moment)
[MAIN]
php $prog [url] [research_product_type] --with-openssl
php $prog [research_product_type] --with-openssl
[PARAMETERS]
[research_product_type] -> the product we search,\033[01;37m lardons/allumettes\033[0m
Expand Down Expand Up @@ -86,15 +86,13 @@ function print_info_scrapper_carrefour() {
(Not usage of the second parameters for the moment)
[MAIN]
php $prog [url] [research_product_type] --with-openssl
php $prog [research_product_type] --with-openssl
[PARAMETERS]
[URL] -> \033[01;37m https://www.carrefour.fr/s?q=lardons \033[0m
[research_product_type] -> the product we search,\033[01;37m lardons/allumettes\033[0m\n\n";
echo "[**END CARREFOUR SCRAPPER INFORMATIONS**]\n";
}

// NEXT RELEASE
/**
* [BRIEF] [INFO_PRINTER_SCRAPPER_AUCHAN]
* @param void
Expand All @@ -119,10 +117,9 @@ function print_info_scrapper_auchan() {
(Not usage of the second parameters for the moment)
[MAIN]
php $prog [url] [research_product_type] [town] --with-openssl
php $prog [research_product_type] [town] --with-openssl
[PARAMETERS]
[URL] -> \033[01;37m https://www.auchan.fr \033[0m
[research_product_type] -> the product we search,\033[01;37m lardons/oeufs\033[0m
[town] -> the research area \033[01;37mParis/Lyon\033[0m\n\n";

Expand All @@ -143,16 +140,14 @@ function print_info_scrapper_systemeu() {
For scrap this set of data we use \033[01;37m puppeteer-extra-plugin-stealth\033[0m
This js/array contain many data per product but we extract many of
these for our tool (the most useful (price,brand) for the moment
(The all items is on the same URL (not necessary to go to another url)
these for our tool (the most useful (price,brand) for the moment.
The result is an array of all products we are in the URL target
[MAIN]
php $prog [url] [research_product_type] [town] --with-openssl
php $prog [research_product_type] [town] --with-openssl
[PARAMETERS]
[URL] -> \033[01;37m https://www.coursesu.com/drive/home \033[0m
[research_product_type] -> the product we search,\033[01;37m lardons/allumettes\033[0m
[town] -> the research area \033[01;37mParis/Lyon\033[0m\n\n";

Expand All @@ -174,16 +169,12 @@ function print_info_scrapper_monoprix() {
\033[01;37m puppeteer\033[0m
This js/array contain many data per product but we extract many of
these for our tool (the most useful (price,brand) for the moment
(The all items is on the same URL (not necessary to go to another url)
The result is an array of all products we are in the URL target
these for our tool (the most useful (price,brand) for the moment.
[MAIN]
php $prog [url] [research_product_type] --with-openssl
php $prog [research_product_type] --with-openssl
[PARAMETERS]
[URL] -> \033[01;37m https://courses.monoprix.fr/products/search?q=\033[0m
[research_product_type] -> the product we search,\033[01;37m lardons/allumettes\033[0m\n\n";

echo "[**END MONOPRIX SCRAPPER INFORMATIONS**]\n";
Expand Down Expand Up @@ -213,10 +204,9 @@ function print_info_scrapper_intermarche() {
(Not usage of the second parameters for the moment)
[MAIN]
php $prog [url] [research_product_type] [town] --with-openssl
php $prog [research_product_type] [town] --with-openssl
[PARAMETERS]
[URL] -> \033[01;37m https://www.intermarche.fr \033[0m
[research_product_type] -> the product we search,\033[01;37m lardons/oeufs\033[0m
[town] -> the research area \033[01;37m Paris/Lyon\033[0m\n\n";

Expand Down
7 changes: 4 additions & 3 deletions project/project.php
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<?php
require_once('infos_programs.php');
$version = "1.0";
$version = "1.4";
$programs = [ "scrapper.php","scrapper_leclerc.php","scrapper_carrefour.php",
"scrapper_intermarche.php","scrapper_auchan.php",
"scrapper_monoprix.php", "scrapper_systemeu.php"
Expand Down Expand Up @@ -82,7 +82,8 @@ function print_help() {
*/
function print_version() {
echo "Version of Scrapping program : ". $GLOBALS["version"] ."\n";
echo "Copyright @-2024 [:chriSmile0:] \n";
$dt = new DateTime("now", new DateTimeZone('America/New_York'));
echo "Copyright @-".$dt->format('Y')." [:chriSmile0:] \n";
}

/**
Expand Down Expand Up @@ -115,5 +116,5 @@ function main($argc, $argv) : bool {
echo "EXECUTION FINISH WITH SUCCESS \n";
return 1;
}
//main($argc,$argv);
main($argc,$argv);
?>
31 changes: 31 additions & 0 deletions src/control_google_.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
const {By, Builder, Browser} = require('selenium-webdriver');
const assert = require("assert");

(async function firstTest() { // TEST SOON WAITED WORK, TRY CLICK
let driver;

try {
driver = await new Builder().forBrowser(Browser.CHROME).build();
await driver.get('https://intermarche.com');

let title = await driver.getTitle();
console.log(title);
//assert.equal("Web form", title);

await driver.manage().setTimeouts({implicit: 500});

/*let textBox = await driver.findElement(By.name('my-text'));
let submitButton = await driver.findElement(By.css('button'));
await textBox.sendKeys('Selenium');
await submitButton.click();
let message = await driver.findElement(By.id('message'));
let value = await message.getText();
assert.equal("Received!", value);*/
} catch (e) {
console.log(e)
} finally {
await driver.quit();
}
}())
8 changes: 4 additions & 4 deletions src/scrape_su.js
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ puppeteer.use(AdblockerPlugin({ blockTrackers: true }))

async function scrape(url,town,product) {
// -- ERROR BOT DETECTION AFTER MANY GOOD TRY WITH THIS SEQUENCE OF INSTRUCTIONS -- //
const browser = await puppeteer.launch({headless: true});
const browser = await puppeteer.launch({headless: 'new'});
const page = await browser.newPage();
await page.goto(url);
const url_ = await page.url();
Expand All @@ -50,13 +50,14 @@ async function scrape(url,town,product) {
console.log(dD);
if(dD != -1) {
console.log("DataDome activate");
break_js.loadedBrk(page,url,'#captcha__puzzle','.slider',"canva_rd.png","screen_su.png");
//break_js.loadedBrk(page,url,'#captcha__puzzle','.slider',"canva_rd.png","screen_su.png");
}
else {
console.log("No DataDome");
}
// -----------------------NO DETECTION BOT USAGE (15 hit OK) ----------------------------//
await page.waitForTimeout(2000);
//await page.waitForTimeout(2000);
await page.screenshot({path:'screen_home_su.png'});
await page.waitForSelector('#popin_tc_privacy_button_2');
await page.click('#popin_tc_privacy_button_2');
// -----------------------NO DETECTION BOT USAGE (15 hit OK)----------------------------//
Expand Down Expand Up @@ -129,7 +130,6 @@ async function scrape(url,town,product) {
await browser.close();
}
url = "https://www.coursesu.com/drive/home";
//url_t = "http://localhost/tests_htmls/sliders/slider.html";
town = argv[2];
product = argv[3];
scrape(url,town,product);
Loading

0 comments on commit f873d7d

Please sign in to comment.