Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release #27

Merged
merged 3 commits into from
Jun 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ test*.js
*.phar
*.lock
*-lock.json
screen_*.png
*.png
vendor
24 changes: 12 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Scrapper

## STATUS [PENDING]
## STATUS [ACTIVE]

## UPDATE 05/20/2024
## UPDATE 06/20/2024
**At this time `Intermarche`,`SystemeU` and `Leclerc` use `Datadome` protection**
- `Intermarche` -> Impossible for me to bypass the new version of Datadome -> Target waiting
- `SystemeU` -> Bypass the old version of Datadome in this website
- `Leclerc` -> Bypass OK
- `SystemeU` -> Same to Intermarche , bypass with proxy and IP Rotating is possible
- `Leclerc` -> Same to Intermarche

## PRESHOT 2024 TARGET EVOLUTION
- `SystemeU` -> Update the version of the DataDome Solution
Expand Down Expand Up @@ -54,7 +54,6 @@ Or API :
<summary>Paths</summary>
<pre>
dev
├── copy_all_leclerc.html
└── JSON_updates.php
project
├── infos_programs.php
Expand Down Expand Up @@ -104,7 +103,7 @@ README.md

## Version

### V1.4.1
### V1.5
- Basic version of scrapper :
- [x] http, https
- [x] html content generate by JS -> `puppeteer`
Expand All @@ -114,11 +113,11 @@ README.md

- Specific version for specific website :
- The french supermarket compagny :
- [Leclerc](https://leclerc.fr) :
- [Leclerc](https://leclerc.fr) [**BLOCKED**]:
- [x] parse specific JS -> json
- [x] usage of https of [basic version](scrapper.php) :
- [x] NoBot Solutions **DataDome** Solution
- [x] Bypass NoBot Solutions with knownledge of all stores (`libJSON/leclercs.json`)
- Try Bypass NoBot Solutions with knownledge of all stores (`libJSON/leclercs.json`) (works before Datadome Solution buy)
- [Carrefour](https://www.carrefour.fr) :
- [x] parse specific JS -> json
- [x] usage of `php-webdriver`
Expand All @@ -135,12 +134,13 @@ README.md
- [Intermaché](https://www.intermarche.com) [**BLOCKED**] :
- [x] parse specific JS -> json
- [x] usage of `php-webdriver`
- [x] NoBot Solutions -> **DataDome** Solution -> `NEW_VERSION`
- [SystemeU](https://www.magasins-u.com) [**UPDATE SOON FOR NEW PUPPETEER VERSION**]:
- [x] NoBot Solutions -> **DataDome** Solution
- [SystemeU](https://www.magasins-u.com) [**BLOCKED**]:
- [x] parse specific JS -> json (products only on the display page)
- [ ] usage of `puppeteer` or `php-webdriver` **IMPOSSIBLE**
- [x] NoBot Solutions -> **DataDome** Solution -> `OLD VERSION`
- [x] Necessary to use `puppeteer-extra-plugin-stealth`
- [x] NoBot Solutions -> **DataDome** Solution
- [x] Necessary to use `puppeteer-extra-plugin-stealth` -> not enough
- Try Bypass with src/libJSON/* (scrape2() in `scrape_su.js`) but blocked again


## Features
87 changes: 79 additions & 8 deletions dev/JSON_updates.php
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
<?php


//-------------------------------- LECLERC SUB_PART --------------------------//
$test = "0011_2_23_Aix_25km";
$test2 = "0012_3_24_Paris_800km";
$keys = ["LeclercCode","idBaseLeclerc","PostalCode","NameCity","DistanceToTarget"];
Expand Down Expand Up @@ -58,20 +59,25 @@ function my_json_rec_encode(string $actual,$associative_tabs, string $tabs) {
for($cpt = 0 ; $cpt < $size-1;$cpt++) {
$next = $associative_tabs[$keys[$cpt]];
$rtn .= "$tabs\"".$keys[$cpt] . "\" : " . my_json_rec_encode($actual,$next,$tabs."\t")."";
$rtn .= (is_array($next)) ? ((array_keys($next)[0]!="0") ? "$tabs},\n" : "$tabs],\n") : ",\n";
$rtn .= (is_array($next)) ? ((array_keys($next)[0]!="0") ? "$tabs},\n" : ((sizeof($next) > 1) ? "$tabs],\n" : ",\n" )) : ",\n";
}
$next = $associative_tabs[$keys[$size-1]];
$rtn .= "$tabs\"".$keys[$size-1] . "\" : " . my_json_rec_encode($actual,$next,$tabs."\t")."";
$rtn .= (is_array($next)) ? ((array_keys($next)[0]!="0") ? "$tabs}\n" : "$tabs]\n") : "\n";
$rtn .= (is_array($next)) ? ((array_keys($next)[0]!="0") ? "$tabs}\n" : ((sizeof($next) > 1) ? "$tabs],\n" : "\n" )) : "\n";

}
else {
$rtn .= $actual . "[\n";
$size = sizeof($associative_tabs);
for($cpt = 0 ; $cpt < $size-1;$cpt++) {
$rtn .= "$tabs".$associative_tabs[$cpt].",\n";
if($size == 1) {
$rtn .= $associative_tabs[0] ."";
}
else {
$rtn .= $actual . "[\n";
for($cpt = 0 ; $cpt < $size-1;$cpt++) {
$rtn .= "$tabs".$associative_tabs[$cpt].",\n";
}
$rtn .= "$tabs".$associative_tabs[$size-1]."\n";
}
$rtn .= "$tabs".$associative_tabs[$size-1]."\n";
}
return $rtn;
}
Expand All @@ -90,7 +96,72 @@ function my_json_rec_encode(string $actual,$associative_tabs, string $tabs) {

}

$file_content = file_get_contents('copy_all_leclerc.html');
//-------------------------------- SYSTEMU SUB_PART --------------------------//

/*$file_content = file_get_contents('copy_all_leclerc.html');
$rtn = array_lines_to_my_json(explode("\n",$file_content),$keys,"PostalCode");
echo $rtn;
echo $rtn;*/
/*function extract_href_su() {
$file_content = file_get_contents('copy_all_systemeu.html');
$lines = explode("\nurl",$file_content);
$sub_lines = "";
foreach($lines as $l) {
$sub_lines .= "url".substr($l,0,strpos($l,">"))."\n";
}
return $sub_lines;
}
echo extract_href_su();*/
function town_in_specific_syntax(string $town) : string {
$unwanted_array = array( 'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y', '&amp;' => '');
$str = strtr( $town, $unwanted_array );
$tr = str_replace(array(' ','+','\''), '',$str);
return strtolower($tr);
}

function change_lines_su() {
$file_content = file_get_contents('links_idf_systemeu.html');
$lines = explode("\n",$file_content);
$sub_lines = "";
foreach($lines as $l) {
$town_type = (explode(";",town_in_specific_syntax($l)));
$sub_lines .= $town_type[1] . "-".$town_type[0]."\n";
}
return $sub_lines;
}
//echo change_lines_su();

function create_json_per_ens() {
$file_content = file_get_contents('links_systemeu_sort.html');
$lines = explode("\n",$file_content);
$types = ["uexpress","hyperu","superu"];
$arr = array();//array([$types[0],$types[1],$types[2]]);
foreach($lines as $l) {
$i = 0;
$l = substr($l,31);
$t = "";
for($i = 0; $i < 3; $i++)
if((strpos($l,$t=$types[$i])===0))
break;

$arr[$t][] = "\"".substr($l,strlen($t)+1)."\"";
}
return $arr;
}

function create_json_per_city() {
$file_content = file_get_contents('links_systemeu_sort.html');
$lines = explode("\n",$file_content);
foreach($lines as $l) {
$l = substr($l,31);
$t = substr($l,0,$p=strpos($l,"-"));
$arr[substr($l,$p+1)][] = "\"".$t."\"";
}
return $arr;
}
//echo my_json_encode(create_json_per_ens());
//echo my_json_encode(create_json_per_city());
?>
Loading
Loading