Activation mechanism documentation added (#1935)

Few site checks fixed
soxoj · Dec 6, 2024 · f04de78 · f04de78
1 parent 260b80c
commit f04de78
Show file tree

Hide file tree

Showing 5 changed files with 145 additions and 66 deletions.
diff --git a/docs/source/development.rst b/docs/source/development.rst
@@ -110,6 +110,65 @@ There are few options for sites data.json helpful in various cases:
 - ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
 - ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
 
+.. _activation-mechanism:
+
+Activation mechanism
+--------------------
+
+The activation mechanism helps make requests to sites requiring additional authentication like cookies, JWT tokens, or custom headers.
+
+Let's study the Vimeo site check record from the Maigret database:
+
+.. code-block:: json
+
+      "Vimeo": {
+          "tags": [
+              "us",
+              "video"
+          ],
+          "headers": {
+              "Authorization": "jwt eyJ0..."
+          },
+          "activation": {
+              "url": "https://vimeo.com/_rv/viewer",
+              "marks": [
+                  "Something strange occurred. Please get in touch with the app's creator."
+              ],
+              "method": "vimeo"
+          },
+          "urlProbe": "https://api.vimeo.com/users/{username}?fields=name...",
+          "checkType": "status_code",
+          "alexaRank": 148,
+          "urlMain": "https://vimeo.com/",
+          "url": "https://vimeo.com/{username}",
+          "usernameClaimed": "blue",
+          "usernameUnclaimed": "noonewouldeverusethis7"
+      },
+
+The activation method is:
+
+.. code-block:: python
+
+    def vimeo(site, logger, cookies={}):
+        headers = dict(site.headers)
+        if "Authorization" in headers:
+            del headers["Authorization"]
+        import requests
+
+        r = requests.get(site.activation["url"], headers=headers)
+        jwt_token = r.json()["jwt"]
+        site.headers["Authorization"] = "jwt " + jwt_token
+
+Here's how the activation process works when a JWT token becomes invalid:
+
+1. The site check makes an HTTP request to ``urlProbe`` with the invalid token
+2. The response contains an error message specified in the ``activation``/``marks`` field
+3. When this error is detected, the ``vimeo`` activation function is triggered
+4. The activation function obtains a new JWT token and updates it in the site check record
+5. On the next site check (either through retry or a new Maigret run), the valid token is used and the check succeeds
+
+Examples of activation mechanism implementation are available in `activation.py <https://github.com/soxoj/maigret/blob/main/maigret/activation.py>`_ file.
+
 How to publish new version of Maigret
 -------------------------------------
 

diff --git a/docs/source/features.rst b/docs/source/features.rst
@@ -147,16 +147,32 @@ Archives and mirrors checking
 
 The Maigret database contains not only the original websites, but also mirrors, archives, and aggregators. For example:
 
-- `Reddit BigData search <https://camas.github.io/reddit-search/>`_
 - `Picuki <https://www.picuki.com/>`_, Instagram mirror
-- `Twitter shadowban <https://shadowban.eu/>`_ checker
+- (no longer available) `Reddit BigData search <https://camas.github.io/reddit-search/>`_
+- (no longer available) `Twitter shadowban <https://shadowban.eu/>`_ checker
 
 It allows getting additional info about the person and checking the existence of the account even if the main site is unavailable (bot protection, captcha, etc.)
 
+Activation
+----------
+The activation mechanism helps make requests to sites requiring additional authentication like cookies, JWT tokens, or custom headers.
+
+It works by implementing a custom function that:
+
+1. Makes a specialized HTTP request to a specific website endpoint
+2. Processes the response
+3. Updates the headers/cookies for that site in the local Maigret database
+
+Since activation only triggers after encountering specific errors, a retry (or another Maigret run) is needed to obtain a valid response with the updated authentication.
+
+The activation mechanism is enabled by default, and cannot be disabled at the moment.
+
+See for more details in Development section :ref:`activation-mechanism`.
+
 .. _extracting-information-from-pages:
 
-Extractiion of information from account pages
----------------------------------------------
+Extraction of information from account pages
+--------------------------------------------
 
 Maigret can parse URLs and content of web pages by URLs to extract info about account owner and other meta information.
 

diff --git a/maigret/resources/data.json b/maigret/resources/data.json
@@ -5260,19 +5260,18 @@
             "regexCheck": "^[a-zA-Z0-9_\\.]{3,49}(?<!\\.com|\\.org|\\.net)$",
             "checkType": "message",
             "absenceStrs": [
-                "EventProfilerImpl"
+                "rsrcTags"
             ],
             "presenseStrs": [
-                "userID"
+                "first_name"
             ],
             "headers": {
-                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
-                "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
+                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
             },
             "alexaRank": 10,
             "urlMain": "https://www.facebook.com/",
             "url": "https://www.facebook.com/{username}",
-            "usernameClaimed": "blue",
+            "usernameClaimed": "zuck",
             "usernameUnclaimed": "noonewouldeverusethis7",
             "tags": [
                 "networking"
@@ -6459,7 +6458,8 @@
             "urlMain": "https://shadowban.eu",
             "url": "https://shadowban.eu/{username}",
             "usernameClaimed": "alex",
-            "usernameUnclaimed": "noonewouldeverusethis7"
+            "usernameUnclaimed": "noonewouldeverusethis7",
+            "disabled": true
         },
         "Gamblejoe": {
             "tags": [
@@ -7013,7 +7013,7 @@
             "alexaRank": 1,
             "urlMain": "https://play.google.com/store",
             "url": "https://play.google.com/store/apps/developer?id={username}",
-            "usernameClaimed": "Skyeng",
+            "usernameClaimed": "OpenAI",
             "usernameUnclaimed": "noonewouldeverusethis7"
         },
         "Gorod.dp.ua": {
@@ -13445,7 +13445,7 @@
                 "Sorry, nobody on Reddit goes by that name."
             ],
             "presenseStrs": [
-                "Post Karma"
+                "Post karma"
             ],
             "alexaRank": 19,
             "urlMain": "https://www.reddit.com/",
@@ -17350,16 +17350,16 @@
                 "video"
             ],
             "headers": {
-                "Authorization": "jwt eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE2OTgyMzM1MjAsInVzZXJfaWQiOm51bGwsImFwcF9pZCI6NTg0NzksInNjb3BlcyI6InB1YmxpYyIsInRlYW1fdXNlcl9pZCI6bnVsbH0.e_hVzSccYGkrjpNoW3b5JpvCWVsNADv50DqFDFt_3No"
+                "Authorization": "jwt eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE3MzM0NDE4ODAsInVzZXJfaWQiOm51bGwsImFwcF9pZCI6NTg0NzksInNjb3BlcyI6InB1YmxpYyIsInRlYW1fdXNlcl9pZCI6bnVsbCwianRpIjoiYzRlNDQ4ZTgtZmFmNC00OWY1LTkyYmMtZWVmZWMzNWNlOTM1In0.nm4mnYvn8hm3u5gfNXh1r451U-R5O2MFOqz40DqixQo"
             },
             "activation": {
                 "url": "https://vimeo.com/_rv/viewer",
                 "marks": [
-                    "Something strange occurred. Please contact the app owners."
+                    "Something strange occurred. Please get in touch with the app's creator."
                 ],
                 "method": "vimeo"
             },
-            "urlProbe": "https://api.vimeo.com/users/{username}?fields=name%2Cgender%2Cbio%2Curi%2Clink%2Cbackground_video%2Clocation_details%2Cpictures%2Cverified%2Cmetadata.public_videos.total%2Cavailable_for_hire%2Ccan_work_remotely%2Cmetadata.connections.videos.total%2Cmetadata.connections.albums.total%2Cmetadata.connections.followers.total%2Cmetadata.connections.following.total%2Cmetadata.public_videos.total%2Ctotal_collection_count%2Ccreated_time%2Cprofile_preferences%2Cmembership%2Cclients%2Cskills%2Cproject_types%2Crates%2Ccategories&fetch_user_profile=1",
+            "urlProbe": "https://api.vimeo.com/users/{username}?fields=name%2Cgender%2Cbio%2Curi%2Clink%2Cbackground_video%2Clocation_details%2Cpictures%2Cverified%2Cmetadata.public_videos.total%2Cavailable_for_hire%2Ccan_work_remotely%2Cmetadata.connections.videos.total%2Cmetadata.connections.albums.total%2Cmetadata.connections.followers.total%2Cmetadata.connections.following.total%2Cmetadata.public_videos.total%2Cmetadata.connections.vimeo_experts.is_enrolled%2Ctotal_collection_count%2Ccreated_time%2Cprofile_preferences%2Cmembership%2Cclients%2Cskills%2Cproject_types%2Crates%2Ccategories%2Cis_expert%2Cprofile_discovery%2Cwebsites%2Ccontact_emails&fetch_user_profile=1",
             "checkType": "status_code",
             "alexaRank": 148,
             "urlMain": "https://vimeo.com/",
@@ -18466,7 +18466,8 @@
             "url": "https://yandex.ru/collections/api/users/{username}/",
             "source": "Yandex",
             "usernameClaimed": "yandex",
-            "usernameUnclaimed": "noonewouldeverusethis7"
+            "usernameUnclaimed": "noonewouldeverusethis7",
+            "disabled": true
         },
         "YandexCollections API (by yandex_public_id)": {
             "tags": [
@@ -18666,41 +18667,47 @@
             "tags": [
                 "video"
             ],
+            "headers": {
+                "User-Agent": "curl/8.6.0",
+                "Accept": "*/*"
+            },
             "regexCheck": "^[^\\/]+$",
             "checkType": "message",
             "presenseStrs": [
-                "href=\"/feed/channel"
+                "visitorData",
+                "userAgent"
             ],
             "absenceStrs": [
-                "Error - Invidious",
-                "This channel does not exist"
+                "404 Not Found"
             ],
             "alexaRank": 2,
             "urlMain": "https://www.youtube.com/",
-            "url": "https://www.youtube.com/{username}",
-            "urlProbe": "https://invidious.slipfox.xyz/c/{username}",
+            "url": "https://www.youtube.com/@{username}",
             "usernameClaimed": "test",
-            "usernameUnclaimed": "noonewouldeverusethis7"
+            "usernameUnclaimed": "noonewouldeverusethis777"
         },
         "YouTube User": {
             "tags": [
                 "video"
             ],
+            "headers": {
+                "User-Agent": "curl/8.6.0",
+                "Accept": "*/*"
+            },
             "regexCheck": "^[^\\/]+$",
             "checkType": "message",
             "presenseStrs": [
-                "href=\"/feed/channel"
+                "visitorData",
+                "userAgent"
             ],
             "absenceStrs": [
-                "Error - Invidious",
-                "This channel does not exist"
+                "404 Not Found"
             ],
             "alexaRank": 2,
             "urlMain": "https://www.youtube.com/",
-            "url": "https://www.youtube.com/{username}",
-            "urlProbe": "https://invidious.slipfox.xyz/user/{username}",
-            "usernameClaimed": "blue",
-            "usernameUnclaimed": "noonewouldeverusethis7"
+            "url": "https://www.youtube.com/@{username}",
+            "usernameClaimed": "test",
+            "usernameUnclaimed": "noonewouldeverusethis777"
         },
         "Yummly": {
             "tags": [

diff --git a/sites.md b/sites.md
@@ -22,7 +22,7 @@ Rank data fetched from Alexa by domains.
 1. ![](https://www.google.com/s2/favicons?domain=https://pt.bongacams.com) [BongaCams (https://pt.bongacams.com)](https://pt.bongacams.com)*: top 50, cz, webcam*
 1. ![](https://www.google.com/s2/favicons?domain=https://www.instagram.com/) [Instagram (https://www.instagram.com/)](https://www.instagram.com/)*: top 50, photo*, search is disabled
 1. ![](https://www.google.com/s2/favicons?domain=https://www.twitch.tv/) [Twitch (https://www.twitch.tv/)](https://www.twitch.tv/)*: top 50, streaming, us*
-1. ![](https://www.google.com/s2/favicons?domain=https://yandex.ru/collections/) [YandexCollections API (https://yandex.ru/collections/)](https://yandex.ru/collections/)*: top 50, ru, sharing*
+1. ![](https://www.google.com/s2/favicons?domain=https://yandex.ru/collections/) [YandexCollections API (https://yandex.ru/collections/)](https://yandex.ru/collections/)*: top 50, ru, sharing*, search is disabled
 1. ![](https://www.google.com/s2/favicons?domain=https://stackoverflow.com) [StackOverflow (https://stackoverflow.com)](https://stackoverflow.com)*: top 50, coding*
 1. ![](https://www.google.com/s2/favicons?domain=https://www.ebay.com/) [Ebay (https://www.ebay.com/)](https://www.ebay.com/)*: top 50, shopping, us*
 1. ![](https://www.google.com/s2/favicons?domain=https://naver.com) [Naver (https://naver.com)](https://naver.com)*: top 50, kr*
@@ -804,7 +804,7 @@ Rank data fetched from Alexa by domains.
 1. ![](https://www.google.com/s2/favicons?domain=https://forums.gentoo.org) [gentoo (https://forums.gentoo.org)](https://forums.gentoo.org)*: top 100K, fi, forum, in*
 1. ![](https://www.google.com/s2/favicons?domain=https://community.asterisk.org) [community.asterisk.org (https://community.asterisk.org)](https://community.asterisk.org)*: top 100K, forum, in, ir, jp, us*
 1. ![](https://www.google.com/s2/favicons?domain=https://www.gapyear.com) [Gapyear (https://www.gapyear.com)](https://www.gapyear.com)*: top 100K, gb, in*
-1. ![](https://www.google.com/s2/favicons?domain=https://shadowban.eu) [Twitter Shadowban (https://shadowban.eu)](https://shadowban.eu)*: top 100K, jp, sa*
+1. ![](https://www.google.com/s2/favicons?domain=https://shadowban.eu) [Twitter Shadowban (https://shadowban.eu)](https://shadowban.eu)*: top 100K, jp, sa*, search is disabled
 1. ![](https://www.google.com/s2/favicons?domain=https://psyera.ru) [Psyera (https://psyera.ru)](https://psyera.ru)*: top 100K, ru*
 1. ![](https://www.google.com/s2/favicons?domain=http://forum.mfd.ru) [mfd (http://forum.mfd.ru)](http://forum.mfd.ru)*: top 100K, forum, ru*
 1. ![](https://www.google.com/s2/favicons?domain=https://forum.mirf.ru/) [mirf (https://forum.mirf.ru/)](https://forum.mirf.ru/)*: top 100K, forum, ru*
@@ -3130,21 +3130,20 @@ Rank data fetched from Alexa by domains.
 1. ![](https://www.google.com/s2/favicons?domain=https://massagerepublic.com) [massagerepublic.com (https://massagerepublic.com)](https://massagerepublic.com)*: top 100M*
 1. ![](https://www.google.com/s2/favicons?domain=https://mynickname.com) [mynickname.com (https://mynickname.com)](https://mynickname.com)*: top 100M*
 
-The list was updated at (2024-11-30)
-
+The list was updated at (2024-12-06)
 ## Statistics
 
-Enabled/total sites: 2693/3126 = 86.15%
+Enabled/total sites: 2691/3126 = 86.08%
 
-Incomplete message checks: 404/2693 = 15.0% (false positive risks)
+Incomplete message checks: 405/2691 = 15.05% (false positive risks)
 
-Status code checks: 618/2694 = 22.94% (false positive risks)
+Status code checks: 719/2691 = 26.72% (false positive risks)
 
-False positive risk (total): 37.97%
+False positive risk (total): 41.77%
 
 Top 20 profile URLs:
 - (796)	`{urlMain}/index/8-0-{username} (uCoz)`
-- (302)	`/{username}`
+- (300)	`/{username}`
 - (221)	`{urlMain}{urlSubpath}/members/?username={username} (XenForo)`
 - (160)	`/user/{username}`
 - (133)	`{urlMain}{urlSubpath}/member.php?username={username} (vBulletin)`
@@ -3154,7 +3153,7 @@ Top 20 profile URLs:
 - (88)	`/users/{username}`
 - (87)	`{urlMain}/u/{username}/summary (Discourse)`
 - (54)	`/wiki/User:{username}`
-- (49)	`/@{username}`
+- (51)	`/@{username}`
 - (42)	`SUBDOMAIN`
 - (41)	`/members/?username={username}`
 - (32)	`/members/{username}`
@@ -3164,25 +3163,24 @@ Top 20 profile URLs:
 - (17)	`/forum/members/?username={username}`
 - (17)	`/search.php?keywords=&terms=all&author={username}`
 
-
 Top 20 tags:
-- (1104)	`NO_TAGS` (non-standard)
-- (735)	`forum`
-- (80)	`gaming`
-- (48)	`photo`
-- (41)	`coding`
-- (30)	`tech`
-- (29)	`news`
-- (27)	`blog`
-- (23)	`music`
-- (18)	`finance`
-- (18)	`crypto`
-- (17)	`sharing`
-- (16)	`freelance`
-- (15)	`art`
-- (15)	`shopping`
-- (13)	`sport`
-- (13)	`business`
-- (12)	`movies`
-- (11)	`hobby`
-- (11)	`education`
+- (327)	`NO_TAGS` (non-standard)
+- (307)	`forum`
+- (50)	`gaming`
+- (26)	`coding`
+- (21)	`photo`
+- (20)	`blog`
+- (19)	`news`
+- (15)	`music`
+- (14)	`tech`
+- (12)	`sharing`
+- (12)	`freelance`
+- (12)	`finance`
+- (10)	`dating`
+- (10)	`art`
+- (10)	`shopping`
+- (10)	`movies`
+- (8)	`hobby`
+- (8)	`crypto`
+- (7)	`sport`
+- (7)	`hacking`
diff --git a/tests/test_activation.py b/tests/test_activation.py
@@ -22,14 +22,13 @@
 """
 
 
-@pytest.mark.skip(reason="periodically fails")
 @pytest.mark.slow
-def test_twitter_activation(default_db):
-    twitter_site = default_db.sites_dict['Twitter']
-    token1 = twitter_site.headers['x-guest-token']
+def test_vimeo_activation(default_db):
+    vimeo_site = default_db.sites_dict['Vimeo']
+    token1 = vimeo_site.headers['Authorization']
 
-    ParsingActivator.twitter(twitter_site, Mock())
-    token2 = twitter_site.headers['x-guest-token']
+    ParsingActivator.vimeo(vimeo_site, Mock())
+    token2 = vimeo_site.headers['Authorization']
 
     assert token1 != token2