Skip to content

Commit

Permalink
Activation mechanism documentation added (#1935)
Browse files Browse the repository at this point in the history
Few site checks fixed
  • Loading branch information
soxoj authored Dec 6, 2024
1 parent 260b80c commit f04de78
Show file tree
Hide file tree
Showing 5 changed files with 145 additions and 66 deletions.
59 changes: 59 additions & 0 deletions docs/source/development.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,65 @@ There are few options for sites data.json helpful in various cases:
- ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives

.. _activation-mechanism:

Activation mechanism
--------------------

The activation mechanism helps make requests to sites requiring additional authentication like cookies, JWT tokens, or custom headers.

Let's study the Vimeo site check record from the Maigret database:

.. code-block:: json
"Vimeo": {
"tags": [
"us",
"video"
],
"headers": {
"Authorization": "jwt eyJ0..."
},
"activation": {
"url": "https://vimeo.com/_rv/viewer",
"marks": [
"Something strange occurred. Please get in touch with the app's creator."
],
"method": "vimeo"
},
"urlProbe": "https://api.vimeo.com/users/{username}?fields=name...",
"checkType": "status_code",
"alexaRank": 148,
"urlMain": "https://vimeo.com/",
"url": "https://vimeo.com/{username}",
"usernameClaimed": "blue",
"usernameUnclaimed": "noonewouldeverusethis7"
},
The activation method is:

.. code-block:: python
def vimeo(site, logger, cookies={}):
headers = dict(site.headers)
if "Authorization" in headers:
del headers["Authorization"]
import requests
r = requests.get(site.activation["url"], headers=headers)
jwt_token = r.json()["jwt"]
site.headers["Authorization"] = "jwt " + jwt_token
Here's how the activation process works when a JWT token becomes invalid:

1. The site check makes an HTTP request to ``urlProbe`` with the invalid token
2. The response contains an error message specified in the ``activation``/``marks`` field
3. When this error is detected, the ``vimeo`` activation function is triggered
4. The activation function obtains a new JWT token and updates it in the site check record
5. On the next site check (either through retry or a new Maigret run), the valid token is used and the check succeeds

Examples of activation mechanism implementation are available in `activation.py <https://github.com/soxoj/maigret/blob/main/maigret/activation.py>`_ file.

How to publish new version of Maigret
-------------------------------------

Expand Down
24 changes: 20 additions & 4 deletions docs/source/features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -147,16 +147,32 @@ Archives and mirrors checking

The Maigret database contains not only the original websites, but also mirrors, archives, and aggregators. For example:

- `Reddit BigData search <https://camas.github.io/reddit-search/>`_
- `Picuki <https://www.picuki.com/>`_, Instagram mirror
- `Twitter shadowban <https://shadowban.eu/>`_ checker
- (no longer available) `Reddit BigData search <https://camas.github.io/reddit-search/>`_
- (no longer available) `Twitter shadowban <https://shadowban.eu/>`_ checker

It allows getting additional info about the person and checking the existence of the account even if the main site is unavailable (bot protection, captcha, etc.)

Activation
----------
The activation mechanism helps make requests to sites requiring additional authentication like cookies, JWT tokens, or custom headers.

It works by implementing a custom function that:

1. Makes a specialized HTTP request to a specific website endpoint
2. Processes the response
3. Updates the headers/cookies for that site in the local Maigret database

Since activation only triggers after encountering specific errors, a retry (or another Maigret run) is needed to obtain a valid response with the updated authentication.

The activation mechanism is enabled by default, and cannot be disabled at the moment.

See for more details in Development section :ref:`activation-mechanism`.

.. _extracting-information-from-pages:

Extractiion of information from account pages
---------------------------------------------
Extraction of information from account pages
--------------------------------------------

Maigret can parse URLs and content of web pages by URLs to extract info about account owner and other meta information.

Expand Down
57 changes: 32 additions & 25 deletions maigret/resources/data.json
Original file line number Diff line number Diff line change
Expand Up @@ -5260,19 +5260,18 @@
"regexCheck": "^[a-zA-Z0-9_\\.]{3,49}(?<!\\.com|\\.org|\\.net)$",
"checkType": "message",
"absenceStrs": [
"EventProfilerImpl"
"rsrcTags"
],
"presenseStrs": [
"userID"
"first_name"
],
"headers": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
},
"alexaRank": 10,
"urlMain": "https://www.facebook.com/",
"url": "https://www.facebook.com/{username}",
"usernameClaimed": "blue",
"usernameClaimed": "zuck",
"usernameUnclaimed": "noonewouldeverusethis7",
"tags": [
"networking"
Expand Down Expand Up @@ -6459,7 +6458,8 @@
"urlMain": "https://shadowban.eu",
"url": "https://shadowban.eu/{username}",
"usernameClaimed": "alex",
"usernameUnclaimed": "noonewouldeverusethis7"
"usernameUnclaimed": "noonewouldeverusethis7",
"disabled": true
},
"Gamblejoe": {
"tags": [
Expand Down Expand Up @@ -7013,7 +7013,7 @@
"alexaRank": 1,
"urlMain": "https://play.google.com/store",
"url": "https://play.google.com/store/apps/developer?id={username}",
"usernameClaimed": "Skyeng",
"usernameClaimed": "OpenAI",
"usernameUnclaimed": "noonewouldeverusethis7"
},
"Gorod.dp.ua": {
Expand Down Expand Up @@ -13445,7 +13445,7 @@
"Sorry, nobody on Reddit goes by that name."
],
"presenseStrs": [
"Post Karma"
"Post karma"
],
"alexaRank": 19,
"urlMain": "https://www.reddit.com/",
Expand Down Expand Up @@ -17350,16 +17350,16 @@
"video"
],
"headers": {
"Authorization": "jwt eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE2OTgyMzM1MjAsInVzZXJfaWQiOm51bGwsImFwcF9pZCI6NTg0NzksInNjb3BlcyI6InB1YmxpYyIsInRlYW1fdXNlcl9pZCI6bnVsbH0.e_hVzSccYGkrjpNoW3b5JpvCWVsNADv50DqFDFt_3No"
"Authorization": "jwt eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE3MzM0NDE4ODAsInVzZXJfaWQiOm51bGwsImFwcF9pZCI6NTg0NzksInNjb3BlcyI6InB1YmxpYyIsInRlYW1fdXNlcl9pZCI6bnVsbCwianRpIjoiYzRlNDQ4ZTgtZmFmNC00OWY1LTkyYmMtZWVmZWMzNWNlOTM1In0.nm4mnYvn8hm3u5gfNXh1r451U-R5O2MFOqz40DqixQo"
},
"activation": {
"url": "https://vimeo.com/_rv/viewer",
"marks": [
"Something strange occurred. Please contact the app owners."
"Something strange occurred. Please get in touch with the app's creator."
],
"method": "vimeo"
},
"urlProbe": "https://api.vimeo.com/users/{username}?fields=name%2Cgender%2Cbio%2Curi%2Clink%2Cbackground_video%2Clocation_details%2Cpictures%2Cverified%2Cmetadata.public_videos.total%2Cavailable_for_hire%2Ccan_work_remotely%2Cmetadata.connections.videos.total%2Cmetadata.connections.albums.total%2Cmetadata.connections.followers.total%2Cmetadata.connections.following.total%2Cmetadata.public_videos.total%2Ctotal_collection_count%2Ccreated_time%2Cprofile_preferences%2Cmembership%2Cclients%2Cskills%2Cproject_types%2Crates%2Ccategories&fetch_user_profile=1",
"urlProbe": "https://api.vimeo.com/users/{username}?fields=name%2Cgender%2Cbio%2Curi%2Clink%2Cbackground_video%2Clocation_details%2Cpictures%2Cverified%2Cmetadata.public_videos.total%2Cavailable_for_hire%2Ccan_work_remotely%2Cmetadata.connections.videos.total%2Cmetadata.connections.albums.total%2Cmetadata.connections.followers.total%2Cmetadata.connections.following.total%2Cmetadata.public_videos.total%2Cmetadata.connections.vimeo_experts.is_enrolled%2Ctotal_collection_count%2Ccreated_time%2Cprofile_preferences%2Cmembership%2Cclients%2Cskills%2Cproject_types%2Crates%2Ccategories%2Cis_expert%2Cprofile_discovery%2Cwebsites%2Ccontact_emails&fetch_user_profile=1",
"checkType": "status_code",
"alexaRank": 148,
"urlMain": "https://vimeo.com/",
Expand Down Expand Up @@ -18466,7 +18466,8 @@
"url": "https://yandex.ru/collections/api/users/{username}/",
"source": "Yandex",
"usernameClaimed": "yandex",
"usernameUnclaimed": "noonewouldeverusethis7"
"usernameUnclaimed": "noonewouldeverusethis7",
"disabled": true
},
"YandexCollections API (by yandex_public_id)": {
"tags": [
Expand Down Expand Up @@ -18666,41 +18667,47 @@
"tags": [
"video"
],
"headers": {
"User-Agent": "curl/8.6.0",
"Accept": "*/*"
},
"regexCheck": "^[^\\/]+$",
"checkType": "message",
"presenseStrs": [
"href=\"/feed/channel"
"visitorData",
"userAgent"
],
"absenceStrs": [
"Error - Invidious",
"This channel does not exist"
"404 Not Found"
],
"alexaRank": 2,
"urlMain": "https://www.youtube.com/",
"url": "https://www.youtube.com/{username}",
"urlProbe": "https://invidious.slipfox.xyz/c/{username}",
"url": "https://www.youtube.com/@{username}",
"usernameClaimed": "test",
"usernameUnclaimed": "noonewouldeverusethis7"
"usernameUnclaimed": "noonewouldeverusethis777"
},
"YouTube User": {
"tags": [
"video"
],
"headers": {
"User-Agent": "curl/8.6.0",
"Accept": "*/*"
},
"regexCheck": "^[^\\/]+$",
"checkType": "message",
"presenseStrs": [
"href=\"/feed/channel"
"visitorData",
"userAgent"
],
"absenceStrs": [
"Error - Invidious",
"This channel does not exist"
"404 Not Found"
],
"alexaRank": 2,
"urlMain": "https://www.youtube.com/",
"url": "https://www.youtube.com/{username}",
"urlProbe": "https://invidious.slipfox.xyz/user/{username}",
"usernameClaimed": "blue",
"usernameUnclaimed": "noonewouldeverusethis7"
"url": "https://www.youtube.com/@{username}",
"usernameClaimed": "test",
"usernameUnclaimed": "noonewouldeverusethis777"
},
"Yummly": {
"tags": [
Expand Down
60 changes: 29 additions & 31 deletions sites.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Rank data fetched from Alexa by domains.
1. ![](https://www.google.com/s2/favicons?domain=https://pt.bongacams.com) [BongaCams (https://pt.bongacams.com)](https://pt.bongacams.com)*: top 50, cz, webcam*
1. ![](https://www.google.com/s2/favicons?domain=https://www.instagram.com/) [Instagram (https://www.instagram.com/)](https://www.instagram.com/)*: top 50, photo*, search is disabled
1. ![](https://www.google.com/s2/favicons?domain=https://www.twitch.tv/) [Twitch (https://www.twitch.tv/)](https://www.twitch.tv/)*: top 50, streaming, us*
1. ![](https://www.google.com/s2/favicons?domain=https://yandex.ru/collections/) [YandexCollections API (https://yandex.ru/collections/)](https://yandex.ru/collections/)*: top 50, ru, sharing*
1. ![](https://www.google.com/s2/favicons?domain=https://yandex.ru/collections/) [YandexCollections API (https://yandex.ru/collections/)](https://yandex.ru/collections/)*: top 50, ru, sharing*, search is disabled
1. ![](https://www.google.com/s2/favicons?domain=https://stackoverflow.com) [StackOverflow (https://stackoverflow.com)](https://stackoverflow.com)*: top 50, coding*
1. ![](https://www.google.com/s2/favicons?domain=https://www.ebay.com/) [Ebay (https://www.ebay.com/)](https://www.ebay.com/)*: top 50, shopping, us*
1. ![](https://www.google.com/s2/favicons?domain=https://naver.com) [Naver (https://naver.com)](https://naver.com)*: top 50, kr*
Expand Down Expand Up @@ -804,7 +804,7 @@ Rank data fetched from Alexa by domains.
1. ![](https://www.google.com/s2/favicons?domain=https://forums.gentoo.org) [gentoo (https://forums.gentoo.org)](https://forums.gentoo.org)*: top 100K, fi, forum, in*
1. ![](https://www.google.com/s2/favicons?domain=https://community.asterisk.org) [community.asterisk.org (https://community.asterisk.org)](https://community.asterisk.org)*: top 100K, forum, in, ir, jp, us*
1. ![](https://www.google.com/s2/favicons?domain=https://www.gapyear.com) [Gapyear (https://www.gapyear.com)](https://www.gapyear.com)*: top 100K, gb, in*
1. ![](https://www.google.com/s2/favicons?domain=https://shadowban.eu) [Twitter Shadowban (https://shadowban.eu)](https://shadowban.eu)*: top 100K, jp, sa*
1. ![](https://www.google.com/s2/favicons?domain=https://shadowban.eu) [Twitter Shadowban (https://shadowban.eu)](https://shadowban.eu)*: top 100K, jp, sa*, search is disabled
1. ![](https://www.google.com/s2/favicons?domain=https://psyera.ru) [Psyera (https://psyera.ru)](https://psyera.ru)*: top 100K, ru*
1. ![](https://www.google.com/s2/favicons?domain=http://forum.mfd.ru) [mfd (http://forum.mfd.ru)](http://forum.mfd.ru)*: top 100K, forum, ru*
1. ![](https://www.google.com/s2/favicons?domain=https://forum.mirf.ru/) [mirf (https://forum.mirf.ru/)](https://forum.mirf.ru/)*: top 100K, forum, ru*
Expand Down Expand Up @@ -3130,21 +3130,20 @@ Rank data fetched from Alexa by domains.
1. ![](https://www.google.com/s2/favicons?domain=https://massagerepublic.com) [massagerepublic.com (https://massagerepublic.com)](https://massagerepublic.com)*: top 100M*
1. ![](https://www.google.com/s2/favicons?domain=https://mynickname.com) [mynickname.com (https://mynickname.com)](https://mynickname.com)*: top 100M*

The list was updated at (2024-11-30)

The list was updated at (2024-12-06)
## Statistics

Enabled/total sites: 2693/3126 = 86.15%
Enabled/total sites: 2691/3126 = 86.08%

Incomplete message checks: 404/2693 = 15.0% (false positive risks)
Incomplete message checks: 405/2691 = 15.05% (false positive risks)

Status code checks: 618/2694 = 22.94% (false positive risks)
Status code checks: 719/2691 = 26.72% (false positive risks)

False positive risk (total): 37.97%
False positive risk (total): 41.77%

Top 20 profile URLs:
- (796) `{urlMain}/index/8-0-{username} (uCoz)`
- (302) `/{username}`
- (300) `/{username}`
- (221) `{urlMain}{urlSubpath}/members/?username={username} (XenForo)`
- (160) `/user/{username}`
- (133) `{urlMain}{urlSubpath}/member.php?username={username} (vBulletin)`
Expand All @@ -3154,7 +3153,7 @@ Top 20 profile URLs:
- (88) `/users/{username}`
- (87) `{urlMain}/u/{username}/summary (Discourse)`
- (54) `/wiki/User:{username}`
- (49) `/@{username}`
- (51) `/@{username}`
- (42) `SUBDOMAIN`
- (41) `/members/?username={username}`
- (32) `/members/{username}`
Expand All @@ -3164,25 +3163,24 @@ Top 20 profile URLs:
- (17) `/forum/members/?username={username}`
- (17) `/search.php?keywords=&terms=all&author={username}`


Top 20 tags:
- (1104) `NO_TAGS` (non-standard)
- (735) `forum`
- (80) `gaming`
- (48) `photo`
- (41) `coding`
- (30) `tech`
- (29) `news`
- (27) `blog`
- (23) `music`
- (18) `finance`
- (18) `crypto`
- (17) `sharing`
- (16) `freelance`
- (15) `art`
- (15) `shopping`
- (13) `sport`
- (13) `business`
- (12) `movies`
- (11) `hobby`
- (11) `education`
- (327) `NO_TAGS` (non-standard)
- (307) `forum`
- (50) `gaming`
- (26) `coding`
- (21) `photo`
- (20) `blog`
- (19) `news`
- (15) `music`
- (14) `tech`
- (12) `sharing`
- (12) `freelance`
- (12) `finance`
- (10) `dating`
- (10) `art`
- (10) `shopping`
- (10) `movies`
- (8) `hobby`
- (8) `crypto`
- (7) `sport`
- (7) `hacking`
11 changes: 5 additions & 6 deletions tests/test_activation.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,13 @@
"""


@pytest.mark.skip(reason="periodically fails")
@pytest.mark.slow
def test_twitter_activation(default_db):
twitter_site = default_db.sites_dict['Twitter']
token1 = twitter_site.headers['x-guest-token']
def test_vimeo_activation(default_db):
vimeo_site = default_db.sites_dict['Vimeo']
token1 = vimeo_site.headers['Authorization']

ParsingActivator.twitter(twitter_site, Mock())
token2 = twitter_site.headers['x-guest-token']
ParsingActivator.vimeo(vimeo_site, Mock())
token2 = vimeo_site.headers['Authorization']

assert token1 != token2

Expand Down

0 comments on commit f04de78

Please sign in to comment.