Activation mechanism documentation added (#1935)

Few site checks fixed
This commit is contained in:
Soxoj
2024-12-06 01:35:19 +01:00
committed by GitHub
parent 260b80c2f1
commit f04de78682
5 changed files with 145 additions and 66 deletions
+59
View File
@@ -110,6 +110,65 @@ There are few options for sites data.json helpful in various cases:
- ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site - ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives - ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
.. _activation-mechanism:
Activation mechanism
--------------------
The activation mechanism helps make requests to sites requiring additional authentication like cookies, JWT tokens, or custom headers.
Let's study the Vimeo site check record from the Maigret database:
.. code-block:: json
"Vimeo": {
"tags": [
"us",
"video"
],
"headers": {
"Authorization": "jwt eyJ0..."
},
"activation": {
"url": "https://vimeo.com/_rv/viewer",
"marks": [
"Something strange occurred. Please get in touch with the app's creator."
],
"method": "vimeo"
},
"urlProbe": "https://api.vimeo.com/users/{username}?fields=name...",
"checkType": "status_code",
"alexaRank": 148,
"urlMain": "https://vimeo.com/",
"url": "https://vimeo.com/{username}",
"usernameClaimed": "blue",
"usernameUnclaimed": "noonewouldeverusethis7"
},
The activation method is:
.. code-block:: python
def vimeo(site, logger, cookies={}):
headers = dict(site.headers)
if "Authorization" in headers:
del headers["Authorization"]
import requests
r = requests.get(site.activation["url"], headers=headers)
jwt_token = r.json()["jwt"]
site.headers["Authorization"] = "jwt " + jwt_token
Here's how the activation process works when a JWT token becomes invalid:
1. The site check makes an HTTP request to ``urlProbe`` with the invalid token
2. The response contains an error message specified in the ``activation``/``marks`` field
3. When this error is detected, the ``vimeo`` activation function is triggered
4. The activation function obtains a new JWT token and updates it in the site check record
5. On the next site check (either through retry or a new Maigret run), the valid token is used and the check succeeds
Examples of activation mechanism implementation are available in `activation.py <https://github.com/soxoj/maigret/blob/main/maigret/activation.py>`_ file.
How to publish new version of Maigret How to publish new version of Maigret
------------------------------------- -------------------------------------
+20 -4
View File
@@ -147,16 +147,32 @@ Archives and mirrors checking
The Maigret database contains not only the original websites, but also mirrors, archives, and aggregators. For example: The Maigret database contains not only the original websites, but also mirrors, archives, and aggregators. For example:
- `Reddit BigData search <https://camas.github.io/reddit-search/>`_
- `Picuki <https://www.picuki.com/>`_, Instagram mirror - `Picuki <https://www.picuki.com/>`_, Instagram mirror
- `Twitter shadowban <https://shadowban.eu/>`_ checker - (no longer available) `Reddit BigData search <https://camas.github.io/reddit-search/>`_
- (no longer available) `Twitter shadowban <https://shadowban.eu/>`_ checker
It allows getting additional info about the person and checking the existence of the account even if the main site is unavailable (bot protection, captcha, etc.) It allows getting additional info about the person and checking the existence of the account even if the main site is unavailable (bot protection, captcha, etc.)
Activation
----------
The activation mechanism helps make requests to sites requiring additional authentication like cookies, JWT tokens, or custom headers.
It works by implementing a custom function that:
1. Makes a specialized HTTP request to a specific website endpoint
2. Processes the response
3. Updates the headers/cookies for that site in the local Maigret database
Since activation only triggers after encountering specific errors, a retry (or another Maigret run) is needed to obtain a valid response with the updated authentication.
The activation mechanism is enabled by default, and cannot be disabled at the moment.
See for more details in Development section :ref:`activation-mechanism`.
.. _extracting-information-from-pages: .. _extracting-information-from-pages:
Extractiion of information from account pages Extraction of information from account pages
--------------------------------------------- --------------------------------------------
Maigret can parse URLs and content of web pages by URLs to extract info about account owner and other meta information. Maigret can parse URLs and content of web pages by URLs to extract info about account owner and other meta information.
+32 -25
View File
@@ -5260,19 +5260,18 @@
"regexCheck": "^[a-zA-Z0-9_\\.]{3,49}(?<!\\.com|\\.org|\\.net)$", "regexCheck": "^[a-zA-Z0-9_\\.]{3,49}(?<!\\.com|\\.org|\\.net)$",
"checkType": "message", "checkType": "message",
"absenceStrs": [ "absenceStrs": [
"EventProfilerImpl" "rsrcTags"
], ],
"presenseStrs": [ "presenseStrs": [
"userID" "first_name"
], ],
"headers": { "headers": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
}, },
"alexaRank": 10, "alexaRank": 10,
"urlMain": "https://www.facebook.com/", "urlMain": "https://www.facebook.com/",
"url": "https://www.facebook.com/{username}", "url": "https://www.facebook.com/{username}",
"usernameClaimed": "blue", "usernameClaimed": "zuck",
"usernameUnclaimed": "noonewouldeverusethis7", "usernameUnclaimed": "noonewouldeverusethis7",
"tags": [ "tags": [
"networking" "networking"
@@ -6459,7 +6458,8 @@
"urlMain": "https://shadowban.eu", "urlMain": "https://shadowban.eu",
"url": "https://shadowban.eu/{username}", "url": "https://shadowban.eu/{username}",
"usernameClaimed": "alex", "usernameClaimed": "alex",
"usernameUnclaimed": "noonewouldeverusethis7" "usernameUnclaimed": "noonewouldeverusethis7",
"disabled": true
}, },
"Gamblejoe": { "Gamblejoe": {
"tags": [ "tags": [
@@ -7013,7 +7013,7 @@
"alexaRank": 1, "alexaRank": 1,
"urlMain": "https://play.google.com/store", "urlMain": "https://play.google.com/store",
"url": "https://play.google.com/store/apps/developer?id={username}", "url": "https://play.google.com/store/apps/developer?id={username}",
"usernameClaimed": "Skyeng", "usernameClaimed": "OpenAI",
"usernameUnclaimed": "noonewouldeverusethis7" "usernameUnclaimed": "noonewouldeverusethis7"
}, },
"Gorod.dp.ua": { "Gorod.dp.ua": {
@@ -13445,7 +13445,7 @@
"Sorry, nobody on Reddit goes by that name." "Sorry, nobody on Reddit goes by that name."
], ],
"presenseStrs": [ "presenseStrs": [
"Post Karma" "Post karma"
], ],
"alexaRank": 19, "alexaRank": 19,
"urlMain": "https://www.reddit.com/", "urlMain": "https://www.reddit.com/",
@@ -17350,16 +17350,16 @@
"video" "video"
], ],
"headers": { "headers": {
"Authorization": "jwt eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE2OTgyMzM1MjAsInVzZXJfaWQiOm51bGwsImFwcF9pZCI6NTg0NzksInNjb3BlcyI6InB1YmxpYyIsInRlYW1fdXNlcl9pZCI6bnVsbH0.e_hVzSccYGkrjpNoW3b5JpvCWVsNADv50DqFDFt_3No" "Authorization": "jwt eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE3MzM0NDE4ODAsInVzZXJfaWQiOm51bGwsImFwcF9pZCI6NTg0NzksInNjb3BlcyI6InB1YmxpYyIsInRlYW1fdXNlcl9pZCI6bnVsbCwianRpIjoiYzRlNDQ4ZTgtZmFmNC00OWY1LTkyYmMtZWVmZWMzNWNlOTM1In0.nm4mnYvn8hm3u5gfNXh1r451U-R5O2MFOqz40DqixQo"
}, },
"activation": { "activation": {
"url": "https://vimeo.com/_rv/viewer", "url": "https://vimeo.com/_rv/viewer",
"marks": [ "marks": [
"Something strange occurred. Please contact the app owners." "Something strange occurred. Please get in touch with the app's creator."
], ],
"method": "vimeo" "method": "vimeo"
}, },
"urlProbe": "https://api.vimeo.com/users/{username}?fields=name%2Cgender%2Cbio%2Curi%2Clink%2Cbackground_video%2Clocation_details%2Cpictures%2Cverified%2Cmetadata.public_videos.total%2Cavailable_for_hire%2Ccan_work_remotely%2Cmetadata.connections.videos.total%2Cmetadata.connections.albums.total%2Cmetadata.connections.followers.total%2Cmetadata.connections.following.total%2Cmetadata.public_videos.total%2Ctotal_collection_count%2Ccreated_time%2Cprofile_preferences%2Cmembership%2Cclients%2Cskills%2Cproject_types%2Crates%2Ccategories&fetch_user_profile=1", "urlProbe": "https://api.vimeo.com/users/{username}?fields=name%2Cgender%2Cbio%2Curi%2Clink%2Cbackground_video%2Clocation_details%2Cpictures%2Cverified%2Cmetadata.public_videos.total%2Cavailable_for_hire%2Ccan_work_remotely%2Cmetadata.connections.videos.total%2Cmetadata.connections.albums.total%2Cmetadata.connections.followers.total%2Cmetadata.connections.following.total%2Cmetadata.public_videos.total%2Cmetadata.connections.vimeo_experts.is_enrolled%2Ctotal_collection_count%2Ccreated_time%2Cprofile_preferences%2Cmembership%2Cclients%2Cskills%2Cproject_types%2Crates%2Ccategories%2Cis_expert%2Cprofile_discovery%2Cwebsites%2Ccontact_emails&fetch_user_profile=1",
"checkType": "status_code", "checkType": "status_code",
"alexaRank": 148, "alexaRank": 148,
"urlMain": "https://vimeo.com/", "urlMain": "https://vimeo.com/",
@@ -18466,7 +18466,8 @@
"url": "https://yandex.ru/collections/api/users/{username}/", "url": "https://yandex.ru/collections/api/users/{username}/",
"source": "Yandex", "source": "Yandex",
"usernameClaimed": "yandex", "usernameClaimed": "yandex",
"usernameUnclaimed": "noonewouldeverusethis7" "usernameUnclaimed": "noonewouldeverusethis7",
"disabled": true
}, },
"YandexCollections API (by yandex_public_id)": { "YandexCollections API (by yandex_public_id)": {
"tags": [ "tags": [
@@ -18666,41 +18667,47 @@
"tags": [ "tags": [
"video" "video"
], ],
"headers": {
"User-Agent": "curl/8.6.0",
"Accept": "*/*"
},
"regexCheck": "^[^\\/]+$", "regexCheck": "^[^\\/]+$",
"checkType": "message", "checkType": "message",
"presenseStrs": [ "presenseStrs": [
"href=\"/feed/channel" "visitorData",
"userAgent"
], ],
"absenceStrs": [ "absenceStrs": [
"Error - Invidious", "404 Not Found"
"This channel does not exist"
], ],
"alexaRank": 2, "alexaRank": 2,
"urlMain": "https://www.youtube.com/", "urlMain": "https://www.youtube.com/",
"url": "https://www.youtube.com/{username}", "url": "https://www.youtube.com/@{username}",
"urlProbe": "https://invidious.slipfox.xyz/c/{username}",
"usernameClaimed": "test", "usernameClaimed": "test",
"usernameUnclaimed": "noonewouldeverusethis7" "usernameUnclaimed": "noonewouldeverusethis777"
}, },
"YouTube User": { "YouTube User": {
"tags": [ "tags": [
"video" "video"
], ],
"headers": {
"User-Agent": "curl/8.6.0",
"Accept": "*/*"
},
"regexCheck": "^[^\\/]+$", "regexCheck": "^[^\\/]+$",
"checkType": "message", "checkType": "message",
"presenseStrs": [ "presenseStrs": [
"href=\"/feed/channel" "visitorData",
"userAgent"
], ],
"absenceStrs": [ "absenceStrs": [
"Error - Invidious", "404 Not Found"
"This channel does not exist"
], ],
"alexaRank": 2, "alexaRank": 2,
"urlMain": "https://www.youtube.com/", "urlMain": "https://www.youtube.com/",
"url": "https://www.youtube.com/{username}", "url": "https://www.youtube.com/@{username}",
"urlProbe": "https://invidious.slipfox.xyz/user/{username}", "usernameClaimed": "test",
"usernameClaimed": "blue", "usernameUnclaimed": "noonewouldeverusethis777"
"usernameUnclaimed": "noonewouldeverusethis7"
}, },
"Yummly": { "Yummly": {
"tags": [ "tags": [
+29 -31
View File
@@ -22,7 +22,7 @@ Rank data fetched from Alexa by domains.
1. ![](https://www.google.com/s2/favicons?domain=https://pt.bongacams.com) [BongaCams (https://pt.bongacams.com)](https://pt.bongacams.com)*: top 50, cz, webcam* 1. ![](https://www.google.com/s2/favicons?domain=https://pt.bongacams.com) [BongaCams (https://pt.bongacams.com)](https://pt.bongacams.com)*: top 50, cz, webcam*
1. ![](https://www.google.com/s2/favicons?domain=https://www.instagram.com/) [Instagram (https://www.instagram.com/)](https://www.instagram.com/)*: top 50, photo*, search is disabled 1. ![](https://www.google.com/s2/favicons?domain=https://www.instagram.com/) [Instagram (https://www.instagram.com/)](https://www.instagram.com/)*: top 50, photo*, search is disabled
1. ![](https://www.google.com/s2/favicons?domain=https://www.twitch.tv/) [Twitch (https://www.twitch.tv/)](https://www.twitch.tv/)*: top 50, streaming, us* 1. ![](https://www.google.com/s2/favicons?domain=https://www.twitch.tv/) [Twitch (https://www.twitch.tv/)](https://www.twitch.tv/)*: top 50, streaming, us*
1. ![](https://www.google.com/s2/favicons?domain=https://yandex.ru/collections/) [YandexCollections API (https://yandex.ru/collections/)](https://yandex.ru/collections/)*: top 50, ru, sharing* 1. ![](https://www.google.com/s2/favicons?domain=https://yandex.ru/collections/) [YandexCollections API (https://yandex.ru/collections/)](https://yandex.ru/collections/)*: top 50, ru, sharing*, search is disabled
1. ![](https://www.google.com/s2/favicons?domain=https://stackoverflow.com) [StackOverflow (https://stackoverflow.com)](https://stackoverflow.com)*: top 50, coding* 1. ![](https://www.google.com/s2/favicons?domain=https://stackoverflow.com) [StackOverflow (https://stackoverflow.com)](https://stackoverflow.com)*: top 50, coding*
1. ![](https://www.google.com/s2/favicons?domain=https://www.ebay.com/) [Ebay (https://www.ebay.com/)](https://www.ebay.com/)*: top 50, shopping, us* 1. ![](https://www.google.com/s2/favicons?domain=https://www.ebay.com/) [Ebay (https://www.ebay.com/)](https://www.ebay.com/)*: top 50, shopping, us*
1. ![](https://www.google.com/s2/favicons?domain=https://naver.com) [Naver (https://naver.com)](https://naver.com)*: top 50, kr* 1. ![](https://www.google.com/s2/favicons?domain=https://naver.com) [Naver (https://naver.com)](https://naver.com)*: top 50, kr*
@@ -804,7 +804,7 @@ Rank data fetched from Alexa by domains.
1. ![](https://www.google.com/s2/favicons?domain=https://forums.gentoo.org) [gentoo (https://forums.gentoo.org)](https://forums.gentoo.org)*: top 100K, fi, forum, in* 1. ![](https://www.google.com/s2/favicons?domain=https://forums.gentoo.org) [gentoo (https://forums.gentoo.org)](https://forums.gentoo.org)*: top 100K, fi, forum, in*
1. ![](https://www.google.com/s2/favicons?domain=https://community.asterisk.org) [community.asterisk.org (https://community.asterisk.org)](https://community.asterisk.org)*: top 100K, forum, in, ir, jp, us* 1. ![](https://www.google.com/s2/favicons?domain=https://community.asterisk.org) [community.asterisk.org (https://community.asterisk.org)](https://community.asterisk.org)*: top 100K, forum, in, ir, jp, us*
1. ![](https://www.google.com/s2/favicons?domain=https://www.gapyear.com) [Gapyear (https://www.gapyear.com)](https://www.gapyear.com)*: top 100K, gb, in* 1. ![](https://www.google.com/s2/favicons?domain=https://www.gapyear.com) [Gapyear (https://www.gapyear.com)](https://www.gapyear.com)*: top 100K, gb, in*
1. ![](https://www.google.com/s2/favicons?domain=https://shadowban.eu) [Twitter Shadowban (https://shadowban.eu)](https://shadowban.eu)*: top 100K, jp, sa* 1. ![](https://www.google.com/s2/favicons?domain=https://shadowban.eu) [Twitter Shadowban (https://shadowban.eu)](https://shadowban.eu)*: top 100K, jp, sa*, search is disabled
1. ![](https://www.google.com/s2/favicons?domain=https://psyera.ru) [Psyera (https://psyera.ru)](https://psyera.ru)*: top 100K, ru* 1. ![](https://www.google.com/s2/favicons?domain=https://psyera.ru) [Psyera (https://psyera.ru)](https://psyera.ru)*: top 100K, ru*
1. ![](https://www.google.com/s2/favicons?domain=http://forum.mfd.ru) [mfd (http://forum.mfd.ru)](http://forum.mfd.ru)*: top 100K, forum, ru* 1. ![](https://www.google.com/s2/favicons?domain=http://forum.mfd.ru) [mfd (http://forum.mfd.ru)](http://forum.mfd.ru)*: top 100K, forum, ru*
1. ![](https://www.google.com/s2/favicons?domain=https://forum.mirf.ru/) [mirf (https://forum.mirf.ru/)](https://forum.mirf.ru/)*: top 100K, forum, ru* 1. ![](https://www.google.com/s2/favicons?domain=https://forum.mirf.ru/) [mirf (https://forum.mirf.ru/)](https://forum.mirf.ru/)*: top 100K, forum, ru*
@@ -3130,21 +3130,20 @@ Rank data fetched from Alexa by domains.
1. ![](https://www.google.com/s2/favicons?domain=https://massagerepublic.com) [massagerepublic.com (https://massagerepublic.com)](https://massagerepublic.com)*: top 100M* 1. ![](https://www.google.com/s2/favicons?domain=https://massagerepublic.com) [massagerepublic.com (https://massagerepublic.com)](https://massagerepublic.com)*: top 100M*
1. ![](https://www.google.com/s2/favicons?domain=https://mynickname.com) [mynickname.com (https://mynickname.com)](https://mynickname.com)*: top 100M* 1. ![](https://www.google.com/s2/favicons?domain=https://mynickname.com) [mynickname.com (https://mynickname.com)](https://mynickname.com)*: top 100M*
The list was updated at (2024-11-30) The list was updated at (2024-12-06)
## Statistics ## Statistics
Enabled/total sites: 2693/3126 = 86.15% Enabled/total sites: 2691/3126 = 86.08%
Incomplete message checks: 404/2693 = 15.0% (false positive risks) Incomplete message checks: 405/2691 = 15.05% (false positive risks)
Status code checks: 618/2694 = 22.94% (false positive risks) Status code checks: 719/2691 = 26.72% (false positive risks)
False positive risk (total): 37.97% False positive risk (total): 41.77%
Top 20 profile URLs: Top 20 profile URLs:
- (796) `{urlMain}/index/8-0-{username} (uCoz)` - (796) `{urlMain}/index/8-0-{username} (uCoz)`
- (302) `/{username}` - (300) `/{username}`
- (221) `{urlMain}{urlSubpath}/members/?username={username} (XenForo)` - (221) `{urlMain}{urlSubpath}/members/?username={username} (XenForo)`
- (160) `/user/{username}` - (160) `/user/{username}`
- (133) `{urlMain}{urlSubpath}/member.php?username={username} (vBulletin)` - (133) `{urlMain}{urlSubpath}/member.php?username={username} (vBulletin)`
@@ -3154,7 +3153,7 @@ Top 20 profile URLs:
- (88) `/users/{username}` - (88) `/users/{username}`
- (87) `{urlMain}/u/{username}/summary (Discourse)` - (87) `{urlMain}/u/{username}/summary (Discourse)`
- (54) `/wiki/User:{username}` - (54) `/wiki/User:{username}`
- (49) `/@{username}` - (51) `/@{username}`
- (42) `SUBDOMAIN` - (42) `SUBDOMAIN`
- (41) `/members/?username={username}` - (41) `/members/?username={username}`
- (32) `/members/{username}` - (32) `/members/{username}`
@@ -3164,25 +3163,24 @@ Top 20 profile URLs:
- (17) `/forum/members/?username={username}` - (17) `/forum/members/?username={username}`
- (17) `/search.php?keywords=&terms=all&author={username}` - (17) `/search.php?keywords=&terms=all&author={username}`
Top 20 tags: Top 20 tags:
- (1104) `NO_TAGS` (non-standard) - (327) `NO_TAGS` (non-standard)
- (735) `forum` - (307) `forum`
- (80) `gaming` - (50) `gaming`
- (48) `photo` - (26) `coding`
- (41) `coding` - (21) `photo`
- (30) `tech` - (20) `blog`
- (29) `news` - (19) `news`
- (27) `blog` - (15) `music`
- (23) `music` - (14) `tech`
- (18) `finance` - (12) `sharing`
- (18) `crypto` - (12) `freelance`
- (17) `sharing` - (12) `finance`
- (16) `freelance` - (10) `dating`
- (15) `art` - (10) `art`
- (15) `shopping` - (10) `shopping`
- (13) `sport` - (10) `movies`
- (13) `business` - (8) `hobby`
- (12) `movies` - (8) `crypto`
- (11) `hobby` - (7) `sport`
- (11) `education` - (7) `hacking`
+5 -6
View File
@@ -22,14 +22,13 @@ httpbin.org FALSE / FALSE 0 a b
""" """
@pytest.mark.skip(reason="periodically fails")
@pytest.mark.slow @pytest.mark.slow
def test_twitter_activation(default_db): def test_vimeo_activation(default_db):
twitter_site = default_db.sites_dict['Twitter'] vimeo_site = default_db.sites_dict['Vimeo']
token1 = twitter_site.headers['x-guest-token'] token1 = vimeo_site.headers['Authorization']
ParsingActivator.twitter(twitter_site, Mock()) ParsingActivator.vimeo(vimeo_site, Mock())
token2 = twitter_site.headers['x-guest-token'] token2 = vimeo_site.headers['Authorization']
assert token1 != token2 assert token1 != token2