feat(core): add POST request support, new sites, migrate to Majestic Million ranking (#2317)

* feat(core): add POST request support, new sites, migrate to Majestic Million ranking - Added native POST request support to the Maigret engine (requestMethod, requestPayload) to enable querying modern JSON registration endpoints. - Replaced the discontinued Alexa rank API with the Majestic Million dataset for global popularity sorting and automated CI updates. - Fixed multiple false positives among top 500 sites and bypassed standard anti-bot protections using custom User-Agents. - Updated public documentation and internal playbooks to reflect the new features. * feat(data): apply all data.json site check updates from main branch - Added CTFtime and PentesterLab (new sites added in main) - Removed forums.imore.com (deleted in main as dead site) - Disabled 5 sites per main branch fixes: Librusec, MirTesen, amateurvoyeurforum.com, forums.stevehoffman.tv, vegalab - Fixed 5 site checks per main branch: SoundCloud, Taplink, Setlist, RoyalCams, club.cnews.ru (switched from status_code to message checkType with proper markers) Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com> Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/a1d194d9-c0ff-4e2b-974c-c5e4b59548bf --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-05-07 06:24:35 +00:00 · 2026-03-24 22:08:42 +01:00
parent abd9aa57fe
commit b145e7b26f
10 changed files with 27299 additions and 26944 deletions
@@ -157,6 +157,7 @@ Summary from an earlier false-positive review for: OpenSea, Mercado Livre, Redtu
 - For **Kaggle**, additionally: **`headers`**, **`errors`** for browser-check text.
 - **Redtube** stayed valid on **`status_code`** with a stable **404** for non-existent users.
 - **Picsart**: the web profile URL is a thin SPA shell; use the **JSON API** (`api.picsart.com/users/show/{username}.json`) in **`url`** with **`message`**-style markers (`"status":"success"` vs `user_not_found`), not the browser-only `/posts` vs `/not-found` navigation.
+- For **Weblate / Anubis Anti-Bot**: Setting `headers` with a basic script User-Agent (e.g. `python-requests/2.25.1`) rather than the default browser UA completely bypassed the Anubis Proof-of-Work challenge HTTP 307 redirect, instantly recovering the native HTTP 404 framework.

 ### What required disabling checks

@@ -76,8 +76,11 @@ Practical observations from fixing top-ranked sites. Full details: section **7**
 | **Some sites always generate a page** | Pbase stubs "pbase Artist {name}" for any path; ffm.bio fuzzy-matches to the nearest real entry. No markers can help — `disabled: true`. |
 | **TLS fingerprinting degrades over time** | Kaggle's custom `User-Agent` fix stopped working — aiohttp now gets 404 for both usernames. Accept `disabled: true` when no API exists. |
 | **API endpoints bypass Cloudflare** | Fandom `api.php` and Substack `/api/v1/` returned clean JSON while main pages were blocked by Cloudflare. Always try API paths on the same domain. |
-| **GraphQL supports GET too** | hashnode GraphQL works via `GET ?query=...` (URL-encoded). Don't assume POST-only — Maigret can use GET `urlProbe` for GraphQL. |
+| **Inspect Network tab for POST APIs** | Many modern platforms (e.g., Discord) heavily protect HTML profiles but expose unauthenticated `POST` endpoints for username checks. Maigret supports this natively: define `"request_method": "POST"` and `"request_payload": {"username": "{username}"}` in `data.json` to query them! |
+| **Strict JSON markers are bulletproof** | When probing APIs, use `checkType: "message"` with exact JSON substrings (like `"{\"taken\": false}"`). Unlike HTML layout checks, this approach is immune to UI redesigns, A/B testing, and language translations. |
+| **GraphQL supports GET too** | hashnode GraphQL works via `GET ?query=...` (URL-encoded). You can use either native POST payloads or GET `urlProbe` for GraphQL. |
 | **URL-encode braces for template safety** | GraphQL `{...}` conflicts with Maigret's `{username}`. Use `%7B`/`%7D` for literal braces in `urlProbe` — `.format()` ignores percent-encoded chars. |
+| **Anti-bot bypass via simple UA** | "Anubis" anti-bot PoW screens (like on Weblate) intercept modern browser UAs via HTTP 307. Hardcoding `"headers": {"User-Agent": "python-requests/2.25.1"}` circumvents the scraper filter and restores default detection logic. |

 ## 8. Documentation maintenance

@@ -42,14 +42,14 @@ include certain categories while excluding others. Read more

 ``-a``, ``--all-sites`` - Use all sites for scan **(default: top 500)**.

-``--top-sites`` - Count of sites for scan ranked by Alexa Top
+``--top-sites`` - Count of sites for scan ranked by Majestic Million
 **(default: top 500)**.

-**Mirrors:** After the top *N* sites by Alexa rank are chosen (respecting
+**Mirrors:** After the top *N* sites by Majestic Million rank are chosen (respecting
 ``--tags``, ``--use-disabled-sites``, etc.), Maigret may add extra sites
 whose database field ``source`` names a **parent platform** that itself falls
-in the Alexa top *N* when ranking **including disabled** sites. For example,
-if ``Twitter`` ranks in the first 500 by Alexa, a mirror such as ``memory.lol``
+in the Majestic Million top *N* when ranking **including disabled** sites. For example,
+if ``Twitter`` ranks in the first 500 by Majestic Million, a mirror such as ``memory.lol``
 (with ``source: Twitter``) is included even though it has no rank and would
 otherwise be cut off. The same applies to Instagram-related mirrors (e.g.
 Picuki) when ``Instagram`` is in that parent top *N* by rank—even if the
@@ -22,9 +22,15 @@ The supported methods (``checkType`` values in ``data.json``) are:
 - ``status_code`` - checks that status code of the response is 2XX
 - ``response_url`` - check if there is not redirect and the response is 2XX

+.. note::
+   Maigret natively treats specific anti-bot HTTP status codes (like LinkedIn's ``HTTP 999``) as a standard "Not Found/Available" signal instead of throwing an infrastructure Server Error, gracefully preventing false positives.
+
 See the details of check mechanisms in the `checking.py <https://github.com/soxoj/maigret/blob/main/maigret/checking.py#L339>`_ file.

-**Mirrors and ``--top-sites``:** When you limit scans with ``--top-sites N``, Maigret also includes *mirror* sites (entries whose ``source`` field points at a parent platform such as Twitter or Instagram) if that parent would appear in the Alexa top *N* when disabled sites are considered for ranking. See the **Mirrors** paragraph under ``--top-sites`` in :doc:`command-line-options`.
+.. note::
+   Maigret now uses the **Majestic Million** dataset for site popularity sorting instead of the discontinued Alexa Rank API. For backward compatibility with existing configurations and parsers, the ranking field in `data.json` and internal site models remains named ``alexaRank`` and ``alexa_rank``.
+
+**Mirrors and ``--top-sites``:** When you limit scans with ``--top-sites N``, Maigret also includes *mirror* sites (entries whose ``source`` field points at a parent platform such as Twitter or Instagram) if that parent would appear in the Majestic Million top *N* when disabled sites are considered for ranking. See the **Mirrors** paragraph under ``--top-sites`` in :doc:`command-line-options`.

 Testing
 -------
@@ -114,6 +120,8 @@ There are few options for sites data.json helpful in various cases:
 - ``headers`` - a dictionary of additional headers to be sent to the site
 - ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
 - ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
+- ``requestMethod`` - set the HTTP method to use (e.g., ``POST``). By default, Maigret natively defaults to GET or HEAD.
+- ``requestPayload`` - a dictionary with the JSON payload to send for POST requests (e.g., ``{"username": "{username}"}``), extremely useful for parsing GraphQL or modern JSON APIs.

 ``urlProbe`` (optional profile probe URL)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -13,7 +13,7 @@ Use Cases
 ---------


-1. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Alexa rank) from the Maigret DB.
+1. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Majestic Million rank) from the Maigret DB.

 .. code-block:: console

@@ -61,30 +61,49 @@ class SimpleAiohttpChecker(CheckerBase):
        self.headers = None
        self.allow_redirects = True
        self.timeout = 0
+        self.allow_redirects = True
+        self.timeout = 0
        self.method = 'get'
+        self.payload = None

-    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
+    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
        self.url = url
        self.headers = headers
        self.allow_redirects = allow_redirects
        self.timeout = timeout
        self.method = method
+        self.payload = payload
        return None

    async def close(self):
        pass

    async def _make_request(
-        self, session, url, headers, allow_redirects, timeout, method, logger
+        self, session, url, headers, allow_redirects, timeout, method, logger, payload=None
    ) -> Tuple[str, int, Optional[CheckError]]:
        try:
-            request_method = session.get if method == 'get' else session.head
-            async with request_method(
-                url=url,
-                headers=headers,
-                allow_redirects=allow_redirects,
-                timeout=timeout,
-            ) as response:
+            if method.lower() == 'get':
+                request_method = session.get
+            elif method.lower() == 'post':
+                request_method = session.post
+            elif method.lower() == 'head':
+                request_method = session.head
+            else:
+                request_method = session.get
+
+            kwargs = {
+                'url': url,
+                'headers': headers,
+                'allow_redirects': allow_redirects,
+                'timeout': timeout,
+            }
+            if payload and method.lower() == 'post':
+                if headers and headers.get('Content-Type') == 'application/x-www-form-urlencoded':
+                    kwargs['data'] = payload
+                else:
+                    kwargs['json'] = payload
+
+            async with request_method(**kwargs) as response:
                status_code = response.status
                response_content = await response.content.read()
                charset = response.charset or "utf-8"
@@ -141,6 +160,7 @@ class SimpleAiohttpChecker(CheckerBase):
                self.timeout,
                self.method,
                self.logger,
+                self.payload,
            )

            if error and str(error) == "Invalid proxy response":
@@ -165,7 +185,7 @@ class AiodnsDomainResolver(CheckerBase):
        self.logger = kwargs.get('logger', Mock())
        self.resolver = aiodns.DNSResolver(loop=loop)

-    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
+    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
        self.url = url
        return None

@@ -191,7 +211,7 @@ class CheckerMock:
    def __init__(self, *args, **kwargs):
        pass

-    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
+    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
        return None

    async def check(self) -> Tuple[str, int, Optional[CheckError]]:
@@ -220,6 +240,11 @@ def detect_error_page(
    if status_code == 403 and not ignore_403:
        return CheckError("Access denied", "403 status code, use proxy/vpn")

+    elif status_code == 999:
+        # LinkedIn anti-bot / HTTP 999 workaround. It shouldn't trigger an infrastructure
+        # Server Error because it represents a valid "Not Found / Blocked" state for the username.
+        pass
+
    elif status_code >= 500:
        return CheckError("Server", f"{status_code} status code")

@@ -494,7 +519,9 @@ def make_site_result(
        for k, v in site.get_params.items():
            url_probe += f"&{k}={v}"

-        if site.check_type == "status_code" and site.request_head_only:
+        if site.request_method:
+            request_method = site.request_method.lower()
+        elif site.check_type == "status_code" and site.request_head_only:
            # In most cases when we are detecting by status code,
            # it is not necessary to get the entire body:  we can
            # detect fine with just the HEAD response.
@@ -505,6 +532,15 @@ def make_site_result(
            # not respond properly unless we request the whole page.
            request_method = 'get'

+        payload = None
+        if site.request_payload:
+            payload = {}
+            for k, v in site.request_payload.items():
+                if isinstance(v, str):
+                    payload[k] = v.format(username=username)
+                else:
+                    payload[k] = v
+
        if site.check_type == "response_url":
            # Site forwards request to a different URL if username not
            # found.  Disallow the redirect so we can capture the
@@ -521,6 +557,7 @@ def make_site_result(
            headers=headers,
            allow_redirects=allow_redirects,
            timeout=options['timeout'],
+            payload=payload,
        )

        # Store future request object in the results object
@@ -577,6 +614,7 @@ async def check_site_for_username(
                    allow_redirects=checker.allow_redirects,
                    timeout=checker.timeout,
                    method=checker.method,
+                    payload=getattr(checker, 'payload', None),
                )
                response = await checker.check()

@@ -65,6 +65,10 @@ class MaigretSite:
    url_probe = None
    # Type of check to perform
    check_type = ""
+    # HTTP request method (GET, POST, HEAD, etc.)
+    request_method = ""
+    # HTTP request payload (for POST, PUT, etc.)
+    request_payload: Dict[str, Any] = {}
    # Whether to only send HEAD requests (GET by default)
    request_head_only = ""
    # GET parameters to include in requests
@@ -137,6 +141,8 @@ class MaigretSite:
                'regex_check',
                'url_probe',
                'check_type',
+                'request_method',
+                'request_payload',
                'request_head_only',
                'get_params',
                'presense_strs',
@@ -24,36 +24,44 @@ RANKS.update({
    '100000000': '100M',
 })

-SEMAPHORE = threading.Semaphore(20)


-def get_rank(domain_to_query, site, print_errors=True):
-    with SEMAPHORE:
-        # Retrieve ranking data via alexa API
-        url = f"http://data.alexa.com/data?cli=10&url={domain_to_query}"
-        xml_data = requests.get(url).text
-        root = ET.fromstring(xml_data)
+import csv
+import io
+from urllib.parse import urlparse

+def fetch_majestic_million():
+    print("Fetching Majestic Million CSV (this may take a few seconds)...")
+    ranks = {}
+    url = "https://downloads.majestic.com/majestic_million.csv"
    try:
-            #Get ranking for this site.
-            site.alexa_rank = int(root.find('.//REACH').attrib['RANK'])
-            # country = root.find('.//COUNTRY')
-            # if not country is None and country.attrib:
-            #     country_code = country.attrib['CODE']
-            #     tags = set(site.tags)
-            #     if country_code:
-            #         tags.add(country_code.lower())
-            #     site.tags = sorted(list(tags))
-            #     if site.type != 'username':
-            #         site.disabled = False
-        except Exception as e:
-            if print_errors:
-                logging.error(e)
-                # We did not find the rank for some reason.
-                print(f"Error retrieving rank information for '{domain_to_query}'")
-                print(f"     Returned XML is |{xml_data}|")
+        response = requests.get(url, stream=True)
+        response.raise_for_status()
        
-        return
+        csv_file = io.StringIO(response.text)
+        reader = csv.reader(csv_file)
+        next(reader) # skip headers
+        
+        for row in reader:
+            if not row or len(row) < 3:
+                continue
+            rank = int(row[0])
+            domain = row[2].lower()
+            ranks[domain] = rank
+    except Exception as e:
+        logging.error(f"Error fetching Majestic Million: {e}")
+        
+    print(f"Loaded {len(ranks)} domains from Majestic Million.")
+    return ranks
+
+def get_base_domain(url):
+    try:
+        netloc = urlparse(url).netloc
+        if netloc.startswith('www.'):
+            netloc = netloc[4:]
+        return netloc.lower()
+    except Exception:
+        return ""


 def get_step_rank(rank):
@@ -91,30 +99,33 @@ def main():
    with open("sites.md", "w") as site_file:
        site_file.write(f"""
 ## List of supported sites (search methods): total {len(sites_subset)}\n
-Rank data fetched from Alexa by domains.
+Rank data fetched from Majestic Million by domains.

 """)

+        majestic_ranks = {}
+        if args.with_rank:
+            majestic_ranks = fetch_majestic_million()
+
        for site in sites_subset:
            if not args.with_rank:
                break
-            url_main = site.url_main
+            
            if site.alexa_rank < sys.maxsize and args.empty_only:
                continue
            if args.exclude_engine_list and site.engine in args.exclude_engine_list:
                continue
-            site.alexa_rank = 0
-            th = threading.Thread(target=get_rank, args=(url_main, site,))
-            pool.append((site.name, url_main, th))
-            th.start()
                
+            domain = get_base_domain(site.url_main)
+            
+            if domain in majestic_ranks:
+                site.alexa_rank = majestic_ranks[domain]
+            else:
+                site.alexa_rank = sys.maxsize
+        
+        # In memory matching complete, no threads to join
        if args.with_rank:
-            index = 1
-            for site_name, url_main, th in pool:
-                th.join()
-                sys.stdout.write("\r{0}".format(f"Updated {index} out of {len(sites_subset)} entries"))
-                sys.stdout.flush()
-                index = index + 1
+            print("Successfully updated ranks matching Majestic Million dataset.")

        sites_full_list = [(s, int(s.alexa_rank)) for s in sites_subset]