feat(core): add POST request support, new sites, migrate to Majestic Million ranking (#2317)

* feat(core): add POST request support, new sites, migrate to Majestic Million ranking
- Added native POST request support to the Maigret engine (requestMethod, requestPayload) to enable querying modern JSON registration endpoints.
- Replaced the discontinued Alexa rank API with the Majestic Million dataset for global popularity sorting and automated CI updates.
- Fixed multiple false positives among top 500 sites and bypassed standard anti-bot protections using custom User-Agents.
- Updated public documentation and internal playbooks to reflect the new features.

* feat(data): apply all data.json site check updates from main branch

- Added CTFtime and PentesterLab (new sites added in main)
- Removed forums.imore.com (deleted in main as dead site)
- Disabled 5 sites per main branch fixes: Librusec, MirTesen, amateurvoyeurforum.com, forums.stevehoffman.tv, vegalab
- Fixed 5 site checks per main branch: SoundCloud, Taplink, Setlist, RoyalCams, club.cnews.ru (switched from status_code to message checkType with proper markers)

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/a1d194d9-c0ff-4e2b-974c-c5e4b59548bf

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
This commit is contained in:
Soxoj
2026-03-24 22:08:42 +01:00
committed by Soxoj
parent e95d061460
commit 06c3360e3d
10 changed files with 27299 additions and 26944 deletions
+4 -4
View File
@@ -42,14 +42,14 @@ include certain categories while excluding others. Read more
``-a``, ``--all-sites`` - Use all sites for scan **(default: top 500)**.
``--top-sites`` - Count of sites for scan ranked by Alexa Top
``--top-sites`` - Count of sites for scan ranked by Majestic Million
**(default: top 500)**.
**Mirrors:** After the top *N* sites by Alexa rank are chosen (respecting
**Mirrors:** After the top *N* sites by Majestic Million rank are chosen (respecting
``--tags``, ``--use-disabled-sites``, etc.), Maigret may add extra sites
whose database field ``source`` names a **parent platform** that itself falls
in the Alexa top *N* when ranking **including disabled** sites. For example,
if ``Twitter`` ranks in the first 500 by Alexa, a mirror such as ``memory.lol``
in the Majestic Million top *N* when ranking **including disabled** sites. For example,
if ``Twitter`` ranks in the first 500 by Majestic Million, a mirror such as ``memory.lol``
(with ``source: Twitter``) is included even though it has no rank and would
otherwise be cut off. The same applies to Instagram-related mirrors (e.g.
Picuki) when ``Instagram`` is in that parent top *N* by rank—even if the
+9 -1
View File
@@ -22,9 +22,15 @@ The supported methods (``checkType`` values in ``data.json``) are:
- ``status_code`` - checks that status code of the response is 2XX
- ``response_url`` - check if there is not redirect and the response is 2XX
.. note::
Maigret natively treats specific anti-bot HTTP status codes (like LinkedIn's ``HTTP 999``) as a standard "Not Found/Available" signal instead of throwing an infrastructure Server Error, gracefully preventing false positives.
See the details of check mechanisms in the `checking.py <https://github.com/soxoj/maigret/blob/main/maigret/checking.py#L339>`_ file.
**Mirrors and ``--top-sites``:** When you limit scans with ``--top-sites N``, Maigret also includes *mirror* sites (entries whose ``source`` field points at a parent platform such as Twitter or Instagram) if that parent would appear in the Alexa top *N* when disabled sites are considered for ranking. See the **Mirrors** paragraph under ``--top-sites`` in :doc:`command-line-options`.
.. note::
Maigret now uses the **Majestic Million** dataset for site popularity sorting instead of the discontinued Alexa Rank API. For backward compatibility with existing configurations and parsers, the ranking field in `data.json` and internal site models remains named ``alexaRank`` and ``alexa_rank``.
**Mirrors and ``--top-sites``:** When you limit scans with ``--top-sites N``, Maigret also includes *mirror* sites (entries whose ``source`` field points at a parent platform such as Twitter or Instagram) if that parent would appear in the Majestic Million top *N* when disabled sites are considered for ranking. See the **Mirrors** paragraph under ``--top-sites`` in :doc:`command-line-options`.
Testing
-------
@@ -114,6 +120,8 @@ There are few options for sites data.json helpful in various cases:
- ``headers`` - a dictionary of additional headers to be sent to the site
- ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
- ``requestMethod`` - set the HTTP method to use (e.g., ``POST``). By default, Maigret natively defaults to GET or HEAD.
- ``requestPayload`` - a dictionary with the JSON payload to send for POST requests (e.g., ``{"username": "{username}"}``), extremely useful for parsing GraphQL or modern JSON APIs.
``urlProbe`` (optional profile probe URL)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+1 -1
View File
@@ -13,7 +13,7 @@ Use Cases
---------
1. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Alexa rank) from the Maigret DB.
1. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Majestic Million rank) from the Maigret DB.
.. code-block:: console