feat(core): add POST request support, new sites, migrate to Majestic Million ranking (#2317)

* feat(core): add POST request support, new sites, migrate to Majestic Million ranking
- Added native POST request support to the Maigret engine (requestMethod, requestPayload) to enable querying modern JSON registration endpoints.
- Replaced the discontinued Alexa rank API with the Majestic Million dataset for global popularity sorting and automated CI updates.
- Fixed multiple false positives among top 500 sites and bypassed standard anti-bot protections using custom User-Agents.
- Updated public documentation and internal playbooks to reflect the new features.

* feat(data): apply all data.json site check updates from main branch

- Added CTFtime and PentesterLab (new sites added in main)
- Removed forums.imore.com (deleted in main as dead site)
- Disabled 5 sites per main branch fixes: Librusec, MirTesen, amateurvoyeurforum.com, forums.stevehoffman.tv, vegalab
- Fixed 5 site checks per main branch: SoundCloud, Taplink, Setlist, RoyalCams, club.cnews.ru (switched from status_code to message checkType with proper markers)

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/a1d194d9-c0ff-4e2b-974c-c5e4b59548bf

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
This commit is contained in:
Soxoj
2026-03-24 22:08:42 +01:00
committed by Soxoj
parent e95d061460
commit 06c3360e3d
10 changed files with 27299 additions and 26944 deletions
+1
View File
@@ -157,6 +157,7 @@ Summary from an earlier false-positive review for: OpenSea, Mercado Livre, Redtu
- For **Kaggle**, additionally: **`headers`**, **`errors`** for browser-check text.
- **Redtube** stayed valid on **`status_code`** with a stable **404** for non-existent users.
- **Picsart**: the web profile URL is a thin SPA shell; use the **JSON API** (`api.picsart.com/users/show/{username}.json`) in **`url`** with **`message`**-style markers (`"status":"success"` vs `user_not_found`), not the browser-only `/posts` vs `/not-found` navigation.
- For **Weblate / Anubis Anti-Bot**: Setting `headers` with a basic script User-Agent (e.g. `python-requests/2.25.1`) rather than the default browser UA completely bypassed the Anubis Proof-of-Work challenge HTTP 307 redirect, instantly recovering the native HTTP 404 framework.
### What required disabling checks
+4 -1
View File
@@ -76,8 +76,11 @@ Practical observations from fixing top-ranked sites. Full details: section **7**
| **Some sites always generate a page** | Pbase stubs "pbase Artist {name}" for any path; ffm.bio fuzzy-matches to the nearest real entry. No markers can help — `disabled: true`. |
| **TLS fingerprinting degrades over time** | Kaggle's custom `User-Agent` fix stopped working — aiohttp now gets 404 for both usernames. Accept `disabled: true` when no API exists. |
| **API endpoints bypass Cloudflare** | Fandom `api.php` and Substack `/api/v1/` returned clean JSON while main pages were blocked by Cloudflare. Always try API paths on the same domain. |
| **GraphQL supports GET too** | hashnode GraphQL works via `GET ?query=...` (URL-encoded). Don't assume POST-only — Maigret can use GET `urlProbe` for GraphQL. |
| **Inspect Network tab for POST APIs** | Many modern platforms (e.g., Discord) heavily protect HTML profiles but expose unauthenticated `POST` endpoints for username checks. Maigret supports this natively: define `"request_method": "POST"` and `"request_payload": {"username": "{username}"}` in `data.json` to query them! |
| **Strict JSON markers are bulletproof** | When probing APIs, use `checkType: "message"` with exact JSON substrings (like `"{\"taken\": false}"`). Unlike HTML layout checks, this approach is immune to UI redesigns, A/B testing, and language translations. |
| **GraphQL supports GET too** | hashnode GraphQL works via `GET ?query=...` (URL-encoded). You can use either native POST payloads or GET `urlProbe` for GraphQL. |
| **URL-encode braces for template safety** | GraphQL `{...}` conflicts with Maigret's `{username}`. Use `%7B`/`%7D` for literal braces in `urlProbe``.format()` ignores percent-encoded chars. |
| **Anti-bot bypass via simple UA** | "Anubis" anti-bot PoW screens (like on Weblate) intercept modern browser UAs via HTTP 307. Hardcoding `"headers": {"User-Agent": "python-requests/2.25.1"}` circumvents the scraper filter and restores default detection logic. |
## 8. Documentation maintenance
+4 -4
View File
@@ -42,14 +42,14 @@ include certain categories while excluding others. Read more
``-a``, ``--all-sites`` - Use all sites for scan **(default: top 500)**.
``--top-sites`` - Count of sites for scan ranked by Alexa Top
``--top-sites`` - Count of sites for scan ranked by Majestic Million
**(default: top 500)**.
**Mirrors:** After the top *N* sites by Alexa rank are chosen (respecting
**Mirrors:** After the top *N* sites by Majestic Million rank are chosen (respecting
``--tags``, ``--use-disabled-sites``, etc.), Maigret may add extra sites
whose database field ``source`` names a **parent platform** that itself falls
in the Alexa top *N* when ranking **including disabled** sites. For example,
if ``Twitter`` ranks in the first 500 by Alexa, a mirror such as ``memory.lol``
in the Majestic Million top *N* when ranking **including disabled** sites. For example,
if ``Twitter`` ranks in the first 500 by Majestic Million, a mirror such as ``memory.lol``
(with ``source: Twitter``) is included even though it has no rank and would
otherwise be cut off. The same applies to Instagram-related mirrors (e.g.
Picuki) when ``Instagram`` is in that parent top *N* by rank—even if the
+9 -1
View File
@@ -22,9 +22,15 @@ The supported methods (``checkType`` values in ``data.json``) are:
- ``status_code`` - checks that status code of the response is 2XX
- ``response_url`` - check if there is not redirect and the response is 2XX
.. note::
Maigret natively treats specific anti-bot HTTP status codes (like LinkedIn's ``HTTP 999``) as a standard "Not Found/Available" signal instead of throwing an infrastructure Server Error, gracefully preventing false positives.
See the details of check mechanisms in the `checking.py <https://github.com/soxoj/maigret/blob/main/maigret/checking.py#L339>`_ file.
**Mirrors and ``--top-sites``:** When you limit scans with ``--top-sites N``, Maigret also includes *mirror* sites (entries whose ``source`` field points at a parent platform such as Twitter or Instagram) if that parent would appear in the Alexa top *N* when disabled sites are considered for ranking. See the **Mirrors** paragraph under ``--top-sites`` in :doc:`command-line-options`.
.. note::
Maigret now uses the **Majestic Million** dataset for site popularity sorting instead of the discontinued Alexa Rank API. For backward compatibility with existing configurations and parsers, the ranking field in `data.json` and internal site models remains named ``alexaRank`` and ``alexa_rank``.
**Mirrors and ``--top-sites``:** When you limit scans with ``--top-sites N``, Maigret also includes *mirror* sites (entries whose ``source`` field points at a parent platform such as Twitter or Instagram) if that parent would appear in the Majestic Million top *N* when disabled sites are considered for ranking. See the **Mirrors** paragraph under ``--top-sites`` in :doc:`command-line-options`.
Testing
-------
@@ -114,6 +120,8 @@ There are few options for sites data.json helpful in various cases:
- ``headers`` - a dictionary of additional headers to be sent to the site
- ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
- ``requestMethod`` - set the HTTP method to use (e.g., ``POST``). By default, Maigret natively defaults to GET or HEAD.
- ``requestPayload`` - a dictionary with the JSON payload to send for POST requests (e.g., ``{"username": "{username}"}``), extremely useful for parsing GraphQL or modern JSON APIs.
``urlProbe`` (optional profile probe URL)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+1 -1
View File
@@ -13,7 +13,7 @@ Use Cases
---------
1. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Alexa rank) from the Maigret DB.
1. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Majestic Million rank) from the Maigret DB.
.. code-block:: console
+50 -12
View File
@@ -61,30 +61,49 @@ class SimpleAiohttpChecker(CheckerBase):
self.headers = None
self.allow_redirects = True
self.timeout = 0
self.allow_redirects = True
self.timeout = 0
self.method = 'get'
self.payload = None
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
self.url = url
self.headers = headers
self.allow_redirects = allow_redirects
self.timeout = timeout
self.method = method
self.payload = payload
return None
async def close(self):
pass
async def _make_request(
self, session, url, headers, allow_redirects, timeout, method, logger
self, session, url, headers, allow_redirects, timeout, method, logger, payload=None
) -> Tuple[str, int, Optional[CheckError]]:
try:
request_method = session.get if method == 'get' else session.head
async with request_method(
url=url,
headers=headers,
allow_redirects=allow_redirects,
timeout=timeout,
) as response:
if method.lower() == 'get':
request_method = session.get
elif method.lower() == 'post':
request_method = session.post
elif method.lower() == 'head':
request_method = session.head
else:
request_method = session.get
kwargs = {
'url': url,
'headers': headers,
'allow_redirects': allow_redirects,
'timeout': timeout,
}
if payload and method.lower() == 'post':
if headers and headers.get('Content-Type') == 'application/x-www-form-urlencoded':
kwargs['data'] = payload
else:
kwargs['json'] = payload
async with request_method(**kwargs) as response:
status_code = response.status
response_content = await response.content.read()
charset = response.charset or "utf-8"
@@ -141,6 +160,7 @@ class SimpleAiohttpChecker(CheckerBase):
self.timeout,
self.method,
self.logger,
self.payload,
)
if error and str(error) == "Invalid proxy response":
@@ -165,7 +185,7 @@ class AiodnsDomainResolver(CheckerBase):
self.logger = kwargs.get('logger', Mock())
self.resolver = aiodns.DNSResolver(loop=loop)
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
self.url = url
return None
@@ -191,7 +211,7 @@ class CheckerMock:
def __init__(self, *args, **kwargs):
pass
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
return None
async def check(self) -> Tuple[str, int, Optional[CheckError]]:
@@ -220,6 +240,11 @@ def detect_error_page(
if status_code == 403 and not ignore_403:
return CheckError("Access denied", "403 status code, use proxy/vpn")
elif status_code == 999:
# LinkedIn anti-bot / HTTP 999 workaround. It shouldn't trigger an infrastructure
# Server Error because it represents a valid "Not Found / Blocked" state for the username.
pass
elif status_code >= 500:
return CheckError("Server", f"{status_code} status code")
@@ -494,7 +519,9 @@ def make_site_result(
for k, v in site.get_params.items():
url_probe += f"&{k}={v}"
if site.check_type == "status_code" and site.request_head_only:
if site.request_method:
request_method = site.request_method.lower()
elif site.check_type == "status_code" and site.request_head_only:
# In most cases when we are detecting by status code,
# it is not necessary to get the entire body: we can
# detect fine with just the HEAD response.
@@ -505,6 +532,15 @@ def make_site_result(
# not respond properly unless we request the whole page.
request_method = 'get'
payload = None
if site.request_payload:
payload = {}
for k, v in site.request_payload.items():
if isinstance(v, str):
payload[k] = v.format(username=username)
else:
payload[k] = v
if site.check_type == "response_url":
# Site forwards request to a different URL if username not
# found. Disallow the redirect so we can capture the
@@ -521,6 +557,7 @@ def make_site_result(
headers=headers,
allow_redirects=allow_redirects,
timeout=options['timeout'],
payload=payload,
)
# Store future request object in the results object
@@ -577,6 +614,7 @@ async def check_site_for_username(
allow_redirects=checker.allow_redirects,
timeout=checker.timeout,
method=checker.method,
payload=getattr(checker, 'payload', None),
)
response = await checker.check()
+24557 -24380
View File
File diff suppressed because it is too large Load Diff
+6
View File
@@ -65,6 +65,10 @@ class MaigretSite:
url_probe = None
# Type of check to perform
check_type = ""
# HTTP request method (GET, POST, HEAD, etc.)
request_method = ""
# HTTP request payload (for POST, PUT, etc.)
request_payload: Dict[str, Any] = {}
# Whether to only send HEAD requests (GET by default)
request_head_only = ""
# GET parameters to include in requests
@@ -137,6 +141,8 @@ class MaigretSite:
'regex_check',
'url_probe',
'check_type',
'request_method',
'request_payload',
'request_head_only',
'get_params',
'presense_strs',
+2617 -2506
View File
File diff suppressed because it is too large Load Diff
+48 -37
View File
@@ -24,36 +24,44 @@ RANKS.update({
'100000000': '100M',
})
SEMAPHORE = threading.Semaphore(20)
def get_rank(domain_to_query, site, print_errors=True):
with SEMAPHORE:
# Retrieve ranking data via alexa API
url = f"http://data.alexa.com/data?cli=10&url={domain_to_query}"
xml_data = requests.get(url).text
root = ET.fromstring(xml_data)
import csv
import io
from urllib.parse import urlparse
def fetch_majestic_million():
print("Fetching Majestic Million CSV (this may take a few seconds)...")
ranks = {}
url = "https://downloads.majestic.com/majestic_million.csv"
try:
#Get ranking for this site.
site.alexa_rank = int(root.find('.//REACH').attrib['RANK'])
# country = root.find('.//COUNTRY')
# if not country is None and country.attrib:
# country_code = country.attrib['CODE']
# tags = set(site.tags)
# if country_code:
# tags.add(country_code.lower())
# site.tags = sorted(list(tags))
# if site.type != 'username':
# site.disabled = False
except Exception as e:
if print_errors:
logging.error(e)
# We did not find the rank for some reason.
print(f"Error retrieving rank information for '{domain_to_query}'")
print(f" Returned XML is |{xml_data}|")
response = requests.get(url, stream=True)
response.raise_for_status()
return
csv_file = io.StringIO(response.text)
reader = csv.reader(csv_file)
next(reader) # skip headers
for row in reader:
if not row or len(row) < 3:
continue
rank = int(row[0])
domain = row[2].lower()
ranks[domain] = rank
except Exception as e:
logging.error(f"Error fetching Majestic Million: {e}")
print(f"Loaded {len(ranks)} domains from Majestic Million.")
return ranks
def get_base_domain(url):
try:
netloc = urlparse(url).netloc
if netloc.startswith('www.'):
netloc = netloc[4:]
return netloc.lower()
except Exception:
return ""
def get_step_rank(rank):
@@ -91,30 +99,33 @@ def main():
with open("sites.md", "w") as site_file:
site_file.write(f"""
## List of supported sites (search methods): total {len(sites_subset)}\n
Rank data fetched from Alexa by domains.
Rank data fetched from Majestic Million by domains.
""")
majestic_ranks = {}
if args.with_rank:
majestic_ranks = fetch_majestic_million()
for site in sites_subset:
if not args.with_rank:
break
url_main = site.url_main
if site.alexa_rank < sys.maxsize and args.empty_only:
continue
if args.exclude_engine_list and site.engine in args.exclude_engine_list:
continue
site.alexa_rank = 0
th = threading.Thread(target=get_rank, args=(url_main, site,))
pool.append((site.name, url_main, th))
th.start()
domain = get_base_domain(site.url_main)
if domain in majestic_ranks:
site.alexa_rank = majestic_ranks[domain]
else:
site.alexa_rank = sys.maxsize
# In memory matching complete, no threads to join
if args.with_rank:
index = 1
for site_name, url_main, th in pool:
th.join()
sys.stdout.write("\r{0}".format(f"Updated {index} out of {len(sites_subset)} entries"))
sys.stdout.flush()
index = index + 1
print("Successfully updated ranks matching Majestic Million dataset.")
sites_full_list = [(s, int(s.alexa_rank)) for s in sites_subset]