mirror of
https://github.com/soxoj/maigret.git
synced 2026-05-17 11:55:36 +00:00
fix(checking): block URL-incompatible usernames before request
This commit is contained in:
@@ -95,6 +95,13 @@ Each site entry uses one of three `checkType` modes to decide whether a profile
|
||||
|
||||
**Errors vs absence.** Anything that means "the server can't answer right now" — rate limits, captchas, "Checking your browser", "unusual traffic", maintenance pages — belongs in `errors` (mapping the substring to a human-readable error string), not in `absenceStrs`. The `errors` mechanism produces an UNKNOWN result instead of a false CLAIMED or false AVAILABLE.
|
||||
|
||||
**`regexCheck` and non-ASCII usernames.** When `{username}` is interpolated into a URL **path segment** and the username contains characters that need percent-encoding (Cyrillic, Chinese, Korean, spaces, etc.), Maigret skips the site with an `URL-incompatible username` error rather than send a request that would land on a generic listing/homepage and trip overly-broad `presenseStrs`. This default avoids the cascade of false-positives observed in [#459](https://github.com/soxoj/maigret/issues/459) and [#2633](https://github.com/soxoj/maigret/issues/2633). Two corollaries for site entries:
|
||||
|
||||
- If your site legitimately accepts non-ASCII characters in the URL path (a wiki that mounts Unicode usernames, a Russian forum that serves Cyrillic slugs, etc.), declare the actual format with an explicit `regexCheck`. For example, a MediaWiki-style wiki could use `"regexCheck": "^[^\\/\\\\#<>\\[\\]\\|{}]+$"`; a Japanese blog platform might use `"regexCheck": "^[\\w\\-_\\.]+$"` (Python's `\w` matches Unicode letters). Don't paper this over with `regexCheck: "."` — pick a regex that reflects what the site actually accepts.
|
||||
- If `{username}` is in a query string (`?name={username}`) or only in `requestPayload`, the default has no effect — query/body values are URL-encoded as parameters and most APIs handle that fine.
|
||||
|
||||
The default kicks in *only* when no per-site `regexCheck` is set. Existing per-site regexes always win.
|
||||
|
||||
Full reference for `checkType`, `urlProbe`, `engine`, and the rest of the `data.json` schema is in the [development guide](docs/source/development.rst), section *How to fix false-positives*.
|
||||
|
||||
### Editing `data.json` safely
|
||||
|
||||
@@ -134,11 +134,50 @@ There are few options for sites data.json helpful in various cases:
|
||||
- ``engine`` - a predefined check for the sites of certain type (e.g. forums), see the ``engines`` section in the JSON file
|
||||
- ``headers`` - a dictionary of additional headers to be sent to the site
|
||||
- ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
|
||||
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
|
||||
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives (see ``regexCheck`` and the non-ASCII default below)
|
||||
- ``requestMethod`` - set the HTTP method to use (e.g., ``POST``). By default, Maigret natively defaults to GET or HEAD.
|
||||
- ``requestPayload`` - a dictionary with the JSON payload to send for POST requests (e.g., ``{"username": "{username}"}``), extremely useful for parsing GraphQL or modern JSON APIs.
|
||||
- ``protection`` - a list of protection types detected on the site (see below).
|
||||
|
||||
``regexCheck`` and non-ASCII usernames
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
When ``{username}`` is interpolated into a URL **path segment** and the user-supplied username contains characters that would be percent-encoded by :py:func:`urllib.parse.quote` (Cyrillic, Chinese, Korean, Arabic, spaces, etc.), Maigret skips the site with an ``URL-incompatible username`` error rather than send a request that would land on a generic listing/homepage and trip overly-broad ``presenseStrs``. This default closes the cascade of false-positives observed in `issue #459 <https://github.com/soxoj/maigret/issues/459>`_ and `issue #2633 <https://github.com/soxoj/maigret/issues/2633>`_.
|
||||
|
||||
Scope of the default:
|
||||
|
||||
- Active **only** when ``{username}`` is in the URL path of ``url`` (or ``urlProbe`` if set), e.g. ``https://example.com/u/{username}``.
|
||||
- **Not** active when ``{username}`` is in the query string (``?name={username}``) or only in ``requestPayload`` — those values are URL-encoded as parameters and most APIs handle them fine.
|
||||
- **Always** deferred when the site has its own ``regexCheck`` — an explicit per-site rule wins.
|
||||
|
||||
Opting a site into broader matching:
|
||||
|
||||
If a site genuinely accepts non-ASCII characters in the URL path (a wiki that mounts Unicode usernames, a Russian forum that serves Cyrillic slugs, etc.), declare the actual accepted format with an explicit ``regexCheck`` that matches your reality. A few worked examples:
|
||||
|
||||
- A MediaWiki-style wiki that allows any character except the MediaWiki-forbidden punctuation:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"url": "https://wiki.example/wiki/User:{username}",
|
||||
"regexCheck": "^[^\\/\\\\#<>\\[\\]\\|{}]+$"
|
||||
}
|
||||
|
||||
- A Japanese blog platform that allows Unicode word characters + dash + dot:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"url": "https://blog.example/{username}",
|
||||
"regexCheck": "^[\\w\\-_\\.]+$"
|
||||
}
|
||||
|
||||
In Python's regex engine, ``\\w`` against a ``str`` pattern matches Unicode letters by default, so Hiragana / Hangul / Cyrillic / etc. all pass.
|
||||
|
||||
**Do not** paper this over with ``"regexCheck": "."`` — that's a placeholder, not a description of what the site accepts; it will let any string through, including URLs and emails that other parts of Maigret may pick up and feed back into recursive search (see ``parse_usernames`` in ``checking.py``).
|
||||
|
||||
The complementary direction also matters: if you notice an existing site with a too-permissive ``regexCheck`` (e.g. ``"^[^\\.]+$"``, which means "anything but a dot" — that gladly lets non-ASCII through), tighten it to the actual accepted character class for the site (typically ``"^[A-Za-z0-9_-]+$"`` for ASCII slugs) when fixing related false-positives.
|
||||
|
||||
``protection`` (site protection tracking)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
||||
@@ -49,6 +49,34 @@ SUPPORTED_IDS = (
|
||||
BAD_CHARS = "#"
|
||||
|
||||
|
||||
def _username_fits_url_template(site: MaigretSite, username: str) -> bool:
|
||||
"""Decide whether a username can be safely substituted into a site's URL
|
||||
path without producing a percent-encoded slug that the site cannot match.
|
||||
|
||||
Rationale: most sites that interpolate ``{username}`` into a URL path
|
||||
segment treat the slug as an ASCII identifier. When a username contains
|
||||
non-ASCII characters (or other reserved characters), ``urllib.parse.quote``
|
||||
percent-encodes the bytes; the site typically cannot resolve such a slug
|
||||
and falls back to a generic listing/homepage that trips overly-broad
|
||||
``presenseStrs`` markers, producing a false CLAIMED. See issues #459 and
|
||||
#2633. Sites that genuinely accept broader character sets (e.g. wikis
|
||||
that allow Unicode usernames) opt into permissive matching by setting
|
||||
their own ``regexCheck``; in that case this helper is bypassed entirely.
|
||||
|
||||
Returns True when the check should proceed, False when the result is
|
||||
inherently unreliable and the site should be skipped (ILLEGAL).
|
||||
"""
|
||||
if site.regex_check:
|
||||
return True
|
||||
template = site.url_probe or site.url or ""
|
||||
if "{username}" not in template:
|
||||
return True
|
||||
path_part, _sep, _query = template.partition("?")
|
||||
if "{username}" not in path_part:
|
||||
return True
|
||||
return quote(username, safe='') == username
|
||||
|
||||
|
||||
def build_cloudflare_bypass_config(
|
||||
settings_obj: Optional[Any], force_enable: bool = False
|
||||
) -> Optional[Dict[str, Any]]:
|
||||
@@ -880,6 +908,23 @@ def make_site_result(
|
||||
results_site["http_status"] = ""
|
||||
results_site["response_text"] = ""
|
||||
# query_notify.update(results_site["status"])
|
||||
# username would be percent-encoded into a path segment — see #459/#2633.
|
||||
elif not _username_fits_url_template(site, username):
|
||||
results_site["status"] = MaigretCheckResult(
|
||||
username,
|
||||
site.name,
|
||||
url,
|
||||
MaigretCheckStatus.ILLEGAL,
|
||||
error=CheckError(
|
||||
'URL-incompatible username',
|
||||
'username contains characters that would be percent-encoded '
|
||||
'in this site\'s URL path; result would be unreliable. Add a '
|
||||
'`regexCheck` to opt this site in if it accepts these chars.'
|
||||
),
|
||||
)
|
||||
results_site["url_user"] = ""
|
||||
results_site["http_status"] = ""
|
||||
results_site["response_text"] = ""
|
||||
else:
|
||||
# URL of user on site (if it exists)
|
||||
results_site["url_user"] = url
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"version": 1,
|
||||
"updated_at": "2026-05-16T16:00:20Z",
|
||||
"updated_at": "2026-05-16T19:48:44Z",
|
||||
"sites_count": 3155,
|
||||
"min_maigret_version": "0.6.1",
|
||||
"data_sha256": "0997b68c05eedb6e714432ed79580688d4923c56ef1ebf46db69b90039ef00d7",
|
||||
|
||||
@@ -13,6 +13,7 @@ from maigret.checking import (
|
||||
timeout_check,
|
||||
debug_response_logging,
|
||||
process_site_result,
|
||||
_username_fits_url_template,
|
||||
)
|
||||
from maigret.errors import CheckError
|
||||
from maigret.result import MaigretCheckResult, MaigretCheckStatus
|
||||
@@ -144,6 +145,79 @@ def test_detect_error_page_instagram_login_wall():
|
||||
assert "rate-limited" in err.desc
|
||||
|
||||
|
||||
def _site_for_url(url_pattern, regex_check=None, url_probe=None):
|
||||
"""Build a minimal MaigretSite stub for the URL-template helper tests."""
|
||||
raw = {
|
||||
"url": url_pattern,
|
||||
"urlMain": "https://example.com/",
|
||||
"checkType": "message",
|
||||
"usernameClaimed": "alice",
|
||||
"usernameUnclaimed": "noone",
|
||||
}
|
||||
if regex_check is not None:
|
||||
raw["regexCheck"] = regex_check
|
||||
if url_probe is not None:
|
||||
raw["urlProbe"] = url_probe
|
||||
return MaigretSite("Example", raw)
|
||||
|
||||
|
||||
# Regression tests for #459 / #2633 — usernames that would be percent-encoded
|
||||
# into a URL path segment trip generic presence markers on fallback pages.
|
||||
def test_username_fits_path_segment_ascii_slug_passes():
|
||||
site = _site_for_url("https://example.com/u/{username}")
|
||||
assert _username_fits_url_template(site, "alice") is True
|
||||
assert _username_fits_url_template(site, "alice-bob") is True
|
||||
assert _username_fits_url_template(site, "alice.bob_42") is True
|
||||
|
||||
|
||||
def test_username_fits_path_segment_non_ascii_blocked():
|
||||
site = _site_for_url("https://example.com/u/{username}")
|
||||
# Cyrillic
|
||||
assert _username_fits_url_template(site, "Александр") is False
|
||||
# Chinese
|
||||
assert _username_fits_url_template(site, "快嘴摩卡酱") is False
|
||||
# Korean
|
||||
assert _username_fits_url_template(site, "홍길동") is False
|
||||
# Space (also percent-encoded)
|
||||
assert _username_fits_url_template(site, "alice bob") is False
|
||||
|
||||
|
||||
def test_username_fits_query_string_is_unconstrained():
|
||||
"""If {username} sits in the query string, the value is URL-encoded as a
|
||||
parameter and most APIs handle that fine — don't block."""
|
||||
site = _site_for_url("https://example.com/api/users?name={username}")
|
||||
assert _username_fits_url_template(site, "快嘴摩卡酱") is True
|
||||
assert _username_fits_url_template(site, "Александр") is True
|
||||
|
||||
|
||||
def test_username_fits_explicit_regex_check_bypasses_helper():
|
||||
"""When the site declares its own regexCheck, the helper defers entirely."""
|
||||
# Permissive site: accepts anything via Unicode-friendly regex.
|
||||
site = _site_for_url(
|
||||
"https://wiki.example/User:{username}", regex_check=r"^[\w\- .]+$"
|
||||
)
|
||||
assert _username_fits_url_template(site, "Александр") is True
|
||||
assert _username_fits_url_template(site, "快嘴摩卡酱") is True
|
||||
|
||||
|
||||
def test_username_fits_url_probe_overrides_url():
|
||||
"""urlProbe is the actual request URL; the helper must use it when set."""
|
||||
# Path-segment url, but urlProbe is a clean query API → no validation
|
||||
site = _site_for_url(
|
||||
"https://example.com/u/{username}",
|
||||
url_probe="https://example.com/api/u?name={username}",
|
||||
)
|
||||
assert _username_fits_url_template(site, "快嘴摩卡酱") is True
|
||||
|
||||
|
||||
def test_username_fits_post_payload_sites_skipped():
|
||||
"""Sites with {username} only in requestPayload (no {username} in URL
|
||||
template at all) should pass unconditionally — payload is JSON-encoded,
|
||||
not URL-path-encoded."""
|
||||
site = _site_for_url("https://api.example.com/check")
|
||||
assert _username_fits_url_template(site, "快嘴摩卡酱") is True
|
||||
|
||||
|
||||
def test_detect_error_page_instagram_marker_no_false_positive_on_profile():
|
||||
"""The login-wall marker must NOT match a real profile page. On a claimed
|
||||
user page, `routePath` carries the user-route template
|
||||
|
||||
Reference in New Issue
Block a user