Additional fixes

fix(checking): block URL-incompatible usernames before request
fix(Instagram): refresh rate-limit marker for stale Login title
2026-05-17 11:55:36 +00:00 · 2026-05-17 10:44:01 +02:00 · 2026-05-16 21:48:43 +02:00 · 2026-05-16 18:00:18 +02:00
7 changed files with 225 additions and 15 deletions
@@ -95,6 +95,13 @@ Each site entry uses one of three `checkType` modes to decide whether a profile

 **Errors vs absence.** Anything that means "the server can't answer right now" — rate limits, captchas, "Checking your browser", "unusual traffic", maintenance pages — belongs in `errors` (mapping the substring to a human-readable error string), not in `absenceStrs`. The `errors` mechanism produces an UNKNOWN result instead of a false CLAIMED or false AVAILABLE.

+**`regexCheck` and non-ASCII usernames.** When `{username}` is interpolated into a URL **path segment** and the username contains characters that need percent-encoding (Cyrillic, Chinese, Korean, spaces, etc.), Maigret skips the site with an `URL-incompatible username` error rather than send a request that would land on a generic listing/homepage and trip overly-broad `presenseStrs`. This default avoids the cascade of false-positives observed in [#459](https://github.com/soxoj/maigret/issues/459) and [#2633](https://github.com/soxoj/maigret/issues/2633). Two corollaries for site entries:
+
+- If your site legitimately accepts non-ASCII characters in the URL path (a wiki that mounts Unicode usernames, a Russian forum that serves Cyrillic slugs, etc.), declare the actual format with an explicit `regexCheck`. For example, a MediaWiki-style wiki could use `"regexCheck": "^[^\\/\\\\#<>\\[\\]\\|{}]+$"`; a Japanese blog platform might use `"regexCheck": "^[\\w\\-_\\.]+$"` (Python's `\w` matches Unicode letters). Don't paper this over with `regexCheck: "."` — pick a regex that reflects what the site actually accepts.
+- If `{username}` is in a query string (`?name={username}`) or only in `requestPayload`, the default has no effect — query/body values are URL-encoded as parameters and most APIs handle that fine.
+
+The default kicks in *only* when no per-site `regexCheck` is set. Existing per-site regexes always win.
+
 Full reference for `checkType`, `urlProbe`, `engine`, and the rest of the `data.json` schema is in the [development guide](docs/source/development.rst), section *How to fix false-positives*.

 ### Editing `data.json` safely
@@ -134,11 +134,50 @@ There are few options for sites data.json helpful in various cases:
 - ``engine`` - a predefined check for the sites of certain type (e.g. forums), see the ``engines`` section in the JSON file
 - ``headers`` - a dictionary of additional headers to be sent to the site
 - ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
+- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives (see ``regexCheck`` and the non-ASCII default below)
 - ``requestMethod`` - set the HTTP method to use (e.g., ``POST``). By default, Maigret natively defaults to GET or HEAD.
 - ``requestPayload`` - a dictionary with the JSON payload to send for POST requests (e.g., ``{"username": "{username}"}``), extremely useful for parsing GraphQL or modern JSON APIs.
 - ``protection`` - a list of protection types detected on the site (see below).

+``regexCheck`` and non-ASCII usernames
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When ``{username}`` is interpolated into a URL **path segment** and the user-supplied username contains characters that would be percent-encoded by :py:func:`urllib.parse.quote` (Cyrillic, Chinese, Korean, Arabic, spaces, etc.), Maigret skips the site with an ``URL-incompatible username`` error rather than send a request that would land on a generic listing/homepage and trip overly-broad ``presenseStrs``. This default closes the cascade of false-positives observed in `issue #459 <https://github.com/soxoj/maigret/issues/459>`_ and `issue #2633 <https://github.com/soxoj/maigret/issues/2633>`_.
+
+Scope of the default:
+
+- Active **only** when ``{username}`` is in the URL path of ``url`` (or ``urlProbe`` if set), e.g. ``https://example.com/u/{username}``.
+- **Not** active when ``{username}`` is in the query string (``?name={username}``) or only in ``requestPayload`` — those values are URL-encoded as parameters and most APIs handle them fine.
+- **Always** deferred when the site has its own ``regexCheck`` — an explicit per-site rule wins.
+
+Opting a site into broader matching:
+
+If a site genuinely accepts non-ASCII characters in the URL path (a wiki that mounts Unicode usernames, a Russian forum that serves Cyrillic slugs, etc.), declare the actual accepted format with an explicit ``regexCheck`` that matches your reality. A few worked examples:
+
+- A MediaWiki-style wiki that allows any character except the MediaWiki-forbidden punctuation:
+
+  .. code-block:: json
+
+     {
+       "url": "https://wiki.example/wiki/User:{username}",
+       "regexCheck": "^[^\\/\\\\#<>\\[\\]\\|{}]+$"
+     }
+
+- A Japanese blog platform that allows Unicode word characters + dash + dot:
+
+  .. code-block:: json
+
+     {
+       "url": "https://blog.example/{username}",
+       "regexCheck": "^[\\w\\-_\\.]+$"
+     }
+
+  In Python's regex engine, ``\\w`` against a ``str`` pattern matches Unicode letters by default, so Hiragana / Hangul / Cyrillic / etc. all pass.
+
+**Do not** paper this over with ``"regexCheck": "."`` — that's a placeholder, not a description of what the site accepts; it will let any string through, including URLs and emails that other parts of Maigret may pick up and feed back into recursive search (see ``parse_usernames`` in ``checking.py``).
+
+The complementary direction also matters: if you notice an existing site with a too-permissive ``regexCheck`` (e.g. ``"^[^\\.]+$"``, which means "anything but a dot" — that gladly lets non-ASCII through), tighten it to the actual accepted character class for the site (typically ``"^[A-Za-z0-9_-]+$"`` for ASCII slugs) when fixing related false-positives.
+
 ``protection`` (site protection tracking)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@@ -49,6 +49,34 @@ SUPPORTED_IDS = (
 BAD_CHARS = "#"


+def _username_fits_url_template(site: MaigretSite, username: str) -> bool:
+    """Decide whether a username can be safely substituted into a site's URL
+    path without producing a percent-encoded slug that the site cannot match.
+
+    Rationale: most sites that interpolate ``{username}`` into a URL path
+    segment treat the slug as an ASCII identifier. When a username contains
+    non-ASCII characters (or other reserved characters), ``urllib.parse.quote``
+    percent-encodes the bytes; the site typically cannot resolve such a slug
+    and falls back to a generic listing/homepage that trips overly-broad
+    ``presenseStrs`` markers, producing a false CLAIMED. See issues #459 and
+    #2633. Sites that genuinely accept broader character sets (e.g. wikis
+    that allow Unicode usernames) opt into permissive matching by setting
+    their own ``regexCheck``; in that case this helper is bypassed entirely.
+
+    Returns True when the check should proceed, False when the result is
+    inherently unreliable and the site should be skipped (ILLEGAL).
+    """
+    if site.regex_check:
+        return True
+    template = site.url_probe or site.url or ""
+    if "{username}" not in template:
+        return True
+    path_part, _sep, _query = template.partition("?")
+    if "{username}" not in path_part:
+        return True
+    return quote(username, safe='') == username
+
+
 def build_cloudflare_bypass_config(
    settings_obj: Optional[Any], force_enable: bool = False
 ) -> Optional[Dict[str, Any]]:
@@ -880,6 +908,23 @@ def make_site_result(
        results_site["http_status"] = ""
        results_site["response_text"] = ""
        # query_notify.update(results_site["status"])
+    # username would be percent-encoded into a path segment — see #459/#2633.
+    elif not _username_fits_url_template(site, username):
+        results_site["status"] = MaigretCheckResult(
+            username,
+            site.name,
+            url,
+            MaigretCheckStatus.ILLEGAL,
+            error=CheckError(
+                'URL-incompatible username',
+                'username contains characters that would be percent-encoded '
+                'in this site\'s URL path; result would be unreliable. Add a '
+                '`regexCheck` to opt this site in if it accepts these chars.'
+            ),
+        )
+        results_site["url_user"] = ""
+        results_site["http_status"] = ""
+        results_site["response_text"] = ""
    else:
        # URL of user on site (if it exists)
        results_site["url_user"] = url
@@ -57,7 +57,8 @@
                "\"routePath\":null"
            ],
            "errors": {
-                "Login • Instagram": "Login required"
+                "Login • Instagram": "Login required",
+                "\"routePath\":\"\\/\"": "Login required (rate-limited or session blocked)"
            },
            "alexaRank": 4,
            "urlMain": "https://www.instagram.com/",
@@ -3766,7 +3767,7 @@
            "absenceStrs": [
                "Couldn't find any profile with name"
            ],
-            "regexCheck": "^.{1,25}$",
+            "regexCheck": "^[A-Za-z0-9_]{3,16}$",
            "usernameClaimed": "blue",
            "usernameUnclaimed": "noonewouldeverusethis7",
            "alexaRank": 1635,
@@ -8217,7 +8218,17 @@
        "Namuwiki": {
            "url": "https://namu.wiki/w/%EC%82%AC%EC%9A%A9%EC%9E%90:{username}",
            "urlMain": "https://namu.wiki/",
-            "checkType": "status_code",
+            "checkType": "message",
+            "presenseStrs": [
+                "<meta property=\"og:title\""
+            ],
+            "absenceStrs": [
+                "새 문서 만들기"
+            ],
+            "regexCheck": "^[\\w\\-_.]+$",
+            "protection": [
+                "cf_js_challenge"
+            ],
            "usernameClaimed": "namu",
            "usernameUnclaimed": "noonewouldeverusethis7",
            "alexaRank": 7047,
@@ -13241,7 +13252,7 @@
                "ru"
            ],
            "checkType": "response_url",
-            "regexCheck": "^[^-]+$",
+            "regexCheck": "^[A-Za-z0-9_.]+$",
            "alexaRank": 29071,
            "urlMain": "https://studfile.net",
            "url": "https://studfile.net/users/{username}/",
@@ -15602,7 +15613,7 @@
            "tags": [
                "coding"
            ],
-            "regexCheck": "^[^\\.]+$",
+            "regexCheck": "^[A-Za-z0-9_-]+$",
            "checkType": "message",
            "absenceStrs": [
                "<title>Users - Hacking with Swift</title>"
@@ -17095,7 +17106,7 @@
            "tags": [
                "hacking"
            ],
-            "regexCheck": "^[^\\.]+$",
+            "regexCheck": "^[A-Za-z0-9_-]+$",
            "checkType": "message",
            "absenceStrs": [
                "Cannot Retrieve Information For The Specified Username"
@@ -17555,7 +17566,7 @@
            "errors": {
                "An error has occurred.": "Site error"
            },
-            "regexCheck": "^[^\\.]+$",
+            "regexCheck": "^[A-Za-z0-9_-]+$",
            "checkType": "message",
            "absenceStrs": [
                "No such user."
@@ -20679,7 +20690,7 @@
            "tags": [
                "ru"
            ],
-            "regexCheck": "^[^\\.]+$",
+            "regexCheck": "^[A-Za-z0-9_-]+$",
            "checkType": "message",
            "absenceStrs": [
                "Указанный пользователь не найден"
@@ -20811,7 +20822,7 @@
            "tags": [
                "hu"
            ],
-            "regexCheck": "^[^\\.]+$",
+            "regexCheck": "^[A-Za-z0-9_-]+$",
            "checkType": "message",
            "absenceStrs": [
                "<title>Log in - Chan4Chan</title>"
@@ -1,8 +1,8 @@
 {
    "version": 1,
-    "updated_at": "2026-05-16T10:45:38Z",
+    "updated_at": "2026-05-17T08:44:03Z",
    "sites_count": 3155,
    "min_maigret_version": "0.6.1",
-    "data_sha256": "df2ab3dbc96bdcdc8aa4e9da485df75ce6c3274814080f00a35e89f7f43783e1",
+    "data_sha256": "896a15cfb0de131848de5ae915a81d60d9d86a3e4537dc1004adeab29ceb4b43",
    "data_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/data.json"
 }
@@ -3159,16 +3159,16 @@ Rank data fetched from Majestic Million by domains.
 1. ![](https://www.google.com/s2/favicons?domain=https://greasyfork.org) [GreasyFork (https://greasyfork.org)](https://greasyfork.org)*: top 100M, coding*
 1. ![](https://www.google.com/s2/favicons?domain=https://faceit.com/) [Faceit (https://faceit.com/)](https://faceit.com/)*: top 100M, gaming*

-The list was updated at (2026-05-15)
+The list was updated at (2026-05-17)
 ## Statistics

 Enabled/total sites: 2522/3155 = 79.94%

 Incomplete message checks: 311/2522 = 12.33% (false positive risks)

-Status code checks: 635/2522 = 25.18% (false positive risks)
+Status code checks: 634/2522 = 25.14% (false positive risks)

-False positive risk (total): 37.51%
+False positive risk (total): 37.47%

 Sites with probing: 500px, Armchairgm, BinarySearch (disabled), BleachFandom, Bluesky, BongaCams, Boosty, BuyMeACoffee, Calendly, Cent, Chess, Code Sandbox (disabled), Code Snippet Wiki, DailyMotion, Discord, Diskusjon.no, Disqus, Docker Hub, Duolingo, Faceit, FandomCommunityCentral, GitHub, GitLab, Google Plus (archived), Gravatar, HackTheBox, Hackerrank, Hashnode, Holopin, Imgur, Issuu, Keybase, Kick, Kvinneguiden, LeetCode, Lesswrong, Livejasmin, LocalCryptos (disabled), Medium, MicrosoftLearn, MixCloud, Monkeytype, NPM, Niftygateway, Omg.lol, OnlyFans, Paragraph, Picsart, Plurk, Polarsteps, Rarible, Reddit, Reddit Search (Pushshift) (disabled), Revolut.me, RoyalCams, Scratch, Soop, SportsTracker, Spotify, StackOverflow, Substack, TAP'D, Topcoder, Trello, Twitch, Twitter, Twitter Shadowban (disabled), UnstoppableDomains, Vimeo, Vivino, Warframe Market, Warpcast, Weibo, Wikipedia, Yapisal (disabled), YouNow, en.brickimedia.org, forums.grandstream.com, nightbot, notabug.org, qiwi.me (disabled)

@@ -13,6 +13,7 @@ from maigret.checking import (
    timeout_check,
    debug_response_logging,
    process_site_result,
+    _username_fits_url_template,
 )
 from maigret.errors import CheckError
 from maigret.result import MaigretCheckResult, MaigretCheckStatus
@@ -126,6 +127,113 @@ def test_detect_error_page_ok():
    assert detect_error_page("hello world", 200, {}, ignore_403=False) is None


+def test_detect_error_page_instagram_login_wall():
+    """Regression for #11: when Instagram serves the login wall (typically the
+    response after rate-limiting an unauthenticated client), the JSON state
+    contains `"routePath":"\\/"` (root path) rather than a username route. The
+    Instagram entry in data.json carries this marker in `errors` so the result
+    surfaces as UNKNOWN instead of a false AVAILABLE.
+    """
+    instagram_errors = {
+        "Login • Instagram": "Login required",
+        '"routePath":"\\/"': "Login required (rate-limited or session blocked)",
+    }
+    login_wall_html = '...{"routePath":"\\/"},"timeSpent":...'
+    err = detect_error_page(login_wall_html, 200, instagram_errors, ignore_403=False)
+    assert err is not None
+    assert err.type == "Site-specific"
+    assert "rate-limited" in err.desc
+
+
+def _site_for_url(url_pattern, regex_check=None, url_probe=None):
+    """Build a minimal MaigretSite stub for the URL-template helper tests."""
+    raw = {
+        "url": url_pattern,
+        "urlMain": "https://example.com/",
+        "checkType": "message",
+        "usernameClaimed": "alice",
+        "usernameUnclaimed": "noone",
+    }
+    if regex_check is not None:
+        raw["regexCheck"] = regex_check
+    if url_probe is not None:
+        raw["urlProbe"] = url_probe
+    return MaigretSite("Example", raw)
+
+
+# Regression tests for #459 / #2633 — usernames that would be percent-encoded
+# into a URL path segment trip generic presence markers on fallback pages.
+def test_username_fits_path_segment_ascii_slug_passes():
+    site = _site_for_url("https://example.com/u/{username}")
+    assert _username_fits_url_template(site, "alice") is True
+    assert _username_fits_url_template(site, "alice-bob") is True
+    assert _username_fits_url_template(site, "alice.bob_42") is True
+
+
+def test_username_fits_path_segment_non_ascii_blocked():
+    site = _site_for_url("https://example.com/u/{username}")
+    # Cyrillic
+    assert _username_fits_url_template(site, "Александр") is False
+    # Chinese
+    assert _username_fits_url_template(site, "快嘴摩卡酱") is False
+    # Korean
+    assert _username_fits_url_template(site, "홍길동") is False
+    # Space (also percent-encoded)
+    assert _username_fits_url_template(site, "alice bob") is False
+
+
+def test_username_fits_query_string_is_unconstrained():
+    """If {username} sits in the query string, the value is URL-encoded as a
+    parameter and most APIs handle that fine — don't block."""
+    site = _site_for_url("https://example.com/api/users?name={username}")
+    assert _username_fits_url_template(site, "快嘴摩卡酱") is True
+    assert _username_fits_url_template(site, "Александр") is True
+
+
+def test_username_fits_explicit_regex_check_bypasses_helper():
+    """When the site declares its own regexCheck, the helper defers entirely."""
+    # Permissive site: accepts anything via Unicode-friendly regex.
+    site = _site_for_url(
+        "https://wiki.example/User:{username}", regex_check=r"^[\w\- .]+$"
+    )
+    assert _username_fits_url_template(site, "Александр") is True
+    assert _username_fits_url_template(site, "快嘴摩卡酱") is True
+
+
+def test_username_fits_url_probe_overrides_url():
+    """urlProbe is the actual request URL; the helper must use it when set."""
+    # Path-segment url, but urlProbe is a clean query API → no validation
+    site = _site_for_url(
+        "https://example.com/u/{username}",
+        url_probe="https://example.com/api/u?name={username}",
+    )
+    assert _username_fits_url_template(site, "快嘴摩卡酱") is True
+
+
+def test_username_fits_post_payload_sites_skipped():
+    """Sites with {username} only in requestPayload (no {username} in URL
+    template at all) should pass unconditionally — payload is JSON-encoded,
+    not URL-path-encoded."""
+    site = _site_for_url("https://api.example.com/check")
+    assert _username_fits_url_template(site, "快嘴摩卡酱") is True
+
+
+def test_detect_error_page_instagram_marker_no_false_positive_on_profile():
+    """The login-wall marker must NOT match a real profile page. On a claimed
+    user page, `routePath` carries the user-route template
+    (`"routePath":"\\/{username}\\/..."`); the closing-quote form
+    `"routePath":"\\/"` only appears on the login wall.
+    """
+    instagram_errors = {
+        '"routePath":"\\/"': "Login required (rate-limited or session blocked)",
+    }
+    profile_html = (
+        'foo,"routePath":"\\/{username}\\/{?tab}\\/{?view_type}\\/",bar'
+    )
+    err = detect_error_page(profile_html, 200, instagram_errors, ignore_403=False)
+    assert err is None
+
+
 def test_parse_usernames_single_username():
    logger = Mock()
    result = parse_usernames({"profile_username": "alice"}, logger)
Author	SHA1	Message	Date
Soxoj	ff00a51840	Additional fixes	2026-05-17 10:44:01 +02:00
Soxoj	073c20338b	fix(checking): block URL-incompatible usernames before request	2026-05-16 21:48:43 +02:00
Soxoj	d1ff1d0e66	fix(Instagram): refresh rate-limit marker for stale Login title	2026-05-16 18:00:18 +02:00