mirror of
https://github.com/soxoj/maigret.git
synced 2026-05-17 11:55:36 +00:00
Compare commits
2 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| ceed9aa9cc | |||
| 51a5169987 |
@@ -95,13 +95,6 @@ Each site entry uses one of three `checkType` modes to decide whether a profile
|
|||||||
|
|
||||||
**Errors vs absence.** Anything that means "the server can't answer right now" — rate limits, captchas, "Checking your browser", "unusual traffic", maintenance pages — belongs in `errors` (mapping the substring to a human-readable error string), not in `absenceStrs`. The `errors` mechanism produces an UNKNOWN result instead of a false CLAIMED or false AVAILABLE.
|
**Errors vs absence.** Anything that means "the server can't answer right now" — rate limits, captchas, "Checking your browser", "unusual traffic", maintenance pages — belongs in `errors` (mapping the substring to a human-readable error string), not in `absenceStrs`. The `errors` mechanism produces an UNKNOWN result instead of a false CLAIMED or false AVAILABLE.
|
||||||
|
|
||||||
**`regexCheck` and non-ASCII usernames.** When `{username}` is interpolated into a URL **path segment** and the username contains characters that need percent-encoding (Cyrillic, Chinese, Korean, spaces, etc.), Maigret skips the site with an `URL-incompatible username` error rather than send a request that would land on a generic listing/homepage and trip overly-broad `presenseStrs`. This default avoids the cascade of false-positives observed in [#459](https://github.com/soxoj/maigret/issues/459) and [#2633](https://github.com/soxoj/maigret/issues/2633). Two corollaries for site entries:
|
|
||||||
|
|
||||||
- If your site legitimately accepts non-ASCII characters in the URL path (a wiki that mounts Unicode usernames, a Russian forum that serves Cyrillic slugs, etc.), declare the actual format with an explicit `regexCheck`. For example, a MediaWiki-style wiki could use `"regexCheck": "^[^\\/\\\\#<>\\[\\]\\|{}]+$"`; a Japanese blog platform might use `"regexCheck": "^[\\w\\-_\\.]+$"` (Python's `\w` matches Unicode letters). Don't paper this over with `regexCheck: "."` — pick a regex that reflects what the site actually accepts.
|
|
||||||
- If `{username}` is in a query string (`?name={username}`) or only in `requestPayload`, the default has no effect — query/body values are URL-encoded as parameters and most APIs handle that fine.
|
|
||||||
|
|
||||||
The default kicks in *only* when no per-site `regexCheck` is set. Existing per-site regexes always win.
|
|
||||||
|
|
||||||
Full reference for `checkType`, `urlProbe`, `engine`, and the rest of the `data.json` schema is in the [development guide](docs/source/development.rst), section *How to fix false-positives*.
|
Full reference for `checkType`, `urlProbe`, `engine`, and the rest of the `data.json` schema is in the [development guide](docs/source/development.rst), section *How to fix false-positives*.
|
||||||
|
|
||||||
### Editing `data.json` safely
|
### Editing `data.json` safely
|
||||||
|
|||||||
@@ -134,50 +134,11 @@ There are few options for sites data.json helpful in various cases:
|
|||||||
- ``engine`` - a predefined check for the sites of certain type (e.g. forums), see the ``engines`` section in the JSON file
|
- ``engine`` - a predefined check for the sites of certain type (e.g. forums), see the ``engines`` section in the JSON file
|
||||||
- ``headers`` - a dictionary of additional headers to be sent to the site
|
- ``headers`` - a dictionary of additional headers to be sent to the site
|
||||||
- ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
|
- ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
|
||||||
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives (see ``regexCheck`` and the non-ASCII default below)
|
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
|
||||||
- ``requestMethod`` - set the HTTP method to use (e.g., ``POST``). By default, Maigret natively defaults to GET or HEAD.
|
- ``requestMethod`` - set the HTTP method to use (e.g., ``POST``). By default, Maigret natively defaults to GET or HEAD.
|
||||||
- ``requestPayload`` - a dictionary with the JSON payload to send for POST requests (e.g., ``{"username": "{username}"}``), extremely useful for parsing GraphQL or modern JSON APIs.
|
- ``requestPayload`` - a dictionary with the JSON payload to send for POST requests (e.g., ``{"username": "{username}"}``), extremely useful for parsing GraphQL or modern JSON APIs.
|
||||||
- ``protection`` - a list of protection types detected on the site (see below).
|
- ``protection`` - a list of protection types detected on the site (see below).
|
||||||
|
|
||||||
``regexCheck`` and non-ASCII usernames
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
||||||
|
|
||||||
When ``{username}`` is interpolated into a URL **path segment** and the user-supplied username contains characters that would be percent-encoded by :py:func:`urllib.parse.quote` (Cyrillic, Chinese, Korean, Arabic, spaces, etc.), Maigret skips the site with an ``URL-incompatible username`` error rather than send a request that would land on a generic listing/homepage and trip overly-broad ``presenseStrs``. This default closes the cascade of false-positives observed in `issue #459 <https://github.com/soxoj/maigret/issues/459>`_ and `issue #2633 <https://github.com/soxoj/maigret/issues/2633>`_.
|
|
||||||
|
|
||||||
Scope of the default:
|
|
||||||
|
|
||||||
- Active **only** when ``{username}`` is in the URL path of ``url`` (or ``urlProbe`` if set), e.g. ``https://example.com/u/{username}``.
|
|
||||||
- **Not** active when ``{username}`` is in the query string (``?name={username}``) or only in ``requestPayload`` — those values are URL-encoded as parameters and most APIs handle them fine.
|
|
||||||
- **Always** deferred when the site has its own ``regexCheck`` — an explicit per-site rule wins.
|
|
||||||
|
|
||||||
Opting a site into broader matching:
|
|
||||||
|
|
||||||
If a site genuinely accepts non-ASCII characters in the URL path (a wiki that mounts Unicode usernames, a Russian forum that serves Cyrillic slugs, etc.), declare the actual accepted format with an explicit ``regexCheck`` that matches your reality. A few worked examples:
|
|
||||||
|
|
||||||
- A MediaWiki-style wiki that allows any character except the MediaWiki-forbidden punctuation:
|
|
||||||
|
|
||||||
.. code-block:: json
|
|
||||||
|
|
||||||
{
|
|
||||||
"url": "https://wiki.example/wiki/User:{username}",
|
|
||||||
"regexCheck": "^[^\\/\\\\#<>\\[\\]\\|{}]+$"
|
|
||||||
}
|
|
||||||
|
|
||||||
- A Japanese blog platform that allows Unicode word characters + dash + dot:
|
|
||||||
|
|
||||||
.. code-block:: json
|
|
||||||
|
|
||||||
{
|
|
||||||
"url": "https://blog.example/{username}",
|
|
||||||
"regexCheck": "^[\\w\\-_\\.]+$"
|
|
||||||
}
|
|
||||||
|
|
||||||
In Python's regex engine, ``\\w`` against a ``str`` pattern matches Unicode letters by default, so Hiragana / Hangul / Cyrillic / etc. all pass.
|
|
||||||
|
|
||||||
**Do not** paper this over with ``"regexCheck": "."`` — that's a placeholder, not a description of what the site accepts; it will let any string through, including URLs and emails that other parts of Maigret may pick up and feed back into recursive search (see ``parse_usernames`` in ``checking.py``).
|
|
||||||
|
|
||||||
The complementary direction also matters: if you notice an existing site with a too-permissive ``regexCheck`` (e.g. ``"^[^\\.]+$"``, which means "anything but a dot" — that gladly lets non-ASCII through), tighten it to the actual accepted character class for the site (typically ``"^[A-Za-z0-9_-]+$"`` for ASCII slugs) when fixing related false-positives.
|
|
||||||
|
|
||||||
``protection`` (site protection tracking)
|
``protection`` (site protection tracking)
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
|||||||
+11
-49
@@ -31,7 +31,7 @@ from .executors import AsyncioQueueGeneratorExecutor
|
|||||||
from .result import MaigretCheckResult, MaigretCheckStatus
|
from .result import MaigretCheckResult, MaigretCheckStatus
|
||||||
from .sites import MaigretDatabase, MaigretSite
|
from .sites import MaigretDatabase, MaigretSite
|
||||||
from .types import QueryOptions, QueryResultWrapper
|
from .types import QueryOptions, QueryResultWrapper
|
||||||
from .utils import ascii_data_display, get_random_user_agent
|
from .utils import ascii_data_display, get_random_user_agent, is_plausible_username
|
||||||
|
|
||||||
|
|
||||||
SUPPORTED_IDS = (
|
SUPPORTED_IDS = (
|
||||||
@@ -49,34 +49,6 @@ SUPPORTED_IDS = (
|
|||||||
BAD_CHARS = "#"
|
BAD_CHARS = "#"
|
||||||
|
|
||||||
|
|
||||||
def _username_fits_url_template(site: MaigretSite, username: str) -> bool:
|
|
||||||
"""Decide whether a username can be safely substituted into a site's URL
|
|
||||||
path without producing a percent-encoded slug that the site cannot match.
|
|
||||||
|
|
||||||
Rationale: most sites that interpolate ``{username}`` into a URL path
|
|
||||||
segment treat the slug as an ASCII identifier. When a username contains
|
|
||||||
non-ASCII characters (or other reserved characters), ``urllib.parse.quote``
|
|
||||||
percent-encodes the bytes; the site typically cannot resolve such a slug
|
|
||||||
and falls back to a generic listing/homepage that trips overly-broad
|
|
||||||
``presenseStrs`` markers, producing a false CLAIMED. See issues #459 and
|
|
||||||
#2633. Sites that genuinely accept broader character sets (e.g. wikis
|
|
||||||
that allow Unicode usernames) opt into permissive matching by setting
|
|
||||||
their own ``regexCheck``; in that case this helper is bypassed entirely.
|
|
||||||
|
|
||||||
Returns True when the check should proceed, False when the result is
|
|
||||||
inherently unreliable and the site should be skipped (ILLEGAL).
|
|
||||||
"""
|
|
||||||
if site.regex_check:
|
|
||||||
return True
|
|
||||||
template = site.url_probe or site.url or ""
|
|
||||||
if "{username}" not in template:
|
|
||||||
return True
|
|
||||||
path_part, _sep, _query = template.partition("?")
|
|
||||||
if "{username}" not in path_part:
|
|
||||||
return True
|
|
||||||
return quote(username, safe='') == username
|
|
||||||
|
|
||||||
|
|
||||||
def build_cloudflare_bypass_config(
|
def build_cloudflare_bypass_config(
|
||||||
settings_obj: Optional[Any], force_enable: bool = False
|
settings_obj: Optional[Any], force_enable: bool = False
|
||||||
) -> Optional[Dict[str, Any]]:
|
) -> Optional[Dict[str, Any]]:
|
||||||
@@ -667,7 +639,6 @@ def process_site_result(
|
|||||||
|
|
||||||
html_text, status_code, check_error = response
|
html_text, status_code, check_error = response
|
||||||
|
|
||||||
# TODO: add elapsed request time counting
|
|
||||||
response_time = None
|
response_time = None
|
||||||
|
|
||||||
if logger.level == logging.DEBUG:
|
if logger.level == logging.DEBUG:
|
||||||
@@ -701,7 +672,6 @@ def process_site_result(
|
|||||||
f"Failed activation {method} for site {site.name}: {str(e)}",
|
f"Failed activation {method} for site {site.name}: {str(e)}",
|
||||||
exc_info=True,
|
exc_info=True,
|
||||||
)
|
)
|
||||||
# TODO: temporary check error
|
|
||||||
|
|
||||||
site_name = site.pretty_name
|
site_name = site.pretty_name
|
||||||
# presense flags
|
# presense flags
|
||||||
@@ -908,23 +878,6 @@ def make_site_result(
|
|||||||
results_site["http_status"] = ""
|
results_site["http_status"] = ""
|
||||||
results_site["response_text"] = ""
|
results_site["response_text"] = ""
|
||||||
# query_notify.update(results_site["status"])
|
# query_notify.update(results_site["status"])
|
||||||
# username would be percent-encoded into a path segment — see #459/#2633.
|
|
||||||
elif not _username_fits_url_template(site, username):
|
|
||||||
results_site["status"] = MaigretCheckResult(
|
|
||||||
username,
|
|
||||||
site.name,
|
|
||||||
url,
|
|
||||||
MaigretCheckStatus.ILLEGAL,
|
|
||||||
error=CheckError(
|
|
||||||
'URL-incompatible username',
|
|
||||||
'username contains characters that would be percent-encoded '
|
|
||||||
'in this site\'s URL path; result would be unreliable. Add a '
|
|
||||||
'`regexCheck` to opt this site in if it accepts these chars.'
|
|
||||||
),
|
|
||||||
)
|
|
||||||
results_site["url_user"] = ""
|
|
||||||
results_site["http_status"] = ""
|
|
||||||
results_site["response_text"] = ""
|
|
||||||
else:
|
else:
|
||||||
# URL of user on site (if it exists)
|
# URL of user on site (if it exists)
|
||||||
results_site["url_user"] = url
|
results_site["url_user"] = url
|
||||||
@@ -1341,7 +1294,6 @@ async def site_self_check(
|
|||||||
)
|
)
|
||||||
|
|
||||||
# don't disable entries with other ids types
|
# don't disable entries with other ids types
|
||||||
# TODO: make normal checking
|
|
||||||
if site.name not in results_dict:
|
if site.name not in results_dict:
|
||||||
logger.info(results_dict)
|
logger.info(results_dict)
|
||||||
changes["issues"].append(f"Site {site.name} not in results (wrong id_type?)")
|
changes["issues"].append(f"Site {site.name} not in results (wrong id_type?)")
|
||||||
@@ -1570,13 +1522,23 @@ def parse_usernames(extracted_ids_data, logger) -> Dict:
|
|||||||
new_usernames = {}
|
new_usernames = {}
|
||||||
for k, v in extracted_ids_data.items():
|
for k, v in extracted_ids_data.items():
|
||||||
if "username" in k and not "usernames" in k:
|
if "username" in k and not "usernames" in k:
|
||||||
|
if is_plausible_username(v):
|
||||||
new_usernames[v] = "username"
|
new_usernames[v] = "username"
|
||||||
|
else:
|
||||||
|
logger.debug(
|
||||||
|
f"Rejected non-username value extracted under key {k!r}: {v!r}"
|
||||||
|
)
|
||||||
elif "usernames" in k:
|
elif "usernames" in k:
|
||||||
try:
|
try:
|
||||||
tree = ast.literal_eval(v)
|
tree = ast.literal_eval(v)
|
||||||
if isinstance(tree, list):
|
if isinstance(tree, list):
|
||||||
for n in tree:
|
for n in tree:
|
||||||
|
if is_plausible_username(n):
|
||||||
new_usernames[n] = "username"
|
new_usernames[n] = "username"
|
||||||
|
else:
|
||||||
|
logger.debug(
|
||||||
|
f"Rejected non-username item from list under key {k!r}: {n!r}"
|
||||||
|
)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(e)
|
logger.warning(e)
|
||||||
if k in SUPPORTED_IDS:
|
if k in SUPPORTED_IDS:
|
||||||
|
|||||||
@@ -77,7 +77,6 @@ ERRORS_TYPES = {
|
|||||||
'Connecting failure': 'Try to decrease number of parallel connections (e.g. -n 10)',
|
'Connecting failure': 'Try to decrease number of parallel connections (e.g. -n 10)',
|
||||||
}
|
}
|
||||||
|
|
||||||
# TODO: checking for reason
|
|
||||||
ERRORS_REASONS = {
|
ERRORS_REASONS = {
|
||||||
'Login required': 'Add authorization cookies through `--cookies-jar-file` (see cookies.txt)',
|
'Login required': 'Add authorization cookies through `--cookies-jar-file` (see cookies.txt)',
|
||||||
}
|
}
|
||||||
|
|||||||
+11
-1
@@ -55,7 +55,7 @@ from .report import (
|
|||||||
from .sites import MaigretDatabase
|
from .sites import MaigretDatabase
|
||||||
from .submit import Submitter
|
from .submit import Submitter
|
||||||
from .types import QueryResultWrapper
|
from .types import QueryResultWrapper
|
||||||
from .utils import get_dict_ascii_tree
|
from .utils import get_dict_ascii_tree, is_plausible_username
|
||||||
from .settings import Settings
|
from .settings import Settings
|
||||||
from .permutator import Permute
|
from .permutator import Permute
|
||||||
|
|
||||||
@@ -85,13 +85,23 @@ def extract_ids_from_page(url, logger, timeout=5) -> dict:
|
|||||||
for k, v in info.items():
|
for k, v in info.items():
|
||||||
# TODO: merge with the same functionality in checking module
|
# TODO: merge with the same functionality in checking module
|
||||||
if 'username' in k and not 'usernames' in k:
|
if 'username' in k and not 'usernames' in k:
|
||||||
|
if is_plausible_username(v):
|
||||||
results[v] = 'username'
|
results[v] = 'username'
|
||||||
|
else:
|
||||||
|
logger.debug(
|
||||||
|
f"Rejected non-username value extracted under key {k!r}: {v!r}"
|
||||||
|
)
|
||||||
elif 'usernames' in k:
|
elif 'usernames' in k:
|
||||||
try:
|
try:
|
||||||
tree = ast.literal_eval(v)
|
tree = ast.literal_eval(v)
|
||||||
if isinstance(tree, list):
|
if isinstance(tree, list):
|
||||||
for n in tree:
|
for n in tree:
|
||||||
|
if is_plausible_username(n):
|
||||||
results[n] = 'username'
|
results[n] = 'username'
|
||||||
|
else:
|
||||||
|
logger.debug(
|
||||||
|
f"Rejected non-username item from list under key {k!r}: {n!r}"
|
||||||
|
)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(e)
|
logger.warning(e)
|
||||||
if k in SUPPORTED_IDS:
|
if k in SUPPORTED_IDS:
|
||||||
|
|||||||
@@ -516,7 +516,6 @@ def generate_report_context(username_results: list):
|
|||||||
tag = pycountry.countries.search_fuzzy(v)[
|
tag = pycountry.countries.search_fuzzy(v)[
|
||||||
0
|
0
|
||||||
].alpha_2.lower() # type: ignore[attr-defined]
|
].alpha_2.lower() # type: ignore[attr-defined]
|
||||||
# TODO: move countries to another struct
|
|
||||||
tags[tag] = tags.get(tag, 0) + 1
|
tags[tag] = tags.get(tag, 0) + 1
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logging.debug(
|
logging.debug(
|
||||||
@@ -568,7 +567,6 @@ def generate_report_context(username_results: list):
|
|||||||
|
|
||||||
return {
|
return {
|
||||||
"username": first_username,
|
"username": first_username,
|
||||||
# TODO: return brief list
|
|
||||||
"brief": brief,
|
"brief": brief,
|
||||||
"results": username_results,
|
"results": username_results,
|
||||||
"first_seen": first_seen,
|
"first_seen": first_seen,
|
||||||
|
|||||||
@@ -3767,7 +3767,7 @@
|
|||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"Couldn't find any profile with name"
|
"Couldn't find any profile with name"
|
||||||
],
|
],
|
||||||
"regexCheck": "^[A-Za-z0-9_]{3,16}$",
|
"regexCheck": "^.{1,25}$",
|
||||||
"usernameClaimed": "blue",
|
"usernameClaimed": "blue",
|
||||||
"usernameUnclaimed": "noonewouldeverusethis7",
|
"usernameUnclaimed": "noonewouldeverusethis7",
|
||||||
"alexaRank": 1635,
|
"alexaRank": 1635,
|
||||||
@@ -8218,17 +8218,7 @@
|
|||||||
"Namuwiki": {
|
"Namuwiki": {
|
||||||
"url": "https://namu.wiki/w/%EC%82%AC%EC%9A%A9%EC%9E%90:{username}",
|
"url": "https://namu.wiki/w/%EC%82%AC%EC%9A%A9%EC%9E%90:{username}",
|
||||||
"urlMain": "https://namu.wiki/",
|
"urlMain": "https://namu.wiki/",
|
||||||
"checkType": "message",
|
"checkType": "status_code",
|
||||||
"presenseStrs": [
|
|
||||||
"<meta property=\"og:title\""
|
|
||||||
],
|
|
||||||
"absenceStrs": [
|
|
||||||
"새 문서 만들기"
|
|
||||||
],
|
|
||||||
"regexCheck": "^[\\w\\-_.]+$",
|
|
||||||
"protection": [
|
|
||||||
"cf_js_challenge"
|
|
||||||
],
|
|
||||||
"usernameClaimed": "namu",
|
"usernameClaimed": "namu",
|
||||||
"usernameUnclaimed": "noonewouldeverusethis7",
|
"usernameUnclaimed": "noonewouldeverusethis7",
|
||||||
"alexaRank": 7047,
|
"alexaRank": 7047,
|
||||||
@@ -13252,7 +13242,7 @@
|
|||||||
"ru"
|
"ru"
|
||||||
],
|
],
|
||||||
"checkType": "response_url",
|
"checkType": "response_url",
|
||||||
"regexCheck": "^[A-Za-z0-9_.]+$",
|
"regexCheck": "^[^-]+$",
|
||||||
"alexaRank": 29071,
|
"alexaRank": 29071,
|
||||||
"urlMain": "https://studfile.net",
|
"urlMain": "https://studfile.net",
|
||||||
"url": "https://studfile.net/users/{username}/",
|
"url": "https://studfile.net/users/{username}/",
|
||||||
@@ -15613,7 +15603,7 @@
|
|||||||
"tags": [
|
"tags": [
|
||||||
"coding"
|
"coding"
|
||||||
],
|
],
|
||||||
"regexCheck": "^[A-Za-z0-9_-]+$",
|
"regexCheck": "^[^\\.]+$",
|
||||||
"checkType": "message",
|
"checkType": "message",
|
||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"<title>Users - Hacking with Swift</title>"
|
"<title>Users - Hacking with Swift</title>"
|
||||||
@@ -17106,7 +17096,7 @@
|
|||||||
"tags": [
|
"tags": [
|
||||||
"hacking"
|
"hacking"
|
||||||
],
|
],
|
||||||
"regexCheck": "^[A-Za-z0-9_-]+$",
|
"regexCheck": "^[^\\.]+$",
|
||||||
"checkType": "message",
|
"checkType": "message",
|
||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"Cannot Retrieve Information For The Specified Username"
|
"Cannot Retrieve Information For The Specified Username"
|
||||||
@@ -17566,7 +17556,7 @@
|
|||||||
"errors": {
|
"errors": {
|
||||||
"An error has occurred.": "Site error"
|
"An error has occurred.": "Site error"
|
||||||
},
|
},
|
||||||
"regexCheck": "^[A-Za-z0-9_-]+$",
|
"regexCheck": "^[^\\.]+$",
|
||||||
"checkType": "message",
|
"checkType": "message",
|
||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"No such user."
|
"No such user."
|
||||||
@@ -20690,7 +20680,7 @@
|
|||||||
"tags": [
|
"tags": [
|
||||||
"ru"
|
"ru"
|
||||||
],
|
],
|
||||||
"regexCheck": "^[A-Za-z0-9_-]+$",
|
"regexCheck": "^[^\\.]+$",
|
||||||
"checkType": "message",
|
"checkType": "message",
|
||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"Указанный пользователь не найден"
|
"Указанный пользователь не найден"
|
||||||
@@ -20822,7 +20812,7 @@
|
|||||||
"tags": [
|
"tags": [
|
||||||
"hu"
|
"hu"
|
||||||
],
|
],
|
||||||
"regexCheck": "^[A-Za-z0-9_-]+$",
|
"regexCheck": "^[^\\.]+$",
|
||||||
"checkType": "message",
|
"checkType": "message",
|
||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"<title>Log in - Chan4Chan</title>"
|
"<title>Log in - Chan4Chan</title>"
|
||||||
|
|||||||
@@ -1,8 +1,8 @@
|
|||||||
{
|
{
|
||||||
"version": 1,
|
"version": 1,
|
||||||
"updated_at": "2026-05-17T08:44:03Z",
|
"updated_at": "2026-05-16T16:00:20Z",
|
||||||
"sites_count": 3155,
|
"sites_count": 3155,
|
||||||
"min_maigret_version": "0.6.1",
|
"min_maigret_version": "0.6.1",
|
||||||
"data_sha256": "896a15cfb0de131848de5ae915a81d60d9d86a3e4537dc1004adeab29ceb4b43",
|
"data_sha256": "0997b68c05eedb6e714432ed79580688d4923c56ef1ebf46db69b90039ef00d7",
|
||||||
"data_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/data.json"
|
"data_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/data.json"
|
||||||
}
|
}
|
||||||
@@ -127,3 +127,29 @@ def get_match_ratio(base_strs: list):
|
|||||||
|
|
||||||
def generate_random_username():
|
def generate_random_username():
|
||||||
return ''.join(random.choices(string.ascii_lowercase, k=10))
|
return ''.join(random.choices(string.ascii_lowercase, k=10))
|
||||||
|
|
||||||
|
|
||||||
|
def is_plausible_username(value: Any) -> bool:
|
||||||
|
"""Reject obviously non-username strings extracted from sites' identity data.
|
||||||
|
|
||||||
|
Extractor schemes occasionally populate fields named like ``*_username``
|
||||||
|
with URLs (e.g. ``instagram_username`` -> ``https://instagram.com/X``) or
|
||||||
|
emails (e.g. ``your_username`` -> ``user@example.com``). Feeding such a
|
||||||
|
value back into a site URL template produces broken requests on every
|
||||||
|
subsequent site, which manifests as a cascade of false errors and the
|
||||||
|
"wrong username" symptom in #1403.
|
||||||
|
"""
|
||||||
|
if not isinstance(value, str):
|
||||||
|
return False
|
||||||
|
s = value.strip()
|
||||||
|
if not s:
|
||||||
|
return False
|
||||||
|
if "://" in s or s.startswith(("http://", "https://", "www.", "//")):
|
||||||
|
return False
|
||||||
|
if "/" in s:
|
||||||
|
return False
|
||||||
|
if any(c.isspace() for c in s):
|
||||||
|
return False
|
||||||
|
if "@" in s and "." in s:
|
||||||
|
return False
|
||||||
|
return True
|
||||||
|
|||||||
@@ -3159,16 +3159,16 @@ Rank data fetched from Majestic Million by domains.
|
|||||||
1.  [GreasyFork (https://greasyfork.org)](https://greasyfork.org)*: top 100M, coding*
|
1.  [GreasyFork (https://greasyfork.org)](https://greasyfork.org)*: top 100M, coding*
|
||||||
1.  [Faceit (https://faceit.com/)](https://faceit.com/)*: top 100M, gaming*
|
1.  [Faceit (https://faceit.com/)](https://faceit.com/)*: top 100M, gaming*
|
||||||
|
|
||||||
The list was updated at (2026-05-17)
|
The list was updated at (2026-05-15)
|
||||||
## Statistics
|
## Statistics
|
||||||
|
|
||||||
Enabled/total sites: 2522/3155 = 79.94%
|
Enabled/total sites: 2522/3155 = 79.94%
|
||||||
|
|
||||||
Incomplete message checks: 311/2522 = 12.33% (false positive risks)
|
Incomplete message checks: 311/2522 = 12.33% (false positive risks)
|
||||||
|
|
||||||
Status code checks: 634/2522 = 25.14% (false positive risks)
|
Status code checks: 635/2522 = 25.18% (false positive risks)
|
||||||
|
|
||||||
False positive risk (total): 37.47%
|
False positive risk (total): 37.51%
|
||||||
|
|
||||||
Sites with probing: 500px, Armchairgm, BinarySearch (disabled), BleachFandom, Bluesky, BongaCams, Boosty, BuyMeACoffee, Calendly, Cent, Chess, Code Sandbox (disabled), Code Snippet Wiki, DailyMotion, Discord, Diskusjon.no, Disqus, Docker Hub, Duolingo, Faceit, FandomCommunityCentral, GitHub, GitLab, Google Plus (archived), Gravatar, HackTheBox, Hackerrank, Hashnode, Holopin, Imgur, Issuu, Keybase, Kick, Kvinneguiden, LeetCode, Lesswrong, Livejasmin, LocalCryptos (disabled), Medium, MicrosoftLearn, MixCloud, Monkeytype, NPM, Niftygateway, Omg.lol, OnlyFans, Paragraph, Picsart, Plurk, Polarsteps, Rarible, Reddit, Reddit Search (Pushshift) (disabled), Revolut.me, RoyalCams, Scratch, Soop, SportsTracker, Spotify, StackOverflow, Substack, TAP'D, Topcoder, Trello, Twitch, Twitter, Twitter Shadowban (disabled), UnstoppableDomains, Vimeo, Vivino, Warframe Market, Warpcast, Weibo, Wikipedia, Yapisal (disabled), YouNow, en.brickimedia.org, forums.grandstream.com, nightbot, notabug.org, qiwi.me (disabled)
|
Sites with probing: 500px, Armchairgm, BinarySearch (disabled), BleachFandom, Bluesky, BongaCams, Boosty, BuyMeACoffee, Calendly, Cent, Chess, Code Sandbox (disabled), Code Snippet Wiki, DailyMotion, Discord, Diskusjon.no, Disqus, Docker Hub, Duolingo, Faceit, FandomCommunityCentral, GitHub, GitLab, Google Plus (archived), Gravatar, HackTheBox, Hackerrank, Hashnode, Holopin, Imgur, Issuu, Keybase, Kick, Kvinneguiden, LeetCode, Lesswrong, Livejasmin, LocalCryptos (disabled), Medium, MicrosoftLearn, MixCloud, Monkeytype, NPM, Niftygateway, Omg.lol, OnlyFans, Paragraph, Picsart, Plurk, Polarsteps, Rarible, Reddit, Reddit Search (Pushshift) (disabled), Revolut.me, RoyalCams, Scratch, Soop, SportsTracker, Spotify, StackOverflow, Substack, TAP'D, Topcoder, Trello, Twitch, Twitter, Twitter Shadowban (disabled), UnstoppableDomains, Vimeo, Vivino, Warframe Market, Warpcast, Weibo, Wikipedia, Yapisal (disabled), YouNow, en.brickimedia.org, forums.grandstream.com, nightbot, notabug.org, qiwi.me (disabled)
|
||||||
|
|
||||||
|
|||||||
+27
-74
@@ -13,7 +13,6 @@ from maigret.checking import (
|
|||||||
timeout_check,
|
timeout_check,
|
||||||
debug_response_logging,
|
debug_response_logging,
|
||||||
process_site_result,
|
process_site_result,
|
||||||
_username_fits_url_template,
|
|
||||||
)
|
)
|
||||||
from maigret.errors import CheckError
|
from maigret.errors import CheckError
|
||||||
from maigret.result import MaigretCheckResult, MaigretCheckStatus
|
from maigret.result import MaigretCheckResult, MaigretCheckStatus
|
||||||
@@ -145,79 +144,6 @@ def test_detect_error_page_instagram_login_wall():
|
|||||||
assert "rate-limited" in err.desc
|
assert "rate-limited" in err.desc
|
||||||
|
|
||||||
|
|
||||||
def _site_for_url(url_pattern, regex_check=None, url_probe=None):
|
|
||||||
"""Build a minimal MaigretSite stub for the URL-template helper tests."""
|
|
||||||
raw = {
|
|
||||||
"url": url_pattern,
|
|
||||||
"urlMain": "https://example.com/",
|
|
||||||
"checkType": "message",
|
|
||||||
"usernameClaimed": "alice",
|
|
||||||
"usernameUnclaimed": "noone",
|
|
||||||
}
|
|
||||||
if regex_check is not None:
|
|
||||||
raw["regexCheck"] = regex_check
|
|
||||||
if url_probe is not None:
|
|
||||||
raw["urlProbe"] = url_probe
|
|
||||||
return MaigretSite("Example", raw)
|
|
||||||
|
|
||||||
|
|
||||||
# Regression tests for #459 / #2633 — usernames that would be percent-encoded
|
|
||||||
# into a URL path segment trip generic presence markers on fallback pages.
|
|
||||||
def test_username_fits_path_segment_ascii_slug_passes():
|
|
||||||
site = _site_for_url("https://example.com/u/{username}")
|
|
||||||
assert _username_fits_url_template(site, "alice") is True
|
|
||||||
assert _username_fits_url_template(site, "alice-bob") is True
|
|
||||||
assert _username_fits_url_template(site, "alice.bob_42") is True
|
|
||||||
|
|
||||||
|
|
||||||
def test_username_fits_path_segment_non_ascii_blocked():
|
|
||||||
site = _site_for_url("https://example.com/u/{username}")
|
|
||||||
# Cyrillic
|
|
||||||
assert _username_fits_url_template(site, "Александр") is False
|
|
||||||
# Chinese
|
|
||||||
assert _username_fits_url_template(site, "快嘴摩卡酱") is False
|
|
||||||
# Korean
|
|
||||||
assert _username_fits_url_template(site, "홍길동") is False
|
|
||||||
# Space (also percent-encoded)
|
|
||||||
assert _username_fits_url_template(site, "alice bob") is False
|
|
||||||
|
|
||||||
|
|
||||||
def test_username_fits_query_string_is_unconstrained():
|
|
||||||
"""If {username} sits in the query string, the value is URL-encoded as a
|
|
||||||
parameter and most APIs handle that fine — don't block."""
|
|
||||||
site = _site_for_url("https://example.com/api/users?name={username}")
|
|
||||||
assert _username_fits_url_template(site, "快嘴摩卡酱") is True
|
|
||||||
assert _username_fits_url_template(site, "Александр") is True
|
|
||||||
|
|
||||||
|
|
||||||
def test_username_fits_explicit_regex_check_bypasses_helper():
|
|
||||||
"""When the site declares its own regexCheck, the helper defers entirely."""
|
|
||||||
# Permissive site: accepts anything via Unicode-friendly regex.
|
|
||||||
site = _site_for_url(
|
|
||||||
"https://wiki.example/User:{username}", regex_check=r"^[\w\- .]+$"
|
|
||||||
)
|
|
||||||
assert _username_fits_url_template(site, "Александр") is True
|
|
||||||
assert _username_fits_url_template(site, "快嘴摩卡酱") is True
|
|
||||||
|
|
||||||
|
|
||||||
def test_username_fits_url_probe_overrides_url():
|
|
||||||
"""urlProbe is the actual request URL; the helper must use it when set."""
|
|
||||||
# Path-segment url, but urlProbe is a clean query API → no validation
|
|
||||||
site = _site_for_url(
|
|
||||||
"https://example.com/u/{username}",
|
|
||||||
url_probe="https://example.com/api/u?name={username}",
|
|
||||||
)
|
|
||||||
assert _username_fits_url_template(site, "快嘴摩卡酱") is True
|
|
||||||
|
|
||||||
|
|
||||||
def test_username_fits_post_payload_sites_skipped():
|
|
||||||
"""Sites with {username} only in requestPayload (no {username} in URL
|
|
||||||
template at all) should pass unconditionally — payload is JSON-encoded,
|
|
||||||
not URL-path-encoded."""
|
|
||||||
site = _site_for_url("https://api.example.com/check")
|
|
||||||
assert _username_fits_url_template(site, "快嘴摩卡酱") is True
|
|
||||||
|
|
||||||
|
|
||||||
def test_detect_error_page_instagram_marker_no_false_positive_on_profile():
|
def test_detect_error_page_instagram_marker_no_false_positive_on_profile():
|
||||||
"""The login-wall marker must NOT match a real profile page. On a claimed
|
"""The login-wall marker must NOT match a real profile page. On a claimed
|
||||||
user page, `routePath` carries the user-route template
|
user page, `routePath` carries the user-route template
|
||||||
@@ -254,6 +180,33 @@ def test_parse_usernames_malformed_list():
|
|||||||
assert logger.warning.called
|
assert logger.warning.called
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_usernames_rejects_url_value():
|
||||||
|
"""Regression for #1403: extractors sometimes return a URL under a *_username
|
||||||
|
key; that URL must not be fed back as a candidate username."""
|
||||||
|
logger = Mock()
|
||||||
|
result = parse_usernames(
|
||||||
|
{"instagram_username": "https://instagram.com/zuck"}, logger
|
||||||
|
)
|
||||||
|
assert result == {}
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_usernames_rejects_email_value():
|
||||||
|
"""Regression for #1403: e.g. socid_extractor's 'your_username' returns an
|
||||||
|
email under a key matching the username heuristic."""
|
||||||
|
logger = Mock()
|
||||||
|
result = parse_usernames({"your_username": "alice@example.com"}, logger)
|
||||||
|
assert result == {}
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_usernames_filters_urls_inside_list():
|
||||||
|
logger = Mock()
|
||||||
|
result = parse_usernames(
|
||||||
|
{"other_usernames": "['alice', 'https://example.com/bob']"}, logger
|
||||||
|
)
|
||||||
|
# 'alice' should survive; the URL should be dropped.
|
||||||
|
assert result == {"alice": "username"}
|
||||||
|
|
||||||
|
|
||||||
def test_parse_usernames_supported_id():
|
def test_parse_usernames_supported_id():
|
||||||
logger = Mock()
|
logger = Mock()
|
||||||
# "telegram" is in SUPPORTED_IDS per socid_extractor
|
# "telegram" is in SUPPORTED_IDS per socid_extractor
|
||||||
|
|||||||
@@ -10,6 +10,7 @@ from maigret.utils import (
|
|||||||
URLMatcher,
|
URLMatcher,
|
||||||
get_dict_ascii_tree,
|
get_dict_ascii_tree,
|
||||||
get_match_ratio,
|
get_match_ratio,
|
||||||
|
is_plausible_username,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -144,3 +145,52 @@ def test_get_match_ratio():
|
|||||||
fun = get_match_ratio(["test", "maigret", "username"])
|
fun = get_match_ratio(["test", "maigret", "username"])
|
||||||
|
|
||||||
assert fun("test") == 1
|
assert fun("test") == 1
|
||||||
|
|
||||||
|
|
||||||
|
# Regression tests for #1403 — Gravatar URL leaking into next-iteration username.
|
||||||
|
# Extractor schemes occasionally store URLs/emails under '*_username' keys; without
|
||||||
|
# validation these were fed back into the search loop and produced cascades of false
|
||||||
|
# errors. See maigret/utils.py::is_plausible_username.
|
||||||
|
def test_is_plausible_username_accepts_bare_usernames():
|
||||||
|
assert is_plausible_username("alice")
|
||||||
|
assert is_plausible_username("alice.bob")
|
||||||
|
assert is_plausible_username("alice_bob-42")
|
||||||
|
assert is_plausible_username("Алиса")
|
||||||
|
|
||||||
|
|
||||||
|
def test_is_plausible_username_rejects_urls():
|
||||||
|
assert not is_plausible_username("https://gravatar.com/alice")
|
||||||
|
assert not is_plausible_username("http://example.com/user/alice")
|
||||||
|
assert not is_plausible_username("//example.com/alice")
|
||||||
|
assert not is_plausible_username("www.facebook.com/zuck")
|
||||||
|
|
||||||
|
|
||||||
|
def test_is_plausible_username_accepts_http_prefixed_handles():
|
||||||
|
"""Don't over-match: bare names that just happen to start with 'http' or 'www'
|
||||||
|
are legitimate (e.g. the httpie CLI maintainer's handle)."""
|
||||||
|
assert is_plausible_username("httpie")
|
||||||
|
assert is_plausible_username("http_user")
|
||||||
|
assert is_plausible_username("wwwsuperstar")
|
||||||
|
|
||||||
|
|
||||||
|
def test_is_plausible_username_rejects_path_like():
|
||||||
|
assert not is_plausible_username("user/alice")
|
||||||
|
assert not is_plausible_username("alice/")
|
||||||
|
|
||||||
|
|
||||||
|
def test_is_plausible_username_rejects_emails():
|
||||||
|
assert not is_plausible_username("alice@example.com")
|
||||||
|
assert not is_plausible_username("user@maigret.io")
|
||||||
|
|
||||||
|
|
||||||
|
def test_is_plausible_username_rejects_whitespace_and_empty():
|
||||||
|
assert not is_plausible_username("")
|
||||||
|
assert not is_plausible_username(" ")
|
||||||
|
assert not is_plausible_username("alice bob")
|
||||||
|
assert not is_plausible_username("alice\nbob")
|
||||||
|
|
||||||
|
|
||||||
|
def test_is_plausible_username_rejects_non_strings():
|
||||||
|
assert not is_plausible_username(None)
|
||||||
|
assert not is_plausible_username(42)
|
||||||
|
assert not is_plausible_username(["alice"])
|
||||||
|
|||||||
@@ -165,7 +165,6 @@ if __name__ == '__main__':
|
|||||||
sites = {site.name: site for site in sites_subset}
|
sites = {site.name: site for site in sites_subset}
|
||||||
engines = db.engines
|
engines = db.engines
|
||||||
|
|
||||||
# TODO: usernames extractors
|
|
||||||
ok_usernames = ['alex', 'god', 'admin', 'red', 'blue', 'john']
|
ok_usernames = ['alex', 'god', 'admin', 'red', 'blue', 'john']
|
||||||
if args.username:
|
if args.username:
|
||||||
ok_usernames = [args.username] + ok_usernames
|
ok_usernames = [args.username] + ok_usernames
|
||||||
|
|||||||
Reference in New Issue
Block a user