mirror of
https://github.com/soxoj/maigret.git
synced 2026-05-16 19:35:38 +00:00
Compare commits
4 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| ceed9aa9cc | |||
| 51a5169987 | |||
| 3e77c13743 | |||
| c5885331d6 |
+22
-9
@@ -51,19 +51,32 @@ pip install --upgrade certifi
|
||||
|
||||
If you are behind a corporate proxy, set `HTTPS_PROXY` / `HTTP_PROXY` environment variables and pass `--proxy "$HTTPS_PROXY"` so Maigret uses the same route.
|
||||
|
||||
## ".onion / .i2p sites are skipped"
|
||||
## Running over Tor, I2P, or Tails OS
|
||||
|
||||
These sites only load through the matching gateway. Start your Tor or I2P daemon first, then:
|
||||
Two different goals, two different flags:
|
||||
|
||||
```bash
|
||||
# Tor
|
||||
maigret user --tor-proxy socks5://127.0.0.1:9050
|
||||
- **Route only `.onion` / `.i2p` sites through their gateway** (clearweb checks still use your direct connection). Use `--tor-proxy` / `--i2p-proxy`:
|
||||
```bash
|
||||
maigret user --tor-proxy socks5://127.0.0.1:9050 # only .onion goes via Tor
|
||||
maigret user --i2p-proxy http://127.0.0.1:4444 # only .i2p goes via I2P
|
||||
```
|
||||
Without these flags, `.onion` / `.i2p` sites are silently skipped.
|
||||
|
||||
# I2P
|
||||
maigret user --i2p-proxy http://127.0.0.1:4444
|
||||
```
|
||||
- **Route the whole run through Tor / a proxy** (e.g. on Tails OS, or to anonymise the scan). Use `--proxy`:
|
||||
```bash
|
||||
# system tor daemon (apt install tor, Tails)
|
||||
maigret user --proxy socks5://127.0.0.1:9050 --timeout 60 --retries 2
|
||||
|
||||
Maigret does not launch or manage these daemons — they must already be running.
|
||||
# Tor Browser bundle (different SOCKS port!)
|
||||
maigret user --proxy socks5://127.0.0.1:9150 --timeout 60 --retries 2
|
||||
```
|
||||
Most public WAFs block Tor exits, so expect more UNKNOWNs over Tor than on a residential line — this is the cost of anonymity, not a bug. Raising `--timeout` to 60 and adding `--retries 2` materially reduces noise.
|
||||
|
||||
On Tails, `torsocks maigret …` / `torify maigret …` do **not** work — Maigret's HTTP client bypasses libc, so the wrapper has no effect. Use `--proxy` instead. To install Maigret over Tor: `torsocks pip install --user maigret`.
|
||||
|
||||
Maigret does not launch or manage Tor / I2P daemons — they must already be running.
|
||||
|
||||
For the full walkthrough (Tor Browser vs system `tor` ports, Tails persistence, reports paths), see the [Tor, I2P, and proxies](https://maigret.readthedocs.io/en/latest/tor-and-proxies.html) page on readthedocs.
|
||||
|
||||
## "The PDF / XMind / HTML report looks wrong"
|
||||
|
||||
|
||||
@@ -63,6 +63,29 @@ from slow sites. On the other hand, this may cause a long delay to
|
||||
gather all results. The choice of the right timeout should be carried
|
||||
out taking into account the bandwidth of the Internet connection.
|
||||
|
||||
Network and proxy options
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``--proxy PROXY_URL`` / ``-p PROXY_URL`` - Route **every** check through
|
||||
the given HTTP or SOCKS proxy. Example: ``socks5://127.0.0.1:1080``,
|
||||
``http://user:pass@proxy.example:3128``. This is the flag to use for
|
||||
routing the whole run through Tor (``--proxy socks5://127.0.0.1:9050``),
|
||||
a residential proxy, or any corporate gateway. No default.
|
||||
|
||||
``--tor-proxy TOR_PROXY_URL`` - Gateway used **only** for ``.onion``
|
||||
sites in the database **(default: socks5://127.0.0.1:9050)**. Clearweb
|
||||
sites are unaffected — for them Maigret uses your direct connection or
|
||||
``--proxy`` if you set one. Without this flag, ``.onion`` sites are
|
||||
silently skipped.
|
||||
|
||||
``--i2p-proxy I2P_PROXY_URL`` - Gateway used **only** for ``.i2p``
|
||||
sites in the database **(default: http://127.0.0.1:4444)**. Same
|
||||
"only matching protocol" rule as ``--tor-proxy``.
|
||||
|
||||
Maigret does not start the Tor or I2P daemon for you — launch it first.
|
||||
For a full walkthrough (Tor Browser vs system ``tor`` port numbers,
|
||||
Tails OS recipe, timeout/retry tuning), see :doc:`tor-and-proxies`.
|
||||
|
||||
``--cookies-jar-file`` - File with custom cookies in Netscape format
|
||||
(aka cookies.txt). You can install an extension to your browser to
|
||||
download own cookies (`Chrome <https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid>`_, `Firefox <https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/>`_).
|
||||
|
||||
@@ -30,6 +30,7 @@ You may be interested in:
|
||||
- :doc:`Command line options <command-line-options>`
|
||||
- :doc:`Features list <features>`
|
||||
- :doc:`Library usage <library-usage>`
|
||||
- :doc:`Tor, I2P, and proxies <tor-and-proxies>`
|
||||
|
||||
.. toctree::
|
||||
:hidden:
|
||||
@@ -40,13 +41,19 @@ You may be interested in:
|
||||
usage-examples
|
||||
command-line-options
|
||||
features
|
||||
library-usage
|
||||
philosophy
|
||||
supported-identifier-types
|
||||
tags
|
||||
settings
|
||||
development
|
||||
|
||||
.. toctree::
|
||||
:hidden:
|
||||
:caption: Advanced usage
|
||||
|
||||
library-usage
|
||||
settings
|
||||
tor-and-proxies
|
||||
|
||||
.. toctree::
|
||||
:hidden:
|
||||
:caption: Use cases
|
||||
|
||||
@@ -0,0 +1,122 @@
|
||||
.. _tor-and-proxies:
|
||||
|
||||
Tor, I2P, and proxies
|
||||
=====================
|
||||
|
||||
Maigret can route checks through an HTTP/SOCKS proxy, the Tor network, or I2P. Three CLI flags cover three distinct goals — knowing which one you need is the most common stumbling block.
|
||||
|
||||
``--proxy`` vs ``--tor-proxy`` (and ``--i2p-proxy``)
|
||||
----------------------------------------------------
|
||||
|
||||
The most-asked question (see `issue #544 <https://github.com/soxoj/maigret/issues/544>`_):
|
||||
|
||||
- **You want every check to go through Tor** (e.g. you're on Tails OS, or behind a country-level block, or your IP is rate-limited). → Use ``--proxy``, pointing at your Tor SOCKS port:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
maigret <username> --proxy socks5://127.0.0.1:9050
|
||||
|
||||
- **You want to reach ``.onion`` sites in the Maigret database**, while the rest of the run still uses your normal connection. → Use ``--tor-proxy``:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
maigret <username> --tor-proxy socks5://127.0.0.1:9050
|
||||
|
||||
``--tor-proxy`` is **only** consulted for sites whose ``url`` is a ``.onion`` host. For every other site Maigret uses your direct connection (or ``--proxy`` if set). Without ``--tor-proxy``, ``.onion`` sites are silently skipped.
|
||||
|
||||
The same split applies to ``--i2p-proxy``: it is consulted only for ``.i2p`` hosts, never for clearweb sites.
|
||||
|
||||
Defaults: ``--tor-proxy`` defaults to ``socks5://127.0.0.1:9050`` and ``--i2p-proxy`` to ``http://127.0.0.1:4444``. ``--proxy`` has no default. Maigret does **not** launch ``tor`` or an I2P router for you — start the daemon first.
|
||||
|
||||
Tor Browser vs system ``tor``: port numbers
|
||||
-------------------------------------------
|
||||
|
||||
The SOCKS port differs by Tor installation:
|
||||
|
||||
- **System ``tor`` daemon** (``apt install tor``, ``brew install tor``, Tails) listens on ``9050``.
|
||||
- **Tor Browser bundle** ships its own ``tor`` listening on ``9150``.
|
||||
|
||||
If a connection refuses, try the other port:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# system tor
|
||||
maigret <username> --proxy socks5://127.0.0.1:9050
|
||||
|
||||
# Tor Browser running in the background
|
||||
maigret <username> --proxy socks5://127.0.0.1:9150
|
||||
|
||||
A note on results over Tor
|
||||
--------------------------
|
||||
|
||||
Most public WAFs (Cloudflare, DDoS-Guard, AWS WAF, Akamai) block Tor exit nodes by default — usually more aggressively than they block datacenter IPs. A Tor run typically produces **more UNKNOWNs and fewer CLAIMEDs** than the same run from a residential connection. This is not a bug in Maigret; it is the cost of anonymity.
|
||||
|
||||
Recommended flags for a Tor run:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
maigret <username> --proxy socks5://127.0.0.1:9050 --timeout 60 --retries 2
|
||||
|
||||
- ``--timeout 60`` — Tor circuits add 1–3 seconds per request; the default 30 s causes spurious timeouts.
|
||||
- ``--retries 2`` — retries cover transient circuit failures, which are common on Tor.
|
||||
- Optional ``-n 20`` — lowering concurrency (default 100) reduces the chance of exits rate-limiting you.
|
||||
|
||||
If you mainly need to bypass WAFs (rather than to remain anonymous), a residential proxy will usually outperform Tor by a wide margin. See the **"Lots of sites fail / timeout / return 403"** section in `TROUBLESHOOTING.md <https://github.com/soxoj/maigret/blob/main/TROUBLESHOOTING.md>`_.
|
||||
|
||||
Running on Tails OS
|
||||
-------------------
|
||||
|
||||
Tails forces every outbound connection through Tor at the network layer. Maigret needs no special configuration to comply — pointing ``--proxy`` at the Tails Tor daemon is enough:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
maigret <username> --proxy socks5://127.0.0.1:9050 --timeout 60
|
||||
|
||||
Things that are **not** needed:
|
||||
|
||||
- ``torsocks maigret …`` and ``torify maigret …`` — these wrap libc socket calls, but Maigret's HTTP client (``aiohttp`` / ``curl_cffi``) bypasses libc for network I/O, so the wrapper has no effect. Use ``--proxy`` instead.
|
||||
- ``--tor-proxy`` — on Tails, *everything* must go via Tor (the OS enforces this), so the niche "only .onion via Tor" mode that ``--tor-proxy`` provides does not apply.
|
||||
|
||||
Installation over Tor on Tails
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``pip`` itself does not know about Tor; on Tails you need ``torsocks`` to wrap it:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
torsocks pip install --user maigret
|
||||
|
||||
After install, the binary lands in ``~/.local/bin/maigret``. If ``maigret: command not found``, either add ``~/.local/bin`` to ``PATH`` or invoke it as ``python3 -m maigret <username>``.
|
||||
|
||||
Persisting Maigret across Tails sessions
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Tails wipes ``~/.local/`` on reboot unless you configure the Persistent Storage to keep it. This is Tails configuration, not Maigret configuration — see the official Tails docs:
|
||||
|
||||
- `Persistent Storage on Tails <https://tails.boum.org/doc/persistent_storage/>`_
|
||||
- `Configuring Persistent Storage features <https://tails.boum.org/doc/persistent_storage/configure/>`_
|
||||
|
||||
A step-by-step recipe contributed by a user (persisting ``~/.local/lib/python3.9`` and ``~/.local/bin`` and patching ``.bashrc``) is in `issue #544 <https://github.com/soxoj/maigret/issues/544#issuecomment-1356469171>`_. Treat it as a starting point: the Python version and Tails internals change between Tails releases.
|
||||
|
||||
Reports on Tails — where to save them
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The default ``reports/`` directory lives next to the working directory and is wiped with the amnesiac session. To save reports somewhere persistent, either pass ``-fo``:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
maigret <username> --html -fo "/home/amnesia/Persistent/maigret-reports"
|
||||
|
||||
or set ``"reports_path"`` in your ``settings.json`` to a persistent path. See :doc:`settings`.
|
||||
|
||||
Programmatic equivalents (Python library)
|
||||
-----------------------------------------
|
||||
|
||||
The same options are available through the Python API. See :doc:`library-usage` — the relevant keyword arguments are ``proxy=``, ``tor_proxy=`` and ``i2p_proxy=``, accepting the same URL formats as the CLI flags.
|
||||
|
||||
See also
|
||||
--------
|
||||
|
||||
- :doc:`command-line-options` — full reference for the three flags.
|
||||
- `TROUBLESHOOTING.md <https://github.com/soxoj/maigret/blob/main/TROUBLESHOOTING.md>`_ — quick recipes for ``.onion`` / I2P sites and for WAF-induced 403s.
|
||||
- :doc:`library-usage` — proxy options for embedded use.
|
||||
+11
-4
@@ -31,7 +31,7 @@ from .executors import AsyncioQueueGeneratorExecutor
|
||||
from .result import MaigretCheckResult, MaigretCheckStatus
|
||||
from .sites import MaigretDatabase, MaigretSite
|
||||
from .types import QueryOptions, QueryResultWrapper
|
||||
from .utils import ascii_data_display, get_random_user_agent
|
||||
from .utils import ascii_data_display, get_random_user_agent, is_plausible_username
|
||||
|
||||
|
||||
SUPPORTED_IDS = (
|
||||
@@ -639,7 +639,6 @@ def process_site_result(
|
||||
|
||||
html_text, status_code, check_error = response
|
||||
|
||||
# TODO: add elapsed request time counting
|
||||
response_time = None
|
||||
|
||||
if logger.level == logging.DEBUG:
|
||||
@@ -673,7 +672,6 @@ def process_site_result(
|
||||
f"Failed activation {method} for site {site.name}: {str(e)}",
|
||||
exc_info=True,
|
||||
)
|
||||
# TODO: temporary check error
|
||||
|
||||
site_name = site.pretty_name
|
||||
# presense flags
|
||||
@@ -1296,7 +1294,6 @@ async def site_self_check(
|
||||
)
|
||||
|
||||
# don't disable entries with other ids types
|
||||
# TODO: make normal checking
|
||||
if site.name not in results_dict:
|
||||
logger.info(results_dict)
|
||||
changes["issues"].append(f"Site {site.name} not in results (wrong id_type?)")
|
||||
@@ -1525,13 +1522,23 @@ def parse_usernames(extracted_ids_data, logger) -> Dict:
|
||||
new_usernames = {}
|
||||
for k, v in extracted_ids_data.items():
|
||||
if "username" in k and not "usernames" in k:
|
||||
if is_plausible_username(v):
|
||||
new_usernames[v] = "username"
|
||||
else:
|
||||
logger.debug(
|
||||
f"Rejected non-username value extracted under key {k!r}: {v!r}"
|
||||
)
|
||||
elif "usernames" in k:
|
||||
try:
|
||||
tree = ast.literal_eval(v)
|
||||
if isinstance(tree, list):
|
||||
for n in tree:
|
||||
if is_plausible_username(n):
|
||||
new_usernames[n] = "username"
|
||||
else:
|
||||
logger.debug(
|
||||
f"Rejected non-username item from list under key {k!r}: {n!r}"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(e)
|
||||
if k in SUPPORTED_IDS:
|
||||
|
||||
@@ -77,7 +77,6 @@ ERRORS_TYPES = {
|
||||
'Connecting failure': 'Try to decrease number of parallel connections (e.g. -n 10)',
|
||||
}
|
||||
|
||||
# TODO: checking for reason
|
||||
ERRORS_REASONS = {
|
||||
'Login required': 'Add authorization cookies through `--cookies-jar-file` (see cookies.txt)',
|
||||
}
|
||||
|
||||
+11
-1
@@ -55,7 +55,7 @@ from .report import (
|
||||
from .sites import MaigretDatabase
|
||||
from .submit import Submitter
|
||||
from .types import QueryResultWrapper
|
||||
from .utils import get_dict_ascii_tree
|
||||
from .utils import get_dict_ascii_tree, is_plausible_username
|
||||
from .settings import Settings
|
||||
from .permutator import Permute
|
||||
|
||||
@@ -85,13 +85,23 @@ def extract_ids_from_page(url, logger, timeout=5) -> dict:
|
||||
for k, v in info.items():
|
||||
# TODO: merge with the same functionality in checking module
|
||||
if 'username' in k and not 'usernames' in k:
|
||||
if is_plausible_username(v):
|
||||
results[v] = 'username'
|
||||
else:
|
||||
logger.debug(
|
||||
f"Rejected non-username value extracted under key {k!r}: {v!r}"
|
||||
)
|
||||
elif 'usernames' in k:
|
||||
try:
|
||||
tree = ast.literal_eval(v)
|
||||
if isinstance(tree, list):
|
||||
for n in tree:
|
||||
if is_plausible_username(n):
|
||||
results[n] = 'username'
|
||||
else:
|
||||
logger.debug(
|
||||
f"Rejected non-username item from list under key {k!r}: {n!r}"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(e)
|
||||
if k in SUPPORTED_IDS:
|
||||
|
||||
@@ -516,7 +516,6 @@ def generate_report_context(username_results: list):
|
||||
tag = pycountry.countries.search_fuzzy(v)[
|
||||
0
|
||||
].alpha_2.lower() # type: ignore[attr-defined]
|
||||
# TODO: move countries to another struct
|
||||
tags[tag] = tags.get(tag, 0) + 1
|
||||
except Exception as e:
|
||||
logging.debug(
|
||||
@@ -568,7 +567,6 @@ def generate_report_context(username_results: list):
|
||||
|
||||
return {
|
||||
"username": first_username,
|
||||
# TODO: return brief list
|
||||
"brief": brief,
|
||||
"results": username_results,
|
||||
"first_seen": first_seen,
|
||||
|
||||
@@ -57,7 +57,8 @@
|
||||
"\"routePath\":null"
|
||||
],
|
||||
"errors": {
|
||||
"Login • Instagram": "Login required"
|
||||
"Login • Instagram": "Login required",
|
||||
"\"routePath\":\"\\/\"": "Login required (rate-limited or session blocked)"
|
||||
},
|
||||
"alexaRank": 4,
|
||||
"urlMain": "https://www.instagram.com/",
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
{
|
||||
"version": 1,
|
||||
"updated_at": "2026-05-15T18:46:56Z",
|
||||
"updated_at": "2026-05-16T16:00:20Z",
|
||||
"sites_count": 3155,
|
||||
"min_maigret_version": "0.6.1",
|
||||
"data_sha256": "df2ab3dbc96bdcdc8aa4e9da485df75ce6c3274814080f00a35e89f7f43783e1",
|
||||
"data_sha256": "0997b68c05eedb6e714432ed79580688d4923c56ef1ebf46db69b90039ef00d7",
|
||||
"data_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/data.json"
|
||||
}
|
||||
@@ -127,3 +127,29 @@ def get_match_ratio(base_strs: list):
|
||||
|
||||
def generate_random_username():
|
||||
return ''.join(random.choices(string.ascii_lowercase, k=10))
|
||||
|
||||
|
||||
def is_plausible_username(value: Any) -> bool:
|
||||
"""Reject obviously non-username strings extracted from sites' identity data.
|
||||
|
||||
Extractor schemes occasionally populate fields named like ``*_username``
|
||||
with URLs (e.g. ``instagram_username`` -> ``https://instagram.com/X``) or
|
||||
emails (e.g. ``your_username`` -> ``user@example.com``). Feeding such a
|
||||
value back into a site URL template produces broken requests on every
|
||||
subsequent site, which manifests as a cascade of false errors and the
|
||||
"wrong username" symptom in #1403.
|
||||
"""
|
||||
if not isinstance(value, str):
|
||||
return False
|
||||
s = value.strip()
|
||||
if not s:
|
||||
return False
|
||||
if "://" in s or s.startswith(("http://", "https://", "www.", "//")):
|
||||
return False
|
||||
if "/" in s:
|
||||
return False
|
||||
if any(c.isspace() for c in s):
|
||||
return False
|
||||
if "@" in s and "." in s:
|
||||
return False
|
||||
return True
|
||||
|
||||
@@ -126,6 +126,40 @@ def test_detect_error_page_ok():
|
||||
assert detect_error_page("hello world", 200, {}, ignore_403=False) is None
|
||||
|
||||
|
||||
def test_detect_error_page_instagram_login_wall():
|
||||
"""Regression for #11: when Instagram serves the login wall (typically the
|
||||
response after rate-limiting an unauthenticated client), the JSON state
|
||||
contains `"routePath":"\\/"` (root path) rather than a username route. The
|
||||
Instagram entry in data.json carries this marker in `errors` so the result
|
||||
surfaces as UNKNOWN instead of a false AVAILABLE.
|
||||
"""
|
||||
instagram_errors = {
|
||||
"Login • Instagram": "Login required",
|
||||
'"routePath":"\\/"': "Login required (rate-limited or session blocked)",
|
||||
}
|
||||
login_wall_html = '...{"routePath":"\\/"},"timeSpent":...'
|
||||
err = detect_error_page(login_wall_html, 200, instagram_errors, ignore_403=False)
|
||||
assert err is not None
|
||||
assert err.type == "Site-specific"
|
||||
assert "rate-limited" in err.desc
|
||||
|
||||
|
||||
def test_detect_error_page_instagram_marker_no_false_positive_on_profile():
|
||||
"""The login-wall marker must NOT match a real profile page. On a claimed
|
||||
user page, `routePath` carries the user-route template
|
||||
(`"routePath":"\\/{username}\\/..."`); the closing-quote form
|
||||
`"routePath":"\\/"` only appears on the login wall.
|
||||
"""
|
||||
instagram_errors = {
|
||||
'"routePath":"\\/"': "Login required (rate-limited or session blocked)",
|
||||
}
|
||||
profile_html = (
|
||||
'foo,"routePath":"\\/{username}\\/{?tab}\\/{?view_type}\\/",bar'
|
||||
)
|
||||
err = detect_error_page(profile_html, 200, instagram_errors, ignore_403=False)
|
||||
assert err is None
|
||||
|
||||
|
||||
def test_parse_usernames_single_username():
|
||||
logger = Mock()
|
||||
result = parse_usernames({"profile_username": "alice"}, logger)
|
||||
@@ -146,6 +180,33 @@ def test_parse_usernames_malformed_list():
|
||||
assert logger.warning.called
|
||||
|
||||
|
||||
def test_parse_usernames_rejects_url_value():
|
||||
"""Regression for #1403: extractors sometimes return a URL under a *_username
|
||||
key; that URL must not be fed back as a candidate username."""
|
||||
logger = Mock()
|
||||
result = parse_usernames(
|
||||
{"instagram_username": "https://instagram.com/zuck"}, logger
|
||||
)
|
||||
assert result == {}
|
||||
|
||||
|
||||
def test_parse_usernames_rejects_email_value():
|
||||
"""Regression for #1403: e.g. socid_extractor's 'your_username' returns an
|
||||
email under a key matching the username heuristic."""
|
||||
logger = Mock()
|
||||
result = parse_usernames({"your_username": "alice@example.com"}, logger)
|
||||
assert result == {}
|
||||
|
||||
|
||||
def test_parse_usernames_filters_urls_inside_list():
|
||||
logger = Mock()
|
||||
result = parse_usernames(
|
||||
{"other_usernames": "['alice', 'https://example.com/bob']"}, logger
|
||||
)
|
||||
# 'alice' should survive; the URL should be dropped.
|
||||
assert result == {"alice": "username"}
|
||||
|
||||
|
||||
def test_parse_usernames_supported_id():
|
||||
logger = Mock()
|
||||
# "telegram" is in SUPPORTED_IDS per socid_extractor
|
||||
|
||||
@@ -10,6 +10,7 @@ from maigret.utils import (
|
||||
URLMatcher,
|
||||
get_dict_ascii_tree,
|
||||
get_match_ratio,
|
||||
is_plausible_username,
|
||||
)
|
||||
|
||||
|
||||
@@ -144,3 +145,52 @@ def test_get_match_ratio():
|
||||
fun = get_match_ratio(["test", "maigret", "username"])
|
||||
|
||||
assert fun("test") == 1
|
||||
|
||||
|
||||
# Regression tests for #1403 — Gravatar URL leaking into next-iteration username.
|
||||
# Extractor schemes occasionally store URLs/emails under '*_username' keys; without
|
||||
# validation these were fed back into the search loop and produced cascades of false
|
||||
# errors. See maigret/utils.py::is_plausible_username.
|
||||
def test_is_plausible_username_accepts_bare_usernames():
|
||||
assert is_plausible_username("alice")
|
||||
assert is_plausible_username("alice.bob")
|
||||
assert is_plausible_username("alice_bob-42")
|
||||
assert is_plausible_username("Алиса")
|
||||
|
||||
|
||||
def test_is_plausible_username_rejects_urls():
|
||||
assert not is_plausible_username("https://gravatar.com/alice")
|
||||
assert not is_plausible_username("http://example.com/user/alice")
|
||||
assert not is_plausible_username("//example.com/alice")
|
||||
assert not is_plausible_username("www.facebook.com/zuck")
|
||||
|
||||
|
||||
def test_is_plausible_username_accepts_http_prefixed_handles():
|
||||
"""Don't over-match: bare names that just happen to start with 'http' or 'www'
|
||||
are legitimate (e.g. the httpie CLI maintainer's handle)."""
|
||||
assert is_plausible_username("httpie")
|
||||
assert is_plausible_username("http_user")
|
||||
assert is_plausible_username("wwwsuperstar")
|
||||
|
||||
|
||||
def test_is_plausible_username_rejects_path_like():
|
||||
assert not is_plausible_username("user/alice")
|
||||
assert not is_plausible_username("alice/")
|
||||
|
||||
|
||||
def test_is_plausible_username_rejects_emails():
|
||||
assert not is_plausible_username("alice@example.com")
|
||||
assert not is_plausible_username("user@maigret.io")
|
||||
|
||||
|
||||
def test_is_plausible_username_rejects_whitespace_and_empty():
|
||||
assert not is_plausible_username("")
|
||||
assert not is_plausible_username(" ")
|
||||
assert not is_plausible_username("alice bob")
|
||||
assert not is_plausible_username("alice\nbob")
|
||||
|
||||
|
||||
def test_is_plausible_username_rejects_non_strings():
|
||||
assert not is_plausible_username(None)
|
||||
assert not is_plausible_username(42)
|
||||
assert not is_plausible_username(["alice"])
|
||||
|
||||
@@ -165,7 +165,6 @@ if __name__ == '__main__':
|
||||
sites = {site.name: site for site in sites_subset}
|
||||
engines = db.engines
|
||||
|
||||
# TODO: usernames extractors
|
||||
ok_usernames = ['alex', 'god', 'admin', 'red', 'blue', 'john']
|
||||
if args.username:
|
||||
ok_usernames = [args.username] + ok_usernames
|
||||
|
||||
Reference in New Issue
Block a user