mirror of
https://github.com/soxoj/maigret.git
synced 2026-05-09 16:14:32 +00:00
Cloudflare bypass webgate (#2628)
This commit is contained in:
@@ -268,6 +268,19 @@ maigret user --i2p-proxy http://127.0.0.1:4444
|
||||
|
||||
Start your Tor / I2P daemon before running the command — Maigret does not manage these gateways.
|
||||
|
||||
### Cloudflare bypass
|
||||
|
||||
> **Experimental.** The Cloudflare webgate is under active development; the configuration schema, CLI behaviour, and the set of routed sites may change without backwards-compatibility guarantees.
|
||||
|
||||
A subset of sites in the database require a real browser to solve a JavaScript challenge. Maigret can offload these checks to a local [FlareSolverr](https://github.com/FlareSolverr/FlareSolverr) instance:
|
||||
|
||||
```bash
|
||||
docker run -d -p 8191:8191 --name flaresolverr ghcr.io/flaresolverr/flaresolverr:latest
|
||||
maigret --cloudflare-bypass <username>
|
||||
```
|
||||
|
||||
The bypass is opt-in (`--cloudflare-bypass` or `cloudflare_bypass.enabled` in `settings.json`) and only fires for sites whose `protection` field matches. See the [feature docs](https://maigret.readthedocs.io/en/latest/features.html#cloudflare-bypass) for backend options and configuration.
|
||||
|
||||
## Contributing
|
||||
|
||||
Add or fix new sites surgically in `data.json` (no `json.load`/`json.dump`), then run `./utils/update_site_data.py` to regenerate `sites.md` and the database metadata, and open a pull request. For more details, see the [CONTRIBUTING guide](https://github.com/soxoj/maigret/blob/main/CONTRIBUTING.md) and [development docs](https://maigret.readthedocs.io/en/latest/development.html). Release history: [CHANGELOG.md](CHANGELOG.md).
|
||||
|
||||
@@ -95,6 +95,17 @@ the run after the explicit update finishes.
|
||||
``--retries RETRIES`` - Count of attempts to restart temporarily failed
|
||||
requests.
|
||||
|
||||
``--cloudflare-bypass`` *(experimental)* - Route checks for sites tagged
|
||||
``protection: ["cf_js_challenge"]`` / ``["cf_firewall"]`` / ``["webgate"]``
|
||||
through a local Chrome-based solver (FlareSolverr by default). The bypass
|
||||
is opt-in — without this flag (or
|
||||
``settings.cloudflare_bypass.enabled = true``) those sites are checked
|
||||
the usual way, which Cloudflare almost always blocks: you get an UNKNOWN
|
||||
status with a JS-challenge / firewall error rather than a real result.
|
||||
Configure the backend in ``settings.cloudflare_bypass.modules``.
|
||||
See :ref:`cloudflare-bypass`. **Experimental** — the flag, schema and
|
||||
routing rules may change without backwards-compatibility guarantees.
|
||||
|
||||
.. _custom-database:
|
||||
|
||||
Using a custom sites database
|
||||
|
||||
@@ -237,6 +237,61 @@ The Maigret database contains not only the original websites, but also mirrors,
|
||||
|
||||
It allows getting additional info about the person and checking the existence of the account even if the main site is unavailable (bot protection, captcha, etc.)
|
||||
|
||||
.. _cloudflare-bypass:
|
||||
|
||||
Cloudflare webgate bypass
|
||||
-------------------------
|
||||
|
||||
.. warning::
|
||||
|
||||
**Experimental feature.** The Cloudflare webgate is under active
|
||||
development. The configuration schema, CLI flag behaviour, and the set
|
||||
of sites that route through it may change without backwards-compatibility
|
||||
guarantees. Expect rough edges (CF rate limits, occasional solver
|
||||
failures) and report issues so they can be ironed out.
|
||||
|
||||
Some sites sit behind a full Cloudflare JavaScript challenge or a CF firewall
|
||||
hard block — these are tagged ``protection: ["cf_js_challenge"]`` or
|
||||
``protection: ["cf_firewall"]`` in the database and are normally kept disabled
|
||||
because neither aiohttp nor curl_cffi can solve the JS challenge on their own.
|
||||
|
||||
Maigret can offload these checks to a local Chrome-based solver. Two backends
|
||||
are supported, configured in ``settings.json`` under
|
||||
``cloudflare_bypass.modules`` (the first reachable module wins; subsequent
|
||||
ones are tried as a fallback chain):
|
||||
|
||||
* **FlareSolverr** (recommended). Runs a real Chrome instance and exposes a
|
||||
JSON API. The upstream HTTP status, headers and final URL are preserved, so
|
||||
``checkType: status_code`` and ``checkType: response_url`` keep working
|
||||
through the bypass.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
docker run -d -p 8191:8191 --name flaresolverr ghcr.io/flaresolverr/flaresolverr:latest
|
||||
|
||||
* **CloudflareBypassForScraping** (legacy fallback). Returns rendered HTML
|
||||
only, so the upstream status code is lost — ``checkType: message`` keeps
|
||||
working but ``status_code`` checks misfire (treated as 200 on success).
|
||||
|
||||
Activate the bypass either with the CLI flag::
|
||||
|
||||
maigret --cloudflare-bypass <username>
|
||||
|
||||
or by setting ``cloudflare_bypass.enabled`` to ``true`` in ``settings.json``.
|
||||
The bypass only fires for sites whose ``protection`` field intersects
|
||||
``cloudflare_bypass.trigger_protection`` (default
|
||||
``["cf_js_challenge", "cf_firewall", "webgate"]``); all other sites use the
|
||||
normal aiohttp / curl_cffi path.
|
||||
|
||||
If all configured modules are unreachable, affected sites get an UNKNOWN
|
||||
status with an actionable error pointing at the first module's URL — the
|
||||
fix is almost always to start the FlareSolverr container.
|
||||
|
||||
FlareSolverr session reuse is automatic: Maigret pins a single
|
||||
``session: <session_prefix>-<pid>`` per run, so cf_clearance cookies are
|
||||
shared between checks of the same domain (5–10× faster on subsequent
|
||||
requests to that host).
|
||||
|
||||
Activation
|
||||
----------
|
||||
The activation mechanism helps make requests to sites requiring additional authentication like cookies, JWT tokens, or custom headers.
|
||||
|
||||
@@ -125,3 +125,25 @@ After installing the system dependencies, retry the maigret installation.
|
||||
|
||||
If you continue to have issues, consider using Docker instead, which includes all
|
||||
necessary dependencies.
|
||||
|
||||
Optional: Cloudflare bypass solver
|
||||
----------------------------------
|
||||
|
||||
.. warning::
|
||||
|
||||
**Experimental.** The Cloudflare webgate is under active development;
|
||||
the configuration schema and CLI behaviour may change without
|
||||
backwards-compatibility guarantees.
|
||||
|
||||
Sites tagged ``cf_js_challenge`` / ``cf_firewall`` need a real browser to pass
|
||||
their JavaScript challenge. To check those sites you can run a local
|
||||
`FlareSolverr <https://github.com/FlareSolverr/FlareSolverr>`_ instance —
|
||||
Maigret will route protected checks to it when ``--cloudflare-bypass`` is set:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker run -d -p 8191:8191 --name flaresolverr ghcr.io/flaresolverr/flaresolverr:latest
|
||||
|
||||
This is **optional** — Maigret runs without it; only sites whose
|
||||
``protection`` field intersects ``settings.cloudflare_bypass.trigger_protection``
|
||||
require the solver. See :ref:`cloudflare-bypass` for details.
|
||||
|
||||
@@ -102,6 +102,95 @@ This is recommended for **Docker containers**, **CI pipelines**, and **air-gappe
|
||||
|
||||
**Using a custom database** with ``--db`` always skips auto-update — you are explicitly choosing your data source.
|
||||
|
||||
Cloudflare webgate
|
||||
------------------
|
||||
|
||||
.. warning::
|
||||
|
||||
**Experimental.** The ``cloudflare_bypass`` block is under active
|
||||
development; field names, defaults, and the trigger-protection routing
|
||||
rules may change without backwards-compatibility guarantees.
|
||||
|
||||
The ``cloudflare_bypass`` block in ``settings.json`` configures the optional
|
||||
bypass described in :ref:`cloudflare-bypass`. Default value:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"cloudflare_bypass": {
|
||||
"enabled": false,
|
||||
"session_prefix": "maigret",
|
||||
"trigger_protection": ["cf_js_challenge", "cf_firewall", "webgate"],
|
||||
"modules": [
|
||||
{
|
||||
"name": "flaresolverr",
|
||||
"method": "json_api",
|
||||
"url": "http://localhost:8191/v1",
|
||||
"max_timeout_ms": 60000
|
||||
},
|
||||
{
|
||||
"name": "chrome_webgate",
|
||||
"method": "url_rewrite",
|
||||
"url": "http://localhost:8000/html?url={url}&retries=1"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
**Fields.**
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 30 70
|
||||
|
||||
* - Field
|
||||
- Description
|
||||
* - ``enabled``
|
||||
- When ``true``, the bypass is active for every run; when ``false``
|
||||
(the default), it activates only on ``--cloudflare-bypass``.
|
||||
* - ``trigger_protection``
|
||||
- List of ``site.protection`` values that route a check through the
|
||||
webgate. Sites whose protection is empty or doesn't intersect this
|
||||
list use the default (aiohttp / curl_cffi) checker.
|
||||
* - ``session_prefix``
|
||||
- Prefix for the FlareSolverr ``session`` field. Maigret appends the
|
||||
process PID so concurrent runs don't collide. Reusing a session
|
||||
caches cf_clearance between checks of the same domain.
|
||||
* - ``modules``
|
||||
- Ordered list of backend modules. The first reachable module
|
||||
handles the check; later ones serve as a fallback chain.
|
||||
|
||||
**Module methods.**
|
||||
|
||||
* ``json_api`` — FlareSolverr-compatible POST endpoint at ``url``.
|
||||
Preserves real upstream HTTP status, headers and final URL.
|
||||
Optional ``max_timeout_ms`` (default ``60000``) is the per-request
|
||||
budget the solver is allowed to spend on the JS challenge.
|
||||
* ``url_rewrite`` — legacy CloudflareBypassForScraping endpoint. The
|
||||
``url`` must contain a ``{url}`` placeholder; the original probe URL
|
||||
is URL-encoded and substituted in. Returns rendered HTML only —
|
||||
``checkType: status_code`` and ``response_url`` checks misfire under
|
||||
this method (treated as a synthetic HTTP 200 on success).
|
||||
|
||||
**Optional ``proxy`` field (``json_api`` only).**
|
||||
|
||||
A module may carry a ``proxy`` entry that the solver routes the upstream
|
||||
request through. Useful when a site enforces ``ip_reputation`` rules
|
||||
that block the solver host. Two forms are accepted:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{ "proxy": "socks5://localhost:1080" }
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{ "proxy": { "url": "http://gw.example:3128",
|
||||
"username": "u",
|
||||
"password": "p" } }
|
||||
|
||||
Only ``url``/``username``/``password`` are forwarded; other keys are
|
||||
dropped. Cloudflare ``Error 1015 / 1020`` responses indicate the IP is
|
||||
rate-limited or banned — switch the proxy rather than retrying.
|
||||
.. _ai-analysis-settings:
|
||||
|
||||
AI analysis
|
||||
|
||||
+287
-3
@@ -2,6 +2,7 @@
|
||||
import ast
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import ssl
|
||||
@@ -48,6 +49,53 @@ SUPPORTED_IDS = (
|
||||
BAD_CHARS = "#"
|
||||
|
||||
|
||||
def build_cloudflare_bypass_config(
|
||||
settings_obj: Optional[Any], force_enable: bool = False
|
||||
) -> Optional[Dict[str, Any]]:
|
||||
"""Resolve Cloudflare webgate config from settings + CLI flag.
|
||||
|
||||
Returns ``None`` when bypass is inactive or no usable module is configured.
|
||||
Otherwise returns a dict consumed by ``CloudflareWebgateChecker``:
|
||||
|
||||
- ``trigger_protection``: list of ``site.protection`` values that
|
||||
activate the bypass (e.g. ``["cf_js_challenge", "cf_firewall", "webgate"]``)
|
||||
- ``modules``: ordered list of backend modules to try; each entry has
|
||||
``name``, ``method`` (``json_api`` for FlareSolverr, ``url_rewrite``
|
||||
for CloudflareBypassForScraping), and a method-specific ``url`` plus
|
||||
optional ``max_timeout_ms``.
|
||||
- ``session_prefix``: prefix for FlareSolverr session reuse.
|
||||
"""
|
||||
raw = {}
|
||||
if settings_obj is not None:
|
||||
raw = getattr(settings_obj, "cloudflare_bypass", {}) or {}
|
||||
enabled = bool(force_enable) or bool(raw.get("enabled", False))
|
||||
if not enabled:
|
||||
return None
|
||||
|
||||
modules_raw = raw.get("modules") or []
|
||||
valid_modules: List[Dict[str, Any]] = []
|
||||
for module in modules_raw:
|
||||
method = module.get("method")
|
||||
url = module.get("url")
|
||||
if method == "json_api" and url:
|
||||
valid_modules.append(dict(module))
|
||||
elif method == "url_rewrite" and url and "{url}" in url:
|
||||
valid_modules.append(dict(module))
|
||||
if not valid_modules:
|
||||
return None
|
||||
|
||||
trigger = raw.get("trigger_protection") or [
|
||||
"cf_js_challenge",
|
||||
"cf_firewall",
|
||||
"webgate",
|
||||
]
|
||||
return {
|
||||
"trigger_protection": list(trigger),
|
||||
"modules": valid_modules,
|
||||
"session_prefix": raw.get("session_prefix", "maigret"),
|
||||
}
|
||||
|
||||
|
||||
class CheckerBase:
|
||||
pass
|
||||
|
||||
@@ -287,6 +335,221 @@ class CurlCffiChecker(CheckerBase):
|
||||
return None, 0, CheckError("Unexpected", str(e))
|
||||
|
||||
|
||||
class CloudflareWebgateChecker(CheckerBase):
|
||||
"""Sends checks through a Cloudflare-bypass proxy.
|
||||
|
||||
Supports two backends, selected by ``modules[0].method`` in settings:
|
||||
|
||||
- ``json_api`` (FlareSolverr): POST to ``/v1`` with ``cmd: request.get``.
|
||||
Preserves real upstream status_code, headers and final URL — drop-in
|
||||
replacement for SimpleAiohttpChecker.
|
||||
- ``url_rewrite`` (CloudflareBypassForScraping ``/html`` endpoint):
|
||||
legacy mode. Returns rendered HTML only. Real upstream status is
|
||||
lost (proxy answers 200 on success). status_code / response_url
|
||||
check types degrade to "200 if HTML returned, AVAILABLE otherwise".
|
||||
"""
|
||||
|
||||
SESSION_PREFIX_DEFAULT = "maigret"
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
self.logger = kwargs.get('logger', Mock())
|
||||
config = kwargs.get('config') or {}
|
||||
self._modules: List[Dict[str, Any]] = []
|
||||
for raw in config.get('modules') or []:
|
||||
module = dict(raw)
|
||||
module.setdefault('method', 'json_api')
|
||||
module.setdefault('name', module.get('method'))
|
||||
self._modules.append(module)
|
||||
if not self._modules:
|
||||
raise ValueError("CloudflareWebgateChecker requires at least one module")
|
||||
# Session ID is computed per-request from the target host. Sharing a
|
||||
# single session across hosts caused FlareSolverr to break in
|
||||
# practice (TLS state / cookies leaking between domains), so each
|
||||
# host gets its own Chrome instance.
|
||||
self._session_prefix = (
|
||||
f"{config.get('session_prefix', self.SESSION_PREFIX_DEFAULT)}-{os.getpid()}"
|
||||
)
|
||||
self.url = None
|
||||
self.headers = None
|
||||
self.allow_redirects = True
|
||||
self.timeout = 0
|
||||
self.method = 'get'
|
||||
self.payload = None
|
||||
|
||||
@property
|
||||
def session_id(self) -> str:
|
||||
"""FlareSolverr session ID, scoped per target host."""
|
||||
from urllib.parse import urlparse
|
||||
|
||||
host = urlparse(self.url or "").hostname or "default"
|
||||
host_safe = re.sub(r"[^a-zA-Z0-9.-]", "_", host)
|
||||
return f"{self._session_prefix}-{host_safe}"
|
||||
|
||||
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
|
||||
self.url = url
|
||||
self.headers = headers or {}
|
||||
self.allow_redirects = allow_redirects
|
||||
self.timeout = timeout
|
||||
self.method = method
|
||||
self.payload = payload
|
||||
return None
|
||||
|
||||
async def close(self):
|
||||
pass
|
||||
|
||||
async def check(self) -> Tuple[Optional[str], int, Optional[CheckError]]:
|
||||
attempts: List[str] = []
|
||||
last_error: Optional[CheckError] = None
|
||||
for module in self._modules:
|
||||
method = module.get('method')
|
||||
module_name = module.get('name', method or '?')
|
||||
if method == 'json_api':
|
||||
result = await self._check_flaresolverr(module)
|
||||
elif method == 'url_rewrite':
|
||||
result = await self._check_url_rewrite(module)
|
||||
else:
|
||||
self.logger.warning(
|
||||
f"Webgate module '{module_name}' has unknown method "
|
||||
f"'{method}', skipping"
|
||||
)
|
||||
attempts.append(f"{module_name}:unknown-method")
|
||||
continue
|
||||
body, status, err = result
|
||||
if err is None:
|
||||
return result
|
||||
last_error = err
|
||||
attempts.append(f"{module_name}:{err.type}")
|
||||
self.logger.info(
|
||||
f"Webgate module '{module_name}' failed for {self.url}: "
|
||||
f"{err.type}: {err.desc}. Trying next module if any."
|
||||
)
|
||||
# All modules failed. Give the user a single, actionable error with
|
||||
# the first module's URL — that's almost always FlareSolverr, and
|
||||
# the most common failure is "user forgot to start the container".
|
||||
primary = self._modules[0]
|
||||
primary_url = primary.get('url', '?')
|
||||
primary_method = primary.get('method', '?')
|
||||
hint = (
|
||||
f"docker run -d -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latest"
|
||||
if primary_method == 'json_api'
|
||||
else "start the local proxy container"
|
||||
)
|
||||
last_desc = last_error.desc if last_error else "unknown"
|
||||
return None, 0, CheckError(
|
||||
"Webgate unavailable",
|
||||
f"all {len(self._modules)} module(s) failed [{', '.join(attempts)}]. "
|
||||
f"Last error: {last_desc}. "
|
||||
f"Is the solver running at {primary_url}? (hint: {hint})",
|
||||
)
|
||||
|
||||
async def _check_flaresolverr(
|
||||
self, module: Dict[str, Any]
|
||||
) -> Tuple[Optional[str], int, Optional[CheckError]]:
|
||||
endpoint = module.get('url') or 'http://localhost:8191/v1'
|
||||
max_timeout_ms = int(module.get('max_timeout_ms', 60000))
|
||||
post_method = self.method.lower() == 'post'
|
||||
cmd = "request.post" if post_method else "request.get"
|
||||
|
||||
body: Dict[str, Any] = {
|
||||
"cmd": cmd,
|
||||
"url": self.url,
|
||||
"maxTimeout": max_timeout_ms,
|
||||
"session": self.session_id,
|
||||
}
|
||||
|
||||
proxy = module.get('proxy')
|
||||
if isinstance(proxy, str) and proxy:
|
||||
body["proxy"] = {"url": proxy}
|
||||
elif isinstance(proxy, dict) and proxy.get("url"):
|
||||
body["proxy"] = {k: v for k, v in proxy.items() if k in ("url", "username", "password")}
|
||||
|
||||
if post_method and self.payload is not None:
|
||||
# FlareSolverr expects postData as urlencoded string for form data,
|
||||
# but if site.request_payload is JSON we still send it.
|
||||
body["postData"] = (
|
||||
"&".join(f"{k}={quote(str(v))}" for k, v in self.payload.items())
|
||||
)
|
||||
|
||||
timeout = max(int(self.timeout) if self.timeout else 30, max_timeout_ms / 1000 + 5)
|
||||
|
||||
try:
|
||||
async with ClientSession() as session:
|
||||
async with session.post(
|
||||
endpoint, json=body, timeout=timeout
|
||||
) as resp:
|
||||
if resp.status >= 500:
|
||||
return None, 0, CheckError(
|
||||
"Webgate", f"FlareSolverr {resp.status}"
|
||||
)
|
||||
data = await resp.json()
|
||||
except (ClientConnectorError, ServerDisconnectedError) as e:
|
||||
return None, 0, CheckError("Webgate unreachable", str(e))
|
||||
except asyncio.TimeoutError:
|
||||
return None, 0, CheckError("Webgate timeout", endpoint)
|
||||
except Exception as e:
|
||||
self.logger.debug(e, exc_info=True)
|
||||
return None, 0, CheckError("Webgate", str(e))
|
||||
|
||||
if data.get("status") != "ok":
|
||||
return None, 0, CheckError("Webgate", data.get("message", "unknown"))
|
||||
|
||||
solution = data.get("solution") or {}
|
||||
upstream_status = int(solution.get("status") or 0)
|
||||
response_text = solution.get("response") or ""
|
||||
|
||||
# Diagnostic: warn if FlareSolverr returned the CF challenge page
|
||||
# itself (challenge not fully solved) rather than the real content.
|
||||
# When this happens with sites that have weak presenseStrs/absenceStrs,
|
||||
# maigret's default-true presence rule produces false CLAIMED.
|
||||
cf_markers = ("Just a moment", "_cf_chl_opt", "cf-mitigated", "challenges.cloudflare.com")
|
||||
if response_text and any(m in response_text for m in cf_markers):
|
||||
self.logger.warning(
|
||||
f"Webgate response from {self.url} still contains CF challenge "
|
||||
f"markers (status={upstream_status}, body={len(response_text)}b). "
|
||||
f"FlareSolverr likely did not solve the challenge — site checks "
|
||||
f"with weak markers may produce false CLAIMED."
|
||||
)
|
||||
|
||||
self.logger.info(
|
||||
f"Webgate response: url={self.url} status={upstream_status} "
|
||||
f"body_len={len(response_text)}"
|
||||
)
|
||||
return response_text, upstream_status, None
|
||||
|
||||
async def _check_url_rewrite(
|
||||
self, module: Dict[str, Any]
|
||||
) -> Tuple[Optional[str], int, Optional[CheckError]]:
|
||||
url_template = module.get('url') or ''
|
||||
if "{url}" not in url_template:
|
||||
return None, 0, CheckError(
|
||||
"Webgate", f"module '{module.get('name')}' url has no {{url}} placeholder"
|
||||
)
|
||||
from urllib.parse import quote_plus
|
||||
|
||||
proxy_url = url_template.format(url=quote_plus(self.url))
|
||||
timeout = self.timeout if self.timeout else 30
|
||||
try:
|
||||
async with ClientSession() as session:
|
||||
async with session.get(proxy_url, timeout=timeout) as resp:
|
||||
if resp.status >= 500:
|
||||
return None, 0, CheckError(
|
||||
"Webgate", f"url_rewrite proxy {resp.status}"
|
||||
)
|
||||
body = await resp.text()
|
||||
except (ClientConnectorError, ServerDisconnectedError) as e:
|
||||
return None, 0, CheckError("Webgate unreachable", str(e))
|
||||
except asyncio.TimeoutError:
|
||||
return None, 0, CheckError("Webgate timeout", proxy_url)
|
||||
except Exception as e:
|
||||
self.logger.debug(e, exc_info=True)
|
||||
return None, 0, CheckError("Webgate", str(e))
|
||||
|
||||
# url_rewrite mode CANNOT recover the upstream HTTP status.
|
||||
# We assume 200 when HTML is returned; status_code/response_url
|
||||
# check types will misfire (see docs).
|
||||
return body, 200, None
|
||||
|
||||
|
||||
class CheckerMock:
|
||||
def __init__(self, *args, **kwargs):
|
||||
pass
|
||||
@@ -547,9 +810,24 @@ def make_site_result(
|
||||
# workaround to prevent slash errors
|
||||
url = re.sub("(?<!:)/+", "/", url)
|
||||
|
||||
# Select checker: use curl_cffi for sites requiring TLS impersonation
|
||||
# Select checker. Order of precedence:
|
||||
# 1. Cloudflare webgate (FlareSolverr / CloudflareBypassForScraping) when
|
||||
# bypass is active and site.protection requests it.
|
||||
# 2. curl_cffi for sites requiring TLS impersonation.
|
||||
# 3. Default protocol-specific checker (aiohttp).
|
||||
cf_bypass = options.get("cloudflare_bypass")
|
||||
needs_webgate = bool(cf_bypass) and any(
|
||||
p in cf_bypass["trigger_protection"] for p in site.protection
|
||||
)
|
||||
needs_impersonation = 'tls_fingerprint' in site.protection
|
||||
if needs_impersonation and CURL_CFFI_AVAILABLE:
|
||||
|
||||
if needs_webgate:
|
||||
checker = CloudflareWebgateChecker(logger=logger, config=cf_bypass)
|
||||
logger.info(
|
||||
f"Using Cloudflare webgate for {site.name} "
|
||||
f"(protection: {list(site.protection)})"
|
||||
)
|
||||
elif needs_impersonation and CURL_CFFI_AVAILABLE:
|
||||
checker = CurlCffiChecker(logger=logger, browser_emulate='chrome')
|
||||
elif needs_impersonation and not CURL_CFFI_AVAILABLE:
|
||||
logger.warning(
|
||||
@@ -761,6 +1039,7 @@ async def maigret(
|
||||
cookies=None,
|
||||
retries=0,
|
||||
check_domains=False,
|
||||
cloudflare_bypass: Optional[Dict[str, Any]] = None,
|
||||
*args,
|
||||
**kwargs,
|
||||
) -> QueryResultWrapper:
|
||||
@@ -859,6 +1138,7 @@ async def maigret(
|
||||
options["timeout"] = timeout
|
||||
options["id_type"] = id_type
|
||||
options["forced"] = forced
|
||||
options["cloudflare_bypass"] = cloudflare_bypass
|
||||
|
||||
# results from analysis of all sites
|
||||
all_results: Dict[str, QueryResultWrapper] = {}
|
||||
@@ -962,6 +1242,7 @@ async def site_self_check(
|
||||
cookies=None,
|
||||
auto_disable=False,
|
||||
diagnose=False,
|
||||
cloudflare_bypass: Optional[Dict[str, Any]] = None,
|
||||
):
|
||||
"""
|
||||
Self-check a site configuration.
|
||||
@@ -1002,6 +1283,7 @@ async def site_self_check(
|
||||
tor_proxy=tor_proxy,
|
||||
i2p_proxy=i2p_proxy,
|
||||
cookies=cookies,
|
||||
cloudflare_bypass=cloudflare_bypass,
|
||||
)
|
||||
|
||||
# don't disable entries with other ids types
|
||||
@@ -1130,6 +1412,7 @@ async def self_check(
|
||||
auto_disable=False,
|
||||
diagnose=False,
|
||||
no_progressbar=False,
|
||||
cloudflare_bypass: Optional[Dict[str, Any]] = None,
|
||||
) -> dict:
|
||||
"""
|
||||
Run self-check on sites.
|
||||
@@ -1158,7 +1441,8 @@ async def self_check(
|
||||
for _, site in all_sites.items():
|
||||
check_coro = site_self_check(
|
||||
site, logger, sem, db, silent, proxy, tor_proxy, i2p_proxy,
|
||||
skip_errors=True, auto_disable=auto_disable, diagnose=diagnose
|
||||
skip_errors=True, auto_disable=auto_disable, diagnose=diagnose,
|
||||
cloudflare_bypass=cloudflare_bypass,
|
||||
)
|
||||
future = asyncio.ensure_future(check_coro)
|
||||
tasks.append((site.name, future))
|
||||
|
||||
@@ -34,6 +34,7 @@ from .checking import (
|
||||
self_check,
|
||||
BAD_CHARS,
|
||||
maigret,
|
||||
build_cloudflare_bypass_config,
|
||||
)
|
||||
from . import errors
|
||||
from .notify import QueryNotifyPrint
|
||||
@@ -281,6 +282,13 @@ def setup_arguments_parser(settings: Settings):
|
||||
default=settings.domain_search,
|
||||
help="Enable (experimental) feature of checking domains on usernames.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--cloudflare-bypass",
|
||||
action="store_true",
|
||||
default=False,
|
||||
help="Enable Cloudflare webgate bypass for sites with protection cf_js_challenge / cf_firewall / webgate. "
|
||||
"Requires a local CloudflareBypassForScraping instance (see settings.json -> cloudflare_bypass.modules[0].url).",
|
||||
)
|
||||
|
||||
filter_group = parser.add_argument_group(
|
||||
'Site filtering', 'Options to set site search scope'
|
||||
@@ -552,6 +560,20 @@ async def main():
|
||||
arg_parser = setup_arguments_parser(settings)
|
||||
args = arg_parser.parse_args()
|
||||
|
||||
# Resolve Cloudflare webgate config (CLI flag OR settings.cloudflare_bypass.enabled)
|
||||
cf_bypass_config = build_cloudflare_bypass_config(
|
||||
settings, force_enable=args.cloudflare_bypass
|
||||
)
|
||||
if cf_bypass_config:
|
||||
modules_summary = ", ".join(
|
||||
f"{m.get('name', m.get('method'))}({m.get('url')})"
|
||||
for m in cf_bypass_config["modules"]
|
||||
)
|
||||
logger.info(
|
||||
f"Cloudflare webgate active: triggers={cf_bypass_config['trigger_protection']}, "
|
||||
f"modules=[{modules_summary}]"
|
||||
)
|
||||
|
||||
# Re-set logging level based on args
|
||||
if args.debug:
|
||||
log_level = logging.DEBUG
|
||||
@@ -682,6 +704,7 @@ async def main():
|
||||
auto_disable=args.auto_disable,
|
||||
diagnose=args.diagnose,
|
||||
no_progressbar=args.no_progressbar,
|
||||
cloudflare_bypass=cf_bypass_config,
|
||||
)
|
||||
|
||||
is_need_update = check_result.get('needs_update', False)
|
||||
@@ -816,6 +839,7 @@ async def main():
|
||||
no_progressbar=args.no_progressbar,
|
||||
retries=args.retries,
|
||||
check_domains=args.with_domains,
|
||||
cloudflare_bypass=cf_bypass_config,
|
||||
)
|
||||
|
||||
if not args.ai:
|
||||
|
||||
+352
-118
File diff suppressed because it is too large
Load Diff
@@ -61,5 +61,25 @@
|
||||
"web_interface_port": 5000,
|
||||
"no_autoupdate": false,
|
||||
"db_update_meta_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/db_meta.json",
|
||||
"autoupdate_check_interval_hours": 24
|
||||
"autoupdate_check_interval_hours": 24,
|
||||
"cloudflare_bypass": {
|
||||
"enabled": false,
|
||||
"session_prefix": "maigret",
|
||||
"trigger_protection": ["cf_js_challenge", "cf_firewall", "webgate"],
|
||||
"modules": [
|
||||
{
|
||||
"name": "flaresolverr",
|
||||
"method": "json_api",
|
||||
"url": "http://localhost:8191/v1",
|
||||
"max_timeout_ms": 60000,
|
||||
"comment": "FlareSolverr (https://github.com/FlareSolverr/FlareSolverr). docker run -d -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latest"
|
||||
},
|
||||
{
|
||||
"name": "chrome_webgate",
|
||||
"method": "url_rewrite",
|
||||
"url": "http://localhost:8000/html?url={url}&retries=1",
|
||||
"comment": "CloudflareBypassForScraping fallback. WARNING: returns rendered HTML only — checkType: status_code and response_url misfire."
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -47,6 +47,7 @@ class Settings:
|
||||
no_autoupdate: bool
|
||||
db_update_meta_url: str
|
||||
autoupdate_check_interval_hours: int
|
||||
cloudflare_bypass: dict
|
||||
|
||||
# submit mode settings
|
||||
presence_strings: list
|
||||
|
||||
@@ -113,6 +113,7 @@ class Submitter:
|
||||
cookies=self.args.cookie_file,
|
||||
# Don't skip errors in submit mode - we need check both false positives/true negatives
|
||||
skip_errors=False,
|
||||
cloudflare_bypass=getattr(self, 'cloudflare_bypass', None),
|
||||
)
|
||||
return changes
|
||||
|
||||
|
||||
@@ -100,7 +100,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [OP.GG LoL Vietnam (https://www.op.gg/)](https://www.op.gg/)*: top 500, gaming, vn*
|
||||
1.  [OP.GG LoL Thailand (https://www.op.gg/)](https://www.op.gg/)*: top 500, gaming, th*
|
||||
1.  [Xing (https://www.xing.com/)](https://www.xing.com/)*: top 500, de, eu*
|
||||
1.  [Patreon (https://www.patreon.com/)](https://www.patreon.com/)*: top 500, finance*, search is disabled
|
||||
1.  [Patreon (https://www.patreon.com/)](https://www.patreon.com/)*: top 500, finance*
|
||||
1.  [DeviantART (https://deviantart.com)](https://deviantart.com)*: top 500, art, photo*
|
||||
1.  [Gofundme (https://www.gofundme.com)](https://www.gofundme.com)*: top 500, finance*
|
||||
1.  [Zhihu (https://www.zhihu.com/)](https://www.zhihu.com/)*: top 500, cn*, search is disabled
|
||||
@@ -170,7 +170,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [LiveInternet (https://www.liveinternet.ru)](https://www.liveinternet.ru)*: top 5K, ru*
|
||||
1.  [BuyMeACoffee (https://www.buymeacoffee.com/)](https://www.buymeacoffee.com/)*: top 5K, freelance*
|
||||
1.  [Gitea (https://gitea.com/)](https://gitea.com/)*: top 5K, coding*
|
||||
1.  [Genius (https://genius.com/)](https://genius.com/)*: top 5K, music*, search is disabled
|
||||
1.  [Genius (https://genius.com/)](https://genius.com/)*: top 5K, music*
|
||||
1.  [Techrepublic (https://www.techrepublic.com)](https://www.techrepublic.com)*: top 5K, news, tech*
|
||||
1.  [HubPages (https://hubpages.com/)](https://hubpages.com/)*: top 5K, blog*
|
||||
1.  [Artstation (https://www.artstation.com)](https://www.artstation.com)*: top 5K, art, stock*
|
||||
@@ -182,7 +182,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [AllTrails (https://www.alltrails.com/)](https://www.alltrails.com/)*: top 5K, sport, travel*, search is disabled
|
||||
1.  [Habr (https://habr.com/)](https://habr.com/)*: top 5K, blog, discussion, ru*
|
||||
1.  [AllRecipes (https://www.allrecipes.com/)](https://www.allrecipes.com/)*: top 5K, hobby*
|
||||
1.  [Redbubble (https://www.redbubble.com/)](https://www.redbubble.com/)*: top 5K, shopping*, search is disabled
|
||||
1.  [Redbubble (https://www.redbubble.com/)](https://www.redbubble.com/)*: top 5K, shopping*
|
||||
1.  [Diigo (https://www.diigo.com/)](https://www.diigo.com/)*: top 5K, bookmarks*
|
||||
1.  [Windy (https://windy.com/)](https://windy.com/)*: top 5K, maps*
|
||||
1.  [Codecanyon (https://codecanyon.net)](https://codecanyon.net)*: top 5K, coding, shopping*
|
||||
@@ -270,7 +270,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Hackaday (https://hackaday.io/)](https://hackaday.io/)*: top 5K, hobby, tech*
|
||||
1.  [AnimeNewsNetwork (https://www.animenewsnetwork.com)](https://www.animenewsnetwork.com)*: top 5K, anime, news*
|
||||
1.  [LibraryThing (https://www.librarything.com/)](https://www.librarything.com/)*: top 5K, books*
|
||||
1.  [Fodors (https://www.fodors.com)](https://www.fodors.com)*: top 5K, travel*, search is disabled
|
||||
1.  [Fodors (https://www.fodors.com)](https://www.fodors.com)*: top 5K, travel*
|
||||
1.  [Designs99 (https://99designs.com)](https://99designs.com)*: top 5K, design, photo*
|
||||
1.  [Periscope (https://www.pscp.tv)](https://www.pscp.tv)*: top 5K, streaming, video*
|
||||
1.  [Freesound (https://freesound.org/)](https://freesound.org/)*: top 5K, music*
|
||||
@@ -415,13 +415,13 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [TheStudentRoom (https://www.thestudentroom.co.uk)](https://www.thestudentroom.co.uk)*: top 100K, forum, gb*, search is disabled
|
||||
1.  [Codementor (https://www.codementor.io/)](https://www.codementor.io/)*: top 100K, coding*
|
||||
1.  [N4g (https://n4g.com/)](https://n4g.com/)*: top 100K, gaming, news*
|
||||
1.  [Lomography (https://www.lomography.com)](https://www.lomography.com)*: top 100K, photo*, search is disabled
|
||||
1.  [Lomography (https://www.lomography.com)](https://www.lomography.com)*: top 100K, photo*
|
||||
1.  [pixelfed.social (https://pixelfed.social/)](https://pixelfed.social/)*: top 100K, art, photo*
|
||||
1.  [Hackerearth (https://www.hackerearth.com)](https://www.hackerearth.com)*: top 100K, freelance*, search is disabled
|
||||
1.  [Weedmaps (https://weedmaps.com)](https://weedmaps.com)*: top 100K, us*
|
||||
1.  [Redtube (https://www.redtube.com/)](https://www.redtube.com/)*: top 100K, porn*
|
||||
1.  [Neoseeker (https://www.neoseeker.com)](https://www.neoseeker.com)*: top 100K, forum, gaming*
|
||||
1.  [Liberapay (https://liberapay.com)](https://liberapay.com)*: top 100K, finance*, search is disabled
|
||||
1.  [Liberapay (https://liberapay.com)](https://liberapay.com)*: top 100K, finance*
|
||||
1.  [Sythe (https://www.sythe.org)](https://www.sythe.org)*: top 100K, forum*
|
||||
1.  [FilmWeb (https://www.filmweb.pl/user/adam)](https://www.filmweb.pl/user/adam)*: top 100K, movies, pl*
|
||||
1.  [Listal (https://listal.com/)](https://listal.com/)*: top 100K, movies, music*
|
||||
@@ -430,7 +430,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Spatial (https://www.spatial.io)](https://www.spatial.io)*: top 100K, crypto, gaming*
|
||||
1.  [NN.RU (https://www.nn.ru/)](https://www.nn.ru/)*: top 100K, ru*
|
||||
1.  [Paragraph (https://paragraph.com)](https://paragraph.com)*: top 100K, blog, crypto*
|
||||
1.  [Huntingnet (https://www.huntingnet.com)](https://www.huntingnet.com)*: top 100K, us*, search is disabled
|
||||
1.  [Huntingnet (https://www.huntingnet.com)](https://www.huntingnet.com)*: top 100K, us*
|
||||
1.  [telescope.ac (https://telescope.ac)](https://telescope.ac)*: top 100K, blog*, search is disabled
|
||||
1.  [chaos.social (https://chaos.social/)](https://chaos.social/)*: top 100K, social*, search is disabled
|
||||
1.  [mastodon.social (https://chaos.social/)](https://chaos.social/)*: top 100K, social*
|
||||
@@ -522,7 +522,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [mastodon.cloud (https://mastodon.cloud/)](https://mastodon.cloud/)*: top 100K, pk*
|
||||
1.  [1x (https://1x.com)](https://1x.com)*: top 100K, photo*
|
||||
1.  [PatientsLikeMe (https://www.patientslikeme.com)](https://www.patientslikeme.com)*: top 100K, medicine, us*
|
||||
1.  [Picuki (https://www.picuki.com/)](https://www.picuki.com/)*: top 100K, photo*, search is disabled
|
||||
1.  [Picuki (https://www.tikvib.com/)](https://www.tikvib.com/)*: top 100K, video*
|
||||
1.  [Pokecommunity (https://www.pokecommunity.com)](https://www.pokecommunity.com)*: top 100K, forum, gaming*
|
||||
1.  [Eintracht (https://eintracht.de)](https://eintracht.de)*: top 100K, tr*
|
||||
1.  [Datpiff (https://www.datpiff.com)](https://www.datpiff.com)*: top 100K, us*
|
||||
@@ -623,14 +623,14 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Mywed (https://mywed.com/ru)](https://mywed.com/ru)*: top 100K, ru*
|
||||
1.  [Golbis (https://golbis.com)](https://golbis.com)*: top 100K, ru*
|
||||
1.  [Soop (https://www.sooplive.co.kr/)](https://www.sooplive.co.kr/)*: top 100K, kr*
|
||||
1.  [Freelancehunt (https://freelancehunt.com)](https://freelancehunt.com)*: top 100K, freelance, ru, ua*, search is disabled
|
||||
1.  [Freelancehunt (https://freelancehunt.com)](https://freelancehunt.com)*: top 100K, freelance, ru, ua*
|
||||
1.  [Atcoder (https://atcoder.jp/)](https://atcoder.jp/)*: top 100K, coding, jp*
|
||||
1.  [Livejasmin (https://www.livejasmin.com/)](https://www.livejasmin.com/)*: top 100K, us, webcam*
|
||||
1.  [Wanelo (https://wanelo.com/)](https://wanelo.com/)*: top 100K, shopping*, search is disabled
|
||||
1.  [Motherless (https://motherless.com/)](https://motherless.com/)*: top 100K, porn*
|
||||
1.  [Fanlore (http://fanlore.org)](http://fanlore.org)*: top 100K, us*, search is disabled
|
||||
1.  [Fanlore (http://fanlore.org)](http://fanlore.org)*: top 100K, us*
|
||||
1.  [Jetpunk (https://www.jetpunk.com)](https://www.jetpunk.com)*: top 100K, gaming*
|
||||
1.  [Icobench (https://icobench.com)](https://icobench.com)*: top 100K, kr, ru*, search is disabled
|
||||
1.  [Icobench (https://icobench.com)](https://icobench.com)*: top 100K, kr, ru*
|
||||
1.  [Rappad (https://www.rappad.co)](https://www.rappad.co)*: top 100K, music*
|
||||
1.  [Maxpark (https://maxpark.com)](https://maxpark.com)*: top 100K, news, ru*, search is disabled
|
||||
1.  [savingadvice.com (https://savingadvice.com)](https://savingadvice.com)*: top 100K, finance*
|
||||
@@ -671,7 +671,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Rmmedia (https://rmmedia.ru)](https://rmmedia.ru)*: top 100K, forum, ru*
|
||||
1.  [Trashbox.ru (https://trashbox.ru/)](https://trashbox.ru/)*: top 100K, az, ru*
|
||||
1.  [Ddo (https://www.ddo.com)](https://www.ddo.com)*: top 100K, forum*, search is disabled
|
||||
1.  [Hometheaterforum (https://www.hometheaterforum.com)](https://www.hometheaterforum.com)*: top 100K, forum, us*, search is disabled
|
||||
1.  [Hometheaterforum (https://www.hometheaterforum.com)](https://www.hometheaterforum.com)*: top 100K, forum, us*
|
||||
1.  [VLR (https://www.vlr.gg)](https://www.vlr.gg)*: top 100K, gaming*
|
||||
1.  [HackingWithSwift (https://www.hackingwithswift.com)](https://www.hackingwithswift.com)*: top 100K, coding*
|
||||
1.  [Partyflock (https://partyflock.nl)](https://partyflock.nl)*: top 100K, nl*
|
||||
@@ -682,7 +682,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Medikforum (https://www.medikforum.ru)](https://www.medikforum.ru)*: top 100K, de, forum, nl, ru, ua*, search is disabled
|
||||
1.  [mynickname.com (https://mynickname.com)](https://mynickname.com)*: top 100K, social*
|
||||
1.  [appleinsider.ru (https://appleinsider.ru)](https://appleinsider.ru)*: top 100K, news, ru, tech*
|
||||
1.  [ImgInn (https://imginn.com)](https://imginn.com)*: top 100K, photo*, search is disabled
|
||||
1.  [ImgInn (https://imginn.com)](https://imginn.com)*: top 100K, photo*
|
||||
1.  [RPGGeek (https://rpggeek.com)](https://rpggeek.com)*: top 100K, gaming*, search is disabled
|
||||
1.  [Suomi24 (https://www.suomi24.fi)](https://www.suomi24.fi)*: top 100K, fi, jp*
|
||||
1.  [Ethereum-magicians (https://ethereum-magicians.org)](https://ethereum-magicians.org)*: top 100K, cr, forum*
|
||||
@@ -763,7 +763,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [FreelanceJob (https://www.freelancejob.ru)](https://www.freelancejob.ru)*: top 10M, ru*, search is disabled
|
||||
1.  [Football (https://www.rusfootball.info/)](https://www.rusfootball.info/)*: top 10M, ru*
|
||||
1.  [Beerintheevening (http://www.beerintheevening.com)](http://www.beerintheevening.com)*: top 10M, gb*
|
||||
1.  [FortniteTracker (https://fortnitetracker.com/challenges)](https://fortnitetracker.com/challenges)*: top 10M, gaming*, search is disabled
|
||||
1.  [FortniteTracker (https://fortnitetracker.com/challenges)](https://fortnitetracker.com/challenges)*: top 10M, gaming*
|
||||
1.  [Heavy R (https://www.heavy-r.com/)](https://www.heavy-r.com/)*: top 10M, porn*
|
||||
1.  [Coolminiornot (http://www.coolminiornot.com)](http://www.coolminiornot.com)*: top 10M, forum, sg*, search is disabled
|
||||
1.  [1001tracklists (https://www.1001tracklists.com)](https://www.1001tracklists.com)*: top 10M, music*
|
||||
@@ -779,7 +779,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Professionali (https://professionali.ru)](https://professionali.ru)*: top 10M, ru*
|
||||
1.  [Listography (https://listography.com/adam)](https://listography.com/adam)*: top 10M, sharing*
|
||||
1.  [The AnswerBank (https://www.theanswerbank.co.uk)](https://www.theanswerbank.co.uk)*: top 10M, gb, q&a*, search is disabled
|
||||
1.  [Bdoutdoors (https://www.bdoutdoors.com)](https://www.bdoutdoors.com)*: top 10M, us*, search is disabled
|
||||
1.  [Bdoutdoors (https://www.bdoutdoors.com)](https://www.bdoutdoors.com)*: top 10M, us*
|
||||
1.  [millerovo161.ru (http://millerovo161.ru)](http://millerovo161.ru)*: top 10M, forum, ru*
|
||||
1.  [Shikimori (https://shikimori.one)](https://shikimori.one)*: top 10M, ru*
|
||||
1.  [KharkovForum (https://www.kharkovforum.com/)](https://www.kharkovforum.com/)*: top 10M, forum, ua*, search is disabled
|
||||
@@ -796,7 +796,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Fluther (https://www.fluther.com/)](https://www.fluther.com/)*: top 10M, q&a*
|
||||
1.  [Sbazar.cz (https://www.sbazar.cz/)](https://www.sbazar.cz/)*: top 10M, cz, shopping*
|
||||
1.  [vintage-mustang.com (https://vintage-mustang.com)](https://vintage-mustang.com)*: top 10M, forum, us*
|
||||
1.  [forum.hr (http://www.forum.hr)](http://www.forum.hr)*: top 10M, forum, hr*, search is disabled
|
||||
1.  [forum.hr (https://www.forum.hr)](https://www.forum.hr)*: top 10M, forum, hr*
|
||||
1.  [school2dobrinka.ru (http://school2dobrinka.ru)](http://school2dobrinka.ru)*: top 10M, education, ru*
|
||||
1.  [Kosmetista (https://kosmetista.ru)](https://kosmetista.ru)*: top 10M, ru*
|
||||
1.  [Pbnation (https://www.pbnation.com/)](https://www.pbnation.com/)*: top 10M, ca*, search is disabled
|
||||
@@ -880,7 +880,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Proglib (https://proglib.io)](https://proglib.io)*: top 10M, ru*
|
||||
1.  [nightbot (https://nightbot.tv/)](https://nightbot.tv/)*: top 10M, jp*
|
||||
1.  [Hunttalk (https://www.hunttalk.com)](https://www.hunttalk.com)*: top 10M, forum, us*, search is disabled
|
||||
1.  [DMOJ (https://dmoj.ca/)](https://dmoj.ca/)*: top 10M, ca, coding*, search is disabled
|
||||
1.  [DMOJ (https://dmoj.ca/)](https://dmoj.ca/)*: top 10M, ca, coding*
|
||||
1.  [Truesteamachievements (https://truesteamachievements.com)](https://truesteamachievements.com)*: top 10M, az, gb*
|
||||
1.  [TheFastlaneForum (https://www.thefastlaneforum.com)](https://www.thefastlaneforum.com)*: top 10M, forum, us*, search is disabled
|
||||
1.  [lada-vesta.net (http://www.lada-vesta.net)](http://www.lada-vesta.net)*: top 10M, auto, forum, ru*
|
||||
@@ -944,7 +944,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Gps-data-team (https://www.gps-data-team.com)](https://www.gps-data-team.com)*: top 10M, maps*, search is disabled
|
||||
1.  [Soberu (https://yasobe.ru)](https://yasobe.ru)*: top 10M, ru*, search is disabled
|
||||
1.  [Imood (https://www.imood.com/)](https://www.imood.com/)*: top 10M, blog*
|
||||
1.  [Elakiri (https://elakiri.com)](https://elakiri.com)*: top 10M, lk*, search is disabled
|
||||
1.  [Elakiri (https://elakiri.com)](https://elakiri.com)*: top 10M, lk*
|
||||
1.  [Countable (https://www.countable.us/)](https://www.countable.us/)*: top 10M, us*, search is disabled
|
||||
1.  [shipmodeling.ru (https://www.shipmodeling.ru/phpbb)](https://www.shipmodeling.ru/phpbb)*: top 10M, forum, ru*
|
||||
1.  [Armtorg (https://armtorg.ru/)](https://armtorg.ru/)*: top 10M, forum, ru*
|
||||
@@ -979,7 +979,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Mdshooters (https://www.mdshooters.com)](https://www.mdshooters.com)*: top 10M, forum, us*, search is disabled
|
||||
1.  [Prodaman (https://prodaman.ru)](https://prodaman.ru)*: top 10M, ru*
|
||||
1.  [mikrob.ru (https://mikrob.ru)](https://mikrob.ru)*: top 10M, forum, ru*
|
||||
1.  [Gardrops (https://www.gardrops.com)](https://www.gardrops.com)*: top 10M, shopping, tr*, search is disabled
|
||||
1.  [Gardrops (https://www.gardrops.com)](https://www.gardrops.com)*: top 10M, shopping, tr*
|
||||
1.  [Zagony (https://zagony.ru)](https://zagony.ru)*: top 10M, ru*, search is disabled
|
||||
1.  [Pogovorim (https://pogovorim.by)](https://pogovorim.by)*: top 10M, by, ru*, search is disabled
|
||||
1.  [sniperforums.com (https://sniperforums.com)](https://sniperforums.com)*: top 10M, forum*
|
||||
@@ -1754,7 +1754,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [social.tchncs.de (https://social.tchncs.de/)](https://social.tchncs.de/)*: top 100M, de*
|
||||
1.  [alliedmods (https://forums.alliedmods.net/)](https://forums.alliedmods.net/)*: top 100M, forum, gb, jp, tr, uz*, search is disabled
|
||||
1.  [GameRevolution (https://forums.gamerevolution.com)](https://forums.gamerevolution.com)*: top 100M, forum, gaming*
|
||||
1.  [Pathofexile (https://ru.pathofexile.com)](https://ru.pathofexile.com)*: top 100M, ru*
|
||||
1.  [Pathofexile (https://ru.pathofexile.com)](https://ru.pathofexile.com)*: top 100M, ru*, search is disabled
|
||||
1.  [boards.theforce.net (https://boards.theforce.net)](https://boards.theforce.net)*: top 100M*, search is disabled
|
||||
1.  [Justlanded (https://community.justlanded.com)](https://community.justlanded.com)*: top 100M*
|
||||
1.  [igromania (http://forum.igromania.ru/)](http://forum.igromania.ru/)*: top 100M, forum, gaming, ru*
|
||||
@@ -1826,7 +1826,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [forum.ubuntu-it.org (https://forum.ubuntu-it.org)](https://forum.ubuntu-it.org)*: top 100M, ch, forum, it*
|
||||
1.  [forum.endeavouros.com (https://forum.endeavouros.com)](https://forum.endeavouros.com)*: top 100M, forum*
|
||||
1.  [forum.newlcn.com (http://forum.newlcn.com)](http://forum.newlcn.com)*: top 100M, forum*
|
||||
1.  [discussion.squadhelp.com (https://discussion.squadhelp.com)](https://discussion.squadhelp.com)*: top 100M, forum*
|
||||
1.  [discussion.squadhelp.com (https://discussion.squadhelp.com)](https://discussion.squadhelp.com)*: top 100M, forum*, search is disabled
|
||||
1.  [discuss.flarum.org (https://discuss.flarum.org)](https://discuss.flarum.org)*: top 100M*
|
||||
1.  [mirf (https://forum.mirf.ru/)](https://forum.mirf.ru/)*: top 100M, forum, ru*, search is disabled
|
||||
1.  [kpyto.pp.net.ua (http://kpyto.pp.net.ua)](http://kpyto.pp.net.ua)*: top 100M, ua*
|
||||
@@ -1885,7 +1885,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [forum.gong.bg (https://forum.gong.bg)](https://forum.gong.bg)*: top 100M, bg, forum*
|
||||
1.  [Velomania (https://forum.velomania.ru/)](https://forum.velomania.ru/)*: top 100M, forum, ru*, search is disabled
|
||||
1.  [bbs.evony.com (http://bbs.evony.com)](http://bbs.evony.com)*: top 100M, forum, pk, tr*, search is disabled
|
||||
1.  [forum.vectric.com (https://forum.vectric.com)](https://forum.vectric.com)*: top 100M, forum*
|
||||
1.  [forum.vectric.com (https://forum.vectric.com)](https://forum.vectric.com)*: top 100M, forum*, search is disabled
|
||||
1.  [Bratsk Forum (http://forum.bratsk.org)](http://forum.bratsk.org)*: top 100M, forum, ru*
|
||||
1.  [Runnersworld (https://forums.runnersworld.co.uk/)](https://forums.runnersworld.co.uk/)*: top 100M, forum, sport*, search is disabled
|
||||
1.  [Qwas (http://forum.qwas.ru)](http://forum.qwas.ru)*: top 100M, forum, ru*
|
||||
@@ -2141,7 +2141,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Fireworktv (https://fireworktv.com)](https://fireworktv.com)*: top 100M, jp*, search is disabled
|
||||
1.  [Flbord (https://flbord.com)](https://flbord.com)*: top 100M, ru, ua*, search is disabled
|
||||
1.  [Fm-forum (https://fm-forum.ru)](https://fm-forum.ru)*: top 100M, forum, ru*, search is disabled
|
||||
1.  [Forum.glow-dm.ru (http://forum.glow-dm.ru)](http://forum.glow-dm.ru)*: top 100M, forum, ru*
|
||||
1.  [Forum.glow-dm.ru (http://forum.glow-dm.ru)](http://forum.glow-dm.ru)*: top 100M, forum, ru*, search is disabled
|
||||
1.  [Forum.jambox.ru (https://forum.jambox.ru)](https://forum.jambox.ru)*: top 100M, forum, ru*
|
||||
1.  [Forum.quake2.com.ru (http://forum.quake2.com.ru/)](http://forum.quake2.com.ru/)*: top 100M, forum, ru*, search is disabled
|
||||
1.  [Forum29 (http://forum29.net)](http://forum29.net)*: top 100M, forum, ru*, search is disabled
|
||||
@@ -2199,7 +2199,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Invalidnost (https://www.invalidnost.com)](https://www.invalidnost.com)*: top 100M, ru*
|
||||
1.  [IonicFramework ()]()*: top 100M*
|
||||
1.  [Ispdn (http://ispdn.ru)](http://ispdn.ru)*: top 100M, ru*
|
||||
1.  [Itforums (https://itforums.ru)](https://itforums.ru)*: top 100M, forum, ru*
|
||||
1.  [Itforums (https://itforums.ru)](https://itforums.ru)*: top 100M, forum, ru*, search is disabled
|
||||
1.  [Itfy (https://itfy.org)](https://itfy.org)*: top 100M, ru*
|
||||
1.  [Jbzd ()]()*: top 100M*
|
||||
1.  [Jeja.pl ()]()*: top 100M*
|
||||
@@ -2278,7 +2278,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Ninjakiwi ()]()*: top 100M*, search is disabled
|
||||
1.  [NationalgunForum (https://www.nationalgunforum.com)](https://www.nationalgunforum.com)*: top 100M, ca, forum*, search is disabled
|
||||
1.  [Naturalworld (https://naturalworld.guru)](https://naturalworld.guru)*: top 100M, ru*
|
||||
1.  [Needrom ()]()*: top 100M*, search is disabled
|
||||
1.  [Needrom ()]()*: top 100M*
|
||||
1.  [No-jus (https://no-jus.com)](https://no-jus.com)*: top 100M, ru*, search is disabled
|
||||
1.  [Numizmat (https://numizmat-forum.ru)](https://numizmat-forum.ru)*: top 100M, forum, ru*
|
||||
1.  [Nyaa.si ()]()*: top 100M*
|
||||
@@ -2306,7 +2306,7 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [Polczat.pl ()]()*: top 100M*
|
||||
1.  [Policja2009 ()]()*: top 100M*
|
||||
1.  [Polleverywhere ()]()*: top 100M*
|
||||
1.  [Polymart ()]()*: top 100M*
|
||||
1.  [Polymart ()]()*: top 100M*, search is disabled
|
||||
1.  [PornhubPornstars ()]()*: top 100M*
|
||||
1.  [Poshmark ()]()*: top 100M*
|
||||
1.  [Pro-cats (http://pro-cats.ru)](http://pro-cats.ru)*: top 100M, ru*
|
||||
@@ -3158,16 +3158,16 @@ Rank data fetched from Majestic Million by domains.
|
||||
1.  [AirNFTs (https://app.airnfts.com)](https://app.airnfts.com)*: top 100M, crypto, nft*
|
||||
1.  [GreasyFork (https://greasyfork.org)](https://greasyfork.org)*: top 100M, coding*
|
||||
|
||||
The list was updated at (2026-05-05)
|
||||
The list was updated at (2026-05-08)
|
||||
## Statistics
|
||||
|
||||
Enabled/total sites: 2510/3154 = 79.58%
|
||||
Enabled/total sites: 2524/3154 = 80.03%
|
||||
|
||||
Incomplete message checks: 308/2510 = 12.27% (false positive risks)
|
||||
Incomplete message checks: 311/2524 = 12.32% (false positive risks)
|
||||
|
||||
Status code checks: 631/2510 = 25.14% (false positive risks)
|
||||
Status code checks: 636/2524 = 25.2% (false positive risks)
|
||||
|
||||
False positive risk (total): 37.41%
|
||||
False positive risk (total): 37.52%
|
||||
|
||||
Sites with probing: 500px, Armchairgm, BinarySearch (disabled), BleachFandom, Bluesky, BongaCams, Boosty, BuyMeACoffee, Calendly, Cent, Chess, Code Sandbox (disabled), Code Snippet Wiki, DailyMotion, Discord, Diskusjon.no, Disqus, Docker Hub, Duolingo, FandomCommunityCentral, GitHub, GitLab, Google Plus (archived), Gravatar, HackTheBox, Hackerrank, Hashnode, Holopin, Imgur, Issuu, Keybase, Kick, Kvinneguiden, LeetCode, Lesswrong, Livejasmin, LocalCryptos (disabled), Medium, MicrosoftLearn, MixCloud, Monkeytype, NPM, Niftygateway, Omg.lol, OnlyFans, Paragraph, Picsart, Plurk, Polarsteps, Rarible, Reddit, Reddit Search (Pushshift) (disabled), Revolut.me, RoyalCams, Scratch, Soop, SportsTracker, Spotify, StackOverflow, Substack, TAP'D, Topcoder, Trello, Twitch, Twitter, Twitter Shadowban (disabled), UnstoppableDomains, Vimeo, Vivino, Warframe Market, Warpcast, Weibo, Wikipedia, Yapisal (disabled), YouNow, en.brickimedia.org, forums.grandstream.com, nightbot, notabug.org, qiwi.me (disabled)
|
||||
|
||||
@@ -3198,10 +3198,10 @@ Top 20 profile URLs:
|
||||
|
||||
Sites by engine:
|
||||
- `uCoz`: 634/709 (89.4%)
|
||||
- `XenForo`: 179/223 (80.3%)
|
||||
- `phpBB/Search`: 120/127 (94.5%)
|
||||
- `vBulletin`: 30/120 (25.0%)
|
||||
- `Discourse`: 85/92 (92.4%)
|
||||
- `XenForo`: 177/223 (79.4%)
|
||||
- `phpBB/Search`: 119/127 (93.7%)
|
||||
- `vBulletin`: 31/120 (25.8%)
|
||||
- `Discourse`: 84/92 (91.3%)
|
||||
- `phpBB`: 21/27 (77.8%)
|
||||
- `engine404`: 19/23 (82.6%)
|
||||
- `op.gg`: 17/17 (100.0%)
|
||||
@@ -3217,7 +3217,7 @@ Top 20 tags:
|
||||
- (749) `forum`
|
||||
- (128) `gaming`
|
||||
- (88) `coding`
|
||||
- (58) `photo`
|
||||
- (57) `photo`
|
||||
- (46) `tech`
|
||||
- (45) `social`
|
||||
- (42) `news`
|
||||
@@ -3226,8 +3226,8 @@ Top 20 tags:
|
||||
- (31) `shopping`
|
||||
- (29) `crypto`
|
||||
- (27) `finance`
|
||||
- (25) `video`
|
||||
- (25) `sharing`
|
||||
- (24) `video`
|
||||
- (23) `education`
|
||||
- (22) `freelance`
|
||||
- (21) `art`
|
||||
|
||||
@@ -53,6 +53,7 @@ DEFAULT_ARGS: Dict[str, Any] = {
|
||||
'ai_model': 'gpt-4o',
|
||||
'no_autoupdate': False,
|
||||
'force_update': False,
|
||||
'cloudflare_bypass': False,
|
||||
}
|
||||
|
||||
|
||||
|
||||
@@ -0,0 +1,256 @@
|
||||
"""Tests for the Cloudflare webgate config + checker."""
|
||||
|
||||
import json
|
||||
from types import SimpleNamespace
|
||||
|
||||
from mock import Mock
|
||||
import pytest
|
||||
|
||||
from maigret.checking import (
|
||||
CloudflareWebgateChecker,
|
||||
build_cloudflare_bypass_config,
|
||||
)
|
||||
|
||||
|
||||
def _settings(payload):
|
||||
return SimpleNamespace(cloudflare_bypass=payload)
|
||||
|
||||
|
||||
def test_config_disabled_by_default():
|
||||
s = _settings({"enabled": False, "modules": [{"method": "json_api", "url": "x"}]})
|
||||
assert build_cloudflare_bypass_config(s, force_enable=False) is None
|
||||
|
||||
|
||||
def test_config_force_enable_overrides_disabled_settings():
|
||||
s = _settings({"enabled": False, "modules": [{"method": "json_api", "url": "http://x:8191/v1"}]})
|
||||
cfg = build_cloudflare_bypass_config(s, force_enable=True)
|
||||
assert cfg is not None
|
||||
assert cfg["modules"][0]["url"] == "http://x:8191/v1"
|
||||
|
||||
|
||||
def test_config_drops_invalid_modules():
|
||||
s = _settings({
|
||||
"enabled": True,
|
||||
"modules": [
|
||||
{"method": "url_rewrite", "url": "http://x:8000/html"}, # missing {url}
|
||||
{"method": "json_api", "url": "http://x:8191/v1"},
|
||||
{"method": "unknown", "url": "http://x"},
|
||||
],
|
||||
})
|
||||
cfg = build_cloudflare_bypass_config(s)
|
||||
assert len(cfg["modules"]) == 1
|
||||
assert cfg["modules"][0]["method"] == "json_api"
|
||||
|
||||
|
||||
def test_config_returns_none_when_no_valid_modules():
|
||||
s = _settings({"enabled": True, "modules": [{"method": "url_rewrite", "url": "no-placeholder"}]})
|
||||
assert build_cloudflare_bypass_config(s) is None
|
||||
|
||||
|
||||
def test_config_default_trigger_protection():
|
||||
s = _settings({"enabled": True, "modules": [{"method": "json_api", "url": "http://x:8191/v1"}]})
|
||||
cfg = build_cloudflare_bypass_config(s)
|
||||
assert "cf_js_challenge" in cfg["trigger_protection"]
|
||||
assert "cf_firewall" in cfg["trigger_protection"]
|
||||
assert "webgate" in cfg["trigger_protection"]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_flaresolverr_success(httpserver):
|
||||
httpserver.expect_request("/v1", method="POST").respond_with_json({
|
||||
"status": "ok",
|
||||
"solution": {"status": 404, "response": "<html>missing</html>", "url": "https://site/missing"},
|
||||
})
|
||||
config = {
|
||||
"modules": [{"name": "fs", "method": "json_api", "url": httpserver.url_for("/v1")}],
|
||||
"session_prefix": "test",
|
||||
}
|
||||
c = CloudflareWebgateChecker(logger=Mock(), config=config)
|
||||
c.prepare(url="https://site/missing", timeout=5)
|
||||
body, status, err = await c.check()
|
||||
assert err is None
|
||||
assert status == 404 # upstream status preserved — fixes status_code checktype
|
||||
assert "missing" in body
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_flaresolverr_solver_error_propagates(httpserver):
|
||||
httpserver.expect_request("/v1", method="POST").respond_with_json({
|
||||
"status": "error",
|
||||
"message": "Challenge could not be solved",
|
||||
})
|
||||
config = {
|
||||
"modules": [{"name": "fs", "method": "json_api", "url": httpserver.url_for("/v1")}],
|
||||
}
|
||||
c = CloudflareWebgateChecker(logger=Mock(), config=config)
|
||||
c.prepare(url="https://site/page", timeout=5)
|
||||
body, status, err = await c.check()
|
||||
assert err is not None
|
||||
assert "Challenge could not be solved" in err.desc
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_falls_back_to_next_module_on_failure(httpserver):
|
||||
# Bind only the second module — the first is unreachable.
|
||||
httpserver.expect_request("/v1", method="POST").respond_with_json({
|
||||
"status": "ok",
|
||||
"solution": {"status": 200, "response": "ok-from-second", "url": "https://x"},
|
||||
})
|
||||
config = {
|
||||
"modules": [
|
||||
{"name": "broken", "method": "json_api", "url": "http://127.0.0.1:1/v1"},
|
||||
{"name": "good", "method": "json_api", "url": httpserver.url_for("/v1")},
|
||||
],
|
||||
}
|
||||
c = CloudflareWebgateChecker(logger=Mock(), config=config)
|
||||
c.prepare(url="https://site/page", timeout=5)
|
||||
body, status, err = await c.check()
|
||||
assert err is None
|
||||
assert status == 200
|
||||
assert body == "ok-from-second"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_url_rewrite_returns_html_with_synthetic_200(httpserver):
|
||||
# CloudflareBypassForScraping returns just the rendered HTML, no JSON wrapper.
|
||||
httpserver.expect_request("/html").respond_with_data(
|
||||
"<html>profile body</html>", status=200, content_type="text/html"
|
||||
)
|
||||
config = {
|
||||
"modules": [{
|
||||
"name": "cbfs",
|
||||
"method": "url_rewrite",
|
||||
"url": httpserver.url_for("/html") + "?url={url}",
|
||||
}],
|
||||
}
|
||||
c = CloudflareWebgateChecker(logger=Mock(), config=config)
|
||||
c.prepare(url="https://site/page", timeout=5)
|
||||
body, status, err = await c.check()
|
||||
assert err is None
|
||||
assert status == 200 # synthetic — url_rewrite cannot recover real status
|
||||
assert "profile body" in body
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_all_modules_unreachable_actionable_error():
|
||||
config = {
|
||||
"modules": [
|
||||
{"name": "fs", "method": "json_api", "url": "http://127.0.0.1:1/v1"},
|
||||
{"name": "cbfs", "method": "url_rewrite", "url": "http://127.0.0.1:2/html?url={url}"},
|
||||
],
|
||||
}
|
||||
c = CloudflareWebgateChecker(logger=Mock(), config=config)
|
||||
c.prepare(url="https://site/page", timeout=2)
|
||||
body, status, err = await c.check()
|
||||
assert err is not None
|
||||
assert err.type == "Webgate unavailable"
|
||||
# Per-module attempt summary helps users see WHICH backend failed
|
||||
assert "fs:" in err.desc and "cbfs:" in err.desc
|
||||
# Primary URL is shown so the user knows where to look
|
||||
assert "http://127.0.0.1:1/v1" in err.desc
|
||||
# FlareSolverr docker hint when primary is json_api
|
||||
assert "flaresolverr" in err.desc.lower()
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_session_is_scoped_per_host(httpserver):
|
||||
seen_sessions = []
|
||||
|
||||
def handler(request):
|
||||
seen_sessions.append(request.get_json()["session"])
|
||||
return {"status": "ok", "solution": {"status": 200, "response": "", "url": "x"}}
|
||||
|
||||
httpserver.expect_request("/v1", method="POST").respond_with_handler(handler)
|
||||
config = {"modules": [{"name": "fs", "method": "json_api", "url": httpserver.url_for("/v1")}]}
|
||||
c = CloudflareWebgateChecker(logger=Mock(), config=config)
|
||||
|
||||
c.prepare(url="https://patreon.com/foo", timeout=5)
|
||||
await c.check()
|
||||
c.prepare(url="https://patreon.com/bar", timeout=5)
|
||||
await c.check()
|
||||
c.prepare(url="https://lomography.com/baz", timeout=5)
|
||||
await c.check()
|
||||
|
||||
assert seen_sessions[0] == seen_sessions[1], "same host -> same session"
|
||||
assert seen_sessions[0] != seen_sessions[2], "different host -> different session"
|
||||
assert "patreon.com" in seen_sessions[0]
|
||||
assert "lomography.com" in seen_sessions[2]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_flaresolverr_request_body_shape(httpserver):
|
||||
captured = {}
|
||||
|
||||
def handler(request):
|
||||
captured["body"] = request.get_json()
|
||||
return {"status": "ok", "solution": {"status": 200, "response": "", "url": "x"}}
|
||||
|
||||
httpserver.expect_request("/v1", method="POST").respond_with_handler(handler)
|
||||
config = {"modules": [{"name": "fs", "method": "json_api", "url": httpserver.url_for("/v1")}]}
|
||||
c = CloudflareWebgateChecker(logger=Mock(), config=config)
|
||||
c.prepare(url="https://site/page", headers={"User-Agent": "test-ua/1.0"}, timeout=5)
|
||||
await c.check()
|
||||
body = captured["body"]
|
||||
assert body["cmd"] == "request.get"
|
||||
assert body["url"] == "https://site/page"
|
||||
assert body["session"].startswith("maigret-")
|
||||
# userAgent was removed in FlareSolverr v2; the impersonated browser's
|
||||
# own UA must be used to keep TLS+UA consistent.
|
||||
assert "userAgent" not in body
|
||||
assert "proxy" not in body
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_flaresolverr_proxy_string_passed_through(httpserver):
|
||||
captured = {}
|
||||
|
||||
def handler(request):
|
||||
captured["body"] = request.get_json()
|
||||
return {"status": "ok", "solution": {"status": 200, "response": "", "url": "x"}}
|
||||
|
||||
httpserver.expect_request("/v1", method="POST").respond_with_handler(handler)
|
||||
config = {
|
||||
"modules": [
|
||||
{
|
||||
"name": "fs",
|
||||
"method": "json_api",
|
||||
"url": httpserver.url_for("/v1"),
|
||||
"proxy": "socks5://localhost:1080",
|
||||
}
|
||||
]
|
||||
}
|
||||
c = CloudflareWebgateChecker(logger=Mock(), config=config)
|
||||
c.prepare(url="https://site/page", headers={}, timeout=5)
|
||||
await c.check()
|
||||
assert captured["body"]["proxy"] == {"url": "socks5://localhost:1080"}
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_flaresolverr_proxy_dict_with_credentials(httpserver):
|
||||
captured = {}
|
||||
|
||||
def handler(request):
|
||||
captured["body"] = request.get_json()
|
||||
return {"status": "ok", "solution": {"status": 200, "response": "", "url": "x"}}
|
||||
|
||||
httpserver.expect_request("/v1", method="POST").respond_with_handler(handler)
|
||||
config = {
|
||||
"modules": [
|
||||
{
|
||||
"name": "fs",
|
||||
"method": "json_api",
|
||||
"url": httpserver.url_for("/v1"),
|
||||
"proxy": {
|
||||
"url": "http://proxy.example:3128",
|
||||
"username": "u",
|
||||
"password": "p",
|
||||
"stripped_extra": "ignored",
|
||||
},
|
||||
}
|
||||
]
|
||||
}
|
||||
c = CloudflareWebgateChecker(logger=Mock(), config=config)
|
||||
c.prepare(url="https://site/page", headers={}, timeout=5)
|
||||
await c.check()
|
||||
proxy = captured["body"]["proxy"]
|
||||
assert proxy == {"url": "http://proxy.example:3128", "username": "u", "password": "p"}
|
||||
Reference in New Issue
Block a user