Cloudflare bypass webgate (#2628)

This commit is contained in:
Soxoj
2026-05-09 10:48:43 +03:00
committed by GitHub
parent b98a134fcf
commit 5c93b206e7
14 changed files with 1170 additions and 159 deletions
+11
View File
@@ -95,6 +95,17 @@ the run after the explicit update finishes.
``--retries RETRIES`` - Count of attempts to restart temporarily failed
requests.
``--cloudflare-bypass`` *(experimental)* - Route checks for sites tagged
``protection: ["cf_js_challenge"]`` / ``["cf_firewall"]`` / ``["webgate"]``
through a local Chrome-based solver (FlareSolverr by default). The bypass
is opt-in — without this flag (or
``settings.cloudflare_bypass.enabled = true``) those sites are checked
the usual way, which Cloudflare almost always blocks: you get an UNKNOWN
status with a JS-challenge / firewall error rather than a real result.
Configure the backend in ``settings.cloudflare_bypass.modules``.
See :ref:`cloudflare-bypass`. **Experimental** — the flag, schema and
routing rules may change without backwards-compatibility guarantees.
.. _custom-database:
Using a custom sites database
+55
View File
@@ -237,6 +237,61 @@ The Maigret database contains not only the original websites, but also mirrors,
It allows getting additional info about the person and checking the existence of the account even if the main site is unavailable (bot protection, captcha, etc.)
.. _cloudflare-bypass:
Cloudflare webgate bypass
-------------------------
.. warning::
**Experimental feature.** The Cloudflare webgate is under active
development. The configuration schema, CLI flag behaviour, and the set
of sites that route through it may change without backwards-compatibility
guarantees. Expect rough edges (CF rate limits, occasional solver
failures) and report issues so they can be ironed out.
Some sites sit behind a full Cloudflare JavaScript challenge or a CF firewall
hard block — these are tagged ``protection: ["cf_js_challenge"]`` or
``protection: ["cf_firewall"]`` in the database and are normally kept disabled
because neither aiohttp nor curl_cffi can solve the JS challenge on their own.
Maigret can offload these checks to a local Chrome-based solver. Two backends
are supported, configured in ``settings.json`` under
``cloudflare_bypass.modules`` (the first reachable module wins; subsequent
ones are tried as a fallback chain):
* **FlareSolverr** (recommended). Runs a real Chrome instance and exposes a
JSON API. The upstream HTTP status, headers and final URL are preserved, so
``checkType: status_code`` and ``checkType: response_url`` keep working
through the bypass.
.. code-block:: console
docker run -d -p 8191:8191 --name flaresolverr ghcr.io/flaresolverr/flaresolverr:latest
* **CloudflareBypassForScraping** (legacy fallback). Returns rendered HTML
only, so the upstream status code is lost — ``checkType: message`` keeps
working but ``status_code`` checks misfire (treated as 200 on success).
Activate the bypass either with the CLI flag::
maigret --cloudflare-bypass <username>
or by setting ``cloudflare_bypass.enabled`` to ``true`` in ``settings.json``.
The bypass only fires for sites whose ``protection`` field intersects
``cloudflare_bypass.trigger_protection`` (default
``["cf_js_challenge", "cf_firewall", "webgate"]``); all other sites use the
normal aiohttp / curl_cffi path.
If all configured modules are unreachable, affected sites get an UNKNOWN
status with an actionable error pointing at the first module's URL — the
fix is almost always to start the FlareSolverr container.
FlareSolverr session reuse is automatic: Maigret pins a single
``session: <session_prefix>-<pid>`` per run, so cf_clearance cookies are
shared between checks of the same domain (510× faster on subsequent
requests to that host).
Activation
----------
The activation mechanism helps make requests to sites requiring additional authentication like cookies, JWT tokens, or custom headers.
+22
View File
@@ -125,3 +125,25 @@ After installing the system dependencies, retry the maigret installation.
If you continue to have issues, consider using Docker instead, which includes all
necessary dependencies.
Optional: Cloudflare bypass solver
----------------------------------
.. warning::
**Experimental.** The Cloudflare webgate is under active development;
the configuration schema and CLI behaviour may change without
backwards-compatibility guarantees.
Sites tagged ``cf_js_challenge`` / ``cf_firewall`` need a real browser to pass
their JavaScript challenge. To check those sites you can run a local
`FlareSolverr <https://github.com/FlareSolverr/FlareSolverr>`_ instance —
Maigret will route protected checks to it when ``--cloudflare-bypass`` is set:
.. code-block:: bash
docker run -d -p 8191:8191 --name flaresolverr ghcr.io/flaresolverr/flaresolverr:latest
This is **optional** — Maigret runs without it; only sites whose
``protection`` field intersects ``settings.cloudflare_bypass.trigger_protection``
require the solver. See :ref:`cloudflare-bypass` for details.
+89
View File
@@ -102,6 +102,95 @@ This is recommended for **Docker containers**, **CI pipelines**, and **air-gappe
**Using a custom database** with ``--db`` always skips auto-update — you are explicitly choosing your data source.
Cloudflare webgate
------------------
.. warning::
**Experimental.** The ``cloudflare_bypass`` block is under active
development; field names, defaults, and the trigger-protection routing
rules may change without backwards-compatibility guarantees.
The ``cloudflare_bypass`` block in ``settings.json`` configures the optional
bypass described in :ref:`cloudflare-bypass`. Default value:
.. code-block:: json
{
"cloudflare_bypass": {
"enabled": false,
"session_prefix": "maigret",
"trigger_protection": ["cf_js_challenge", "cf_firewall", "webgate"],
"modules": [
{
"name": "flaresolverr",
"method": "json_api",
"url": "http://localhost:8191/v1",
"max_timeout_ms": 60000
},
{
"name": "chrome_webgate",
"method": "url_rewrite",
"url": "http://localhost:8000/html?url={url}&retries=1"
}
]
}
}
**Fields.**
.. list-table::
:header-rows: 1
:widths: 30 70
* - Field
- Description
* - ``enabled``
- When ``true``, the bypass is active for every run; when ``false``
(the default), it activates only on ``--cloudflare-bypass``.
* - ``trigger_protection``
- List of ``site.protection`` values that route a check through the
webgate. Sites whose protection is empty or doesn't intersect this
list use the default (aiohttp / curl_cffi) checker.
* - ``session_prefix``
- Prefix for the FlareSolverr ``session`` field. Maigret appends the
process PID so concurrent runs don't collide. Reusing a session
caches cf_clearance between checks of the same domain.
* - ``modules``
- Ordered list of backend modules. The first reachable module
handles the check; later ones serve as a fallback chain.
**Module methods.**
* ``json_api`` — FlareSolverr-compatible POST endpoint at ``url``.
Preserves real upstream HTTP status, headers and final URL.
Optional ``max_timeout_ms`` (default ``60000``) is the per-request
budget the solver is allowed to spend on the JS challenge.
* ``url_rewrite`` — legacy CloudflareBypassForScraping endpoint. The
``url`` must contain a ``{url}`` placeholder; the original probe URL
is URL-encoded and substituted in. Returns rendered HTML only —
``checkType: status_code`` and ``response_url`` checks misfire under
this method (treated as a synthetic HTTP 200 on success).
**Optional ``proxy`` field (``json_api`` only).**
A module may carry a ``proxy`` entry that the solver routes the upstream
request through. Useful when a site enforces ``ip_reputation`` rules
that block the solver host. Two forms are accepted:
.. code-block:: json
{ "proxy": "socks5://localhost:1080" }
.. code-block:: json
{ "proxy": { "url": "http://gw.example:3128",
"username": "u",
"password": "p" } }
Only ``url``/``username``/``password`` are forwarded; other keys are
dropped. Cloudflare ``Error 1015 / 1020`` responses indicate the IP is
rate-limited or banned — switch the proxy rather than retrying.
.. _ai-analysis-settings:
AI analysis