mirror of
https://github.com/soxoj/maigret.git
synced 2026-05-09 16:14:32 +00:00
Cloudflare bypass webgate (#2628)
This commit is contained in:
@@ -95,6 +95,17 @@ the run after the explicit update finishes.
|
||||
``--retries RETRIES`` - Count of attempts to restart temporarily failed
|
||||
requests.
|
||||
|
||||
``--cloudflare-bypass`` *(experimental)* - Route checks for sites tagged
|
||||
``protection: ["cf_js_challenge"]`` / ``["cf_firewall"]`` / ``["webgate"]``
|
||||
through a local Chrome-based solver (FlareSolverr by default). The bypass
|
||||
is opt-in — without this flag (or
|
||||
``settings.cloudflare_bypass.enabled = true``) those sites are checked
|
||||
the usual way, which Cloudflare almost always blocks: you get an UNKNOWN
|
||||
status with a JS-challenge / firewall error rather than a real result.
|
||||
Configure the backend in ``settings.cloudflare_bypass.modules``.
|
||||
See :ref:`cloudflare-bypass`. **Experimental** — the flag, schema and
|
||||
routing rules may change without backwards-compatibility guarantees.
|
||||
|
||||
.. _custom-database:
|
||||
|
||||
Using a custom sites database
|
||||
|
||||
@@ -237,6 +237,61 @@ The Maigret database contains not only the original websites, but also mirrors,
|
||||
|
||||
It allows getting additional info about the person and checking the existence of the account even if the main site is unavailable (bot protection, captcha, etc.)
|
||||
|
||||
.. _cloudflare-bypass:
|
||||
|
||||
Cloudflare webgate bypass
|
||||
-------------------------
|
||||
|
||||
.. warning::
|
||||
|
||||
**Experimental feature.** The Cloudflare webgate is under active
|
||||
development. The configuration schema, CLI flag behaviour, and the set
|
||||
of sites that route through it may change without backwards-compatibility
|
||||
guarantees. Expect rough edges (CF rate limits, occasional solver
|
||||
failures) and report issues so they can be ironed out.
|
||||
|
||||
Some sites sit behind a full Cloudflare JavaScript challenge or a CF firewall
|
||||
hard block — these are tagged ``protection: ["cf_js_challenge"]`` or
|
||||
``protection: ["cf_firewall"]`` in the database and are normally kept disabled
|
||||
because neither aiohttp nor curl_cffi can solve the JS challenge on their own.
|
||||
|
||||
Maigret can offload these checks to a local Chrome-based solver. Two backends
|
||||
are supported, configured in ``settings.json`` under
|
||||
``cloudflare_bypass.modules`` (the first reachable module wins; subsequent
|
||||
ones are tried as a fallback chain):
|
||||
|
||||
* **FlareSolverr** (recommended). Runs a real Chrome instance and exposes a
|
||||
JSON API. The upstream HTTP status, headers and final URL are preserved, so
|
||||
``checkType: status_code`` and ``checkType: response_url`` keep working
|
||||
through the bypass.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
docker run -d -p 8191:8191 --name flaresolverr ghcr.io/flaresolverr/flaresolverr:latest
|
||||
|
||||
* **CloudflareBypassForScraping** (legacy fallback). Returns rendered HTML
|
||||
only, so the upstream status code is lost — ``checkType: message`` keeps
|
||||
working but ``status_code`` checks misfire (treated as 200 on success).
|
||||
|
||||
Activate the bypass either with the CLI flag::
|
||||
|
||||
maigret --cloudflare-bypass <username>
|
||||
|
||||
or by setting ``cloudflare_bypass.enabled`` to ``true`` in ``settings.json``.
|
||||
The bypass only fires for sites whose ``protection`` field intersects
|
||||
``cloudflare_bypass.trigger_protection`` (default
|
||||
``["cf_js_challenge", "cf_firewall", "webgate"]``); all other sites use the
|
||||
normal aiohttp / curl_cffi path.
|
||||
|
||||
If all configured modules are unreachable, affected sites get an UNKNOWN
|
||||
status with an actionable error pointing at the first module's URL — the
|
||||
fix is almost always to start the FlareSolverr container.
|
||||
|
||||
FlareSolverr session reuse is automatic: Maigret pins a single
|
||||
``session: <session_prefix>-<pid>`` per run, so cf_clearance cookies are
|
||||
shared between checks of the same domain (5–10× faster on subsequent
|
||||
requests to that host).
|
||||
|
||||
Activation
|
||||
----------
|
||||
The activation mechanism helps make requests to sites requiring additional authentication like cookies, JWT tokens, or custom headers.
|
||||
|
||||
@@ -125,3 +125,25 @@ After installing the system dependencies, retry the maigret installation.
|
||||
|
||||
If you continue to have issues, consider using Docker instead, which includes all
|
||||
necessary dependencies.
|
||||
|
||||
Optional: Cloudflare bypass solver
|
||||
----------------------------------
|
||||
|
||||
.. warning::
|
||||
|
||||
**Experimental.** The Cloudflare webgate is under active development;
|
||||
the configuration schema and CLI behaviour may change without
|
||||
backwards-compatibility guarantees.
|
||||
|
||||
Sites tagged ``cf_js_challenge`` / ``cf_firewall`` need a real browser to pass
|
||||
their JavaScript challenge. To check those sites you can run a local
|
||||
`FlareSolverr <https://github.com/FlareSolverr/FlareSolverr>`_ instance —
|
||||
Maigret will route protected checks to it when ``--cloudflare-bypass`` is set:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker run -d -p 8191:8191 --name flaresolverr ghcr.io/flaresolverr/flaresolverr:latest
|
||||
|
||||
This is **optional** — Maigret runs without it; only sites whose
|
||||
``protection`` field intersects ``settings.cloudflare_bypass.trigger_protection``
|
||||
require the solver. See :ref:`cloudflare-bypass` for details.
|
||||
|
||||
@@ -102,6 +102,95 @@ This is recommended for **Docker containers**, **CI pipelines**, and **air-gappe
|
||||
|
||||
**Using a custom database** with ``--db`` always skips auto-update — you are explicitly choosing your data source.
|
||||
|
||||
Cloudflare webgate
|
||||
------------------
|
||||
|
||||
.. warning::
|
||||
|
||||
**Experimental.** The ``cloudflare_bypass`` block is under active
|
||||
development; field names, defaults, and the trigger-protection routing
|
||||
rules may change without backwards-compatibility guarantees.
|
||||
|
||||
The ``cloudflare_bypass`` block in ``settings.json`` configures the optional
|
||||
bypass described in :ref:`cloudflare-bypass`. Default value:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"cloudflare_bypass": {
|
||||
"enabled": false,
|
||||
"session_prefix": "maigret",
|
||||
"trigger_protection": ["cf_js_challenge", "cf_firewall", "webgate"],
|
||||
"modules": [
|
||||
{
|
||||
"name": "flaresolverr",
|
||||
"method": "json_api",
|
||||
"url": "http://localhost:8191/v1",
|
||||
"max_timeout_ms": 60000
|
||||
},
|
||||
{
|
||||
"name": "chrome_webgate",
|
||||
"method": "url_rewrite",
|
||||
"url": "http://localhost:8000/html?url={url}&retries=1"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
**Fields.**
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 30 70
|
||||
|
||||
* - Field
|
||||
- Description
|
||||
* - ``enabled``
|
||||
- When ``true``, the bypass is active for every run; when ``false``
|
||||
(the default), it activates only on ``--cloudflare-bypass``.
|
||||
* - ``trigger_protection``
|
||||
- List of ``site.protection`` values that route a check through the
|
||||
webgate. Sites whose protection is empty or doesn't intersect this
|
||||
list use the default (aiohttp / curl_cffi) checker.
|
||||
* - ``session_prefix``
|
||||
- Prefix for the FlareSolverr ``session`` field. Maigret appends the
|
||||
process PID so concurrent runs don't collide. Reusing a session
|
||||
caches cf_clearance between checks of the same domain.
|
||||
* - ``modules``
|
||||
- Ordered list of backend modules. The first reachable module
|
||||
handles the check; later ones serve as a fallback chain.
|
||||
|
||||
**Module methods.**
|
||||
|
||||
* ``json_api`` — FlareSolverr-compatible POST endpoint at ``url``.
|
||||
Preserves real upstream HTTP status, headers and final URL.
|
||||
Optional ``max_timeout_ms`` (default ``60000``) is the per-request
|
||||
budget the solver is allowed to spend on the JS challenge.
|
||||
* ``url_rewrite`` — legacy CloudflareBypassForScraping endpoint. The
|
||||
``url`` must contain a ``{url}`` placeholder; the original probe URL
|
||||
is URL-encoded and substituted in. Returns rendered HTML only —
|
||||
``checkType: status_code`` and ``response_url`` checks misfire under
|
||||
this method (treated as a synthetic HTTP 200 on success).
|
||||
|
||||
**Optional ``proxy`` field (``json_api`` only).**
|
||||
|
||||
A module may carry a ``proxy`` entry that the solver routes the upstream
|
||||
request through. Useful when a site enforces ``ip_reputation`` rules
|
||||
that block the solver host. Two forms are accepted:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{ "proxy": "socks5://localhost:1080" }
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{ "proxy": { "url": "http://gw.example:3128",
|
||||
"username": "u",
|
||||
"password": "p" } }
|
||||
|
||||
Only ``url``/``username``/``password`` are forwarded; other keys are
|
||||
dropped. Cloudflare ``Error 1015 / 1020`` responses indicate the IP is
|
||||
rate-limited or banned — switch the proxy rather than retrying.
|
||||
.. _ai-analysis-settings:
|
||||
|
||||
AI analysis
|
||||
|
||||
Reference in New Issue
Block a user