Compare commits

..

1 Commits

Author SHA1 Message Date
Soxoj ffb4c1856c Make xhtml2pdf optional, fix install on Linux without libcairo
Move xhtml2pdf to the new [pdf] extra so default `pip install maigret`
no longer pulls pycairo (which has no Linux/macOS wheels and breaks the
build without libcairo2-dev). save_pdf_report now raises a clear
RuntimeError pointing to `pip install 'maigret[pdf]'`, and the CLI
turns it into a friendly warning instead of a crash. Adds tests
covering the missing-extra path, plus per-OS install docs.

Fix for #2657, #2534
2026-05-15 12:17:10 +02:00
12 changed files with 265 additions and 95 deletions
+2
View File
@@ -173,6 +173,8 @@ docker build --target web -t maigret-web . # Web UI image
Build errors? See the [troubleshooting guide](https://maigret.readthedocs.io/en/latest/installation.html#troubleshooting).
PDF reports (`--pdf`) are an optional extra — install with `pip install 'maigret[pdf]'`. They need system-level graphics libraries on Linux/macOS; see the [PDF reports section](https://maigret.readthedocs.io/en/latest/installation.html#optional-pdf-reports-maigret-pdf) for per-OS install steps.
## Usage
### Examples
+137
View File
@@ -58,6 +58,17 @@ Maigret ships with a bundled site database. After installation from PyPI (or any
# usage
maigret username
PDF report support is shipped as an **optional extra** because it relies on
system-level graphics libraries that pip cannot install for you. If you plan to
use ``--pdf``, install Maigret with the ``pdf`` extra:
.. code-block:: bash
pip3 install 'maigret[pdf]'
See :ref:`pdf-extra` below for the full background on why PDF support is
optional and how to fix the most common build errors.
Development version (GitHub)
----------------------------
@@ -126,6 +137,132 @@ After installing the system dependencies, retry the maigret installation.
If you continue to have issues, consider using Docker instead, which includes all
necessary dependencies.
.. _pdf-extra:
Optional: PDF reports (``maigret[pdf]``)
----------------------------------------
The ``--pdf`` report format is shipped as an optional extra. To enable it:
.. code-block:: bash
pip3 install 'maigret[pdf]'
If PDF support is not installed and you pass ``--pdf``, Maigret prints a
warning and continues without crashing — every other output format
(``--html``, ``--json``, ``--csv``, ``--txt``, ``--xmind``, ``--graph``)
keeps working.
Why is PDF optional?
~~~~~~~~~~~~~~~~~~~~
Maigret renders PDFs by converting an HTML template, and that conversion
pipeline ultimately depends on the ``cairo`` graphics library through a
chain of Python packages roughly shaped like::
maigret[pdf] → xhtml2pdf → svglib → rlPyCairo → pycairo → libcairo2 (system)
The bottom of that chain is a C library — ``libcairo2`` — that has to exist
on the host *before* pip can build the Python bindings. The Python binding
package (``pycairo``) currently ships **only Windows wheels** on PyPI; on
Linux and macOS pip falls back to building from source, and the build fails
the moment ``pkg-config`` cannot find ``cairo``. The error looks like::
../cairo/meson.build:31:12: ERROR: Dependency "cairo" not found (tried pkg-config)
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
Pulling this whole chain for every Maigret install just so the much smaller
group of users who actually want PDFs can have them is a poor trade — so
``xhtml2pdf`` is gated behind the ``pdf`` extra.
Installing the system prerequisites
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Install the cairo headers, ``pkg-config``, and a working C toolchain
*before* running ``pip install 'maigret[pdf]'``.
**Debian / Ubuntu / Linux Mint / Kali:**
.. code-block:: bash
sudo apt update
sudo apt install -y libcairo2-dev pkg-config python3-dev build-essential
pip3 install --upgrade pip setuptools wheel
pip3 install 'maigret[pdf]'
**Fedora / RHEL / CentOS:**
.. code-block:: bash
sudo dnf install -y cairo-devel pkgconfig python3-devel gcc
pip3 install 'maigret[pdf]'
**Arch Linux:**
.. code-block:: bash
sudo pacman -S cairo pkgconf base-devel
pip3 install 'maigret[pdf]'
**Alpine Linux:**
.. code-block:: bash
sudo apk add cairo-dev pkgconf python3-dev build-base
pip3 install 'maigret[pdf]'
**macOS (Homebrew):**
.. code-block:: bash
brew install cairo pkg-config
pip3 install --upgrade pip setuptools wheel
pip3 install 'maigret[pdf]'
**Windows:**
No system packages are needed — ``pycairo`` ships prebuilt wheels for
Windows. Just run:
.. code-block:: bash
pip install 'maigret[pdf]'
**Google Cloud Shell / Colab / Replit / generic CI:**
These environments behave like Debian/Ubuntu — install the same
``libcairo2-dev pkg-config python3-dev build-essential`` triple before
``pip install 'maigret[pdf]'``. If you do not control the base image and
cannot ``apt install``, skip the extra and use ``--html`` reports instead;
HTML reports contain the same data and open in any browser.
``maigret: command not found`` after install
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If pip prints warnings like::
WARNING: The scripts maigret and update_sitesmd are installed in
'/home/<user>/.local/bin' which is not on PATH.
…and ``maigret --version`` then fails with ``command not found``, your
``--user`` install put the entry-point script in a directory the shell does
not search. Add it to ``PATH``:
.. code-block:: bash
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
Or install into a virtual environment, where the entry point lands in the
venv's ``bin/`` automatically:
.. code-block:: bash
python3 -m venv ~/.venvs/maigret
source ~/.venvs/maigret/bin/activate
pip install 'maigret[pdf]' # or just `pip install maigret`
Optional: Cloudflare bypass solver
----------------------------------
+2 -11
View File
@@ -273,7 +273,6 @@ class CurlCffiChecker(CheckerBase):
def __init__(self, *args, **kwargs):
self.logger = kwargs.get('logger', Mock())
self.browser_emulate = kwargs.get('browser_emulate', 'chrome')
self.proxy = kwargs.get('proxy')
self.url = None
self.headers = None
self.allow_redirects = True
@@ -295,10 +294,7 @@ class CurlCffiChecker(CheckerBase):
async def check(self) -> Tuple[Optional[str], int, Optional[CheckError]]:
try:
session_kwargs = {}
if self.proxy:
session_kwargs['proxies'] = {'http': self.proxy, 'https': self.proxy}
async with CurlCffiAsyncSession(**session_kwargs) as session:
async with CurlCffiAsyncSession() as session:
# Strip the User-Agent so curl_cffi can use the impersonated browser's
# matching UA. Mixing a random UA with a Chrome TLS fingerprint trips
# composite bot scoring (e.g. Cloudflare returns a JS challenge for
@@ -832,11 +828,7 @@ def make_site_result(
f"(protection: {list(site.protection)})"
)
elif needs_impersonation and CURL_CFFI_AVAILABLE:
checker = CurlCffiChecker(
logger=logger,
browser_emulate='chrome',
proxy=options.get('proxy'),
)
checker = CurlCffiChecker(logger=logger, browser_emulate='chrome')
elif needs_impersonation and not CURL_CFFI_AVAILABLE:
logger.warning(
f"Site {site.name} requires TLS impersonation (curl_cffi) but it's not installed. "
@@ -1147,7 +1139,6 @@ async def maigret(
options["id_type"] = id_type
options["forced"] = forced
options["cloudflare_bypass"] = cloudflare_bypass
options["proxy"] = proxy
# results from analysis of all sites
all_results: Dict[str, QueryResultWrapper] = {}
+8 -2
View File
@@ -908,8 +908,14 @@ async def main():
if args.pdf:
username = username.replace('/', '_')
filename = report_filepath_tpl.format(username=username, postfix='.pdf')
save_pdf_report(filename, report_context)
query_notify.warning(f'PDF report on all usernames saved in {filename}')
try:
save_pdf_report(filename, report_context)
except RuntimeError as e:
query_notify.warning(str(e))
else:
query_notify.warning(
f'PDF report on all usernames saved in {filename}'
)
if args.md:
username = username.replace('/', '_')
+13 -3
View File
@@ -78,13 +78,23 @@ def save_html_report(filename: str, context: dict):
f.write(filled_template)
PDF_EXTRA_HINT = (
"PDF reports require the optional 'pdf' extra. "
"Install it with: pip install 'maigret[pdf]'"
)
def save_pdf_report(filename: str, context: dict):
# Imported lazily so that users without the optional 'pdf' extra
# can still import maigret.report and use other report formats.
try:
from xhtml2pdf import pisa # type: ignore[import-untyped]
except ImportError as e:
raise RuntimeError(PDF_EXTRA_HINT) from e
template, css = generate_report_template(is_pdf=True)
filled_template = template.render(**context)
# moved here to speed up the launch of Maigret
from xhtml2pdf import pisa # type: ignore[import-untyped]
with open(filename, "w+b") as f:
pisa.pisaDocument(io.StringIO(filled_template), dest=f, default_css=css)
+10 -3
View File
@@ -467,7 +467,7 @@
],
"urlMain": "http://en.gravatar.com/",
"url": "http://en.gravatar.com/{username}",
"usernameClaimed": "automattic",
"usernameClaimed": "blue",
"usernameUnclaimed": "noonewouldeverusethis7"
},
"Reddit": {
@@ -863,7 +863,14 @@
"tags": [
"links"
],
"checkType": "status_code",
"checkType": "message",
"absenceStrs": [
"The page youre looking for doesnt exist.",
"Want this to be your username?"
],
"presenseStrs": [
"@container/profile-container"
],
"urlMain": "https://linktr.ee",
"url": "https://linktr.ee/{username}",
"usernameUnclaimed": "noonewouldeverusethis7",
@@ -3130,7 +3137,7 @@
"alexaRank": 1126,
"urlMain": "https://duolingo.com/",
"url": "https://www.duolingo.com/profile/{username}",
"usernameClaimed": "duolingo",
"usernameClaimed": "blue",
"usernameUnclaimed": "noonewouldeverusethis7"
},
"kofi": {
+2 -2
View File
@@ -1,8 +1,8 @@
{
"version": 1,
"updated_at": "2026-05-13T10:39:40Z",
"updated_at": "2026-05-15T10:17:13Z",
"sites_count": 3154,
"min_maigret_version": "0.6.0",
"data_sha256": "f86d77a18bcd1d353933b64d99953634ce5e2966860f25bacd5e3de5659fb8a7",
"data_sha256": "1787a341c90d91a56507ae704c8471743709b56d85d6c3dfa8c56189dccbc6dd",
"data_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/data.json"
}
Generated
+13 -13
View File
@@ -1302,14 +1302,14 @@ lxml = ["lxml ; platform_python_implementation == \"CPython\""]
[[package]]
name = "idna"
version = "3.15"
version = "3.14"
description = "Internationalized Domain Names in Applications (IDNA)"
optional = false
python-versions = ">=3.8"
groups = ["main"]
files = [
{file = "idna-3.15-py3-none-any.whl", hash = "sha256:048adeaf8c2d788c40fee287673ccaa74c24ffd8dcf09ffa555a2fbb59f10ac8"},
{file = "idna-3.15.tar.gz", hash = "sha256:ca962446ea538f7092a95e057da437618e886f4d349216d2b1e294abfdb65fdc"},
{file = "idna-3.14-py3-none-any.whl", hash = "sha256:e677eaf072e290f7b725f9acf0b3a2bd55f9fd6f7c70abe5f0e34823d0accf69"},
{file = "idna-3.14.tar.gz", hash = "sha256:466d810d7a2cc1022bea9b037c39728d51ae7dad40d480fc9b7d7ecf98ba8ee3"},
]
[package.extras]
@@ -2884,19 +2884,19 @@ Werkzeug = ">=2.0.0"
[[package]]
name = "pytest-rerunfailures"
version = "16.2"
version = "16.1"
description = "pytest plugin to re-run tests to eliminate flaky failures"
optional = false
python-versions = ">=3.10"
groups = ["dev"]
files = [
{file = "pytest_rerunfailures-16.2-py3-none-any.whl", hash = "sha256:c22a53d2827becc76f057d4ded123c0e726523f2f0e5f0bb4efb31fd59e1f14e"},
{file = "pytest_rerunfailures-16.2.tar.gz", hash = "sha256:5f5a32f15674a3d54f7598388fcd3cc1bc5c37284731a4704a44485dcdda5e23"},
{file = "pytest_rerunfailures-16.1-py3-none-any.whl", hash = "sha256:5d11b12c0ca9a1665b5054052fcc1084f8deadd9328962745ef6b04e26382e86"},
{file = "pytest_rerunfailures-16.1.tar.gz", hash = "sha256:c38b266db8a808953ebd71ac25c381cb1981a78ff9340a14bcb9f1b9bff1899e"},
]
[package.dependencies]
packaging = ">=17.1"
pytest = ">=8.1,<8.2.2 || >8.2.2"
pytest = ">=7.4,<8.2.2 || >8.2.2"
[[package]]
name = "python-bidi"
@@ -3239,14 +3239,14 @@ png = ["pypng"]
[[package]]
name = "reportlab"
version = "4.5.1"
version = "4.5.0"
description = "The Reportlab Toolkit"
optional = false
python-versions = "<4,>=3.9"
groups = ["main", "dev"]
files = [
{file = "reportlab-4.5.1-py3-none-any.whl", hash = "sha256:06fce8cb56c83307cfa4909cdf4e6a2ddbb44e5d6ef4d2edca896d7e9769f091"},
{file = "reportlab-4.5.1.tar.gz", hash = "sha256:9fdf68f4de9171ec66acb4a5feed8f8ca2af43479e707a6fbb0daa75d88e5494"},
{file = "reportlab-4.5.0-py3-none-any.whl", hash = "sha256:b8cc8996947d84e805368b47b2376070966f091d029351a0d8a1f238984c2c7f"},
{file = "reportlab-4.5.0.tar.gz", hash = "sha256:e595932789ab7a107ba253e83f7815622708a9fd49920d0d6a909880eb66ac75"},
]
[package.dependencies]
@@ -3262,14 +3262,14 @@ shaping = ["uharfbuzz"]
[[package]]
name = "requests"
version = "2.34.2"
version = "2.33.1"
description = "Python HTTP for Humans."
optional = false
python-versions = ">=3.10"
groups = ["main"]
files = [
{file = "requests-2.34.2-py3-none-any.whl", hash = "sha256:2a0d60c172f83ac6ab31e4554906c0f3b3588d37b5cb939b1c061f4907e278e0"},
{file = "requests-2.34.2.tar.gz", hash = "sha256:f288924cae4e29463698d6d60bc6a4da69c89185ad1e0bcc4104f584e960b9ed"},
{file = "requests-2.33.1-py3-none-any.whl", hash = "sha256:4e6d1ef462f3626a1f0a0a9c42dd93c63bad33f9f1c1937509b8c5c8718ab56a"},
{file = "requests-2.33.1.tar.gz", hash = "sha256:18817f8c57c6263968bc123d237e3b8b08ac046f5456bd1e307ee8f4250d3517"},
]
[package.dependencies]
+9 -1
View File
@@ -69,7 +69,7 @@ torrequest = "^0.1.0"
alive_progress = "^3.2.0"
typing-extensions = "^4.14.1"
webencodings = "^0.5.1"
xhtml2pdf = "^0.2.11"
xhtml2pdf = {version = "^0.2.11", optional = true}
XMind = "^1.2.0"
yarl = "^1.20.1"
networkx = "^2.6.3"
@@ -82,6 +82,13 @@ platformdirs = "^4.3.8"
curl-cffi = ">=0.14,<1.0"
[tool.poetry.extras]
# Install PDF support with: pip install 'maigret[pdf]'
# Skipped by default because the underlying `pycairo` has no Linux/macOS
# wheels on PyPI and requires system libcairo + pkg-config to build.
pdf = ["xhtml2pdf"]
[tool.poetry.group.dev.dependencies]
# How to add a new dev dependency: poetry add black --group dev
# Install dev dependencies with: poetry install --with dev
@@ -92,6 +99,7 @@ pytest-cov = ">=6,<8"
pytest-httpserver = "^1.0.0"
pytest-rerunfailures = ">=15.1,<17.0"
reportlab = "^4.4.3"
xhtml2pdf = "^0.2.11"
mypy = ">=1.14.1,<3.0.0"
tuna = "^0.5.11"
coverage = "^7.9.2"
+3 -3
View File
@@ -3158,16 +3158,16 @@ Rank data fetched from Majestic Million by domains.
1. ![](https://www.google.com/s2/favicons?domain=https://app.airnfts.com) [AirNFTs (https://app.airnfts.com)](https://app.airnfts.com)*: top 100M, crypto, nft*
1. ![](https://www.google.com/s2/favicons?domain=https://greasyfork.org) [GreasyFork (https://greasyfork.org)](https://greasyfork.org)*: top 100M, coding*
The list was updated at (2026-05-13)
The list was updated at (2026-05-15)
## Statistics
Enabled/total sites: 2524/3154 = 80.03%
Incomplete message checks: 311/2524 = 12.32% (false positive risks)
Status code checks: 637/2524 = 25.24% (false positive risks)
Status code checks: 636/2524 = 25.2% (false positive risks)
False positive risk (total): 37.56%
False positive risk (total): 37.52%
Sites with probing: 500px, Armchairgm, BinarySearch (disabled), BleachFandom, Bluesky, BongaCams, Boosty, BuyMeACoffee, Calendly, Cent, Chess, Code Sandbox (disabled), Code Snippet Wiki, DailyMotion, Discord, Diskusjon.no, Disqus, Docker Hub, Duolingo, FandomCommunityCentral, GitHub, GitLab, Google Plus (archived), Gravatar, HackTheBox, Hackerrank, Hashnode, Holopin, Imgur, Issuu, Keybase, Kick, Kvinneguiden, LeetCode, Lesswrong, Livejasmin, LocalCryptos (disabled), Medium, MicrosoftLearn, MixCloud, Monkeytype, NPM, Niftygateway, Omg.lol, OnlyFans, Paragraph, Picsart, Plurk, Polarsteps, Rarible, Reddit, Reddit Search (Pushshift) (disabled), Revolut.me, RoyalCams, Scratch, Soop, SportsTracker, Spotify, StackOverflow, Substack, TAP'D, Topcoder, Trello, Twitch, Twitter, Twitter Shadowban (disabled), UnstoppableDomains, Vimeo, Vivino, Warframe Market, Warpcast, Weibo, Wikipedia, Yapisal (disabled), YouNow, en.brickimedia.org, forums.grandstream.com, nightbot, notabug.org, qiwi.me (disabled)
+1 -57
View File
@@ -329,14 +329,10 @@ class _FakeCurlResponse:
class _FakeCurlSession:
"""Captures constructor + .get/.post/.head call kwargs for assertions."""
"""Captures the kwargs of the last .get/.post/.head call for assertions."""
last_method = None
last_kwargs = None
last_init_kwargs = None
def __init__(self, **kwargs):
type(self).last_init_kwargs = kwargs
async def __aenter__(self):
return self
@@ -366,7 +362,6 @@ def fake_curl_cffi(monkeypatch):
from maigret import checking
_FakeCurlSession.last_method = None
_FakeCurlSession.last_kwargs = None
_FakeCurlSession.last_init_kwargs = None
monkeypatch.setattr(checking, 'CurlCffiAsyncSession', _FakeCurlSession)
return _FakeCurlSession
@@ -480,54 +475,3 @@ async def test_curl_cffi_strips_ua_for_post_too(fake_curl_cffi):
assert sent['json'] == {"username": "test"}
assert "User-Agent" not in sent['headers']
assert sent['headers'].get("Content-Type") == "application/json"
@pytest.mark.asyncio
async def test_curl_cffi_forwards_proxy_to_async_session(fake_curl_cffi):
"""Regression for #2648: when --proxy is set, the proxy URL must be
forwarded to curl_cffi's AsyncSession via the `proxies` kwarg on the
session constructor. Otherwise sites with `tls_fingerprint` protection
(Instagram, Reddit, SoundCloud, Threads, ) silently bypass the
configured proxy and connect direct.
"""
from maigret.checking import CurlCffiChecker
proxy = "http://user:pass@proxy.example.com:8080"
checker = CurlCffiChecker(logger=Mock(), browser_emulate='chrome', proxy=proxy)
checker.prepare(
url='https://example.com/u/test',
headers=None,
allow_redirects=True,
timeout=10,
method='get',
)
await checker.check()
init = fake_curl_cffi.last_init_kwargs
assert init is not None, "CurlCffiAsyncSession was never constructed"
# curl_cffi expects the standard requests-style {scheme: url} mapping
assert init.get('proxies') == {'http': proxy, 'https': proxy}
@pytest.mark.asyncio
async def test_curl_cffi_no_proxy_omits_proxies_kwarg(fake_curl_cffi):
"""Counterpart to the proxy-forwarding test: when no proxy is configured,
the `proxies` kwarg must NOT appear on the AsyncSession constructor.
Passing `proxies=None` or an empty mapping would let curl_cffi inherit
the process-wide HTTPS_PROXY env var unintentionally.
"""
from maigret.checking import CurlCffiChecker
checker = CurlCffiChecker(logger=Mock(), browser_emulate='chrome')
checker.prepare(
url='https://example.com/u/test',
headers=None,
allow_redirects=True,
timeout=10,
method='get',
)
await checker.check()
init = fake_curl_cffi.last_init_kwargs
assert init is not None, "CurlCffiAsyncSession was never constructed"
assert 'proxies' not in init
+65
View File
@@ -3,6 +3,9 @@
import copy
import json
import os
import subprocess
import sys
import textwrap
import pytest
from io import StringIO
@@ -442,6 +445,68 @@ def test_pdf_report():
assert os.path.exists(report_name)
def test_save_pdf_report_raises_helpful_error_without_xhtml2pdf(
monkeypatch, tmp_path
):
# Setting an entry to None makes a subsequent `import` raise ImportError —
# this simulates the optional 'pdf' extra not being installed without
# actually uninstalling xhtml2pdf from the test environment.
monkeypatch.setitem(sys.modules, 'xhtml2pdf', None)
monkeypatch.setitem(sys.modules, 'xhtml2pdf.pisa', None)
context = generate_report_context(TEST)
target = tmp_path / "report.pdf"
with pytest.raises(RuntimeError) as excinfo:
save_pdf_report(str(target), context)
msg = str(excinfo.value)
assert "maigret[pdf]" in msg
assert "pip install" in msg
assert not target.exists()
def test_xhtml2pdf_is_not_module_level_dependency():
# Guard against a regression where someone hoists `import xhtml2pdf` /
# `from xhtml2pdf import pisa` to the top of maigret/report.py — that
# would force every Maigret user to install the optional extra.
import maigret.report as report_module
module_globals = vars(report_module)
assert 'xhtml2pdf' not in module_globals
assert 'pisa' not in module_globals
def test_import_maigret_without_xhtml2pdf():
# End-to-end check: spawn a fresh interpreter where xhtml2pdf is blocked
# before any maigret module is loaded, and confirm the package, the
# report module, and save_pdf_report itself all import cleanly. Mirrors
# what a user without the [pdf] extra installed would experience.
code = textwrap.dedent(
"""
import sys
sys.modules['xhtml2pdf'] = None
sys.modules['xhtml2pdf.pisa'] = None
import maigret
import maigret.report
from maigret.report import save_pdf_report
assert callable(save_pdf_report)
print("OK")
"""
)
result = subprocess.run(
[sys.executable, "-c", code],
capture_output=True,
text=True,
)
assert result.returncode == 0, (
f"stdout={result.stdout!r} stderr={result.stderr!r}"
)
assert "OK" in result.stdout
def test_text_report():
context = generate_report_context(TEST)
report_text = get_plaintext_report(context)