Overhaul site tags and naming: add social tag to 33 networks, fill mi… (#2430)

* Overhaul site tags and naming: add social tag to 33 networks, fill missing tags for 213 top-1000 sites, clean up false us/in country tags (~374 sites), normalize site names to Title Case, add tag validation tests, document tagging and naming rules
Remove LLM folder: ask @soxoj for the up-to-date version!

* Remove LLM/ from version control

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Soxoj
2026-03-28 19:48:16 +01:00
committed by Soxoj
parent c2f912b164
commit f41f9439fc
9 changed files with 310 additions and 1891 deletions
+1 -1
View File
@@ -42,4 +42,4 @@ settings.json
# other
*.egg-info
build
build
-558
View File
@@ -1,558 +0,0 @@
# Site checks — guide (Maigret)
Working document for future changes: workflow, findings from reviews, and practical steps. See also [`site-checks-playbook.md`](site-checks-playbook.md) (short checklist), [`socid_extractor_improvements.log`](socid_extractor_improvements.log) (proposals for upstream identity extraction), and the code in [`maigret/checking.py`](../maigret/checking.py).
**Documentation maintenance:** whenever you improve Maigret, add search tooling, or change check logic, update **this file** and [`site-checks-playbook.md`](site-checks-playbook.md) in sync (see the section at the end). If you change rules about the JSON API check or the `socid_extractor` log format, update **[`socid_extractor_improvements.log`](socid_extractor_improvements.log)** (template / header) together with this guide.
---
## 1. How checks work
Logic lives in `process_site_result` ([`maigret/checking.py`](../maigret/checking.py)):
| `checkType` | Meaning |
|-------------|---------|
| `message` | Profile is “found” if the HTML contains **none** of the `absenceStrs` substrings **and** at least one `presenseStrs` marker matches. If `presenseStrs` is **empty**, presence is treated as true for **any** page (risky configuration). |
| `status_code` | HTTP **2xx** is enough — only safe if the server does **not** return 200 for “user not found”. |
| `response_url` | Custom flow with **redirects disabled** so the status/URL of the *first* response can be used. |
For other `checkType` values, [`make_site_result`](../maigret/checking.py) sets **`allow_redirects=True`**: the client follows redirects and `process_site_result` sees the **final** response body and status (not the pre-redirect hop). You do **not** need to “turn on” follow-redirect separately for most sites.
Sites with an `engine` field (e.g. XenForo) are merged with a template from the `engines` section in [`maigret/resources/data.json`](../maigret/resources/data.json) ([`MaigretSite.update_from_engine`](../maigret/sites.py)).
### `urlProbe`: probe URL vs reported profile URL
- **`url`** — pattern for the **public profile page** users should open (what appears in reports as `url_user`). Supports `{username}`, `{urlMain}`, `{urlSubpath}`; the username segment is URL-encoded when the string is built ([`make_site_result`](../maigret/checking.py)).
- **`urlProbe`** (optional) — if set, Maigret sends the HTTP **GET** (or HEAD where applicable) to **this** URL for the check, instead of to `url`. Same placeholders. Use it when the reliable signal is a **JSON/API** endpoint but the human-facing link must stay on the main site (e.g. `https://picsart.com/u/{username}` + probe `https://api.picsart.com/users/show/{username}.json`, or GitHubs `https://github.com/{username}` + `https://api.github.com/users/{username}`).
If `urlProbe` is omitted, the probe URL defaults to `url`.
### Redirects and final URL as a signal
If the **HTML shell** looks the same for “user exists” and “user does not exist” (typical SPA), it is still worth checking whether the **server** behaves differently:
- **Final URL** after redirects (e.g. profile canonical URL vs `/404` path).
- **Redirect chain** length or target host (e.g. lander vs profile).
If that differs reliably, you may be able to use **`checkType`: `response_url`** in [`data.json`](../maigret/resources/data.json) (no auto-follow) or extend logic — but only when the difference is stable.
**Server-side HTTP vs client-side navigation.** Maigret follows **HTTP** redirects only; it does **not** run JavaScript. If the browser shows a navigation to `/u/name/posts` or `/not-found` **after** the SPA bundle loads, that may never appear as an extra hop in `curl`/aiohttp — only a **trailing-slash** `301` might show up. Always confirm with `curl -sIL` / a small script whether the **Location** chain differs for real vs fake users before relying on URL-based rules.
**Empirical check (claimed vs non-existent usernames, `GET` with follow redirects, no JS):**
| Site | Result |
|------|--------|
| **Kaskus** | No HTTP redirects beyond the request path; same generic `<title>` and near-identical body length — **no** discriminating signal from redirects alone. |
| **Bibsonomy** | Both requests redirect to **`/pow-challenge/?return=/user/...`** (proof-of-work). Only the `return` path changes with the username; **both** existing and fake hit the same challenge flow — not a profile-vs-missing distinction. |
| **Picsart (web UI `https://picsart.com/u/{username}`)** | Only a **trailing-slash** `301`; the first HTML is the same empty app shell (~3 KiB) for real and fake users. Browser-only routes such as `…/posts` vs `…/not-found` are **not** visible as additional HTTP redirects in this pipeline. |
**Picsart — workable check via public API.** The site exposes **`https://api.picsart.com/users/show/{username}.json`**: JSON with `"status":"success"` and a user object when the account exists, and `"reason":"user_not_found"` when it does not. Put that URL in **`urlProbe`**, set **`url`** to the web profile pattern **`https://picsart.com/u/{username}`**, and use **`checkType`: `message`** with narrow `presenseStrs` / `absenceStrs` so reports show the human link while the request hits the API (see **`urlProbe`** above).
For **Kaskus** and **Bibsonomy**, HTTP-level comparison still does **not** unlock a safe check without PoW / richer signals; keep **`disabled: true`** until something stable appears (API, SSR markers, etc.).
---
## 2. Standard checks: public JSON API and `socid_extractor` log
### 2.1 Public JSON API (always)
When diagnosing a site—especially **SPAs**, **soft 404s**, or **near-identical HTML** for real vs fake users—**routinely look for a public JSON (or JSON-like) API** used for profile or user lookup. Typical leads: paths containing `/api/`, `/v1/`, `graphql`, `users/show`, `.json` suffixes, or the same endpoints mobile apps use. Verify with `curl` (or the Maigret request path) that **claimed** and **unclaimed** usernames produce **reliably different** bodies or status codes. If such an endpoint is more stable than HTML, put it in **`urlProbe`** and keep **`url`** as the canonical profile page on the main site (see **`urlProbe`** in section 1). If there is no separate public URL for humans, you may still point **`url`** at the API only (reports will show that URL).
This is a **standard** part of site-check work, not an optional extra.
### 2.2 Mandatory: [`LLM/socid_extractor_improvements.log`](socid_extractor_improvements.log)
If you discover **either**:
1. **JSON embedded in HTML** with user/profile fields (inline scripts, `__NEXT_DATA__`, `application/ld+json`, hydration blobs, etc.), or
2. A **standalone JSON HTTP response** (public API) with user/profile data for that service,
you **must append** a proposal block to **[`LLM/socid_extractor_improvements.log`](socid_extractor_improvements.log)**.
**Why:** Maigret calls [`socid_extractor.extract`](https://pypi.org/project/socid-extractor/) on the response body ([`extract_ids_data` in `checking.py`](../maigret/checking.py)) to fill `ids_data`. New payloads usually need a **new scheme** upstream (`flags`, `regex`, optional `extract_json`, `fields`, optional `url_mutations` / `transforms`), matching patterns such as **`GitHub API`** or **`Gitlab API`** in `socid_extractor`s `schemes.py`.
**Each log entry must include:**
- **Date** — ISO `YYYY-MM-DD` (day you add the entry).
- **Example username** — Prefer the sites `usernameClaimed` from `data.json`, or any account that reproduces the payload.
- **Proposal** — Use the **block template** in the log file: detection idea, optional URL mutation, and field mappings in the same style as existing schemes.
If the service is **already covered** by an existing `socid_extractor` scheme, add a **short** entry anyway (date, example username, scheme name, “already implemented”) so there is an audit trail.
Do **not** paste secrets, cookies, or full private JSON; short key names and structure hints are enough.
---
## 3. Improvement workflow
### Phase A — Reproduce
1. Targeted run:
```bash
maigret --db /path/to/maigret/resources/data.json \
TEST_USERNAME \
--site "SiteName" \
--print-not-found --print-errors \
--no-progressbar -vv
```
2. Run separately with a **real** existing username and a **definitely non-existent** one (as `usernameClaimed` / `usernameUnclaimed` in JSON).
3. If needed: `-vvv` and `debug.log` (raw response).
4. Automated pair check:
```bash
maigret --db ... --self-check --site "SiteName" --no-progressbar
```
### Phase B — Classify the cause
| Symptom | Likely cause |
|---------|----------------|
| False “found” with `status_code` | Soft 404 (200 on a “not found” page). |
| False “found” with `message` | Overly broad `presenseStrs` (`name`, `email`, JSON keys) or stale `absenceStrs`. |
| Same HTML for different users | SPA / skeleton shell before hydration — also compare **final URL / redirect chain** (see above); if still identical, often `disabled`. |
| Login page instead of profile | XenForo etc.: guest, `ignore403`, “must be logged in” strings. |
| reCAPTCHA / “Checking your browser” / “not a bot” | Bot protection; Maigrets default User-Agent may worsen the response. |
| Redirect to another domain / lander | Stale URL template. |
### Phase C — Edits in [`data.json`](../maigret/resources/data.json)
**CRITICAL — surgical edits only.** `data.json` is a ~36 000-line file. **Never** rewrite it via `json.load()` + `json.dump()` — this reformats every line and produces a 70 000-line diff that is impossible to review. Instead, make **targeted text-level edits** (find the site's block, change only the specific lines). Use the `Edit` tool (or equivalent line-precise method), not a full JSON round-trip. The same rule applies to scripts: if a helper writes `data.json`, it must preserve the original formatting of untouched entries.
1. Update `url` / `urlMain` if needed (HTTPS, new profile path).
2. Replace inappropriate `status_code` with `message` (or `response_url`), choosing:
- **`absenceStrs`** — only what reliably appears on the “user does not exist” page;
- **`presenseStrs`** — narrow markers of a real profile (avoid generic words).
3. For XenForo: override only fields that differ in the site entry; do not break the global `engines` template.
4. Refresh `usernameClaimed` / `usernameUnclaimed` if reference accounts disappeared.
5. Set **`headers`** (e.g. another `User-Agent`) if the site serves a captcha only to “suspicious” clients.
6. Use **`errors`**: HTML substring → meaningful check error (UNKNOWN), so it is not confused with “available”.
### Phase D — Decision criteria
| Outcome | When to use |
|---------|-------------|
| **Check fixed** | The `claimed` / `unclaimed` pair behaves predictably, `--self-check` passes, no regression on a similar site with the same engine. |
| **Check disabled** (`disabled: true`) | Cloudflare / anti-bot / login required / indistinguishable SPA without stable markers. |
| **Entry removed** | **Only** if the domain/service is gone (NXDOMAIN, clearly dead project), not “because it is hard to fix”. |
### Phase E — Before commit
- `maigret --self-check` for affected sites.
- `make test`.
---
## 4. Findings from reviews (concrete site batch)
Summary from an earlier false-positive review for: OpenSea, Mercado Livre, Redtube, Toms Guide, Kaggle, Kaskus, Livemaster, TechPowerUp, authorSTREAM, Bibsonomy, Bulbagarden, iXBT, Serebii, Picsart, Hashnode, hi5.
### What most often broke checks
1. **`status_code` where content checks are needed** — soft 404 with status 200.
2. **Broad `presenseStrs`** — matches on error pages or generic SPA shells.
3. **XenForo + guest** — HTML includes strings like “You must be logged in” that overlap the engine template.
4. **User-Agent** — on some sites (e.g. Kaggle) the default UA triggered a reCAPTCHA page instead of profile HTML; a deliberate `User-Agent` in site `headers` helped.
5. **SPAs and redirects** — identical first HTML, redirect to lander / another product (hi5 → Tagged), URL format changes by region (Mercado Livre).
### What worked as a fix
- Switching to **`message`** with narrow strings from **`<title>`** or unique markup where stable (**Kaggle**, **Mercado Livre**, **Hashnode**).
- For **Kaggle**, additionally: **`headers`**, **`errors`** for browser-check text.
- **Redtube** stayed valid on **`status_code`** with a stable **404** for non-existent users.
- **Picsart**: the web profile URL is a thin SPA shell; use the **JSON API** (`api.picsart.com/users/show/{username}.json`) in **`url`** with **`message`**-style markers (`"status":"success"` vs `user_not_found`), not the browser-only `/posts` vs `/not-found` navigation.
- For **Weblate / Anubis Anti-Bot**: Setting `headers` with a basic script User-Agent (e.g. `python-requests/2.25.1`) rather than the default browser UA completely bypassed the Anubis Proof-of-Work challenge HTTP 307 redirect, instantly recovering the native HTTP 404 framework.
### What required disabling checks
Where you **cannot** reliably tell “profile exists” from “no profile” without bypassing protection, login, or full JS:
- Anti-bot / captcha / “not a bot” page;
- Guest-only access to the needed page;
- SPA with indistinguishable first response;
- Forums returning **403** and a login page instead of a member profile for the member-search URL;
- Stale URLs that redirect to a stub.
In those cases **`disabled: true`** is better than false “found”; remove the DB entry only on **actual** domain death.
### Code notes
- For the `status_code` branch in `process_site_result`, use **strict** comparison `check_type == "status_code"`, not a substring match inside `"status_code"`.
- Treat empty `presenseStrs` with `message` as risky: when debugging, watch DEBUG-level logs if that diagnostics exists in code.
---
## 5. Future ideas (Maigret improvements)
- A mode or script: one site, two usernames, print statuses and first N bytes of the response (wrapper around `maigret()`).
- Document in CLI help that **`--use-disabled-sites`** is needed to analyze disabled entries.
---
## 6. Development utilities
### 6.1 `utils/site_check.py` — Single site diagnostics
A comprehensive utility for testing individual sites with multiple modes:
```bash
# Basic comparison of claimed vs unclaimed (aiohttp)
python utils/site_check.py --site "VK" --check-claimed
# Test via Maigret's checker directly
python utils/site_check.py --site "VK" --maigret
# Compare aiohttp vs Maigret results (find discrepancies)
python utils/site_check.py --site "VK" --compare-methods
# Full diagnosis with recommendations
python utils/site_check.py --site "VK" --diagnose
# Test with custom URL
python utils/site_check.py --url "https://example.com/{username}" --compare user1 user2
# Find a valid username for a site
python utils/site_check.py --site "VK" --find-user
```
**Key features:**
- `--maigret` — Uses Maigret's actual checking code, not raw aiohttp
- `--compare-methods` — Shows if aiohttp and Maigret see different results (useful for debugging)
- `--diagnose` — Validates checkType against actual responses, suggests fixes
- Color output with markers detection (captcha, cloudflare, login, etc.)
- `--json` flag for machine-readable output
**When to use each mode:**
| Mode | Use case |
|------|----------|
| `--check-claimed` | Quick sanity check: do claimed/unclaimed still differ? |
| `--maigret` | Verify Maigret's actual behavior matches expectations |
| `--compare-methods` | Debug "works in curl but fails in Maigret" issues |
| `--diagnose` | Full analysis when a site is broken, get fix recommendations |
### 6.2 `utils/check_top_n.py` — Mass site checking
Batch-check top N sites by Alexa rank with categorized reporting:
```bash
# Check top 100 sites
python utils/check_top_n.py --top 100
# Faster with more parallelism
python utils/check_top_n.py --top 100 --parallel 10
# Output JSON report
python utils/check_top_n.py --top 100 --output report.json
# Only show broken sites
python utils/check_top_n.py --top 100 --only-broken
```
**Output categories:**
- `working` — Site check passes
- `broken` — Check fails (wrong status, missing markers)
- `timeout` — Request timed out
- `anti_bot` — 403/429 or captcha detected
- `error` — Connection or other errors
- `disabled` — Already disabled in data.json
**Report includes:**
- Summary counts by category
- List of broken sites with issues
- Recommendations for fixes (e.g., "Switch to checkType: status_code")
### 6.3 Self-check behavior (`--self-check`)
The self-check command has been improved to be less aggressive:
```bash
# Check sites WITHOUT auto-disabling (default)
maigret --self-check --site "VK"
# Auto-disable failing sites (old behavior)
maigret --self-check --site "VK" --auto-disable
# Show detailed diagnosis for each failure
maigret --self-check --site "VK" --diagnose
```
**Behavior changes:**
| Flag | Effect |
|------|--------|
| `--self-check` alone | Reports issues but does NOT disable sites |
| `--auto-disable` | Automatically disables sites that fail (opt-in) |
| `--diagnose` | Prints detailed diagnosis with recommendations |
**Why this matters:**
- Old behavior was too aggressive — sites got disabled without explanation
- New behavior reports issues and suggests fixes
- Explicit `--auto-disable` required to modify database
---
## 7. Lessons learned (practical observations)
Collected from hands-on work fixing top-ranked sites (Reddit, Wikipedia, Microsoft Learn, Baidu, etc.).
### 7.1 JSON API is the first thing to look for
Both Reddit and Microsoft Learn had working public APIs that solved the problem entirely. The web pages were SPAs or blocked by anti-bot measures, but the APIs worked reliably:
- **Reddit**: `https://api.reddit.com/user/{username}/about` — returns JSON with user data or `{"message": "Not Found", "error": 404}`.
- **Microsoft Learn**: `https://learn.microsoft.com/api/profiles/{username}` — returns JSON with `userName` field or HTTP 404.
This confirms the playbook recommendation: always check for `/api/`, `.json`, GraphQL endpoints before giving up on a site.
### 7.2 `urlProbe` is a powerful tool
It separates "what we check" (API) from "what we show the user" (human-readable profile URL). Reddit is a perfect example:
```json
{
"url": "https://www.reddit.com/user/{username}",
"urlProbe": "https://api.reddit.com/user/{username}/about",
"checkType": "message",
"presenseStrs": ["\"name\":"],
"absenceStrs": ["Not Found"]
}
```
The check hits the API, but reports display `www.reddit.com/user/blue`.
### 7.3 aiohttp ≠ curl ≠ requests
Wikipedia returned HTTP 200 for `curl` and Python `requests`, but HTTP 403 for `aiohttp`. This is **TLS fingerprinting** — the server identifies the HTTP library by cryptographic characteristics of the TLS handshake, not by headers.
**Key insight:** Changing `User-Agent` does **not** help against TLS fingerprinting. Always test with aiohttp directly (or via Maigret with `-vvv` and `debug.log`), not just `curl`.
```python
# This returns 403 for Wikipedia even with browser UA:
async with aiohttp.ClientSession() as session:
async with session.get(url, headers={"User-Agent": "Mozilla/5.0 ..."}) as resp:
print(resp.status) # 403
```
### 7.4 HTTP 403 in Maigret can mean different things
Initially it seemed Wikipedia was returning 403, but `curl` showed 200. Only `debug.log` revealed the real picture — aiohttp was getting blocked at TLS level.
**Lesson:** Use `-vvv` flag and inspect `debug.log` for raw response status and body. The warning message alone may be misleading.
### 7.5 Dead services migrate, not disappear
MSDN Social and TechNet profiles redirected to Microsoft Learn. Instead of deleting old entries:
1. Keep old entries with `disabled: true` as historical record.
2. Create a new entry for the current service with working API.
This preserves audit trail and avoids breaking existing workflows.
### 7.6 `status_code` is more reliable than `message` for APIs
Microsoft Learn API returns HTTP 404 for non-existent users — a clean signal without HTML parsing. For JSON APIs that return proper HTTP status codes, `status_code` is often the best choice:
```json
{
"checkType": "status_code",
"urlProbe": "https://learn.microsoft.com/api/profiles/{username}"
}
```
No need for fragile string matching when the API speaks HTTP correctly.
### 7.8 Engine templates can silently break across many sites
The **vBulletin** engine template has `absenceStrs` in six languages ("This user has not registered…", two Russian variants, Turkish, Ukrainian, Dutch). In a comprehensive audit of all 57 enabled vBulletin sites (2026-03-27), **26 were broken** (46%). The root causes were **not** template marker mismatch in most cases:
| Category | Count | Examples |
|----------|-------|---------|
| Cloudflare challenge (403 `cf-mitigated`) | 7 | Mpgh, TheStudentRoom, SevenForums, alliedmods |
| Dead/unreachable | 5 | Tanks, holodforum.ru, Microchip |
| Server-side 403 (non-CF) | 5 | scaleforum.ru, forum-history.ru, Gorod.dp.ua |
| Redirect/domain moved | 5 | Warface, Revelation, Stratege |
| Login required to view profiles | 4 | goha, Animeforum, WiredNewYork |
Only the "login required" category relates to the template markers: when a forum requires authentication, the member.php page shows a generic response without the "user not registered" text. All 26 sites were disabled.
**Note on Russian translations:** Two distinct Russian vBulletin translations exist in the wild:
- `"Этот пользователь ещё не зарегистрирован, поэтому его профиль недоступен."` (standard)
- `"Пользователь не зарегистрирован и не имеет профиля для просмотра."` (goha.ru variant)
Both are now in the engine template.
**Lesson:** When a whole engine class shows high failure rates, categorize failures first — most are site-level infrastructure issues (CF, dead, auth), not template problems. Batch-disable broken sites rather than patching individually. Only investigate the template itself if the HTTP response is 200 but markers don't match.
### 7.9 Search-by-author URLs are architecturally unreliable
Several sites (OnanistovNet, Shoppingzone, Pogovorim, Astrogalaxy, Sexwin) used a phpBB-style `search.php?keywords=&terms=all&author={username}` URL as the check endpoint. This searches for **posts** by that author, not for the user account itself. Even if the markers worked, a user who exists but has zero posts would be indistinguishable from a non-existent user. And in practice, the sites changed their response format — some now return HTTP 404, others dropped the expected Russian absence text altogether.
**Lesson:** Avoid author-search URLs as the check endpoint; they test "has posts" rather than "account exists" and are doubly fragile (both logic mismatch and format drift).
### 7.10 Some sites generate a page for any path — permanent false positives
Two distinct patterns:
- **Pbase** creates a stub page titled "pbase Artist {username}" for **every** URL, real or fake. Both return HTTP 200 with nearly identical content (~3.3 KB). No markers can distinguish them.
- **ffm.bio** is even trickier: for the non-existent username `a.slomkoowski` it generated a page titled "mr.a" with description "a is a", apparently fuzzy-matching the path to the closest real entry. Both return HTTP 200 with large, content-rich pages.
**Lesson:** Before writing markers for a site, verify that the "unclaimed" URL actually produces an **error-like** response (different status, different title, unique error text). If the site always returns a plausible-looking page, no combination of `presenseStrs` / `absenceStrs` will help — `disabled: true` is the only safe option.
### 7.11 TLS fingerprinting can degrade over time (Kaggle)
Kaggle was previously fixed with a custom `User-Agent` header and `errors` for the "Checking your browser" captcha page. In the latest batch review, aiohttp receives HTTP 404 with identical content for **both** claimed and unclaimed usernames — the site now blocks the entire request before it reaches the profile page. This matches the TLS fingerprinting pattern seen earlier with Wikipedia (section 7.3), but here the degradation happened **after** a working fix was already in place.
**Lesson:** Sites that rely on bot-detection can tighten their rules at any time. A working `User-Agent` override today may fail tomorrow. When a previously fixed site starts returning identical responses for both usernames, suspect TLS fingerprinting first, and accept `disabled: true` if no public API is available.
### 7.12 API endpoints may bypass Cloudflare even when the main site is blocked
All four Fandom wikis returned HTTP 403 with a Cloudflare "Just a moment..." challenge when aiohttp accessed the user profile page (`/wiki/User:{username}`). However, the **MediaWiki API** on the same domain (`/api.php?action=query&list=users&ususers={username}&format=json`) returned clean JSON without any challenge. Similarly, **Substack** served a captcha-laden SPA for `/@{username}`, but its `public_profile` API (`/api/v1/user/{username}/public_profile`) responded with proper JSON and correct HTTP 404 for missing users.
This is likely because API routes are excluded from the Cloudflare WAF rules or use a different pipeline than the HTML-serving paths.
**Lesson:** When a site's main pages are blocked by Cloudflare or similar WAF, still check API endpoints on the **same domain** — they may not go through the same protection layer. This is especially true for:
- MediaWiki's `api.php` on wiki farms (Fandom, Wikia, self-hosted MediaWiki)
- REST API paths (`/api/v1/`, `/api/v2/`) on SPA-heavy sites
- Internal data endpoints that the SPA itself calls
### 7.13 GraphQL APIs often support GET, not just POST
**hashnode** exposes a GraphQL endpoint at `https://gql.hashnode.com`. While GraphQL is typically associated with POST requests, many implementations also support **GET** with the query passed as a URL parameter. This is critical for Maigret, which only supports GET/HEAD for `urlProbe`.
```
GET https://gql.hashnode.com?query=%7Buser(username%3A%20%22melwinalm%22)%20%7B%20name%20username%20%7D%7D
→ {"data":{"user":{"name":"Melwin D'Almeida","username":"melwinalm"}}}
GET https://gql.hashnode.com?query=%7Buser(username%3A%20%22a.slomkoowski%22)%20%7B%20name%20username%20%7D%7D
→ {"data":{"user":null}}
```
**Lesson:** Before giving up on a GraphQL-only site, try the same query via GET with `?query=...` (URL-encoded). Many GraphQL servers accept both methods.
### 7.14 URL-encoding resolves template placeholder conflicts
The hashnode GraphQL query `{user(username: "{username}") { name }}` contains curly braces that conflict with Maigret's `{username}` placeholder — Python's `str.format()` would raise a `KeyError` on `{user(username...}`.
The fix: URL-encode the GraphQL braces (`{` → `%7B`, `}` → `%7D`) but leave `{username}` as-is. Python's `.format()` only interprets literal `{…}` as placeholders, not `%7B…%7D`, and the GraphQL server decodes the percent-encoding on its end:
```
urlProbe: https://gql.hashnode.com?query=%7Buser(username%3A%20%22{username}%22)%20%7B%20name%20username%20%7D%7D
```
After `.format(username="melwinalm")`:
```
https://gql.hashnode.com?query=%7Buser(username%3A%20%22melwinalm%22)%20%7B%20name%20username%20%7D%7D
```
**Lesson:** When a `urlProbe` needs literal curly braces (GraphQL, JSON in URL, etc.), percent-encode them. This is a general technique for any `data.json` URL field processed by `.format()`.
### 7.15 Rate-limit responses belong in `errors`, not `absenceStrs`
When a site's API returns a rate-limit response, the text may **not** match the `absenceStrs` entry — either because the wording varies between API versions (`"The resource is being rate limited"` vs `"You are being rate limited."`) or because the JSON structure differs entirely. If the rate-limit string is in `absenceStrs` and the actual response uses a different phrasing, **no** absence string matches. With empty `presenseStrs` (presence always true), the result is a false **CLAIMED**.
**Fix:** Move rate-limit strings out of `absenceStrs` and into `errors` (mapping to `"Rate limited"` or similar). The `errors` mechanism produces an **UNKNOWN** result instead of CLAIMED or NOT FOUND, which is the correct semantic: rate limiting means "we don't know", not "user exists" or "user doesn't exist".
```json
{
"absenceStrs": ["{\"taken\":false}"],
"errors": {
"The resource is being rate limited": "Rate limited",
"You are being rate limited": "Rate limited"
}
}
```
**General rule:** Any response that means "I can't answer right now" (rate limit, maintenance page, CAPTCHA, temporary ban) should go into `errors`, never into `absenceStrs` or `presenseStrs`. Only strings that reliably indicate "user does / does not exist" belong in the presence/absence lists.
**Discord example (2026-03-24):** The POST API at `discord.com/api/v9/unique-username/username-attempt-unauthed` returns `{"taken":true}` / `{"taken":false}` normally, but under load returns varying rate-limit messages. Keeping only `{"taken":false}` in `absenceStrs` and all rate-limit variants in `errors` eliminates the transient false positives the Maigret bot was reporting.
### 7.16 Non-UTF-8 page encoding silently breaks string markers
**opennet.ru** serves pages in **KOI8-R** encoding. The `absenceStrs` value `"Имя участника не найдено"` is stored as UTF-8 bytes in `data.json`, but the HTTP response body contains the same text encoded as KOI8-R bytes. Since Maigret (and aiohttp) compares raw bytes by default, the substring is **never found** — the absence check silently fails, and empty `presenseStrs` (presence always true) produces a false CLAIMED.
**How to detect:** If `absenceStrs` contains non-ASCII text and the check fails despite the string visibly appearing on the page in a browser, inspect the `Content-Type` header or raw bytes for a non-UTF-8 `charset` (KOI8-R, Windows-1251, ISO-8859-*, etc.). Also check with `curl -s URL | iconv -f KOI8-R -t UTF-8` to confirm.
**Lesson:** Maigret has no built-in charset transcoding for marker comparison. If a site serves a non-UTF-8 charset and the relevant markers contain non-ASCII characters, string matching will fail. Options:
- Find ASCII-only markers that work in any encoding (HTML tags, class names, English text).
- Use a JSON API endpoint (APIs almost always return UTF-8).
- If neither is available, `disabled: true`.
### 7.17 ARIA and HTML boilerplate attributes are dangerous `presenseStrs`
SlideShare had `"polite"` in `presenseStrs`, matching the standard `aria-live="polite"` attribute. This attribute appears on virtually any modern web page — including anti-bot challenge pages, error pages, and homepage redirects. When the real profile page is replaced by such a generic page, `absenceStrs` don't match (different content) but `presenseStrs` still fires → false CLAIMED.
**Common traps:** `polite`, `alert`, `status`, `navigation`, `assertive`, `banner`, `main`, `complementary`, `contentinfo` — all standard ARIA landmark/live-region values present on most pages.
**Lesson:** Never use single generic words that are part of HTML/ARIA boilerplate as `presenseStrs`. Profile markers should be **specific to the profile page structure**: unique CSS classes (e.g. `"profile-card"`), `<title>` fragments with the site name (e.g. `"- salon24.pl</title>"`), or JSON field names from API responses (e.g. `"displayName"`).
### 7.18 Anti-bot challenge pages can pass through `message` checks as false CLAIMED
When a site intermittently serves an anti-bot challenge page (e.g. SlideShare's "Client Challenge", Cloudflare "Just a moment..."), a specific failure mode occurs with `checkType: "message"`:
1. The challenge HTML replaces the real profile/error page.
2. `absenceStrs` don't match (challenge page has different content than "user not found").
3. If `presenseStrs` is empty (presence always true) **or** contains a broad marker that matches the challenge HTML → result is **CLAIMED**.
This is different from a simple "anti-bot → disable" situation because the challenge may be **intermittent** — the check works most of the time but produces sporadic false positives under load or for specific IPs.
**Fix:** Add the challenge page's distinctive text to `errors`:
```json
{
"errors": {
"Client Challenge": "Anti-bot challenge",
"Just a moment": "Cloudflare challenge",
"Checking your browser": "Anti-bot challenge"
}
}
```
The `errors` mechanism produces **UNKNOWN** instead of CLAIMED, which is correct: "we got a challenge page, not a profile page, so we don't know."
**Lesson:** When fixing a site that is **intermittently** reported as false positive, check whether the failure happens only when anti-bot protection triggers. If so, adding challenge markers to `errors` is better than disabling the entire check.
### 7.19 Redirect-to-homepage as a "user not found" signal
Some sites (e.g. **Salon24.pl**) redirect non-existent user URLs to the **homepage** via HTTP 301/302, while existing users get a 200 with profile content. Since Maigret follows redirects by default (`allow_redirects=True` for `message`/`status_code` checks), it sees the **final** page — the homepage.
This creates a usable signal for `checkType: "message"`:
- **`presenseStrs`** with a fragment unique to profile pages (e.g. `"- salon24.pl</title>"` which appears in `"test 1 - salon24.pl</title>"` on profiles but not on the generic homepage title).
- No `absenceStrs` needed — the homepage simply doesn't contain the profile-specific marker.
**Lesson:** When a site returns the same HTTP 200 for both users (after redirect-follow), compare the **final page content** for both. If unclaimed lands on the homepage, use a profile-specific `presenseStrs` marker rather than trying to find an absence string on the homepage.
### 7.20 Non-standard HTTP status codes from anti-bot systems
Anti-bot systems don't always use standard 403/429 codes. Observed examples:
- **HTTP 468** (forum.exkavator.ru) — custom Tengine anti-bot status.
- **HTTP 520530** — Cloudflare-specific error codes (520 = unknown error, 521 = web server down, 522 = connection timed out, 523 = origin unreachable, 524 = timeout, 525 = SSL handshake failed, 526 = invalid SSL, 530 = with 1xxx error).
**Lesson:** When diagnosing a site that returns connection errors or unexpected statuses in Maigret, check with `curl -sIL` first. If the status code is non-standard (not 2xx/3xx/4xx/5xx from the origin), it's likely an intermediary (CDN, WAF, anti-bot) and the site should be `disabled: true`.
### 7.21 `site_check.py --diagnose` does not test POST APIs
The `utils/site_check.py --diagnose` tool performs raw aiohttp GET requests to compare claimed/unclaimed responses. For sites that use `requestMethod: "POST"` (e.g. Discord, Holopin), the diagnose tool will show the site as broken because GET to a POST endpoint returns different content (often the site's homepage or an error page).
**Workaround:** For POST-based checks, verify manually with `curl -X POST` or use `maigret --self-check --site "SiteName"` which respects the full configuration including request method and payload.
### 7.7 The playbook classification works
The decision tree from the documentation accurately describes real-world cases:
| Situation | Playbook says | Actual result |
|-----------|---------------|---------------|
| Captcha (Baidu) | `disabled: true` | Correct |
| TLS fingerprinting (Wikipedia) | `disabled: true` (anti-bot) | Correct |
| Working API available (Reddit, MS Learn) | Use `urlProbe` | Correct |
| Service migrated (MSDN → MS Learn) | Update URL or create new entry | Correct |
---
## Documentation maintenance
For any of the changes below, **always** keep these artifacts in sync — this file ([`site-checks-guide.md`](site-checks-guide.md)), [`site-checks-playbook.md`](site-checks-playbook.md), and (when rules or templates change) the header/template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log):
- Maigret code changes (including [`maigret/checking.py`](../maigret/checking.py), request executors, CLI);
- New or changed search tools / helper utilities for site checks;
- Changes to rules or semantics of `checkType`, `data.json` fields, self-check, etc.;
- Changes to the **public JSON API** diagnostic step or **mandatory** `socid_extractor` logging rules.
Prefer updating the guide, playbook, and log template in one commit or in the same task so instructions do not diverge. **Append-only:** new proposals go at the bottom of `socid_extractor_improvements.log`; do not delete historical entries when editing the template.
-142
View File
@@ -1,142 +0,0 @@
# Site checks — playbook (Maigret)
Short checklist for edits to [`maigret/resources/data.json`](../maigret/resources/data.json) and, when needed, [`maigret/checking.py`](../maigret/checking.py). Full guide: [`site-checks-guide.md`](site-checks-guide.md). Upstream extraction proposals: [`socid_extractor_improvements.log`](socid_extractor_improvements.log).
**Documentation maintenance:** whenever you improve Maigret, add search tooling, or change check logic, update **both** this file and [`site-checks-guide.md`](site-checks-guide.md) (see the “Documentation maintenance” section at the end of that file). When JSON API / `socid_extractor` logging rules change, update the **template header** in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) in the same change.
## 0. Standard checks (do alongside reproduce / classify)
- **Public JSON API:** always look for a stable JSON (or GraphQL JSON) profile endpoint (`/api/`, `.json`, mobile-style URLs). When the API is more reliable than HTML, set **`urlProbe`** to that endpoint and keep **`url`** as the human-readable profile link (e.g. `https://picsart.com/u/{username}`). If there is no separate profile URL, use the API as `url` only. Details: **`urlProbe`** and section **2.1** in [`site-checks-guide.md`](site-checks-guide.md).
- **`socid_extractor` log (mandatory):** if you find **embedded user JSON in HTML** or a **standalone JSON profile API**, append a dated entry (with **example username**) to [`socid_extractor_improvements.log`](socid_extractor_improvements.log). Details: section **2.2** in [`site-checks-guide.md`](site-checks-guide.md).
## 1. Reproduce
- Run a targeted check:
`maigret USER --db /path/to/maigret/resources/data.json --site "SiteName" --print-not-found --print-errors --no-progressbar -vv`
- Compare an **existing** and a **non-existent** username (as `usernameClaimed` / `usernameUnclaimed` in JSON).
- With `-vvv`, inspect `debug.log` (raw response in the log).
## 2. Classify the cause
| Symptom | Typical cause | Action |
|--------|-----------------|--------|
| HTTP 200 for “user does not exist” | Soft 404 | Move from `status_code` to `message` or `response_url`; add `absenceStrs` / narrow `presenseStrs` |
| Generic words match (`name`, `email`) | `presenseStrs` too broad | Remove generic markers; add profile-specific ones. **Avoid** ARIA/boilerplate words (`polite`, `alert`, `navigation`, etc.) — see 7.17 in guide |
| Same HTML without JS | SPA / skeleton shell | Compare **final URL and HTTP redirects** (Maigret already follows redirects by default). If the browser shows extra routes (`/posts`, `/not-found`) only **after JS**, they will **not** appear to Maigret — try a **public JSON/API** endpoint for the same site if one exists. See **Redirects and final URL** and **Picsart** in [`site-checks-guide.md`](site-checks-guide.md). |
| Unclaimed redirects to homepage | Site returns 301/302 to main page | Use `presenseStrs` with a profile-specific marker (e.g. title fragment unique to profile pages). See 7.19 in guide |
| 403 / “Log in” / guest-only | Auth or anti-bot required | `disabled: true` |
| reCAPTCHA / “Checking your browser” / “Client Challenge” | Bot protection | Add challenge text to `errors` (→ UNKNOWN). Try a reasonable `User-Agent` in `headers`. If intermittent, `errors` is better than `disabled`. See 7.18 in guide |
| Non-standard HTTP code (468, 520530) | CDN/WAF anti-bot | `disabled: true`. Check with `curl -sIL` to confirm the code comes from an intermediary. See 7.20 in guide |
| Non-ASCII `absenceStrs` not matching despite visible text | Page encoding ≠ UTF-8 | Check `Content-Type` for charset (KOI8-R, Windows-1251, etc.). Use ASCII-only markers, a JSON API, or `disabled: true`. See 7.16 in guide |
| Domain does not resolve / persistent timeout | Dead service | Remove entry **only** after confirming the domain is dead |
## 3. Data edits
**CRITICAL — surgical edits only.** Never rewrite `data.json` via `json.load()` + `json.dump()` — this reformats the entire ~36 000-line file and produces an unreviewable diff. Make targeted, line-level edits to only the fields you are changing. See Phase C in [`site-checks-guide.md`](site-checks-guide.md).
1. Update `url` / `urlMain` if needed (HTTPS redirects). Use optional **`urlProbe`** when the HTTP check should hit a different URL than the profile link shown in reports (API vs web UI).
2. For `message`: **always** tune string pairs so `absenceStrs` fire on “no user” pages and `presenseStrs` fire on real profiles without false absence hits.
- **Never** use ARIA/boilerplate words as `presenseStrs` (`polite`, `alert`, `navigation`, `status`, `main`, etc.).
- If markers contain **non-ASCII text**, verify the page charset is UTF-8. Non-UTF-8 pages (KOI8-R, Windows-1251) will silently fail byte comparison — prefer ASCII-only markers or a JSON API.
3. Engine (`engine`, e.g. XenForo): override only differing fields in the site entry so other sites are not broken.
4. Keep `status_code` only if the response **reliably** differs by status code without soft 404.
5. Add **anti-bot challenge text** to `errors` (not `absenceStrs`) when the site intermittently serves challenge pages. Common patterns: `”Client Challenge”`, `”Just a moment”`, `”Checking your browser”`, `”Attention Required”`. This produces UNKNOWN instead of false CLAIMED.
## 4. Verify
- `maigret --self-check --site "SiteName" --db ...` for touched entries.
- `make test` before commit.
## 5. Code notes
- `process_site_result` uses strict comparison to `"status_code"` for `checkType` (not a substring trick).
- Empty `presenseStrs` with `message` means “presence always true”; a debug line is logged only at DEBUG level.
## 6. Development utilities
Quick reference for site check utilities. Full details: section **6** in [`site-checks-guide.md`](site-checks-guide.md).
| Command | Purpose |
|---------|---------|
| `python utils/site_check.py --site "X" --check-claimed` | Quick aiohttp comparison |
| `python utils/site_check.py --site "X" --maigret` | Test via Maigret checker |
| `python utils/site_check.py --site "X" --compare-methods` | Find aiohttp vs Maigret discrepancies |
| `python utils/site_check.py --site "X" --diagnose` | Full diagnosis with fix recommendations |
| `python utils/check_top_n.py --top 100` | Mass-check top 100 sites |
| `maigret --self-check --site "X"` | Self-check (reports only, no auto-disable) |
| `maigret --self-check --site "X" --auto-disable` | Self-check with auto-disable |
| `maigret --self-check --site "X" --diagnose` | Self-check with detailed diagnosis |
## 7. Quick tips (lessons learned)
Practical observations from fixing top-ranked sites. Full details: section **7** in [`site-checks-guide.md`](site-checks-guide.md).
| Tip | Why it matters |
|-----|----------------|
| **API first** | Reddit, Microsoft Learn — APIs worked when web pages were blocked. Always check `/api/`, `.json` endpoints. |
| **`urlProbe` separates check from display** | Check via API, show human URL in reports. Example: Reddit API → `www.reddit.com/user/` link. |
| **aiohttp ≠ curl** | Wikipedia returned 200 for curl, 403 for aiohttp (TLS fingerprinting). Always test with Maigret directly. |
| **Use `debug.log`** | Run with `-vvv` to see raw response. Warning messages alone can be misleading. |
| **`status_code` for clean APIs** | If API returns proper 404 for missing users, prefer `status_code` over `message`. |
| **Migrate, don't delete** | MSDN → Microsoft Learn: keep old entry disabled, create new one for current service. |
| **Engine templates break silently** | vBulletin `absenceStrs` failed on ~12 forums at once — many require login, showing a generic page with no error text. Check the engine template first. |
| **Search-by-author is unreliable** | phpBB `search.php?author=` checks for posts, not accounts. A user with zero posts looks identical to a non-existent user. Avoid these URLs. |
| **Some sites always generate a page** | Pbase stubs "pbase Artist {name}" for any path; ffm.bio fuzzy-matches to the nearest real entry. No markers can help — `disabled: true`. |
| **TLS fingerprinting degrades over time** | Kaggle's custom `User-Agent` fix stopped working — aiohttp now gets 404 for both usernames. Accept `disabled: true` when no API exists. |
| **API endpoints bypass Cloudflare** | Fandom `api.php` and Substack `/api/v1/` returned clean JSON while main pages were blocked by Cloudflare. Always try API paths on the same domain. |
| **Inspect Network tab for POST APIs** | Many modern platforms (e.g., Discord) heavily protect HTML profiles but expose unauthenticated `POST` endpoints for username checks. Maigret supports this natively: define `"request_method": "POST"` and `"request_payload": {"username": "{username}"}` in `data.json` to query them! |
| **Strict JSON markers are bulletproof** | When probing APIs, use `checkType: "message"` with exact JSON substrings (like `"{\"taken\": false}"`). Unlike HTML layout checks, this approach is immune to UI redesigns, A/B testing, and language translations. |
| **GraphQL supports GET too** | hashnode GraphQL works via `GET ?query=...` (URL-encoded). You can use either native POST payloads or GET `urlProbe` for GraphQL. |
| **URL-encode braces for template safety** | GraphQL `{...}` conflicts with Maigret's `{username}`. Use `%7B`/`%7D` for literal braces in `urlProbe``.format()` ignores percent-encoded chars. |
| **Anti-bot bypass via simple UA** | "Anubis" anti-bot PoW screens (like on Weblate) intercept modern browser UAs via HTTP 307. Hardcoding `"headers": {"User-Agent": "python-requests/2.25.1"}` circumvents the scraper filter and restores default detection logic. |
| **Rate-limit → `errors`, not `absenceStrs`** | Rate-limit wording varies across API versions. If the phrasing doesn't match `absenceStrs` and `presenseStrs` is empty, the result is a false CLAIMED. Put all "can't answer right now" strings (rate limit, CAPTCHA, maintenance) in `errors` so the result is UNKNOWN. |
| **Non-UTF-8 encoding breaks markers** | opennet.ru serves KOI8-R; UTF-8 `absenceStrs` never match raw bytes. Use ASCII-only markers, a JSON API, or `disabled: true`. |
| **ARIA attrs are presenseStrs traps** | `"polite"`, `"alert"`, `"navigation"` match `aria-live`/ARIA landmarks on any page including anti-bot challenges. Use profile-specific markers instead. |
| **Anti-bot challenge + broad markers = false CLAIMED** | Challenge pages bypass `absenceStrs` but match broad `presenseStrs`. Add challenge text (e.g. `"Client Challenge"`) to `errors` → UNKNOWN. Better than disabling for intermittent issues. |
| **Redirect-to-homepage as signal** | Salon24.pl 301-redirects unclaimed users to homepage. Use `presenseStrs` with a profile-only marker (e.g. `"- salon24.pl</title>"`). |
| **Non-standard anti-bot HTTP codes** | HTTP 468 (Tengine), 520530 (Cloudflare) — not standard 403/429. Check with `curl -sIL`; if code is from intermediary → `disabled: true`. |
| **`--diagnose` doesn't test POST** | `site_check.py --diagnose` uses GET only. For POST APIs (Discord, Holopin), verify with `curl -X POST` or `maigret --self-check`. |
## 8. Site naming rules
Site names in `data.json` are the **keys** of the `"sites"` object and appear in user-facing reports. Follow these rules:
| Rule | Example | Counter-example |
|------|---------|-----------------|
| **Title Case** by default | `Hacker News`, `Product Hunt` | ~~`hackernews`~~, ~~`product hunt`~~ |
| **Lowercase** if the brand is written that way | `kofi`, `note`, `hi5` | ~~`Kofi`~~, ~~`Note`~~ |
| **No domain suffix** unless it is part of the recognized brand | `Flickr`, `Calendly`, `Upwork` | ~~`www.flickr.com`~~, ~~`calendly.com`~~ |
| **Domain OK** when the brand is commonly written with it | `last.fm`, `VC.ru`, `Archive.org` | |
| **No full UPPERCASE** unless the brand is an acronym/initialism | `VK`, `CNET`, `ICQ`, `IFTTT` | ~~`BOOTH`~~, ~~`VSCO`~~`Booth`, `VSCO` (brand) |
| **`{username}` templates** in names are OK | `{username}.tilda.ws` | |
| **Spaces** are allowed when the brand uses them | `Star Citizen`, `Google Maps` | |
| **No `www.` or `https://`** prefix | `Flickr`, `Change.org` | ~~`www.flickr.com`~~, ~~`https:`~~ |
When in doubt, check how the service refers to itself on its homepage or in its page title.
## 9. Tagging rules
### Country tags (ISO 3166-1 alpha-2)
The goal of a country tag is to **attribute a person to their country of origin or residence**, not to be a perfect truth source.
| Scenario | Action | Example |
|----------|--------|---------|
| Site is global, account says nothing about country | **No country tag** | GitHub, YouTube, Reddit, Medium, Udemy |
| Account implies connection to a specific country | **Add country tag** | VK → `ru`, Naver → `kr`, Zhihu → `cn` |
| Service used mostly in a few specific countries | **Multiple country tags OK** | Xing → `de`, `eu` |
| Very local/regional site | **Must have country tag** | Nairaland → `ng`, 4pda → `ru` |
**Do NOT** assign country tags based on traffic statistics (e.g. Alexa/SimilarWeb audience data). A site popular in India by traffic is not "Indian" if it is used globally. The `in` tag was previously over-applied this way.
### Category tags
- Every tag used in `data.json` must be registered in the `"tags"` array at the bottom of the file. The `test_tags_validity` test enforces this.
- Do not use platform/software names as tags (`writefreely`, `pixelfed`). Use category names instead (`blog`, `photo`).
- Avoid 2-letter category tags that collide with ISO country codes (e.g. `ai` = Anguilla). The `is_country_tag()` function treats any 2-letter tag as a country code.
- Keep existing category tags when modifying country tags.
- Top-50 sites by alexaRank must have at least one category tag (enforced by `test_top_sites_have_category_tag`).
## 10. Documentation maintenance
When you change Maigret, add search tools, or change check logic, keep **this playbook**, [`site-checks-guide.md`](site-checks-guide.md), and (when applicable) the template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) aligned. New log **entries** are append-only at the bottom of that file.
-4
View File
@@ -1,4 +0,0 @@
include LICENSE
include README.md
include requirements.txt
include maigret/resources/*
+8 -2
View File
@@ -139,12 +139,18 @@ class SimpleAiohttpChecker(CheckerBase):
async def check(self) -> Tuple[str, int, Optional[CheckError]]:
from aiohttp_socks import ProxyConnector
# Use a real SSL context instead of ssl=False to avoid TLS fingerprinting
# blocks by Cloudflare and similar WAFs. Certificate verification is
# disabled to handle sites with invalid/expired certs.
ssl_context = ssl.create_default_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE
connector = (
ProxyConnector.from_url(self.proxy)
if self.proxy
else TCPConnector(ssl=False)
else TCPConnector(ssl=ssl_context)
)
connector.verify_ssl = False
async with ClientSession(
connector=connector,
+1
View File
@@ -58,6 +58,7 @@ COMMON_ERRORS = {
'Censorship', 'MGTS'
),
'Incapsula incident ID': CheckError('Bot protection', 'Incapsula'),
'<title>DDoS-Guard</title>': CheckError('Bot protection', 'DDoS-Guard'),
'Сайт заблокирован хостинг-провайдером': CheckError(
'Site-specific', 'Site is disabled (Beget)'
),
File diff suppressed because it is too large Load Diff
-3
View File
@@ -1,3 +0,0 @@
[mutmut]
paths_to_mutate=maigret/
tests_dir=tests/
+170 -270
View File
File diff suppressed because it is too large Load Diff