mirror of
https://github.com/soxoj/maigret.git
synced 2026-05-06 22:19:01 +00:00
Sites re-check (#2423)
This commit is contained in:
@@ -115,6 +115,8 @@ Do **not** paste secrets, cookies, or full private JSON; short key names and str
|
||||
|
||||
### Phase C — Edits in [`data.json`](../maigret/resources/data.json)
|
||||
|
||||
**CRITICAL — surgical edits only.** `data.json` is a ~36 000-line file. **Never** rewrite it via `json.load()` + `json.dump()` — this reformats every line and produces a 70 000-line diff that is impossible to review. Instead, make **targeted text-level edits** (find the site's block, change only the specific lines). Use the `Edit` tool (or equivalent line-precise method), not a full JSON round-trip. The same rule applies to scripts: if a helper writes `data.json`, it must preserve the original formatting of untouched entries.
|
||||
|
||||
1. Update `url` / `urlMain` if needed (HTTPS, new profile path).
|
||||
2. Replace inappropriate `status_code` with `message` (or `response_url`), choosing:
|
||||
- **`absenceStrs`** — only what reliably appears on the “user does not exist” page;
|
||||
@@ -360,9 +362,25 @@ No need for fragile string matching when the API speaks HTTP correctly.
|
||||
|
||||
### 7.8 Engine templates can silently break across many sites
|
||||
|
||||
The **vBulletin** engine template has `absenceStrs` in five languages ("This user has not registered…", "Пользователь не зарегистрирован…", etc.). In a batch review of ~12 vBulletin forums (oneclickchicks, mirf, Pesiq, VKMOnline, forum.zone-game.info, etc.), **none** of the absence strings matched — the forums returned identical pages for both claimed and unclaimed usernames. Root cause: many of these forums require login to view member profiles, so they serve a generic page (no "user not registered" message at all) instead of an informative error.
|
||||
The **vBulletin** engine template has `absenceStrs` in six languages ("This user has not registered…", two Russian variants, Turkish, Ukrainian, Dutch). In a comprehensive audit of all 57 enabled vBulletin sites (2026-03-27), **26 were broken** (46%). The root causes were **not** template marker mismatch in most cases:
|
||||
|
||||
**Lesson:** When a whole engine class shows false positives, do not patch sites one by one — check whether the **engine template** itself still matches the actual error pages. A template written for one version/language pack may silently stop working after a forum upgrade or config change.
|
||||
| Category | Count | Examples |
|
||||
|----------|-------|---------|
|
||||
| Cloudflare challenge (403 `cf-mitigated`) | 7 | Mpgh, TheStudentRoom, SevenForums, alliedmods |
|
||||
| Dead/unreachable | 5 | Tanks, holodforum.ru, Microchip |
|
||||
| Server-side 403 (non-CF) | 5 | scaleforum.ru, forum-history.ru, Gorod.dp.ua |
|
||||
| Redirect/domain moved | 5 | Warface, Revelation, Stratege |
|
||||
| Login required to view profiles | 4 | goha, Animeforum, WiredNewYork |
|
||||
|
||||
Only the "login required" category relates to the template markers: when a forum requires authentication, the member.php page shows a generic response without the "user not registered" text. All 26 sites were disabled.
|
||||
|
||||
**Note on Russian translations:** Two distinct Russian vBulletin translations exist in the wild:
|
||||
- `"Этот пользователь ещё не зарегистрирован, поэтому его профиль недоступен."` (standard)
|
||||
- `"Пользователь не зарегистрирован и не имеет профиля для просмотра."` (goha.ru variant)
|
||||
|
||||
Both are now in the engine template.
|
||||
|
||||
**Lesson:** When a whole engine class shows high failure rates, categorize failures first — most are site-level infrastructure issues (CF, dead, auth), not template problems. Batch-disable broken sites rather than patching individually. Only investigate the template itself if the HTTP response is 200 but markers don't match.
|
||||
|
||||
### 7.9 Search-by-author URLs are architecturally unreliable
|
||||
|
||||
@@ -447,6 +465,74 @@ When a site's API returns a rate-limit response, the text may **not** match the
|
||||
|
||||
**Discord example (2026-03-24):** The POST API at `discord.com/api/v9/unique-username/username-attempt-unauthed` returns `{"taken":true}` / `{"taken":false}` normally, but under load returns varying rate-limit messages. Keeping only `{"taken":false}` in `absenceStrs` and all rate-limit variants in `errors` eliminates the transient false positives the Maigret bot was reporting.
|
||||
|
||||
### 7.16 Non-UTF-8 page encoding silently breaks string markers
|
||||
|
||||
**opennet.ru** serves pages in **KOI8-R** encoding. The `absenceStrs` value `"Имя участника не найдено"` is stored as UTF-8 bytes in `data.json`, but the HTTP response body contains the same text encoded as KOI8-R bytes. Since Maigret (and aiohttp) compares raw bytes by default, the substring is **never found** — the absence check silently fails, and empty `presenseStrs` (presence always true) produces a false CLAIMED.
|
||||
|
||||
**How to detect:** If `absenceStrs` contains non-ASCII text and the check fails despite the string visibly appearing on the page in a browser, inspect the `Content-Type` header or raw bytes for a non-UTF-8 `charset` (KOI8-R, Windows-1251, ISO-8859-*, etc.). Also check with `curl -s URL | iconv -f KOI8-R -t UTF-8` to confirm.
|
||||
|
||||
**Lesson:** Maigret has no built-in charset transcoding for marker comparison. If a site serves a non-UTF-8 charset and the relevant markers contain non-ASCII characters, string matching will fail. Options:
|
||||
- Find ASCII-only markers that work in any encoding (HTML tags, class names, English text).
|
||||
- Use a JSON API endpoint (APIs almost always return UTF-8).
|
||||
- If neither is available, `disabled: true`.
|
||||
|
||||
### 7.17 ARIA and HTML boilerplate attributes are dangerous `presenseStrs`
|
||||
|
||||
SlideShare had `"polite"` in `presenseStrs`, matching the standard `aria-live="polite"` attribute. This attribute appears on virtually any modern web page — including anti-bot challenge pages, error pages, and homepage redirects. When the real profile page is replaced by such a generic page, `absenceStrs` don't match (different content) but `presenseStrs` still fires → false CLAIMED.
|
||||
|
||||
**Common traps:** `polite`, `alert`, `status`, `navigation`, `assertive`, `banner`, `main`, `complementary`, `contentinfo` — all standard ARIA landmark/live-region values present on most pages.
|
||||
|
||||
**Lesson:** Never use single generic words that are part of HTML/ARIA boilerplate as `presenseStrs`. Profile markers should be **specific to the profile page structure**: unique CSS classes (e.g. `"profile-card"`), `<title>` fragments with the site name (e.g. `"- salon24.pl</title>"`), or JSON field names from API responses (e.g. `"displayName"`).
|
||||
|
||||
### 7.18 Anti-bot challenge pages can pass through `message` checks as false CLAIMED
|
||||
|
||||
When a site intermittently serves an anti-bot challenge page (e.g. SlideShare's "Client Challenge", Cloudflare "Just a moment..."), a specific failure mode occurs with `checkType: "message"`:
|
||||
|
||||
1. The challenge HTML replaces the real profile/error page.
|
||||
2. `absenceStrs` don't match (challenge page has different content than "user not found").
|
||||
3. If `presenseStrs` is empty (presence always true) **or** contains a broad marker that matches the challenge HTML → result is **CLAIMED**.
|
||||
|
||||
This is different from a simple "anti-bot → disable" situation because the challenge may be **intermittent** — the check works most of the time but produces sporadic false positives under load or for specific IPs.
|
||||
|
||||
**Fix:** Add the challenge page's distinctive text to `errors`:
|
||||
```json
|
||||
{
|
||||
"errors": {
|
||||
"Client Challenge": "Anti-bot challenge",
|
||||
"Just a moment": "Cloudflare challenge",
|
||||
"Checking your browser": "Anti-bot challenge"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `errors` mechanism produces **UNKNOWN** instead of CLAIMED, which is correct: "we got a challenge page, not a profile page, so we don't know."
|
||||
|
||||
**Lesson:** When fixing a site that is **intermittently** reported as false positive, check whether the failure happens only when anti-bot protection triggers. If so, adding challenge markers to `errors` is better than disabling the entire check.
|
||||
|
||||
### 7.19 Redirect-to-homepage as a "user not found" signal
|
||||
|
||||
Some sites (e.g. **Salon24.pl**) redirect non-existent user URLs to the **homepage** via HTTP 301/302, while existing users get a 200 with profile content. Since Maigret follows redirects by default (`allow_redirects=True` for `message`/`status_code` checks), it sees the **final** page — the homepage.
|
||||
|
||||
This creates a usable signal for `checkType: "message"`:
|
||||
- **`presenseStrs`** with a fragment unique to profile pages (e.g. `"- salon24.pl</title>"` which appears in `"test 1 - salon24.pl</title>"` on profiles but not on the generic homepage title).
|
||||
- No `absenceStrs` needed — the homepage simply doesn't contain the profile-specific marker.
|
||||
|
||||
**Lesson:** When a site returns the same HTTP 200 for both users (after redirect-follow), compare the **final page content** for both. If unclaimed lands on the homepage, use a profile-specific `presenseStrs` marker rather than trying to find an absence string on the homepage.
|
||||
|
||||
### 7.20 Non-standard HTTP status codes from anti-bot systems
|
||||
|
||||
Anti-bot systems don't always use standard 403/429 codes. Observed examples:
|
||||
- **HTTP 468** (forum.exkavator.ru) — custom Tengine anti-bot status.
|
||||
- **HTTP 520–530** — Cloudflare-specific error codes (520 = unknown error, 521 = web server down, 522 = connection timed out, 523 = origin unreachable, 524 = timeout, 525 = SSL handshake failed, 526 = invalid SSL, 530 = with 1xxx error).
|
||||
|
||||
**Lesson:** When diagnosing a site that returns connection errors or unexpected statuses in Maigret, check with `curl -sIL` first. If the status code is non-standard (not 2xx/3xx/4xx/5xx from the origin), it's likely an intermediary (CDN, WAF, anti-bot) and the site should be `disabled: true`.
|
||||
|
||||
### 7.21 `site_check.py --diagnose` does not test POST APIs
|
||||
|
||||
The `utils/site_check.py --diagnose` tool performs raw aiohttp GET requests to compare claimed/unclaimed responses. For sites that use `requestMethod: "POST"` (e.g. Discord, Holopin), the diagnose tool will show the site as broken because GET to a POST endpoint returns different content (often the site's homepage or an error page).
|
||||
|
||||
**Workaround:** For POST-based checks, verify manually with `curl -X POST` or use `maigret --self-check --site "SiteName"` which respects the full configuration including request method and payload.
|
||||
|
||||
### 7.7 The playbook classification works
|
||||
|
||||
The decision tree from the documentation accurately describes real-world cases:
|
||||
|
||||
@@ -21,18 +21,26 @@ Short checklist for edits to [`maigret/resources/data.json`](../maigret/resource
|
||||
| Symptom | Typical cause | Action |
|
||||
|--------|-----------------|--------|
|
||||
| HTTP 200 for “user does not exist” | Soft 404 | Move from `status_code` to `message` or `response_url`; add `absenceStrs` / narrow `presenseStrs` |
|
||||
| Generic words match (`name`, `email`) | `presenseStrs` too broad | Remove generic markers; add profile-specific ones |
|
||||
| Generic words match (`name`, `email`) | `presenseStrs` too broad | Remove generic markers; add profile-specific ones. **Avoid** ARIA/boilerplate words (`polite`, `alert`, `navigation`, etc.) — see 7.17 in guide |
|
||||
| Same HTML without JS | SPA / skeleton shell | Compare **final URL and HTTP redirects** (Maigret already follows redirects by default). If the browser shows extra routes (`/posts`, `/not-found`) only **after JS**, they will **not** appear to Maigret — try a **public JSON/API** endpoint for the same site if one exists. See **Redirects and final URL** and **Picsart** in [`site-checks-guide.md`](site-checks-guide.md). |
|
||||
| Unclaimed redirects to homepage | Site returns 301/302 to main page | Use `presenseStrs` with a profile-specific marker (e.g. title fragment unique to profile pages). See 7.19 in guide |
|
||||
| 403 / “Log in” / guest-only | Auth or anti-bot required | `disabled: true` |
|
||||
| reCAPTCHA / “Checking your browser” | Bot protection | Try a reasonable `User-Agent` in `headers`; else `errors` + UNKNOWN or `disabled` |
|
||||
| reCAPTCHA / “Checking your browser” / “Client Challenge” | Bot protection | Add challenge text to `errors` (→ UNKNOWN). Try a reasonable `User-Agent` in `headers`. If intermittent, `errors` is better than `disabled`. See 7.18 in guide |
|
||||
| Non-standard HTTP code (468, 520–530) | CDN/WAF anti-bot | `disabled: true`. Check with `curl -sIL` to confirm the code comes from an intermediary. See 7.20 in guide |
|
||||
| Non-ASCII `absenceStrs` not matching despite visible text | Page encoding ≠ UTF-8 | Check `Content-Type` for charset (KOI8-R, Windows-1251, etc.). Use ASCII-only markers, a JSON API, or `disabled: true`. See 7.16 in guide |
|
||||
| Domain does not resolve / persistent timeout | Dead service | Remove entry **only** after confirming the domain is dead |
|
||||
|
||||
## 3. Data edits
|
||||
|
||||
**CRITICAL — surgical edits only.** Never rewrite `data.json` via `json.load()` + `json.dump()` — this reformats the entire ~36 000-line file and produces an unreviewable diff. Make targeted, line-level edits to only the fields you are changing. See Phase C in [`site-checks-guide.md`](site-checks-guide.md).
|
||||
|
||||
1. Update `url` / `urlMain` if needed (HTTPS redirects). Use optional **`urlProbe`** when the HTTP check should hit a different URL than the profile link shown in reports (API vs web UI).
|
||||
2. For `message`: **always** tune string pairs so `absenceStrs` fire on “no user” pages and `presenseStrs` fire on real profiles without false absence hits.
|
||||
- **Never** use ARIA/boilerplate words as `presenseStrs` (`polite`, `alert`, `navigation`, `status`, `main`, etc.).
|
||||
- If markers contain **non-ASCII text**, verify the page charset is UTF-8. Non-UTF-8 pages (KOI8-R, Windows-1251) will silently fail byte comparison — prefer ASCII-only markers or a JSON API.
|
||||
3. Engine (`engine`, e.g. XenForo): override only differing fields in the site entry so other sites are not broken.
|
||||
4. Keep `status_code` only if the response **reliably** differs by status code without soft 404.
|
||||
5. Add **anti-bot challenge text** to `errors` (not `absenceStrs`) when the site intermittently serves challenge pages. Common patterns: `”Client Challenge”`, `”Just a moment”`, `”Checking your browser”`, `”Attention Required”`. This produces UNKNOWN instead of false CLAIMED.
|
||||
|
||||
## 4. Verify
|
||||
|
||||
@@ -82,6 +90,12 @@ Practical observations from fixing top-ranked sites. Full details: section **7**
|
||||
| **URL-encode braces for template safety** | GraphQL `{...}` conflicts with Maigret's `{username}`. Use `%7B`/`%7D` for literal braces in `urlProbe` — `.format()` ignores percent-encoded chars. |
|
||||
| **Anti-bot bypass via simple UA** | "Anubis" anti-bot PoW screens (like on Weblate) intercept modern browser UAs via HTTP 307. Hardcoding `"headers": {"User-Agent": "python-requests/2.25.1"}` circumvents the scraper filter and restores default detection logic. |
|
||||
| **Rate-limit → `errors`, not `absenceStrs`** | Rate-limit wording varies across API versions. If the phrasing doesn't match `absenceStrs` and `presenseStrs` is empty, the result is a false CLAIMED. Put all "can't answer right now" strings (rate limit, CAPTCHA, maintenance) in `errors` so the result is UNKNOWN. |
|
||||
| **Non-UTF-8 encoding breaks markers** | opennet.ru serves KOI8-R; UTF-8 `absenceStrs` never match raw bytes. Use ASCII-only markers, a JSON API, or `disabled: true`. |
|
||||
| **ARIA attrs are presenseStrs traps** | `"polite"`, `"alert"`, `"navigation"` match `aria-live`/ARIA landmarks on any page including anti-bot challenges. Use profile-specific markers instead. |
|
||||
| **Anti-bot challenge + broad markers = false CLAIMED** | Challenge pages bypass `absenceStrs` but match broad `presenseStrs`. Add challenge text (e.g. `"Client Challenge"`) to `errors` → UNKNOWN. Better than disabling for intermittent issues. |
|
||||
| **Redirect-to-homepage as signal** | Salon24.pl 301-redirects unclaimed users to homepage. Use `presenseStrs` with a profile-only marker (e.g. `"- salon24.pl</title>"`). |
|
||||
| **Non-standard anti-bot HTTP codes** | HTTP 468 (Tengine), 520–530 (Cloudflare) — not standard 403/429. Check with `curl -sIL`; if code is from intermediary → `disabled: true`. |
|
||||
| **`--diagnose` doesn't test POST** | `site_check.py --diagnose` uses GET only. For POST APIs (Discord, Holopin), verify with `curl -X POST` or `maigret --self-check`. |
|
||||
|
||||
## 8. Documentation maintenance
|
||||
|
||||
|
||||
Reference in New Issue
Block a user