Improve site-check quality: fix broken site configs, add diagnostic utilities, and make self-check report-only by default with opt-in auto-disable. (#2301)

- Fix VK and TradingView checkType; add Reddit and Microsoft Learn API-style probes where appropriate; adjust or disable entries that are unreliable under anti-bot protection. - Self-check: stop aggressive auto-disable; default to reporting issues only; add --auto-disable and --diagnose for optional fixes and deeper output. - Tooling: add utils/site_check.py and utils/check_top_n.py (and related helpers) to inspect and rank site behavior against the top-N list - Scope: aligns with fixing top-traffic / high-impact sites and making diagnostics repeatable without silently flipping disabled flags
2026-05-07 06:24:35 +00:00 · 2026-03-22 16:48:35 +01:00
parent 4784ecdacc
commit c9ab9d676b
14 changed files with 1959 additions and 65 deletions
@@ -20,6 +20,13 @@ For other `checkType` values, [`make_site_result`](../maigret/checking.py) sets

 Sites with an `engine` field (e.g. XenForo) are merged with a template from the `engines` section in [`maigret/resources/data.json`](../maigret/resources/data.json) ([`MaigretSite.update_from_engine`](../maigret/sites.py)).

+### `urlProbe`: probe URL vs reported profile URL
+
+- **`url`** — pattern for the **public profile page** users should open (what appears in reports as `url_user`). Supports `{username}`, `{urlMain}`, `{urlSubpath}`; the username segment is URL-encoded when the string is built ([`make_site_result`](../maigret/checking.py)).
+- **`urlProbe`** (optional) — if set, Maigret sends the HTTP **GET** (or HEAD where applicable) to **this** URL for the check, instead of to `url`. Same placeholders. Use it when the reliable signal is a **JSON/API** endpoint but the human-facing link must stay on the main site (e.g. `https://picsart.com/u/{username}` + probe `https://api.picsart.com/users/show/{username}.json`, or GitHub’s `https://github.com/{username}` + `https://api.github.com/users/{username}`).
+
+If `urlProbe` is omitted, the probe URL defaults to `url`.
+
 ### Redirects and final URL as a signal

 If the **HTML shell** looks the same for “user exists” and “user does not exist” (typical SPA), it is still worth checking whether the **server** behaves differently:
@@ -39,7 +46,7 @@ If that differs reliably, you may be able to use **`checkType`: `response_url`**
 | **Bibsonomy** | Both requests redirect to **`/pow-challenge/?return=/user/...`** (proof-of-work). Only the `return` path changes with the username; **both** existing and fake hit the same challenge flow — not a profile-vs-missing distinction. |
 | **Picsart (web UI `https://picsart.com/u/{username}`)** | Only a **trailing-slash** `301`; the first HTML is the same empty app shell (~3 KiB) for real and fake users. Browser-only routes such as `…/posts` vs `…/not-found` are **not** visible as additional HTTP redirects in this pipeline. |

-**Picsart — workable check via public API.** The site exposes **`https://api.picsart.com/users/show/{username}.json`**: JSON with `"status":"success"` and a user object when the account exists, and `"reason":"user_not_found"` when it does not. Pointing the site entry’s **`url`** at this endpoint with **`checkType`: `message`** and narrow `presenseStrs` / `absenceStrs` restores a reliable check without a headless browser.
+**Picsart — workable check via public API.** The site exposes **`https://api.picsart.com/users/show/{username}.json`**: JSON with `"status":"success"` and a user object when the account exists, and `"reason":"user_not_found"` when it does not. Put that URL in **`urlProbe`**, set **`url`** to the web profile pattern **`https://picsart.com/u/{username}`**, and use **`checkType`: `message`** with narrow `presenseStrs` / `absenceStrs` so reports show the human link while the request hits the API (see **`urlProbe`** above).

 For **Kaskus** and **Bibsonomy**, HTTP-level comparison still does **not** unlock a safe check without PoW / richer signals; keep **`disabled: true`** until something stable appears (API, SSR markers, etc.).

@@ -49,7 +56,7 @@ For **Kaskus** and **Bibsonomy**, HTTP-level comparison still does **not** unloc

 ### 2.1 Public JSON API (always)

-When diagnosing a site—especially **SPAs**, **soft 404s**, or **near-identical HTML** for real vs fake users—**routinely look for a public JSON (or JSON-like) API** used for profile or user lookup. Typical leads: paths containing `/api/`, `/v1/`, `graphql`, `users/show`, `.json` suffixes, or the same endpoints mobile apps use. Verify with `curl` (or the Maigret request path) that **claimed** and **unclaimed** usernames produce **reliably different** bodies or status codes. If such an endpoint is more stable than HTML, prefer it for the site entry’s **`url`** in [`data.json`](../maigret/resources/data.json) (see **Picsart** above).
+When diagnosing a site—especially **SPAs**, **soft 404s**, or **near-identical HTML** for real vs fake users—**routinely look for a public JSON (or JSON-like) API** used for profile or user lookup. Typical leads: paths containing `/api/`, `/v1/`, `graphql`, `users/show`, `.json` suffixes, or the same endpoints mobile apps use. Verify with `curl` (or the Maigret request path) that **claimed** and **unclaimed** usernames produce **reliably different** bodies or status codes. If such an endpoint is more stable than HTML, put it in **`urlProbe`** and keep **`url`** as the canonical profile page on the main site (see **`urlProbe`** in section 1). If there is no separate public URL for humans, you may still point **`url`** at the API only (reports will show that URL).

 This is a **standard** part of site-check work, not an optional extra.

@@ -177,6 +184,192 @@ In those cases **`disabled: true`** is better than false “found”; remove the

 ---

+## 6. Development utilities
+
+### 6.1 `utils/site_check.py` — Single site diagnostics
+
+A comprehensive utility for testing individual sites with multiple modes:
+
+```bash
+# Basic comparison of claimed vs unclaimed (aiohttp)
+python utils/site_check.py --site "VK" --check-claimed
+
+# Test via Maigret's checker directly
+python utils/site_check.py --site "VK" --maigret
+
+# Compare aiohttp vs Maigret results (find discrepancies)
+python utils/site_check.py --site "VK" --compare-methods
+
+# Full diagnosis with recommendations
+python utils/site_check.py --site "VK" --diagnose
+
+# Test with custom URL
+python utils/site_check.py --url "https://example.com/{username}" --compare user1 user2
+
+# Find a valid username for a site
+python utils/site_check.py --site "VK" --find-user
+```
+
+**Key features:**
+- `--maigret` — Uses Maigret's actual checking code, not raw aiohttp
+- `--compare-methods` — Shows if aiohttp and Maigret see different results (useful for debugging)
+- `--diagnose` — Validates checkType against actual responses, suggests fixes
+- Color output with markers detection (captcha, cloudflare, login, etc.)
+- `--json` flag for machine-readable output
+
+**When to use each mode:**
+
+| Mode | Use case |
+|------|----------|
+| `--check-claimed` | Quick sanity check: do claimed/unclaimed still differ? |
+| `--maigret` | Verify Maigret's actual behavior matches expectations |
+| `--compare-methods` | Debug "works in curl but fails in Maigret" issues |
+| `--diagnose` | Full analysis when a site is broken, get fix recommendations |
+
+### 6.2 `utils/check_top_n.py` — Mass site checking
+
+Batch-check top N sites by Alexa rank with categorized reporting:
+
+```bash
+# Check top 100 sites
+python utils/check_top_n.py --top 100
+
+# Faster with more parallelism
+python utils/check_top_n.py --top 100 --parallel 10
+
+# Output JSON report
+python utils/check_top_n.py --top 100 --output report.json
+
+# Only show broken sites
+python utils/check_top_n.py --top 100 --only-broken
+```
+
+**Output categories:**
+- `working` — Site check passes
+- `broken` — Check fails (wrong status, missing markers)
+- `timeout` — Request timed out
+- `anti_bot` — 403/429 or captcha detected
+- `error` — Connection or other errors
+- `disabled` — Already disabled in data.json
+
+**Report includes:**
+- Summary counts by category
+- List of broken sites with issues
+- Recommendations for fixes (e.g., "Switch to checkType: status_code")
+
+### 6.3 Self-check behavior (`--self-check`)
+
+The self-check command has been improved to be less aggressive:
+
+```bash
+# Check sites WITHOUT auto-disabling (default)
+maigret --self-check --site "VK"
+
+# Auto-disable failing sites (old behavior)
+maigret --self-check --site "VK" --auto-disable
+
+# Show detailed diagnosis for each failure
+maigret --self-check --site "VK" --diagnose
+```
+
+**Behavior changes:**
+
+| Flag | Effect |
+|------|--------|
+| `--self-check` alone | Reports issues but does NOT disable sites |
+| `--auto-disable` | Automatically disables sites that fail (opt-in) |
+| `--diagnose` | Prints detailed diagnosis with recommendations |
+
+**Why this matters:**
+- Old behavior was too aggressive — sites got disabled without explanation
+- New behavior reports issues and suggests fixes
+- Explicit `--auto-disable` required to modify database
+
+---
+
+## 7. Lessons learned (practical observations)
+
+Collected from hands-on work fixing top-ranked sites (Reddit, Wikipedia, Microsoft Learn, Baidu, etc.).
+
+### 7.1 JSON API is the first thing to look for
+
+Both Reddit and Microsoft Learn had working public APIs that solved the problem entirely. The web pages were SPAs or blocked by anti-bot measures, but the APIs worked reliably:
+
+- **Reddit**: `https://api.reddit.com/user/{username}/about` — returns JSON with user data or `{"message": "Not Found", "error": 404}`.
+- **Microsoft Learn**: `https://learn.microsoft.com/api/profiles/{username}` — returns JSON with `userName` field or HTTP 404.
+
+This confirms the playbook recommendation: always check for `/api/`, `.json`, GraphQL endpoints before giving up on a site.
+
+### 7.2 `urlProbe` is a powerful tool
+
+It separates "what we check" (API) from "what we show the user" (human-readable profile URL). Reddit is a perfect example:
+
+```json
+{
+  "url": "https://www.reddit.com/user/{username}",
+  "urlProbe": "https://api.reddit.com/user/{username}/about",
+  "checkType": "message",
+  "presenseStrs": ["\"name\":"],
+  "absenceStrs": ["Not Found"]
+}
+```
+
+The check hits the API, but reports display `www.reddit.com/user/blue`.
+
+### 7.3 aiohttp ≠ curl ≠ requests
+
+Wikipedia returned HTTP 200 for `curl` and Python `requests`, but HTTP 403 for `aiohttp`. This is **TLS fingerprinting** — the server identifies the HTTP library by cryptographic characteristics of the TLS handshake, not by headers.
+
+**Key insight:** Changing `User-Agent` does **not** help against TLS fingerprinting. Always test with aiohttp directly (or via Maigret with `-vvv` and `debug.log`), not just `curl`.
+
+```python
+# This returns 403 for Wikipedia even with browser UA:
+async with aiohttp.ClientSession() as session:
+    async with session.get(url, headers={"User-Agent": "Mozilla/5.0 ..."}) as resp:
+        print(resp.status)  # 403
+```
+
+### 7.4 HTTP 403 in Maigret can mean different things
+
+Initially it seemed Wikipedia was returning 403, but `curl` showed 200. Only `debug.log` revealed the real picture — aiohttp was getting blocked at TLS level.
+
+**Lesson:** Use `-vvv` flag and inspect `debug.log` for raw response status and body. The warning message alone may be misleading.
+
+### 7.5 Dead services migrate, not disappear
+
+MSDN Social and TechNet profiles redirected to Microsoft Learn. Instead of deleting old entries:
+
+1. Keep old entries with `disabled: true` as historical record.
+2. Create a new entry for the current service with working API.
+
+This preserves audit trail and avoids breaking existing workflows.
+
+### 7.6 `status_code` is more reliable than `message` for APIs
+
+Microsoft Learn API returns HTTP 404 for non-existent users — a clean signal without HTML parsing. For JSON APIs that return proper HTTP status codes, `status_code` is often the best choice:
+
+```json
+{
+  "checkType": "status_code",
+  "urlProbe": "https://learn.microsoft.com/api/profiles/{username}"
+}
+```
+
+No need for fragile string matching when the API speaks HTTP correctly.
+
+### 7.7 The playbook classification works
+
+The decision tree from the documentation accurately describes real-world cases:
+
+| Situation | Playbook says | Actual result |
+|-----------|---------------|---------------|
+| Captcha (Baidu) | `disabled: true` | Correct |
+| TLS fingerprinting (Wikipedia) | `disabled: true` (anti-bot) | Correct |
+| Working API available (Reddit, MS Learn) | Use `urlProbe` | Correct |
+| Service migrated (MSDN → MS Learn) | Update URL or create new entry | Correct |
+
+---
+
 ## Documentation maintenance

 For any of the changes below, **always** keep these artifacts in sync — this file ([`site-checks-guide.md`](site-checks-guide.md)), [`site-checks-playbook.md`](site-checks-playbook.md), and (when rules or templates change) the header/template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log):
@@ -6,7 +6,7 @@ Short checklist for edits to [`maigret/resources/data.json`](../maigret/resource

 ## 0. Standard checks (do alongside reproduce / classify)

- **Public JSON API:** always look for a stable JSON (or GraphQL JSON) profile endpoint (`/api/`, `.json`, mobile-style URLs). Prefer it in `url` when it differentiates claimed vs unclaimed users better than HTML. Details: section **2.1** in [`site-checks-guide.md`](site-checks-guide.md).
+- **Public JSON API:** always look for a stable JSON (or GraphQL JSON) profile endpoint (`/api/`, `.json`, mobile-style URLs). When the API is more reliable than HTML, set **`urlProbe`** to that endpoint and keep **`url`** as the human-readable profile link (e.g. `https://picsart.com/u/{username}`). If there is no separate profile URL, use the API as `url` only. Details: **`urlProbe`** and section **2.1** in [`site-checks-guide.md`](site-checks-guide.md).
 - **`socid_extractor` log (mandatory):** if you find **embedded user JSON in HTML** or a **standalone JSON profile API**, append a dated entry (with **example username**) to [`socid_extractor_improvements.log`](socid_extractor_improvements.log). Details: section **2.2** in [`site-checks-guide.md`](site-checks-guide.md).

 ## 1. Reproduce
@@ -29,7 +29,7 @@ Short checklist for edits to [`maigret/resources/data.json`](../maigret/resource

 ## 3. Data edits

-1. Update `url` / `urlMain` if needed (HTTPS redirects).
+1. Update `url` / `urlMain` if needed (HTTPS redirects). Use optional **`urlProbe`** when the HTTP check should hit a different URL than the profile link shown in reports (API vs web UI).
 2. For `message`: **always** tune string pairs so `absenceStrs` fire on “no user” pages and `presenseStrs` fire on real profiles without false absence hits.
 3. Engine (`engine`, e.g. XenForo): override only differing fields in the site entry so other sites are not broken.
 4. Keep `status_code` only if the response **reliably** differs by status code without soft 404.
@@ -44,6 +44,34 @@ Short checklist for edits to [`maigret/resources/data.json`](../maigret/resource
 - `process_site_result` uses strict comparison to `"status_code"` for `checkType` (not a substring trick).
 - Empty `presenseStrs` with `message` means “presence always true”; a debug line is logged only at DEBUG level.

-## 6. Documentation maintenance
+## 6. Development utilities
+
+Quick reference for site check utilities. Full details: section **6** in [`site-checks-guide.md`](site-checks-guide.md).
+
+| Command | Purpose |
+|---------|---------|
+| `python utils/site_check.py --site "X" --check-claimed` | Quick aiohttp comparison |
+| `python utils/site_check.py --site "X" --maigret` | Test via Maigret checker |
+| `python utils/site_check.py --site "X" --compare-methods` | Find aiohttp vs Maigret discrepancies |
+| `python utils/site_check.py --site "X" --diagnose` | Full diagnosis with fix recommendations |
+| `python utils/check_top_n.py --top 100` | Mass-check top 100 sites |
+| `maigret --self-check --site "X"` | Self-check (reports only, no auto-disable) |
+| `maigret --self-check --site "X" --auto-disable` | Self-check with auto-disable |
+| `maigret --self-check --site "X" --diagnose` | Self-check with detailed diagnosis |
+
+## 7. Quick tips (lessons learned)
+
+Practical observations from fixing top-ranked sites. Full details: section **7** in [`site-checks-guide.md`](site-checks-guide.md).
+
+| Tip | Why it matters |
+|-----|----------------|
+| **API first** | Reddit, Microsoft Learn — APIs worked when web pages were blocked. Always check `/api/`, `.json` endpoints. |
+| **`urlProbe` separates check from display** | Check via API, show human URL in reports. Example: Reddit API → `www.reddit.com/user/` link. |
+| **aiohttp ≠ curl** | Wikipedia returned 200 for curl, 403 for aiohttp (TLS fingerprinting). Always test with Maigret directly. |
+| **Use `debug.log`** | Run with `-vvv` to see raw response. Warning messages alone can be misleading. |
+| **`status_code` for clean APIs** | If API returns proper 404 for missing users, prefer `status_code` over `message`. |
+| **Migrate, don't delete** | MSDN → Microsoft Learn: keep old entry disabled, create new one for current service. |
+
+## 8. Documentation maintenance

 When you change Maigret, add search tools, or change check logic, keep **this playbook**, [`site-checks-guide.md`](site-checks-guide.md), and (when applicable) the template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) aligned. New log **entries** are append-only at the bottom of that file.