mirror of
https://github.com/soxoj/maigret.git
synced 2026-05-07 06:24:35 +00:00
Improve site-check quality: fix broken site configs, add diagnostic utilities, and make self-check report-only by default with opt-in auto-disable. (#2301)
- Fix VK and TradingView checkType; add Reddit and Microsoft Learn API-style probes where appropriate; adjust or disable entries that are unreliable under anti-bot protection. - Self-check: stop aggressive auto-disable; default to reporting issues only; add --auto-disable and --diagnose for optional fixes and deeper output. - Tooling: add utils/site_check.py and utils/check_top_n.py (and related helpers) to inspect and rank site behavior against the top-N list - Scope: aligns with fixing top-traffic / high-impact sites and making diagnostics repeatable without silently flipping disabled flags
This commit is contained in:
+195
-2
@@ -20,6 +20,13 @@ For other `checkType` values, [`make_site_result`](../maigret/checking.py) sets
|
||||
|
||||
Sites with an `engine` field (e.g. XenForo) are merged with a template from the `engines` section in [`maigret/resources/data.json`](../maigret/resources/data.json) ([`MaigretSite.update_from_engine`](../maigret/sites.py)).
|
||||
|
||||
### `urlProbe`: probe URL vs reported profile URL
|
||||
|
||||
- **`url`** — pattern for the **public profile page** users should open (what appears in reports as `url_user`). Supports `{username}`, `{urlMain}`, `{urlSubpath}`; the username segment is URL-encoded when the string is built ([`make_site_result`](../maigret/checking.py)).
|
||||
- **`urlProbe`** (optional) — if set, Maigret sends the HTTP **GET** (or HEAD where applicable) to **this** URL for the check, instead of to `url`. Same placeholders. Use it when the reliable signal is a **JSON/API** endpoint but the human-facing link must stay on the main site (e.g. `https://picsart.com/u/{username}` + probe `https://api.picsart.com/users/show/{username}.json`, or GitHub’s `https://github.com/{username}` + `https://api.github.com/users/{username}`).
|
||||
|
||||
If `urlProbe` is omitted, the probe URL defaults to `url`.
|
||||
|
||||
### Redirects and final URL as a signal
|
||||
|
||||
If the **HTML shell** looks the same for “user exists” and “user does not exist” (typical SPA), it is still worth checking whether the **server** behaves differently:
|
||||
@@ -39,7 +46,7 @@ If that differs reliably, you may be able to use **`checkType`: `response_url`**
|
||||
| **Bibsonomy** | Both requests redirect to **`/pow-challenge/?return=/user/...`** (proof-of-work). Only the `return` path changes with the username; **both** existing and fake hit the same challenge flow — not a profile-vs-missing distinction. |
|
||||
| **Picsart (web UI `https://picsart.com/u/{username}`)** | Only a **trailing-slash** `301`; the first HTML is the same empty app shell (~3 KiB) for real and fake users. Browser-only routes such as `…/posts` vs `…/not-found` are **not** visible as additional HTTP redirects in this pipeline. |
|
||||
|
||||
**Picsart — workable check via public API.** The site exposes **`https://api.picsart.com/users/show/{username}.json`**: JSON with `"status":"success"` and a user object when the account exists, and `"reason":"user_not_found"` when it does not. Pointing the site entry’s **`url`** at this endpoint with **`checkType`: `message`** and narrow `presenseStrs` / `absenceStrs` restores a reliable check without a headless browser.
|
||||
**Picsart — workable check via public API.** The site exposes **`https://api.picsart.com/users/show/{username}.json`**: JSON with `"status":"success"` and a user object when the account exists, and `"reason":"user_not_found"` when it does not. Put that URL in **`urlProbe`**, set **`url`** to the web profile pattern **`https://picsart.com/u/{username}`**, and use **`checkType`: `message`** with narrow `presenseStrs` / `absenceStrs` so reports show the human link while the request hits the API (see **`urlProbe`** above).
|
||||
|
||||
For **Kaskus** and **Bibsonomy**, HTTP-level comparison still does **not** unlock a safe check without PoW / richer signals; keep **`disabled: true`** until something stable appears (API, SSR markers, etc.).
|
||||
|
||||
@@ -49,7 +56,7 @@ For **Kaskus** and **Bibsonomy**, HTTP-level comparison still does **not** unloc
|
||||
|
||||
### 2.1 Public JSON API (always)
|
||||
|
||||
When diagnosing a site—especially **SPAs**, **soft 404s**, or **near-identical HTML** for real vs fake users—**routinely look for a public JSON (or JSON-like) API** used for profile or user lookup. Typical leads: paths containing `/api/`, `/v1/`, `graphql`, `users/show`, `.json` suffixes, or the same endpoints mobile apps use. Verify with `curl` (or the Maigret request path) that **claimed** and **unclaimed** usernames produce **reliably different** bodies or status codes. If such an endpoint is more stable than HTML, prefer it for the site entry’s **`url`** in [`data.json`](../maigret/resources/data.json) (see **Picsart** above).
|
||||
When diagnosing a site—especially **SPAs**, **soft 404s**, or **near-identical HTML** for real vs fake users—**routinely look for a public JSON (or JSON-like) API** used for profile or user lookup. Typical leads: paths containing `/api/`, `/v1/`, `graphql`, `users/show`, `.json` suffixes, or the same endpoints mobile apps use. Verify with `curl` (or the Maigret request path) that **claimed** and **unclaimed** usernames produce **reliably different** bodies or status codes. If such an endpoint is more stable than HTML, put it in **`urlProbe`** and keep **`url`** as the canonical profile page on the main site (see **`urlProbe`** in section 1). If there is no separate public URL for humans, you may still point **`url`** at the API only (reports will show that URL).
|
||||
|
||||
This is a **standard** part of site-check work, not an optional extra.
|
||||
|
||||
@@ -177,6 +184,192 @@ In those cases **`disabled: true`** is better than false “found”; remove the
|
||||
|
||||
---
|
||||
|
||||
## 6. Development utilities
|
||||
|
||||
### 6.1 `utils/site_check.py` — Single site diagnostics
|
||||
|
||||
A comprehensive utility for testing individual sites with multiple modes:
|
||||
|
||||
```bash
|
||||
# Basic comparison of claimed vs unclaimed (aiohttp)
|
||||
python utils/site_check.py --site "VK" --check-claimed
|
||||
|
||||
# Test via Maigret's checker directly
|
||||
python utils/site_check.py --site "VK" --maigret
|
||||
|
||||
# Compare aiohttp vs Maigret results (find discrepancies)
|
||||
python utils/site_check.py --site "VK" --compare-methods
|
||||
|
||||
# Full diagnosis with recommendations
|
||||
python utils/site_check.py --site "VK" --diagnose
|
||||
|
||||
# Test with custom URL
|
||||
python utils/site_check.py --url "https://example.com/{username}" --compare user1 user2
|
||||
|
||||
# Find a valid username for a site
|
||||
python utils/site_check.py --site "VK" --find-user
|
||||
```
|
||||
|
||||
**Key features:**
|
||||
- `--maigret` — Uses Maigret's actual checking code, not raw aiohttp
|
||||
- `--compare-methods` — Shows if aiohttp and Maigret see different results (useful for debugging)
|
||||
- `--diagnose` — Validates checkType against actual responses, suggests fixes
|
||||
- Color output with markers detection (captcha, cloudflare, login, etc.)
|
||||
- `--json` flag for machine-readable output
|
||||
|
||||
**When to use each mode:**
|
||||
|
||||
| Mode | Use case |
|
||||
|------|----------|
|
||||
| `--check-claimed` | Quick sanity check: do claimed/unclaimed still differ? |
|
||||
| `--maigret` | Verify Maigret's actual behavior matches expectations |
|
||||
| `--compare-methods` | Debug "works in curl but fails in Maigret" issues |
|
||||
| `--diagnose` | Full analysis when a site is broken, get fix recommendations |
|
||||
|
||||
### 6.2 `utils/check_top_n.py` — Mass site checking
|
||||
|
||||
Batch-check top N sites by Alexa rank with categorized reporting:
|
||||
|
||||
```bash
|
||||
# Check top 100 sites
|
||||
python utils/check_top_n.py --top 100
|
||||
|
||||
# Faster with more parallelism
|
||||
python utils/check_top_n.py --top 100 --parallel 10
|
||||
|
||||
# Output JSON report
|
||||
python utils/check_top_n.py --top 100 --output report.json
|
||||
|
||||
# Only show broken sites
|
||||
python utils/check_top_n.py --top 100 --only-broken
|
||||
```
|
||||
|
||||
**Output categories:**
|
||||
- `working` — Site check passes
|
||||
- `broken` — Check fails (wrong status, missing markers)
|
||||
- `timeout` — Request timed out
|
||||
- `anti_bot` — 403/429 or captcha detected
|
||||
- `error` — Connection or other errors
|
||||
- `disabled` — Already disabled in data.json
|
||||
|
||||
**Report includes:**
|
||||
- Summary counts by category
|
||||
- List of broken sites with issues
|
||||
- Recommendations for fixes (e.g., "Switch to checkType: status_code")
|
||||
|
||||
### 6.3 Self-check behavior (`--self-check`)
|
||||
|
||||
The self-check command has been improved to be less aggressive:
|
||||
|
||||
```bash
|
||||
# Check sites WITHOUT auto-disabling (default)
|
||||
maigret --self-check --site "VK"
|
||||
|
||||
# Auto-disable failing sites (old behavior)
|
||||
maigret --self-check --site "VK" --auto-disable
|
||||
|
||||
# Show detailed diagnosis for each failure
|
||||
maigret --self-check --site "VK" --diagnose
|
||||
```
|
||||
|
||||
**Behavior changes:**
|
||||
|
||||
| Flag | Effect |
|
||||
|------|--------|
|
||||
| `--self-check` alone | Reports issues but does NOT disable sites |
|
||||
| `--auto-disable` | Automatically disables sites that fail (opt-in) |
|
||||
| `--diagnose` | Prints detailed diagnosis with recommendations |
|
||||
|
||||
**Why this matters:**
|
||||
- Old behavior was too aggressive — sites got disabled without explanation
|
||||
- New behavior reports issues and suggests fixes
|
||||
- Explicit `--auto-disable` required to modify database
|
||||
|
||||
---
|
||||
|
||||
## 7. Lessons learned (practical observations)
|
||||
|
||||
Collected from hands-on work fixing top-ranked sites (Reddit, Wikipedia, Microsoft Learn, Baidu, etc.).
|
||||
|
||||
### 7.1 JSON API is the first thing to look for
|
||||
|
||||
Both Reddit and Microsoft Learn had working public APIs that solved the problem entirely. The web pages were SPAs or blocked by anti-bot measures, but the APIs worked reliably:
|
||||
|
||||
- **Reddit**: `https://api.reddit.com/user/{username}/about` — returns JSON with user data or `{"message": "Not Found", "error": 404}`.
|
||||
- **Microsoft Learn**: `https://learn.microsoft.com/api/profiles/{username}` — returns JSON with `userName` field or HTTP 404.
|
||||
|
||||
This confirms the playbook recommendation: always check for `/api/`, `.json`, GraphQL endpoints before giving up on a site.
|
||||
|
||||
### 7.2 `urlProbe` is a powerful tool
|
||||
|
||||
It separates "what we check" (API) from "what we show the user" (human-readable profile URL). Reddit is a perfect example:
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://www.reddit.com/user/{username}",
|
||||
"urlProbe": "https://api.reddit.com/user/{username}/about",
|
||||
"checkType": "message",
|
||||
"presenseStrs": ["\"name\":"],
|
||||
"absenceStrs": ["Not Found"]
|
||||
}
|
||||
```
|
||||
|
||||
The check hits the API, but reports display `www.reddit.com/user/blue`.
|
||||
|
||||
### 7.3 aiohttp ≠ curl ≠ requests
|
||||
|
||||
Wikipedia returned HTTP 200 for `curl` and Python `requests`, but HTTP 403 for `aiohttp`. This is **TLS fingerprinting** — the server identifies the HTTP library by cryptographic characteristics of the TLS handshake, not by headers.
|
||||
|
||||
**Key insight:** Changing `User-Agent` does **not** help against TLS fingerprinting. Always test with aiohttp directly (or via Maigret with `-vvv` and `debug.log`), not just `curl`.
|
||||
|
||||
```python
|
||||
# This returns 403 for Wikipedia even with browser UA:
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.get(url, headers={"User-Agent": "Mozilla/5.0 ..."}) as resp:
|
||||
print(resp.status) # 403
|
||||
```
|
||||
|
||||
### 7.4 HTTP 403 in Maigret can mean different things
|
||||
|
||||
Initially it seemed Wikipedia was returning 403, but `curl` showed 200. Only `debug.log` revealed the real picture — aiohttp was getting blocked at TLS level.
|
||||
|
||||
**Lesson:** Use `-vvv` flag and inspect `debug.log` for raw response status and body. The warning message alone may be misleading.
|
||||
|
||||
### 7.5 Dead services migrate, not disappear
|
||||
|
||||
MSDN Social and TechNet profiles redirected to Microsoft Learn. Instead of deleting old entries:
|
||||
|
||||
1. Keep old entries with `disabled: true` as historical record.
|
||||
2. Create a new entry for the current service with working API.
|
||||
|
||||
This preserves audit trail and avoids breaking existing workflows.
|
||||
|
||||
### 7.6 `status_code` is more reliable than `message` for APIs
|
||||
|
||||
Microsoft Learn API returns HTTP 404 for non-existent users — a clean signal without HTML parsing. For JSON APIs that return proper HTTP status codes, `status_code` is often the best choice:
|
||||
|
||||
```json
|
||||
{
|
||||
"checkType": "status_code",
|
||||
"urlProbe": "https://learn.microsoft.com/api/profiles/{username}"
|
||||
}
|
||||
```
|
||||
|
||||
No need for fragile string matching when the API speaks HTTP correctly.
|
||||
|
||||
### 7.7 The playbook classification works
|
||||
|
||||
The decision tree from the documentation accurately describes real-world cases:
|
||||
|
||||
| Situation | Playbook says | Actual result |
|
||||
|-----------|---------------|---------------|
|
||||
| Captcha (Baidu) | `disabled: true` | Correct |
|
||||
| TLS fingerprinting (Wikipedia) | `disabled: true` (anti-bot) | Correct |
|
||||
| Working API available (Reddit, MS Learn) | Use `urlProbe` | Correct |
|
||||
| Service migrated (MSDN → MS Learn) | Update URL or create new entry | Correct |
|
||||
|
||||
---
|
||||
|
||||
## Documentation maintenance
|
||||
|
||||
For any of the changes below, **always** keep these artifacts in sync — this file ([`site-checks-guide.md`](site-checks-guide.md)), [`site-checks-playbook.md`](site-checks-playbook.md), and (when rules or templates change) the header/template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log):
|
||||
|
||||
@@ -6,7 +6,7 @@ Short checklist for edits to [`maigret/resources/data.json`](../maigret/resource
|
||||
|
||||
## 0. Standard checks (do alongside reproduce / classify)
|
||||
|
||||
- **Public JSON API:** always look for a stable JSON (or GraphQL JSON) profile endpoint (`/api/`, `.json`, mobile-style URLs). Prefer it in `url` when it differentiates claimed vs unclaimed users better than HTML. Details: section **2.1** in [`site-checks-guide.md`](site-checks-guide.md).
|
||||
- **Public JSON API:** always look for a stable JSON (or GraphQL JSON) profile endpoint (`/api/`, `.json`, mobile-style URLs). When the API is more reliable than HTML, set **`urlProbe`** to that endpoint and keep **`url`** as the human-readable profile link (e.g. `https://picsart.com/u/{username}`). If there is no separate profile URL, use the API as `url` only. Details: **`urlProbe`** and section **2.1** in [`site-checks-guide.md`](site-checks-guide.md).
|
||||
- **`socid_extractor` log (mandatory):** if you find **embedded user JSON in HTML** or a **standalone JSON profile API**, append a dated entry (with **example username**) to [`socid_extractor_improvements.log`](socid_extractor_improvements.log). Details: section **2.2** in [`site-checks-guide.md`](site-checks-guide.md).
|
||||
|
||||
## 1. Reproduce
|
||||
@@ -29,7 +29,7 @@ Short checklist for edits to [`maigret/resources/data.json`](../maigret/resource
|
||||
|
||||
## 3. Data edits
|
||||
|
||||
1. Update `url` / `urlMain` if needed (HTTPS redirects).
|
||||
1. Update `url` / `urlMain` if needed (HTTPS redirects). Use optional **`urlProbe`** when the HTTP check should hit a different URL than the profile link shown in reports (API vs web UI).
|
||||
2. For `message`: **always** tune string pairs so `absenceStrs` fire on “no user” pages and `presenseStrs` fire on real profiles without false absence hits.
|
||||
3. Engine (`engine`, e.g. XenForo): override only differing fields in the site entry so other sites are not broken.
|
||||
4. Keep `status_code` only if the response **reliably** differs by status code without soft 404.
|
||||
@@ -44,6 +44,34 @@ Short checklist for edits to [`maigret/resources/data.json`](../maigret/resource
|
||||
- `process_site_result` uses strict comparison to `"status_code"` for `checkType` (not a substring trick).
|
||||
- Empty `presenseStrs` with `message` means “presence always true”; a debug line is logged only at DEBUG level.
|
||||
|
||||
## 6. Documentation maintenance
|
||||
## 6. Development utilities
|
||||
|
||||
Quick reference for site check utilities. Full details: section **6** in [`site-checks-guide.md`](site-checks-guide.md).
|
||||
|
||||
| Command | Purpose |
|
||||
|---------|---------|
|
||||
| `python utils/site_check.py --site "X" --check-claimed` | Quick aiohttp comparison |
|
||||
| `python utils/site_check.py --site "X" --maigret` | Test via Maigret checker |
|
||||
| `python utils/site_check.py --site "X" --compare-methods` | Find aiohttp vs Maigret discrepancies |
|
||||
| `python utils/site_check.py --site "X" --diagnose` | Full diagnosis with fix recommendations |
|
||||
| `python utils/check_top_n.py --top 100` | Mass-check top 100 sites |
|
||||
| `maigret --self-check --site "X"` | Self-check (reports only, no auto-disable) |
|
||||
| `maigret --self-check --site "X" --auto-disable` | Self-check with auto-disable |
|
||||
| `maigret --self-check --site "X" --diagnose` | Self-check with detailed diagnosis |
|
||||
|
||||
## 7. Quick tips (lessons learned)
|
||||
|
||||
Practical observations from fixing top-ranked sites. Full details: section **7** in [`site-checks-guide.md`](site-checks-guide.md).
|
||||
|
||||
| Tip | Why it matters |
|
||||
|-----|----------------|
|
||||
| **API first** | Reddit, Microsoft Learn — APIs worked when web pages were blocked. Always check `/api/`, `.json` endpoints. |
|
||||
| **`urlProbe` separates check from display** | Check via API, show human URL in reports. Example: Reddit API → `www.reddit.com/user/` link. |
|
||||
| **aiohttp ≠ curl** | Wikipedia returned 200 for curl, 403 for aiohttp (TLS fingerprinting). Always test with Maigret directly. |
|
||||
| **Use `debug.log`** | Run with `-vvv` to see raw response. Warning messages alone can be misleading. |
|
||||
| **`status_code` for clean APIs** | If API returns proper 404 for missing users, prefer `status_code` over `message`. |
|
||||
| **Migrate, don't delete** | MSDN → Microsoft Learn: keep old entry disabled, create new one for current service. |
|
||||
|
||||
## 8. Documentation maintenance
|
||||
|
||||
When you change Maigret, add search tools, or change check logic, keep **this playbook**, [`site-checks-guide.md`](site-checks-guide.md), and (when applicable) the template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) aligned. New log **entries** are append-only at the bottom of that file.
|
||||
|
||||
Reference in New Issue
Block a user