Site checks — playbook (Maigret)

Short checklist for edits to maigret/resources/data.json and, when needed, maigret/checking.py. Full guide: site-checks-guide.md. Upstream extraction proposals: socid_extractor_improvements.log.

Documentation maintenance: whenever you improve Maigret, add search tooling, or change check logic, update both this file and site-checks-guide.md (see the “Documentation maintenance” section at the end of that file). When JSON API / socid_extractor logging rules change, update the template header in socid_extractor_improvements.log in the same change.

0. Standard checks (do alongside reproduce / classify)

Public JSON API: always look for a stable JSON (or GraphQL JSON) profile endpoint (/api/, .json, mobile-style URLs). When the API is more reliable than HTML, set urlProbe to that endpoint and keep url as the human-readable profile link (e.g. https://picsart.com/u/{username}). If there is no separate profile URL, use the API as url only. Details: urlProbe and section 2.1 in site-checks-guide.md.
socid_extractor log (mandatory): if you find embedded user JSON in HTML or a standalone JSON profile API, append a dated entry (with example username) to socid_extractor_improvements.log. Details: section 2.2 in site-checks-guide.md.

1. Reproduce

Run a targeted check:
maigret USER --db /path/to/maigret/resources/data.json --site "SiteName" --print-not-found --print-errors --no-progressbar -vv
Compare an existing and a non-existent username (as usernameClaimed / usernameUnclaimed in JSON).
With -vvv, inspect debug.log (raw response in the log).

2. Classify the cause

Symptom	Typical cause	Action
HTTP 200 for “user does not exist”	Soft 404	Move from `status_code` to `message` or `response_url`; add `absenceStrs` / narrow `presenseStrs`
Generic words match (`name`, `email`)	`presenseStrs` too broad	Remove generic markers; add profile-specific ones
Same HTML without JS	SPA / skeleton shell	Compare final URL and HTTP redirects (Maigret already follows redirects by default). If the browser shows extra routes (`/posts`, `/not-found`) only after JS, they will not appear to Maigret — try a public JSON/API endpoint for the same site if one exists. See Redirects and final URL and Picsart in `site-checks-guide.md`.
403 / “Log in” / guest-only	Auth or anti-bot required	`disabled: true`
reCAPTCHA / “Checking your browser”	Bot protection	Try a reasonable `User-Agent` in `headers`; else `errors` + UNKNOWN or `disabled`
Domain does not resolve / persistent timeout	Dead service	Remove entry only after confirming the domain is dead

3. Data edits

Update url / urlMain if needed (HTTPS redirects). Use optional urlProbe when the HTTP check should hit a different URL than the profile link shown in reports (API vs web UI).
For message: always tune string pairs so absenceStrs fire on “no user” pages and presenseStrs fire on real profiles without false absence hits.
Engine (engine, e.g. XenForo): override only differing fields in the site entry so other sites are not broken.
Keep status_code only if the response reliably differs by status code without soft 404.

4. Verify

maigret --self-check --site "SiteName" --db ... for touched entries.
make test before commit.

5. Code notes

process_site_result uses strict comparison to "status_code" for checkType (not a substring trick).
Empty presenseStrs with message means “presence always true”; a debug line is logged only at DEBUG level.

6. Development utilities

Quick reference for site check utilities. Full details: section 6 in site-checks-guide.md.

Command	Purpose
`python utils/site_check.py --site "X" --check-claimed`	Quick aiohttp comparison
`python utils/site_check.py --site "X" --maigret`	Test via Maigret checker
`python utils/site_check.py --site "X" --compare-methods`	Find aiohttp vs Maigret discrepancies
`python utils/site_check.py --site "X" --diagnose`	Full diagnosis with fix recommendations
`python utils/check_top_n.py --top 100`	Mass-check top 100 sites
`maigret --self-check --site "X"`	Self-check (reports only, no auto-disable)
`maigret --self-check --site "X" --auto-disable`	Self-check with auto-disable
`maigret --self-check --site "X" --diagnose`	Self-check with detailed diagnosis

7. Quick tips (lessons learned)

Practical observations from fixing top-ranked sites. Full details: section 7 in site-checks-guide.md.

Tip	Why it matters
API first	Reddit, Microsoft Learn — APIs worked when web pages were blocked. Always check `/api/`, `.json` endpoints.
`urlProbe` separates check from display	Check via API, show human URL in reports. Example: Reddit API → `www.reddit.com/user/` link.
aiohttp ≠ curl	Wikipedia returned 200 for curl, 403 for aiohttp (TLS fingerprinting). Always test with Maigret directly.
Use `debug.log`	Run with `-vvv` to see raw response. Warning messages alone can be misleading.
`status_code` for clean APIs	If API returns proper 404 for missing users, prefer `status_code` over `message`.
Migrate, don't delete	MSDN → Microsoft Learn: keep old entry disabled, create new one for current service.
Engine templates break silently	vBulletin `absenceStrs` failed on ~12 forums at once — many require login, showing a generic page with no error text. Check the engine template first.
Search-by-author is unreliable	phpBB `search.php?author=` checks for posts, not accounts. A user with zero posts looks identical to a non-existent user. Avoid these URLs.
Some sites always generate a page	Pbase stubs "pbase Artist {name}" for any path; ffm.bio fuzzy-matches to the nearest real entry. No markers can help — `disabled: true`.
TLS fingerprinting degrades over time	Kaggle's custom `User-Agent` fix stopped working — aiohttp now gets 404 for both usernames. Accept `disabled: true` when no API exists.
API endpoints bypass Cloudflare	Fandom `api.php` and Substack `/api/v1/` returned clean JSON while main pages were blocked by Cloudflare. Always try API paths on the same domain.
Inspect Network tab for POST APIs	Many modern platforms (e.g., Discord) heavily protect HTML profiles but expose unauthenticated `POST` endpoints for username checks. Maigret supports this natively: define `"request_method": "POST"` and `"request_payload": {"username": "{username}"}` in `data.json` to query them!
Strict JSON markers are bulletproof	When probing APIs, use `checkType: "message"` with exact JSON substrings (like `"{\"taken\": false}"`). Unlike HTML layout checks, this approach is immune to UI redesigns, A/B testing, and language translations.
GraphQL supports GET too	hashnode GraphQL works via `GET ?query=...` (URL-encoded). You can use either native POST payloads or GET `urlProbe` for GraphQL.
URL-encode braces for template safety	GraphQL `{...}` conflicts with Maigret's `{username}`. Use `%7B`/`%7D` for literal braces in `urlProbe` — `.format()` ignores percent-encoded chars.
Anti-bot bypass via simple UA	"Anubis" anti-bot PoW screens (like on Weblate) intercept modern browser UAs via HTTP 307. Hardcoding `"headers": {"User-Agent": "python-requests/2.25.1"}` circumvents the scraper filter and restores default detection logic.
Rate-limit → `errors`, not `absenceStrs`	Rate-limit wording varies across API versions. If the phrasing doesn't match `absenceStrs` and `presenseStrs` is empty, the result is a false CLAIMED. Put all "can't answer right now" strings (rate limit, CAPTCHA, maintenance) in `errors` so the result is UNKNOWN.

8. Documentation maintenance

When you change Maigret, add search tools, or change check logic, keep this playbook, site-checks-guide.md, and (when applicable) the template in socid_extractor_improvements.log aligned. New log entries are append-only at the bottom of that file.

8.6 KiB Raw Blame History