8.6 KiB
Site checks — playbook (Maigret)
Short checklist for edits to maigret/resources/data.json and, when needed, maigret/checking.py. Full guide: site-checks-guide.md. Upstream extraction proposals: socid_extractor_improvements.log.
Documentation maintenance: whenever you improve Maigret, add search tooling, or change check logic, update both this file and site-checks-guide.md (see the “Documentation maintenance” section at the end of that file). When JSON API / socid_extractor logging rules change, update the template header in socid_extractor_improvements.log in the same change.
0. Standard checks (do alongside reproduce / classify)
- Public JSON API: always look for a stable JSON (or GraphQL JSON) profile endpoint (
/api/,.json, mobile-style URLs). When the API is more reliable than HTML, seturlProbeto that endpoint and keepurlas the human-readable profile link (e.g.https://picsart.com/u/{username}). If there is no separate profile URL, use the API asurlonly. Details:urlProbeand section 2.1 insite-checks-guide.md. socid_extractorlog (mandatory): if you find embedded user JSON in HTML or a standalone JSON profile API, append a dated entry (with example username) tosocid_extractor_improvements.log. Details: section 2.2 insite-checks-guide.md.
1. Reproduce
- Run a targeted check:
maigret USER --db /path/to/maigret/resources/data.json --site "SiteName" --print-not-found --print-errors --no-progressbar -vv - Compare an existing and a non-existent username (as
usernameClaimed/usernameUnclaimedin JSON). - With
-vvv, inspectdebug.log(raw response in the log).
2. Classify the cause
| Symptom | Typical cause | Action |
|---|---|---|
| HTTP 200 for “user does not exist” | Soft 404 | Move from status_code to message or response_url; add absenceStrs / narrow presenseStrs |
Generic words match (name, email) |
presenseStrs too broad |
Remove generic markers; add profile-specific ones |
| Same HTML without JS | SPA / skeleton shell | Compare final URL and HTTP redirects (Maigret already follows redirects by default). If the browser shows extra routes (/posts, /not-found) only after JS, they will not appear to Maigret — try a public JSON/API endpoint for the same site if one exists. See Redirects and final URL and Picsart in site-checks-guide.md. |
| 403 / “Log in” / guest-only | Auth or anti-bot required | disabled: true |
| reCAPTCHA / “Checking your browser” | Bot protection | Try a reasonable User-Agent in headers; else errors + UNKNOWN or disabled |
| Domain does not resolve / persistent timeout | Dead service | Remove entry only after confirming the domain is dead |
3. Data edits
- Update
url/urlMainif needed (HTTPS redirects). Use optionalurlProbewhen the HTTP check should hit a different URL than the profile link shown in reports (API vs web UI). - For
message: always tune string pairs soabsenceStrsfire on “no user” pages andpresenseStrsfire on real profiles without false absence hits. - Engine (
engine, e.g. XenForo): override only differing fields in the site entry so other sites are not broken. - Keep
status_codeonly if the response reliably differs by status code without soft 404.
4. Verify
maigret --self-check --site "SiteName" --db ...for touched entries.make testbefore commit.
5. Code notes
process_site_resultuses strict comparison to"status_code"forcheckType(not a substring trick).- Empty
presenseStrswithmessagemeans “presence always true”; a debug line is logged only at DEBUG level.
6. Development utilities
Quick reference for site check utilities. Full details: section 6 in site-checks-guide.md.
| Command | Purpose |
|---|---|
python utils/site_check.py --site "X" --check-claimed |
Quick aiohttp comparison |
python utils/site_check.py --site "X" --maigret |
Test via Maigret checker |
python utils/site_check.py --site "X" --compare-methods |
Find aiohttp vs Maigret discrepancies |
python utils/site_check.py --site "X" --diagnose |
Full diagnosis with fix recommendations |
python utils/check_top_n.py --top 100 |
Mass-check top 100 sites |
maigret --self-check --site "X" |
Self-check (reports only, no auto-disable) |
maigret --self-check --site "X" --auto-disable |
Self-check with auto-disable |
maigret --self-check --site "X" --diagnose |
Self-check with detailed diagnosis |
7. Quick tips (lessons learned)
Practical observations from fixing top-ranked sites. Full details: section 7 in site-checks-guide.md.
| Tip | Why it matters |
|---|---|
| API first | Reddit, Microsoft Learn — APIs worked when web pages were blocked. Always check /api/, .json endpoints. |
urlProbe separates check from display |
Check via API, show human URL in reports. Example: Reddit API → www.reddit.com/user/ link. |
| aiohttp ≠ curl | Wikipedia returned 200 for curl, 403 for aiohttp (TLS fingerprinting). Always test with Maigret directly. |
Use debug.log |
Run with -vvv to see raw response. Warning messages alone can be misleading. |
status_code for clean APIs |
If API returns proper 404 for missing users, prefer status_code over message. |
| Migrate, don't delete | MSDN → Microsoft Learn: keep old entry disabled, create new one for current service. |
| Engine templates break silently | vBulletin absenceStrs failed on ~12 forums at once — many require login, showing a generic page with no error text. Check the engine template first. |
| Search-by-author is unreliable | phpBB search.php?author= checks for posts, not accounts. A user with zero posts looks identical to a non-existent user. Avoid these URLs. |
| Some sites always generate a page | Pbase stubs "pbase Artist {name}" for any path; ffm.bio fuzzy-matches to the nearest real entry. No markers can help — disabled: true. |
| TLS fingerprinting degrades over time | Kaggle's custom User-Agent fix stopped working — aiohttp now gets 404 for both usernames. Accept disabled: true when no API exists. |
| API endpoints bypass Cloudflare | Fandom api.php and Substack /api/v1/ returned clean JSON while main pages were blocked by Cloudflare. Always try API paths on the same domain. |
| Inspect Network tab for POST APIs | Many modern platforms (e.g., Discord) heavily protect HTML profiles but expose unauthenticated POST endpoints for username checks. Maigret supports this natively: define "request_method": "POST" and "request_payload": {"username": "{username}"} in data.json to query them! |
| Strict JSON markers are bulletproof | When probing APIs, use checkType: "message" with exact JSON substrings (like "{\"taken\": false}"). Unlike HTML layout checks, this approach is immune to UI redesigns, A/B testing, and language translations. |
| GraphQL supports GET too | hashnode GraphQL works via GET ?query=... (URL-encoded). You can use either native POST payloads or GET urlProbe for GraphQL. |
| URL-encode braces for template safety | GraphQL {...} conflicts with Maigret's {username}. Use %7B/%7D for literal braces in urlProbe — .format() ignores percent-encoded chars. |
| Anti-bot bypass via simple UA | "Anubis" anti-bot PoW screens (like on Weblate) intercept modern browser UAs via HTTP 307. Hardcoding "headers": {"User-Agent": "python-requests/2.25.1"} circumvents the scraper filter and restores default detection logic. |
Rate-limit → errors, not absenceStrs |
Rate-limit wording varies across API versions. If the phrasing doesn't match absenceStrs and presenseStrs is empty, the result is a false CLAIMED. Put all "can't answer right now" strings (rate limit, CAPTCHA, maintenance) in errors so the result is UNKNOWN. |
8. Documentation maintenance
When you change Maigret, add search tools, or change check logic, keep this playbook, site-checks-guide.md, and (when applicable) the template in socid_extractor_improvements.log aligned. New log entries are append-only at the bottom of that file.