Files
maigret/LLM/site-checks-playbook.md
T
Soxoj 5fa86187f5 feat(sites): fix false positives: disable 74 broken sites, fix 8 with API probes and better markers (#2302)
- Disable 74 sites: Cloudflare/captcha blocks, identical responses,
    dead domains, vBulletin/phpBB engine failures
  - Fix Roblox, Salon24.pl, Planetaexcel → status_code (clear 404 signal)
  - Fix en.brickimedia.org → message with "noarticletext" absenceStr
  - Fix Arduino → narrower title-based presenseStrs/absenceStrs
  - Re-enable Fandom (3 wikis) via MediaWiki api.php urlProbe
  - Re-enable Substack via /api/v1/user/{}/public_profile urlProbe
  - Re-enable hashnode via GraphQL GET urlProbe (URL-encoded query)
  - Document lessons: engine template drift, search-by-author fragility,
    always-200 sites, TLS degradation, API bypassing Cloudflare,
    GraphQL GET support, URL-encoding for template safety
2026-04-08 00:48:36 +02:00

7.4 KiB

Site checks — playbook (Maigret)

Short checklist for edits to maigret/resources/data.json and, when needed, maigret/checking.py. Full guide: site-checks-guide.md. Upstream extraction proposals: socid_extractor_improvements.log.

Documentation maintenance: whenever you improve Maigret, add search tooling, or change check logic, update both this file and site-checks-guide.md (see the “Documentation maintenance” section at the end of that file). When JSON API / socid_extractor logging rules change, update the template header in socid_extractor_improvements.log in the same change.

0. Standard checks (do alongside reproduce / classify)

  • Public JSON API: always look for a stable JSON (or GraphQL JSON) profile endpoint (/api/, .json, mobile-style URLs). When the API is more reliable than HTML, set urlProbe to that endpoint and keep url as the human-readable profile link (e.g. https://picsart.com/u/{username}). If there is no separate profile URL, use the API as url only. Details: urlProbe and section 2.1 in site-checks-guide.md.
  • socid_extractor log (mandatory): if you find embedded user JSON in HTML or a standalone JSON profile API, append a dated entry (with example username) to socid_extractor_improvements.log. Details: section 2.2 in site-checks-guide.md.

1. Reproduce

  • Run a targeted check:
    maigret USER --db /path/to/maigret/resources/data.json --site "SiteName" --print-not-found --print-errors --no-progressbar -vv
  • Compare an existing and a non-existent username (as usernameClaimed / usernameUnclaimed in JSON).
  • With -vvv, inspect debug.log (raw response in the log).

2. Classify the cause

Symptom Typical cause Action
HTTP 200 for “user does not exist” Soft 404 Move from status_code to message or response_url; add absenceStrs / narrow presenseStrs
Generic words match (name, email) presenseStrs too broad Remove generic markers; add profile-specific ones
Same HTML without JS SPA / skeleton shell Compare final URL and HTTP redirects (Maigret already follows redirects by default). If the browser shows extra routes (/posts, /not-found) only after JS, they will not appear to Maigret — try a public JSON/API endpoint for the same site if one exists. See Redirects and final URL and Picsart in site-checks-guide.md.
403 / “Log in” / guest-only Auth or anti-bot required disabled: true
reCAPTCHA / “Checking your browser” Bot protection Try a reasonable User-Agent in headers; else errors + UNKNOWN or disabled
Domain does not resolve / persistent timeout Dead service Remove entry only after confirming the domain is dead

3. Data edits

  1. Update url / urlMain if needed (HTTPS redirects). Use optional urlProbe when the HTTP check should hit a different URL than the profile link shown in reports (API vs web UI).
  2. For message: always tune string pairs so absenceStrs fire on “no user” pages and presenseStrs fire on real profiles without false absence hits.
  3. Engine (engine, e.g. XenForo): override only differing fields in the site entry so other sites are not broken.
  4. Keep status_code only if the response reliably differs by status code without soft 404.

4. Verify

  • maigret --self-check --site "SiteName" --db ... for touched entries.
  • make test before commit.

5. Code notes

  • process_site_result uses strict comparison to "status_code" for checkType (not a substring trick).
  • Empty presenseStrs with message means “presence always true”; a debug line is logged only at DEBUG level.

6. Development utilities

Quick reference for site check utilities. Full details: section 6 in site-checks-guide.md.

Command Purpose
python utils/site_check.py --site "X" --check-claimed Quick aiohttp comparison
python utils/site_check.py --site "X" --maigret Test via Maigret checker
python utils/site_check.py --site "X" --compare-methods Find aiohttp vs Maigret discrepancies
python utils/site_check.py --site "X" --diagnose Full diagnosis with fix recommendations
python utils/check_top_n.py --top 100 Mass-check top 100 sites
maigret --self-check --site "X" Self-check (reports only, no auto-disable)
maigret --self-check --site "X" --auto-disable Self-check with auto-disable
maigret --self-check --site "X" --diagnose Self-check with detailed diagnosis

7. Quick tips (lessons learned)

Practical observations from fixing top-ranked sites. Full details: section 7 in site-checks-guide.md.

Tip Why it matters
API first Reddit, Microsoft Learn — APIs worked when web pages were blocked. Always check /api/, .json endpoints.
urlProbe separates check from display Check via API, show human URL in reports. Example: Reddit API → www.reddit.com/user/ link.
aiohttp ≠ curl Wikipedia returned 200 for curl, 403 for aiohttp (TLS fingerprinting). Always test with Maigret directly.
Use debug.log Run with -vvv to see raw response. Warning messages alone can be misleading.
status_code for clean APIs If API returns proper 404 for missing users, prefer status_code over message.
Migrate, don't delete MSDN → Microsoft Learn: keep old entry disabled, create new one for current service.
Engine templates break silently vBulletin absenceStrs failed on ~12 forums at once — many require login, showing a generic page with no error text. Check the engine template first.
Search-by-author is unreliable phpBB search.php?author= checks for posts, not accounts. A user with zero posts looks identical to a non-existent user. Avoid these URLs.
Some sites always generate a page Pbase stubs "pbase Artist {name}" for any path; ffm.bio fuzzy-matches to the nearest real entry. No markers can help — disabled: true.
TLS fingerprinting degrades over time Kaggle's custom User-Agent fix stopped working — aiohttp now gets 404 for both usernames. Accept disabled: true when no API exists.
API endpoints bypass Cloudflare Fandom api.php and Substack /api/v1/ returned clean JSON while main pages were blocked by Cloudflare. Always try API paths on the same domain.
GraphQL supports GET too hashnode GraphQL works via GET ?query=... (URL-encoded). Don't assume POST-only — Maigret can use GET urlProbe for GraphQL.
URL-encode braces for template safety GraphQL {...} conflicts with Maigret's {username}. Use %7B/%7D for literal braces in urlProbe.format() ignores percent-encoded chars.

8. Documentation maintenance

When you change Maigret, add search tools, or change check logic, keep this playbook, site-checks-guide.md, and (when applicable) the template in socid_extractor_improvements.log aligned. New log entries are append-only at the bottom of that file.