Files
maigret/LLM/site-checks-playbook.md
T

8.6 KiB

Site checks — playbook (Maigret)

Short checklist for edits to maigret/resources/data.json and, when needed, maigret/checking.py. Full guide: site-checks-guide.md. Upstream extraction proposals: socid_extractor_improvements.log.

Documentation maintenance: whenever you improve Maigret, add search tooling, or change check logic, update both this file and site-checks-guide.md (see the “Documentation maintenance” section at the end of that file). When JSON API / socid_extractor logging rules change, update the template header in socid_extractor_improvements.log in the same change.

0. Standard checks (do alongside reproduce / classify)

  • Public JSON API: always look for a stable JSON (or GraphQL JSON) profile endpoint (/api/, .json, mobile-style URLs). When the API is more reliable than HTML, set urlProbe to that endpoint and keep url as the human-readable profile link (e.g. https://picsart.com/u/{username}). If there is no separate profile URL, use the API as url only. Details: urlProbe and section 2.1 in site-checks-guide.md.
  • socid_extractor log (mandatory): if you find embedded user JSON in HTML or a standalone JSON profile API, append a dated entry (with example username) to socid_extractor_improvements.log. Details: section 2.2 in site-checks-guide.md.

1. Reproduce

  • Run a targeted check:
    maigret USER --db /path/to/maigret/resources/data.json --site "SiteName" --print-not-found --print-errors --no-progressbar -vv
  • Compare an existing and a non-existent username (as usernameClaimed / usernameUnclaimed in JSON).
  • With -vvv, inspect debug.log (raw response in the log).

2. Classify the cause

Symptom Typical cause Action
HTTP 200 for “user does not exist” Soft 404 Move from status_code to message or response_url; add absenceStrs / narrow presenseStrs
Generic words match (name, email) presenseStrs too broad Remove generic markers; add profile-specific ones
Same HTML without JS SPA / skeleton shell Compare final URL and HTTP redirects (Maigret already follows redirects by default). If the browser shows extra routes (/posts, /not-found) only after JS, they will not appear to Maigret — try a public JSON/API endpoint for the same site if one exists. See Redirects and final URL and Picsart in site-checks-guide.md.
403 / “Log in” / guest-only Auth or anti-bot required disabled: true
reCAPTCHA / “Checking your browser” Bot protection Try a reasonable User-Agent in headers; else errors + UNKNOWN or disabled
Domain does not resolve / persistent timeout Dead service Remove entry only after confirming the domain is dead

3. Data edits

  1. Update url / urlMain if needed (HTTPS redirects). Use optional urlProbe when the HTTP check should hit a different URL than the profile link shown in reports (API vs web UI).
  2. For message: always tune string pairs so absenceStrs fire on “no user” pages and presenseStrs fire on real profiles without false absence hits.
  3. Engine (engine, e.g. XenForo): override only differing fields in the site entry so other sites are not broken.
  4. Keep status_code only if the response reliably differs by status code without soft 404.

4. Verify

  • maigret --self-check --site "SiteName" --db ... for touched entries.
  • make test before commit.

5. Code notes

  • process_site_result uses strict comparison to "status_code" for checkType (not a substring trick).
  • Empty presenseStrs with message means “presence always true”; a debug line is logged only at DEBUG level.

6. Development utilities

Quick reference for site check utilities. Full details: section 6 in site-checks-guide.md.

Command Purpose
python utils/site_check.py --site "X" --check-claimed Quick aiohttp comparison
python utils/site_check.py --site "X" --maigret Test via Maigret checker
python utils/site_check.py --site "X" --compare-methods Find aiohttp vs Maigret discrepancies
python utils/site_check.py --site "X" --diagnose Full diagnosis with fix recommendations
python utils/check_top_n.py --top 100 Mass-check top 100 sites
maigret --self-check --site "X" Self-check (reports only, no auto-disable)
maigret --self-check --site "X" --auto-disable Self-check with auto-disable
maigret --self-check --site "X" --diagnose Self-check with detailed diagnosis

7. Quick tips (lessons learned)

Practical observations from fixing top-ranked sites. Full details: section 7 in site-checks-guide.md.

Tip Why it matters
API first Reddit, Microsoft Learn — APIs worked when web pages were blocked. Always check /api/, .json endpoints.
urlProbe separates check from display Check via API, show human URL in reports. Example: Reddit API → www.reddit.com/user/ link.
aiohttp ≠ curl Wikipedia returned 200 for curl, 403 for aiohttp (TLS fingerprinting). Always test with Maigret directly.
Use debug.log Run with -vvv to see raw response. Warning messages alone can be misleading.
status_code for clean APIs If API returns proper 404 for missing users, prefer status_code over message.
Migrate, don't delete MSDN → Microsoft Learn: keep old entry disabled, create new one for current service.
Engine templates break silently vBulletin absenceStrs failed on ~12 forums at once — many require login, showing a generic page with no error text. Check the engine template first.
Search-by-author is unreliable phpBB search.php?author= checks for posts, not accounts. A user with zero posts looks identical to a non-existent user. Avoid these URLs.
Some sites always generate a page Pbase stubs "pbase Artist {name}" for any path; ffm.bio fuzzy-matches to the nearest real entry. No markers can help — disabled: true.
TLS fingerprinting degrades over time Kaggle's custom User-Agent fix stopped working — aiohttp now gets 404 for both usernames. Accept disabled: true when no API exists.
API endpoints bypass Cloudflare Fandom api.php and Substack /api/v1/ returned clean JSON while main pages were blocked by Cloudflare. Always try API paths on the same domain.
Inspect Network tab for POST APIs Many modern platforms (e.g., Discord) heavily protect HTML profiles but expose unauthenticated POST endpoints for username checks. Maigret supports this natively: define "request_method": "POST" and "request_payload": {"username": "{username}"} in data.json to query them!
Strict JSON markers are bulletproof When probing APIs, use checkType: "message" with exact JSON substrings (like "{\"taken\": false}"). Unlike HTML layout checks, this approach is immune to UI redesigns, A/B testing, and language translations.
GraphQL supports GET too hashnode GraphQL works via GET ?query=... (URL-encoded). You can use either native POST payloads or GET urlProbe for GraphQL.
URL-encode braces for template safety GraphQL {...} conflicts with Maigret's {username}. Use %7B/%7D for literal braces in urlProbe.format() ignores percent-encoded chars.
Anti-bot bypass via simple UA "Anubis" anti-bot PoW screens (like on Weblate) intercept modern browser UAs via HTTP 307. Hardcoding "headers": {"User-Agent": "python-requests/2.25.1"} circumvents the scraper filter and restores default detection logic.
Rate-limit → errors, not absenceStrs Rate-limit wording varies across API versions. If the phrasing doesn't match absenceStrs and presenseStrs is empty, the result is a false CLAIMED. Put all "can't answer right now" strings (rate limit, CAPTCHA, maintenance) in errors so the result is UNKNOWN.

8. Documentation maintenance

When you change Maigret, add search tools, or change check logic, keep this playbook, site-checks-guide.md, and (when applicable) the template in socid_extractor_improvements.log aligned. New log entries are append-only at the bottom of that file.