Files
maigret/CONTRIBUTING.md

12 KiB

How to contribute

Hey! I'm really glad you're reading this. Maigret contains a lot of sites, and it is very hard to keep all the sites operational. That's why any fix is important.

Code of Conduct

Please read and follow the Code of Conduct to foster a welcoming and inclusive community.

Local setup

Install Maigret with development dependencies via Poetry:

git clone https://github.com/soxoj/maigret && cd maigret
poetry install --with dev

Activate the repo's git hooks once after cloning:

git config --local core.hooksPath .githooks/

The pre-commit hook does two things every time you commit changes that touch the site database:

  • regenerates the database signature maigret/resources/db_meta.json (used to detect compatible auto-updates), and
  • regenerates sites.md (the human-readable list of supported sites with per-engine statistics).

It also auto-stages the regenerated files so they land in the same commit as your edits. Always run git commit from inside the repo so the hook can fire — without it, your PR will land with a stale signature and a stale sites.md, and database auto-update will misbehave for users on your branch.

How to contribute

There are two main ways to help.

1. Add a new site

Beginner. Use the --submit mode — Maigret takes a single existing-account URL, auto-detects the site engine, picks presenseStrs / absenceStrs, and offers to add the entry:

maigret --submit https://example.com/users/alice

--submit works well when the site has clean status codes and no anti-bot protection. It will not discover a public JSON API (urlProbe), classify protection (tls_fingerprint, cf_js_challenge, ip_reputation, ...), or recognise SPA / soft-404 pages. For those, fall back to manual editing.

Advanced. Edit maigret/resources/data.json by hand — see Editing data.json safely below. There is also an add-a-site issue template if you want a maintainer to do it for you.

2. Fix existing sites

The most useful work in this project is keeping checks accurate over time. Sites change layout, switch engines, add Cloudflare, redirect to login walls — every fix is welcome.

Where to start. Good candidates:

  • Issues with the false-positive label, especially those opened automatically by the Telegram bot.
  • Sites currently disabled: true in data.json — many were disabled on a transient symptom and have since healed.
  • Sites for which --self-check --diagnose reports a problem.
  • A focused audit of one engine (vBulletin, XenForo, phpBB, Discourse, Flarum, ...). Engine-wide breakage usually has a single root cause and several sites can be fixed in one PR.

Diagnose with built-in tools.

By default, Maigret skips entries with disabled: true in every mode (--self-check, --site, plain search). Whenever your target is a disabled site — diagnosing it, validating a fix, running the two-filter check below — pass --use-disabled-sites explicitly. Without the flag, the site is silently dropped from the run and you get an empty result that looks like "everything's fine".

  • Per-site diagnosis with recommendations:

    maigret --self-check --site "SiteName" --diagnose
    # add --use-disabled-sites if the entry is currently disabled
    

    Without --auto-disable, this only reports — it never edits the database. Add --auto-disable only when you really want to write the result back.

  • Single-site comparison of claimed vs unclaimed responses (status, markers, headers):

    python utils/site_check.py --site "SiteName" --diagnose
    python utils/site_check.py --site "SiteName" --compare-methods   # raw aiohttp vs Maigret's checker
    
  • Mass check of top-N sites:

    python utils/check_top_n.py --top 100 --only-broken
    

Understanding checkType

Each site entry uses one of three checkType modes to decide whether a profile exists. Picking the right one for your site is the most important data-modeling decision in data.json:

  • message (most common, most flexible) — Maigret fetches the page and inspects the HTML body. The profile is reported as found when the body contains at least one substring from presenseStrs and none of the substrings from absenceStrs. Pick narrow, profile-specific markers: a <title> fragment unique to profile pages, a CSS class only rendered on profiles (e.g. "profile-card"), or a JSON field name from an embedded data blob ("displayName":). Avoid generic words (name, email) and HTML/ARIA boilerplate (polite, alert, navigation, status) — they match on every page including error and anti-bot challenge pages, and produce false positives. If the marker contains non-ASCII text, double-check the page is UTF-8 (some legacy sites serve KOI8-R or Windows-1251, in which case byte-level matching silently fails — prefer ASCII markers or a JSON API).

  • status_code — Maigret only looks at the HTTP status code; 2xx means "found", anything else means "not found". Use this only when the site reliably returns proper status codes — typically clean JSON APIs that return HTTP 200 for real users and HTTP 404 for missing ones. Don't use it for sites that return HTTP 200 with a soft "user not found" page (this is the single most common cause of false-positive checks).

  • response_url — Maigret follows the redirect chain and inspects the final URL. Useful when the server reliably redirects missing-user URLs to a different path (e.g. /login, /404, the homepage) while existing-user URLs stay put. For most sites message is a better fit; reach for response_url only when a redirect-based signal is genuinely the most stable one.

urlProbe (optional, works with any checkType). If the most reliable signal lives at a different URL than the public profile page — a JSON API, a GraphQL endpoint, a mobile-app route — set urlProbe to that URL. Maigret fetches urlProbe for the check, but reports continue to show the human-readable url so users see a profile link they can click. Examples: GitHub uses https://github.com/{username} as url and https://api.github.com/users/{username} as urlProbe; Picsart uses the web profile as url and https://api.picsart.com/users/show/{username}.json as urlProbe. A clean public API is almost always more stable than parsing HTML — it's worth probing for one before settling on message against the SPA shell.

Errors vs absence. Anything that means "the server can't answer right now" — rate limits, captchas, "Checking your browser", "unusual traffic", maintenance pages — belongs in errors (mapping the substring to a human-readable error string), not in absenceStrs. The errors mechanism produces an UNKNOWN result instead of a false CLAIMED or false AVAILABLE.

Full reference for checkType, urlProbe, engine, and the rest of the data.json schema is in the development guide, section How to fix false-positives.

Editing data.json safely

data.json is a single ~36 000-line JSON file. Make surgical, line-level edits only. Never rewrite it by reading it into a Python dict and dumping it back — json.load + json.dump reformats every entry and produces an unreviewable 70 000-line diff. The same rule applies to any helper script that touches the file: it must preserve the original formatting of untouched entries.

If your editor reformats JSON on save, disable that for data.json before editing.

Two-filter validation when re-enabling a site

Removing disabled: true requires two independent checks. --self-check alone is not sufficient — it only verifies the two specific usernames recorded in the entry, so a site that returns CLAIMED for any arbitrary username will still pass the self-check.

# Filter 1: self-check on the recorded claimed/unclaimed pair
maigret --self-check --site "SiteName" --use-disabled-sites

# Filter 2: live probe with a clearly fake username — nothing should match
maigret noonewouldeverusethis7 --site "SiteName" --use-disabled-sites --print-not-found

Both filters need --use-disabled-sites, since a candidate for re-enable still has disabled: true in the working tree until your edit lands. If you forget the flag, both commands silently no-op.

If the second command reports [+] for the fake username, the check is a false positive — do not enable. This step takes seconds and is non-negotiable for any re-enable PR.

Site naming, tags, and protection

  • Site naming conventions (Title Case by default, brand-specific exceptions, no www. prefix, etc.) are documented in the development guide, section Site naming conventions.

  • Country tags (us, ru, kr, ...) attribute an account to a country of origin or residence — they're not a traffic-share label. Global services (GitHub, YouTube, Reddit) get no country tag; regional services (VK → ru, Naver → kr) must have one. Don't assign a country tag from Alexa/SimilarWeb audience stats.

  • Category tags must come from the canonical "tags" array at the bottom of data.json. The test_tags_validity test fails if you introduce an unregistered tag. If no existing tag fits well, either pick the closest reasonable match or add the new tag to the canonical list as an explicit, separate change. Don't use platform names (writefreely, pixelfed) — use category names (blog, photo).

  • Protection tags (tls_fingerprint, ip_reputation, cf_js_challenge, cf_firewall, aws_waf_js_challenge, ddos_guard_challenge, js_challenge, custom_bot_protection) describe the kind of anti-bot protection a site uses. One of them — tls_fingerprint — is load-bearing: when a site fingerprints the TLS handshake (JA3/JA4) and blocks non-browser clients, tagging it with tls_fingerprint makes Maigret automatically swap its HTTP client to curl_cffi with Chrome browser emulation, which is usually enough to pass. The site stays enabled — no disabled: true is needed. Examples: Instagram, NPM, Codepen, Kickstarter, Letterboxd. The remaining tags are documentation-only and pair with disabled: true until a per-provider solver is integrated. The full taxonomy and the rules for picking the right tag are in the development guide, section protection (site protection tracking). Don't add a protection tag without empirical evidence it applies in the current environment.

Testing

CI runs the same checks on every PR, but please run them locally first:

make format     # auto-format with black
make lint       # flake / mypy
make test       # pytest with coverage

Submitting changes

Open a GitHub PR against main. Always write a clear log message:

$ git commit -m "A brief summary of the commit
>
> A paragraph describing what changed and its impact."

One-line messages are fine for small changes; bigger changes should explain the why in the body.

Coding conventions

General

  • Follow PEP 8 for Python.
  • Make sure all tests pass before opening the PR.

Code style

  • Indentation: 4 spaces per level.
  • Imports: standard library first, third-party next, project-local last; group them logically.

Naming

  • Variables and functions: snake_case.
  • Classes: CamelCase.
  • Constants: UPPER_CASE.

Start reading the code and you'll get the hang of it.

Getting help

If you're stuck on something — a check that won't behave, a setup error, an unclear field in data.json, or just want to discuss an approach before opening a PR — there are two places to ask:

  • GitHub Discussions — searchable, public, good for technical questions and design ideas. Prefer this for anything other contributors might run into too.
  • Telegram: @soxoj — direct channel to the maintainer, good for quick questions and informal chat.

Bug reports and feature requests still belong in GitHub Issues.

License

Maigret is MIT-licensed; by submitting a contribution you agree to publish it under the same license. There is no CLA.