Compare commits

...

75 Commits

Author SHA1 Message Date
Soxoj 5830c9ce72 Add response time measurement 2026-04-21 02:02:24 +02:00
Soxoj d6905a8fd8 Fix site checks: 4 fixed, 14 → ip_reputation, 8 disabled, 5 dead deleted (#2537) 2026-04-21 00:40:24 +02:00
Soxoj 7d216638fa fix site checks: 14 sites → ip_reputation, 7 disabled, 5 dead deleted (#2536) 2026-04-20 23:51:18 +02:00
Soxoj fb71f26fd0 Fix site checks: recover 6 CF sites via tls_fingerprint, 500px GraphQL, delete 4 dead domains (#2535) 2026-04-20 22:41:51 +02:00
dependabot[bot] 621b104523 build(deps): bump lxml from 6.0.4 to 6.1.0 (#2533)
Bumps [lxml](https://github.com/lxml/lxml) from 6.0.4 to 6.1.0.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](https://github.com/lxml/lxml/compare/lxml-6.0.4...lxml-6.1.0)

---
updated-dependencies:
- dependency-name: lxml
  dependency-version: 6.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-20 20:48:47 +02:00
Soxoj 37ce4fe728 Update of Readme and documentation (#2514)
* Big readme update

* Readme and documentation update

* Readme structure update

* Small fixes

* Changelog update
2026-04-17 17:42:36 +02:00
Soxoj f74f82ee13 Fixed: Hack MD, DailyKos, Mywed, WikimapiaSearch, TikTok Online Viewer (#2526, #2522, #2523, #2500, #2496); Disabled: Radiokot, Lurkmore, Mylespaul, AppleDiscussions, Loveplanet (#2524, #2511, #2498) (#2528) 2026-04-17 17:04:50 +02:00
dependabot[bot] 7e6d70a680 build(deps): bump pypdf from 6.10.0 to 6.10.2 (#2527)
Bumps [pypdf](https://github.com/py-pdf/pypdf) from 6.10.0 to 6.10.2.
- [Release notes](https://github.com/py-pdf/pypdf/releases)
- [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md)
- [Commits](https://github.com/py-pdf/pypdf/compare/6.10.0...6.10.2)

---
updated-dependencies:
- dependency-name: pypdf
  dependency-version: 6.10.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-17 15:00:33 +02:00
dependabot[bot] e900d4a853 build(deps): bump chardet from 7.4.2 to 7.4.3 (#2521)
Bumps [chardet](https://github.com/chardet/chardet) from 7.4.2 to 7.4.3.
- [Release notes](https://github.com/chardet/chardet/releases)
- [Changelog](https://github.com/chardet/chardet/blob/main/docs/changelog.rst)
- [Commits](https://github.com/chardet/chardet/compare/7.4.2...7.4.3)

---
updated-dependencies:
- dependency-name: chardet
  dependency-version: 7.4.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-16 12:05:44 +02:00
dependabot[bot] 9ee4eb9b69 build(deps): bump pillow from 12.1.1 to 12.2.0 (#2520)
Bumps [pillow](https://github.com/python-pillow/Pillow) from 12.1.1 to 12.2.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/12.1.1...12.2.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 12.2.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-14 11:01:45 +02:00
dependabot[bot] 53f21eda98 build(deps-dev): bump mypy from 1.20.0 to 1.20.1 (#2518)
Bumps [mypy](https://github.com/python/mypy) from 1.20.0 to 1.20.1.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](https://github.com/python/mypy/compare/v1.20.0...v1.20.1)

---
updated-dependencies:
- dependency-name: mypy
  dependency-version: 1.20.1
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 16:47:55 +02:00
dependabot[bot] 1a0f36ffb6 build(deps): bump chardet from 7.4.1 to 7.4.2 (#2517)
Bumps [chardet](https://github.com/chardet/chardet) from 7.4.1 to 7.4.2.
- [Release notes](https://github.com/chardet/chardet/releases)
- [Changelog](https://github.com/chardet/chardet/blob/main/docs/changelog.rst)
- [Commits](https://github.com/chardet/chardet/compare/7.4.1...7.4.2)

---
updated-dependencies:
- dependency-name: chardet
  dependency-version: 7.4.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 15:15:41 +02:00
dependabot[bot] 14114e681c build(deps): bump lxml from 6.0.3 to 6.0.4 (#2519)
Bumps [lxml](https://github.com/lxml/lxml) from 6.0.3 to 6.0.4.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](https://github.com/lxml/lxml/compare/lxml-6.0.3...lxml-6.0.4)

---
updated-dependencies:
- dependency-name: lxml
  dependency-version: 6.0.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 15:15:17 +02:00
dependabot[bot] bc0649e6a8 build(deps-dev): bump tuna from 0.5.11 to 0.5.13 (#2516)
Bumps [tuna](https://github.com/nschloe/tuna) from 0.5.11 to 0.5.13.
- [Release notes](https://github.com/nschloe/tuna/releases)
- [Commits](https://github.com/nschloe/tuna/compare/v0.5.11...v0.5.13)

---
updated-dependencies:
- dependency-name: tuna
  dependency-version: 0.5.13
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 11:36:27 +02:00
Soxoj 8267367bed Support Python 3.14 in tests (#2515) 2026-04-11 15:11:09 +02:00
Michael Sargis 28cb6c9ffb Remove duplicate attribute initialization in SimpleAiohttpChecker.__init__ (#2513)
self.allow_redirects and self.timeout were each initialized twice in
SimpleAiohttpChecker.__init__, which is redundant code.

Co-authored-by: zocomputer <help@zocomputer.com>
2026-04-11 10:35:56 +02:00
dependabot[bot] 7a31328325 build(deps): bump pypdf from 6.9.2 to 6.10.0 (#2512)
Bumps [pypdf](https://github.com/py-pdf/pypdf) from 6.9.2 to 6.10.0.
- [Release notes](https://github.com/py-pdf/pypdf/releases)
- [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md)
- [Commits](https://github.com/py-pdf/pypdf/compare/6.9.2...6.10.0)

---
updated-dependencies:
- dependency-name: pypdf
  dependency-version: 6.10.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-11 00:14:13 +02:00
Soxoj 7fd9bb3692 Update workflow to trigger on published releases (#2508) 2026-04-10 12:34:00 +02:00
Soxoj 385f9f5bb3 Bump to 0.6.0 (#2506) 2026-04-10 12:30:43 +02:00
Soxoj dc8751ac55 Added 3 sites, fixed 6, disabled 8 (#2505) 2026-04-10 12:26:41 +02:00
Copilot 9303b1686d Disable Kinja.com site check (#2503) 2026-04-10 12:16:28 +02:00
dependabot[bot] aa80bd4232 build(deps): bump lxml from 6.0.2 to 6.0.3 (#2501)
Bumps [lxml](https://github.com/lxml/lxml) from 6.0.2 to 6.0.3.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](https://github.com/lxml/lxml/compare/lxml-6.0.2...lxml-6.0.3)

---
updated-dependencies:
- dependency-name: lxml
  dependency-version: 6.0.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 11:08:04 +02:00
dependabot[bot] f5c4b1c35d build(deps): bump socid-extractor from 0.0.27 to 0.0.28 (#2502)
Bumps [socid-extractor](https://github.com/soxoj/socid-extractor) from 0.0.27 to 0.0.28.
- [Release notes](https://github.com/soxoj/socid-extractor/releases)
- [Changelog](https://github.com/soxoj/socid-extractor/blob/master/CHANGELOG.md)
- [Commits](https://github.com/soxoj/socid-extractor/compare/0.0.27...v0.0.28)

---
updated-dependencies:
- dependency-name: socid-extractor
  dependency-version: 0.0.28
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 11:05:54 +02:00
Soxoj 5e24117e93 Fix false positives (#2499)
* Re-disable 29 false positives from #2478
2026-04-09 17:48:45 +02:00
Soxoj 777e503e30 Re-enable 69 stale-disabled sites validated via self-check (#2478)
Total: 2539 → 2608 enabled sites (+69).
2026-04-09 12:27:48 +02:00
dependabot[bot] c222c96aeb build(deps): bump platformdirs from 4.9.4 to 4.9.6 (#2477)
Bumps [platformdirs](https://github.com/tox-dev/platformdirs) from 4.9.4 to 4.9.6.
- [Release notes](https://github.com/tox-dev/platformdirs/releases)
- [Changelog](https://github.com/tox-dev/platformdirs/blob/main/docs/changelog.rst)
- [Commits](https://github.com/tox-dev/platformdirs/compare/4.9.4...4.9.6)

---
updated-dependencies:
- dependency-name: platformdirs
  dependency-version: 4.9.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-09 10:05:17 +02:00
Soxoj b213f6e079 vBulletin cleanup, Flarum sites, engine stats, UA bump (#2476) 2026-04-09 01:17:24 +02:00
dependabot[bot] 9354331874 build(deps): bump cryptography from 46.0.6 to 46.0.7 (#2475)
Bumps [cryptography](https://github.com/pyca/cryptography) from 46.0.6 to 46.0.7.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/46.0.6...46.0.7)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-version: 46.0.7
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-09 01:14:44 +02:00
dependabot[bot] 8a82eb6ee6 build(deps): bump chardet from 7.4.0.post2 to 7.4.1 (#2472)
Bumps [chardet](https://github.com/chardet/chardet) from 7.4.0.post2 to 7.4.1.
- [Release notes](https://github.com/chardet/chardet/releases)
- [Changelog](https://github.com/chardet/chardet/blob/main/docs/changelog.rst)
- [Commits](https://github.com/chardet/chardet/compare/7.4.0.post2...7.4.1)

---
updated-dependencies:
- dependency-name: chardet
  dependency-version: 7.4.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-08 20:28:55 +02:00
github-actions[bot] a61f3b32c4 Updated site list and statistics (#2474)
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-08 14:34:48 +02:00
Copilot fbb8255518 Update HackTheBox and Wikipedia to use new API endpoints (#2470)
* Initial plan

* Update HackTheBox and Wikipedia to use new API endpoints for username checking

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/6dc9147c-787f-4f4f-8903-7b9873ac6ac9

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-08 10:23:07 +02:00
dependabot[bot] 9bad5d8269 build(deps-dev): bump pytest from 9.0.2 to 9.0.3 (#2473)
Bumps [pytest](https://github.com/pytest-dev/pytest) from 9.0.2 to 9.0.3.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pytest-dev/pytest/compare/9.0.2...9.0.3)

---
updated-dependencies:
- dependency-name: pytest
  dependency-version: 9.0.3
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-08 09:30:36 +02:00
Olivier Cervello a8e7ab4540 Bump lxml minimum to 6.0.2 for Python 3.14 compatibility (#2279)
* Bump lxml minimum to 6.0.2 for Python 3.14 compatibility

lxml 5.x fails to build on Python 3.14 due to incompatible pointer
types in Cython-generated C code. lxml 6.0.2 compiles correctly.

Fixes #2266

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update poetry.lock to match pyproject.toml changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-08 00:55:21 +02:00
Soxoj 6db1df2ddb Fix failing test for custom DB path resolution (#2468)
* Fix `--db` bug

* Fix test_resolve_db_path_custom_file to create the file before testing

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/3ea7b2e8-0565-4fca-8ec2-eff8eb4ee617

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-04-08 00:53:57 +02:00
Copilot 23adc178ea Fix crash on -a --self-check by adding exception handling to site check coroutines (#2466)
* Initial plan

* Fix crash on -a --self-check by adding exception handling in site_self_check and self_check

Wrap the body of site_self_check in try/except to catch unexpected errors
and always return a valid changes dict. Also add a safety-net try/except
in self_check around awaiting individual site check futures so that a
single site failure doesn't crash the entire self-check process.

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/5e27d620-5cbb-43d2-a9f9-ecb53a29904d

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

* Restore @pytest.mark.slow on test_maigret_results

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/5e27d620-5cbb-43d2-a9f9-ecb53a29904d

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

* Document --self-check error resilience, --auto-disable, and --diagnose in docs/

Update command-line-options.rst with expanded --self-check description
and new --auto-disable and --diagnose entries. Add a "Database self-check"
section to features.rst explaining error-resilient behaviour and usage
examples. Update usage-examples.rst to reference --auto-disable.

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/af1f0f09-9112-4902-8475-e81d235ff3ed

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-07 19:44:09 +02:00
Soxoj 6834483360 Fix Spotify, add Spotify Community forum (#2467) 2026-04-07 18:25:13 +02:00
Copilot 6ed8fdefcc Add installation troubleshooting for missing system dependencies (#2465)
* Initial plan

* Add installation troubleshooting section for system dependency errors

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/6c3a5612-bdd5-4611-ba77-aea7ab52e304

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

* Simplify README troubleshooting to a link to the full docs

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/6c557093-0643-4980-93ad-973e2d3141ef

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-07 17:54:02 +02:00
Soxoj 3fd34afb77 Sites fixes (#2464) 2026-04-06 21:41:16 +02:00
Soxoj ad95302745 Add Markdown reports for LLM analysis (#2463) 2026-04-06 18:26:43 +02:00
dependabot[bot] 44a6c729e3 build(deps): bump curl-cffi from 0.14.0 to 0.15.0 (#2462)
Bumps [curl-cffi](https://github.com/lexiforest/curl_cffi) from 0.14.0 to 0.15.0.
- [Release notes](https://github.com/lexiforest/curl_cffi/releases)
- [Changelog](https://github.com/lexiforest/curl_cffi/blob/main/docs/changelog.rst)
- [Commits](https://github.com/lexiforest/curl_cffi/compare/v0.14.0...v0.15.0)

---
updated-dependencies:
- dependency-name: curl-cffi
  dependency-version: 0.15.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 15:04:48 +02:00
Soxoj 6d0a22b738 False positive fixes (#2460)
* Fix false positives: APClips, Taplink, gentoo, Discord.bio, ChaturBate; disable 7Cups, playtime, openriskmanual, reactos; update tags

* Fix db_meta.json regeneration in update_site_data.py (inline instead of module import)

* Fix false positives: disable Bit.ly, Firearmstalk, Needrom, Travelblog; fix gentoo, Discord.bio, brickimedia via API; remove dead sites dreamhost, typepad
2026-04-04 19:08:51 +02:00
Soxoj abce3c9be4 Fix false positives (#2459)
* Fix false positives: APClips, Taplink, gentoo, Discord.bio, ChaturBate; disable 7Cups, playtime, openriskmanual, reactos; update tags

* Fix db_meta.json regeneration in update_site_data.py (inline instead of module import)
2026-04-04 18:22:21 +02:00
Soxoj 269d50eedc DB update mechanism (#2458)
* Database update mechanism
2026-04-04 18:00:50 +02:00
Soxoj e8f4318e5d Added Crypto/Web3 site checks (#2457) 2026-04-04 16:49:12 +02:00
Soxoj 75289c78bf Update of MIT License (#2455) 2026-04-03 18:02:54 +02:00
Julio César Suástegui eeb38ccdc0 fix(data): update InterPals absence string to match current site response (#2442)
The previous absence string 'The requested user does not exist or is inactive'
no longer matches the live site response. InterPals now returns 'User not found'
for non-existent profiles, causing false positives for all username searches.

Tested against interpals.net/noneownsthisusername (non-existent) and
interpals.net/blue (claimed) to confirm detection accuracy.

Closes #2433

Co-authored-by: Julio César Suástegui <juliosuas@users.noreply.github.com>
2026-04-03 13:43:33 +02:00
Soxoj d136014576 Multiple lint and types fixes (#2454) 2026-04-02 21:01:49 +02:00
Soxoj 5d502eaef6 Add site protection tracking system, fix broken site checks (Instagra… (#2452)
* Add site protection tracking system, fix broken site checks (Instagram, StackOverflow, LeetCode, Boosty, LiveLib), preserve unicode in data.json

* Update poetry.lock by running poetry lock

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/14333f41-67d5-4e28-a782-9730b31fc667

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-04-02 20:28:20 +02:00
dependabot[bot] 9e8a701c54 build(deps): bump aiohttp from 3.13.4 to 3.13.5 (#2448)
---
updated-dependencies:
- dependency-name: aiohttp
  dependency-version: 3.13.5
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-02 08:11:42 +02:00
dependabot[bot] 7b67c61240 build(deps-dev): bump mypy from 1.19.1 to 1.20.0 (#2447)
Bumps [mypy](https://github.com/python/mypy) from 1.19.1 to 1.20.0.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](https://github.com/python/mypy/compare/v1.19.1...v1.20.0)

---
updated-dependencies:
- dependency-name: mypy
  dependency-version: 1.20.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-01 20:12:29 +02:00
dependabot[bot] 0e113c4592 build(deps): bump requests from 2.33.0 to 2.33.1 (#2444)
Bumps [requests](https://github.com/psf/requests) from 2.33.0 to 2.33.1.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.33.0...v2.33.1)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.33.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-31 16:49:17 +02:00
dependabot[bot] fb4e17be92 build(deps): bump pygments from 2.18.0 to 2.20.0 (#2440)
Bumps [pygments](https://github.com/pygments/pygments) from 2.18.0 to 2.20.0.
- [Release notes](https://github.com/pygments/pygments/releases)
- [Changelog](https://github.com/pygments/pygments/blob/master/CHANGES)
- [Commits](https://github.com/pygments/pygments/compare/2.18.0...2.20.0)

---
updated-dependencies:
- dependency-name: pygments
  dependency-version: 2.20.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 20:25:10 +02:00
dependabot[bot] adb19e5930 build(deps): bump aiohttp from 3.13.3 to 3.13.4 (#2435)
---
updated-dependencies:
- dependency-name: aiohttp
  dependency-version: 3.13.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 14:27:54 +02:00
dependabot[bot] 116fae3e0f build(deps): bump platformdirs from 4.5.0 to 4.9.4 (#2434)
Bumps [platformdirs](https://github.com/tox-dev/platformdirs) from 4.5.0 to 4.9.4.
- [Release notes](https://github.com/tox-dev/platformdirs/releases)
- [Changelog](https://github.com/tox-dev/platformdirs/blob/main/docs/changelog.rst)
- [Commits](https://github.com/tox-dev/platformdirs/compare/4.5.0...4.9.4)

---
updated-dependencies:
- dependency-name: platformdirs
  dependency-version: 4.9.4
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 11:09:27 +02:00
dependabot[bot] bf495cd57e build(deps): bump chardet from 5.2.0 to 7.4.0.post2 (#2436)
Bumps [chardet](https://github.com/chardet/chardet) from 5.2.0 to 7.4.0.post2.
- [Release notes](https://github.com/chardet/chardet/releases)
- [Changelog](https://github.com/chardet/chardet/blob/main/docs/changelog.rst)
- [Commits](https://github.com/chardet/chardet/compare/5.2.0...7.4.0.post2)

---
updated-dependencies:
- dependency-name: chardet
  dependency-version: 7.4.0.post2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 11:09:14 +02:00
dependabot[bot] e49aa533df build(deps): bump multidict from 6.7.0 to 6.7.1 (#2396)
Bumps [multidict](https://github.com/aio-libs/multidict) from 6.7.0 to 6.7.1.
- [Release notes](https://github.com/aio-libs/multidict/releases)
- [Changelog](https://github.com/aio-libs/multidict/blob/master/CHANGES.rst)
- [Commits](https://github.com/aio-libs/multidict/compare/v6.7.0...v6.7.1)

---
updated-dependencies:
- dependency-name: multidict
  dependency-version: 6.7.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-29 12:20:48 +02:00
Soxoj 5aa7f6429b Overhaul site tags and naming: add social tag to 33 networks, fill mi… (#2430)
* Overhaul site tags and naming: add social tag to 33 networks, fill missing tags for 213 top-1000 sites, clean up false us/in country tags (~374 sites), normalize site names to Title Case, add tag validation tests, document tagging and naming rules
Remove LLM folder: ask @soxoj for the up-to-date version!

* Remove LLM/ from version control

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 19:48:16 +01:00
Soxoj a5d337b765 Tags and site names improvements (#2427)
- Added social tag to social networks (33 sites)
- Fixed wrong tags (8 sites)
- Filled empty tags for 213 sites in top-1000
- Country tag cleanup (~374 sites)
- Site naming normalization (75 sites)
- New tests (3)
- Documentation updates
2026-03-28 15:42:12 +01:00
dependabot[bot] 5aa0c908b0 build(deps): bump cryptography from 46.0.5 to 46.0.6 (#2422)
Bumps [cryptography](https://github.com/pyca/cryptography) from 46.0.5 to 46.0.6.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/46.0.5...46.0.6)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-version: 46.0.6
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-28 10:10:35 +01:00
Soxoj 51b452ad71 Add urlProbes (#2425) 2026-03-28 00:08:02 +01:00
Soxoj fa1a4d1b4a Sites re-check (#2423) 2026-03-27 22:41:55 +01:00
dependabot[bot] 184519b202 build(deps): bump soupsieve from 2.8 to 2.8.3 (#2404)
Bumps [soupsieve](https://github.com/facelessuser/soupsieve) from 2.8 to 2.8.3.
- [Release notes](https://github.com/facelessuser/soupsieve/releases)
- [Commits](https://github.com/facelessuser/soupsieve/compare/2.8...2.8.3)

---
updated-dependencies:
- dependency-name: soupsieve
  dependency-version: 2.8.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-27 22:41:40 +01:00
dependabot[bot] a203eecbb2 build(deps-dev): bump pytest from 9.0.1 to 9.0.2 (#2381)
Bumps [pytest](https://github.com/pytest-dev/pytest) from 9.0.1 to 9.0.2.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pytest-dev/pytest/compare/9.0.1...9.0.2)

---
updated-dependencies:
- dependency-name: pytest
  dependency-version: 9.0.2
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-27 22:15:56 +01:00
dependabot[bot] dde1cd5d78 build(deps): bump psutil from 7.1.3 to 7.2.2 (#2406)
Bumps [psutil](https://github.com/giampaolo/psutil) from 7.1.3 to 7.2.2.
- [Changelog](https://github.com/giampaolo/psutil/blob/master/docs/changelog.rst)
- [Commits](https://github.com/giampaolo/psutil/compare/v7.1.3...v7.2.2)

---
updated-dependencies:
- dependency-name: psutil
  dependency-version: 7.2.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-27 15:58:24 +01:00
dependabot[bot] 547512519b build(deps): bump pyinstaller from 6.16.0 to 6.19.0 (#2405)
Bumps [pyinstaller](https://github.com/pyinstaller/pyinstaller) from 6.16.0 to 6.19.0.
- [Release notes](https://github.com/pyinstaller/pyinstaller/releases)
- [Changelog](https://github.com/pyinstaller/pyinstaller/blob/develop/doc/CHANGES.rst)
- [Commits](https://github.com/pyinstaller/pyinstaller/compare/v6.16.0...v6.19.0)

---
updated-dependencies:
- dependency-name: pyinstaller
  dependency-version: 6.19.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-27 09:19:04 +01:00
Soxoj b333a2e2b2 Readme update: commercial use (#2403) 2026-03-26 21:51:53 +01:00
dependabot[bot] 2835ec71c7 build(deps): bump requests from 2.32.5 to 2.33.0 (#2394)
Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.32.5...v2.33.0)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.33.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-26 21:10:51 +01:00
github-actions[bot] af67a6a3f3 Updated site list and statistics (#2399)
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-03-26 16:36:23 +01:00
dependabot[bot] 4f737b5260 build(deps-dev): bump pytest-httpserver from 1.1.0 to 1.1.5 (#2397)
Bumps [pytest-httpserver](https://github.com/csernazs/pytest-httpserver) from 1.1.0 to 1.1.5.
- [Release notes](https://github.com/csernazs/pytest-httpserver/releases)
- [Changelog](https://github.com/csernazs/pytest-httpserver/blob/master/CHANGES.rst)
- [Commits](https://github.com/csernazs/pytest-httpserver/compare/1.1.0...1.1.5)

---
updated-dependencies:
- dependency-name: pytest-httpserver
  dependency-version: 1.1.5
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-26 16:12:53 +01:00
dependabot[bot] 185e09e4ea build(deps): bump pypdf from 6.9.1 to 6.9.2 (#2392)
Bumps [pypdf](https://github.com/py-pdf/pypdf) from 6.9.1 to 6.9.2.
- [Release notes](https://github.com/py-pdf/pypdf/releases)
- [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md)
- [Commits](https://github.com/py-pdf/pypdf/compare/6.9.1...6.9.2)

---
updated-dependencies:
- dependency-name: pypdf
  dependency-version: 6.9.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-25 23:33:41 +01:00
dependabot[bot] 5865e0f375 build(deps): bump yarl from 1.22.0 to 1.23.0 (#2383)
---
updated-dependencies:
- dependency-name: yarl
  dependency-version: 1.23.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-25 16:59:46 +01:00
dependabot[bot] 815c8cb2f3 build(deps): bump asgiref from 3.11.0 to 3.11.1 (#2384)
Bumps [asgiref](https://github.com/django/asgiref) from 3.11.0 to 3.11.1.
- [Changelog](https://github.com/django/asgiref/blob/main/CHANGELOG.txt)
- [Commits](https://github.com/django/asgiref/compare/3.11.0...3.11.1)

---
updated-dependencies:
- dependency-name: asgiref
  dependency-version: 3.11.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-25 15:03:02 +01:00
Soxoj 656fe1df24 Added Max.ru check; --no-progressbar flag fixed (#2386) 2026-03-25 11:48:12 +01:00
dependabot[bot] 1c5dc5f152 build(deps): bump pycountry from 24.6.1 to 26.2.16 (#2382)
Bumps [pycountry](https://github.com/pycountry/pycountry) from 24.6.1 to 26.2.16.
- [Release notes](https://github.com/pycountry/pycountry/releases)
- [Changelog](https://github.com/pycountry/pycountry/blob/main/HISTORY.txt)
- [Commits](https://github.com/pycountry/pycountry/compare/24.6.1...26.2.16)

---
updated-dependencies:
- dependency-name: pycountry
  dependency-version: 26.2.16
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-25 09:54:11 +01:00
Soxoj bc3d9faad9 Fix false-positive site checks reported by Maigret Bot (#2376) 2026-03-24 23:01:11 +01:00
59 changed files with 6952 additions and 6143 deletions
+8 -1
View File
@@ -1,3 +1,10 @@
#!/bin/sh
echo 'Activating update_sitesmd hook script...'
poetry run update_sitesmd
poetry run update_sitesmd
echo 'Regenerating db_meta.json...'
python3 utils/generate_db_meta.py
git add maigret/resources/db_meta.json
git add maigret/resources/data.json
git add sites.md
+1 -1
View File
@@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13"]
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
steps:
- name: Checkout
+23 -14
View File
@@ -1,21 +1,30 @@
name: Upload Python Package to PyPI when a Release is Created
name: Upload Python Package to PyPI when a Release is Published
on:
release:
types: [created]
push:
tags:
- "v*"
permissions:
id-token: write
contents: read
types: [published]
jobs:
build-and-publish:
pypi-publish:
name: Publish release to PyPI
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/maigret
permissions:
id-token: write
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
- run: uv build
- name: Publish to PyPI (Trusted Publishing)
uses: pypa/gh-action-pypi-publish@release/v1
- name: Set up Python
uses: actions/setup-python@v4
with:
packages-dir: dist
python-version: "3.x"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build
- name: Build package
run: |
python -m build
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
+3
View File
@@ -27,6 +27,9 @@ jobs:
pip3 install .
python3 ./utils/update_site_data.py --empty-only
- name: Regenerate db_meta.json
run: python3 utils/generate_db_meta.py
- name: Remove ambiguous main tag
run: git tag -d main || true
+2 -1
View File
@@ -42,4 +42,5 @@ settings.json
# other
*.egg-info
build
build
LLM
+191
View File
@@ -1,5 +1,196 @@
# Changelog
## [0.6.0] - 2025-04-10
## What's Changed
* Updated workflows: added 3.13 to test, updated pypi upload by @soxoj in https://github.com/soxoj/maigret/pull/2111
* Bump pypdf from 5.1.0 to 6.0.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2122
* Bump coverage from 7.9.2 to 7.10.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2117
* Bump soupsieve from 2.6 to 2.7 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2118
* Bump mock from 5.1.0 to 5.2.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2116
* Bump pytest-asyncio from 1.0.0 to 1.1.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2114
* Bump pytest-cov from 6.0.0 to 6.2.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2115
* Bump xhtml2pdf from 0.2.16 to 0.2.17 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2149
* Bump requests from 2.32.4 to 2.32.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2165
* Bump lxml from 5.3.0 to 6.0.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2146
* Bump aiodns from 3.2.0 to 3.5.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2148
* Bump alive-progress from 3.2.0 to 3.3.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2145
* Bump certifi from 2025.6.15 to 2025.8.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2147
* Disabled some sites giving false positive results by @soxoj in https://github.com/soxoj/maigret/pull/2170
* Bump flask from 3.1.1 to 3.1.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2175
* Bump pyinstaller from 6.11.1 to 6.15.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2174
* Bump mypy from 1.14.1 to 1.17.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2173
* Bump pytest from 8.3.4 to 8.4.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2172
* Bump flake8 from 7.1.1 to 7.3.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2171
* Bump aiohttp from 3.12.14 to 3.12.15 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2181
* Bump coverage from 7.10.3 to 7.10.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2180
* Bump psutil from 6.1.1 to 7.0.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2179
* Bump lxml from 6.0.0 to 6.0.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2178
* Bump multidict from 6.6.3 to 6.6.4 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2177
* Bump soupsieve from 2.7 to 2.8 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2185
* Bump typing-extensions from 4.14.1 to 4.15.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2182
* Bump python-bidi from 0.6.3 to 0.6.6 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2183
* Bump platformdirs from 4.3.8 to 4.4.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2184
* Make web interface accessible for Docker deployment by default by @soxoj in https://github.com/soxoj/maigret/pull/2189
* Bump coverage from 7.10.5 to 7.10.6 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2192
* Bump pytest-rerunfailures from 15.1 to 16.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2191
* Bump pytest-rerunfailures from 15.1 to 16.0.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2193
* Bump pytest from 8.4.1 to 8.4.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2194
* Bump pytest-cov from 6.2.1 to 6.3.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2195
* Bump pytest-cov from 6.3.0 to 7.0.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2196
* Bump mypy from 1.17.1 to 1.18.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2197
* Bump black from 25.1.0 to 25.9.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2203
* Bump mypy from 1.18.1 to 1.18.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2202
* Bump pytest-asyncio from 1.1.0 to 1.2.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2200
* Bump pyinstaller from 6.15.0 to 6.16.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2199
* Bump reportlab from 4.4.3 to 4.4.4 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2206
* Bump coverage from 7.10.6 to 7.10.7 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2207
* Bump psutil from 7.0.0 to 7.1.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2201
* Bump asgiref from 3.9.1 to 3.9.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2204
* Bump lxml from 6.0.1 to 6.0.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2208
* Bump platformdirs from 4.4.0 to 4.5.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2223
* Bump asgiref from 3.9.2 to 3.10.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2220
* Bump yarl from 1.20.1 to 1.22.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2221
* Bump markupsafe from 3.0.2 to 3.0.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2209
* Bump multidict from 6.6.4 to 6.7.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2224
* Bump idna from 3.10 to 3.11 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2227
* Bump aiohttp from 3.12.15 to 3.13.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2225
* Bump coverage from 7.10.7 to 7.11.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2230
* Bump certifi from 2025.8.3 to 2025.10.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2228
* Bump pytest-rerunfailures from 16.0.1 to 16.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2229
* Bump attrs from 25.3.0 to 25.4.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2226
* Bump aiohttp from 3.13.0 to 3.13.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2237
* Bump pypdf from 6.0.0 to 6.1.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2233
* Bump black from 25.9.0 to 25.11.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2239
* Bump python-bidi from 0.6.6 to 0.6.7 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2234
* Bump psutil from 7.1.0 to 7.1.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2240
* Bump coverage from 7.11.0 to 7.12.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2241
* Bump werkzeug from 3.1.3 to 3.1.4 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2248
* Bump pypdf from 6.1.3 to 6.4.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2245
* Bump asgiref from 3.10.0 to 3.11.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2243
* Bump pytest-asyncio from 1.2.0 to 1.3.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2242
* Bump aiohttp from 3.13.2 to 3.13.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2261
* Bump pytest from 8.4.2 to 9.0.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2244
* Bump mypy from 1.18.2 to 1.19.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2250
* ♻️ Refactor: Hardcoded relative path for database file by @tang-vu in https://github.com/soxoj/maigret/pull/2285
* ✨ Quality: Missing tests for settings cascade and override logic by @tang-vu in https://github.com/soxoj/maigret/pull/2287
* ✨ Quality: Unexpanded tilde in file path by @tang-vu in https://github.com/soxoj/maigret/pull/2283
* Bump urllib3 from 2.5.0 to 2.6.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2262
* Bump pillow from 11.0.0 to 12.1.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2271
* Bump black from 25.11.0 to 26.3.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2280
* Bump cryptography from 44.0.1 to 46.0.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2270
* Bump pypdf from 6.4.0 to 6.9.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2281
* Dockerfile fix by @soxoj in https://github.com/soxoj/maigret/pull/2290
* Fixed false positives in top-500 by @soxoj in https://github.com/soxoj/maigret/pull/2292
* Update Telegram bot link in README by @soxoj in https://github.com/soxoj/maigret/pull/2293
* Pyinstaller GitHub workflow fix by @soxoj in https://github.com/soxoj/maigret/pull/2298
* Twitter fixed, mirrors mechanism improvement by @soxoj in https://github.com/soxoj/maigret/pull/2299
* build(deps): bump flask from 3.1.2 to 3.1.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2289
* Bump reportlab from 4.4.4 to 4.4.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2251
* build(deps): bump werkzeug from 3.1.4 to 3.1.6 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2288
* Bump certifi from 2025.10.5 to 2025.11.12 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2249
* Update Telegram bot link in README by @soxoj in https://github.com/soxoj/maigret/pull/2300
* Improve site-check quality by @soxoj in https://github.com/soxoj/maigret/pull/2301
* feat(sites): fix false positives: disable 74 broken sites, fix 8 with… by @soxoj in https://github.com/soxoj/maigret/pull/2302
* Update sites list workflow by @soxoj in https://github.com/soxoj/maigret/pull/2303
* Bump svglib from 1.5.1 to 1.6.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2205
* feat(workflow): fix update site data workflow dependency by @soxoj in https://github.com/soxoj/maigret/pull/2306
* Re-enable taplink.cc with browser User-Agent to bypass Cloudflare by @Copilot in https://github.com/soxoj/maigret/pull/2308
* feat(workflow): fix update site data workflow err by @soxoj in https://github.com/soxoj/maigret/pull/2312
* Update site data workflow fix: remove ambiguous main tag by @soxoj in https://github.com/soxoj/maigret/pull/2313
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2314
* Fix Love.Mail.ru: update to numeric-only identifiers and new profile URL by @Copilot in https://github.com/soxoj/maigret/pull/2307
* Remove dead site xxxforum.org by @Copilot in https://github.com/soxoj/maigret/pull/2310
* Disable forums.developer.nvidia.com (auth-gated user profiles) by @Copilot in https://github.com/soxoj/maigret/pull/2305
* Pin requests-toolbelt>=1.0.0 to fix urllib3 v2 incompatibility by @Copilot in https://github.com/soxoj/maigret/pull/2316
* build(deps): bump reportlab from 4.4.5 to 4.4.10 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2323
* build(deps-dev): bump coverage from 7.12.0 to 7.13.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2321
* build(deps-dev): bump pytest-cov from 7.0.0 to 7.1.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2320
* build(deps): bump aiohttp-socks from 0.10.1 to 0.11.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2319
* Disable false-positive site probe: amateurvoyeurforum.com by @Copilot in https://github.com/soxoj/maigret/pull/2332
* Disable forums.stevehoffman.tv due to false positives by @Copilot in https://github.com/soxoj/maigret/pull/2331
* [WIP] Fix false-positive probe for vegalab site by @Copilot in https://github.com/soxoj/maigret/pull/2336
* Fix RoyalCams site check using BongaCams white-label pattern by @Copilot in https://github.com/soxoj/maigret/pull/2334
* Fix Setlist site check: switch to message checkType with proper markers by @Copilot in https://github.com/soxoj/maigret/pull/2333
* [WIP] Fix invalid link on forums.imore.com by @Copilot in https://github.com/soxoj/maigret/pull/2337
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2315
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2339
* Fix false-positive site probe: Re-enable Taplink with message checkType by @Copilot in https://github.com/soxoj/maigret/pull/2326
* build(deps): bump aiodns from 3.5.0 to 4.0.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2345
* build(deps-dev): bump mypy from 1.19.0 to 1.19.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2347
* Disable Librusec site check (false positive) by @Copilot in https://github.com/soxoj/maigret/pull/2349
* Disable MirTesen site check (false positive) by @Copilot in https://github.com/soxoj/maigret/pull/2350
* build(deps): bump attrs from 25.4.0 to 26.1.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2344
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2341
* feat: add cybersecurity platforms + re-enable Root-Me by @juliosuas in https://github.com/soxoj/maigret/pull/2318
* Fix club.cnews.ru false positive: switch from status_code to message checkType by @Copilot in https://github.com/soxoj/maigret/pull/2342
* Fix SoundCloud false-positive: switch to message-based check by @Copilot in https://github.com/soxoj/maigret/pull/2355
* build(deps): bump certifi from 2025.11.12 to 2026.2.25 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2346
* feat: add tag blacklisting via `--exclude-tags` by @Copilot in https://github.com/soxoj/maigret/pull/2352
* Fix domain substring matching and NoneType crash in submit dialog by @Copilot in https://github.com/soxoj/maigret/pull/2367
* feat(core): add POST request support, new sites, migrate to Majestic Million ranking by @soxoj in https://github.com/soxoj/maigret/pull/2317
* Fix update-site-data workflow race condition on branch push by @Copilot in https://github.com/soxoj/maigret/pull/2366
* Fix false-positive site checks reported by Maigret Bot by @soxoj in https://github.com/soxoj/maigret/pull/2376
* build(deps): bump pycountry from 24.6.1 to 26.2.16 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2382
* Added Max.ru check; --no-progressbar flag fixed by @soxoj in https://github.com/soxoj/maigret/pull/2386
* build(deps): bump asgiref from 3.11.0 to 3.11.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2384
* build(deps): bump yarl from 1.22.0 to 1.23.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2383
* build(deps): bump pypdf from 6.9.1 to 6.9.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2392
* build(deps-dev): bump pytest-httpserver from 1.1.0 to 1.1.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2397
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2399
* build(deps): bump requests from 2.32.5 to 2.33.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2394
* Readme update: commercial use by @soxoj in https://github.com/soxoj/maigret/pull/2403
* build(deps): bump pyinstaller from 6.16.0 to 6.19.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2405
* build(deps): bump psutil from 7.1.3 to 7.2.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2406
* build(deps-dev): bump pytest from 9.0.1 to 9.0.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2381
* build(deps): bump soupsieve from 2.8 to 2.8.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2404
* Sites re-check by @soxoj in https://github.com/soxoj/maigret/pull/2423
* Add urlProbes by @soxoj in https://github.com/soxoj/maigret/pull/2425
* build(deps): bump cryptography from 46.0.5 to 46.0.6 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2422
* Tags and site names improvements by @soxoj in https://github.com/soxoj/maigret/pull/2427
* Overhaul site tags and naming: add social tag to 33 networks, fill mi… by @soxoj in https://github.com/soxoj/maigret/pull/2430
* build(deps): bump multidict from 6.7.0 to 6.7.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2396
* build(deps): bump chardet from 5.2.0 to 7.4.0.post2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2436
* build(deps): bump platformdirs from 4.5.0 to 4.9.4 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2434
* build(deps): bump aiohttp from 3.13.3 to 3.13.4 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2435
* build(deps): bump pygments from 2.18.0 to 2.20.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2440
* build(deps): bump requests from 2.33.0 to 2.33.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2444
* build(deps-dev): bump mypy from 1.19.1 to 1.20.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2447
* build(deps): bump aiohttp from 3.13.4 to 3.13.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2448
* Add site protection tracking system, fix broken site checks (Instagra… by @soxoj in https://github.com/soxoj/maigret/pull/2452
* Multiple lint and types fixes by @soxoj in https://github.com/soxoj/maigret/pull/2454
* fix(data): update InterPals absence string to match current site response by @juliosuas in https://github.com/soxoj/maigret/pull/2442
* Update of MIT License by @soxoj in https://github.com/soxoj/maigret/pull/2455
* Added Crypto/Web3 site checks by @soxoj in https://github.com/soxoj/maigret/pull/2457
* DB update mechanism by @soxoj in https://github.com/soxoj/maigret/pull/2458
* Fix false positives by @soxoj in https://github.com/soxoj/maigret/pull/2459
* False positive fixes by @soxoj in https://github.com/soxoj/maigret/pull/2460
* build(deps): bump curl-cffi from 0.14.0 to 0.15.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2462
* Add Markdown reports for LLM analysis by @soxoj in https://github.com/soxoj/maigret/pull/2463
* Sites fixes by @soxoj in https://github.com/soxoj/maigret/pull/2464
* Add installation troubleshooting for missing system dependencies by @Copilot in https://github.com/soxoj/maigret/pull/2465
* Fix Spotify, add Spotify Community forum by @soxoj in https://github.com/soxoj/maigret/pull/2467
* Fix crash on `-a --self-check` by adding exception handling to site check coroutines by @Copilot in https://github.com/soxoj/maigret/pull/2466
* Fix failing test for custom DB path resolution by @soxoj in https://github.com/soxoj/maigret/pull/2468
* Bump lxml minimum to 6.0.2 for Python 3.14 compatibility by @ocervell in https://github.com/soxoj/maigret/pull/2279
* build(deps-dev): bump pytest from 9.0.2 to 9.0.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2473
* Update HackTheBox and Wikipedia to use new API endpoints by @Copilot in https://github.com/soxoj/maigret/pull/2470
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2474
* build(deps): bump chardet from 7.4.0.post2 to 7.4.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2472
* build(deps): bump cryptography from 46.0.6 to 46.0.7 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2475
* vBulletin cleanup, Flarum sites, engine stats, UA bump by @soxoj in https://github.com/soxoj/maigret/pull/2476
* build(deps): bump platformdirs from 4.9.4 to 4.9.6 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2477
* Re-enable 69 stale-disabled sites validated via self-check by @soxoj in https://github.com/soxoj/maigret/pull/2478
* Fix false positives by @soxoj in https://github.com/soxoj/maigret/pull/2499
* build(deps): bump socid-extractor from 0.0.27 to 0.0.28 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2502
* build(deps): bump lxml from 6.0.2 to 6.0.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2501
* Disable Kinja.com site check by @Copilot in https://github.com/soxoj/maigret/pull/2503
* Added 3 sites, fixed 6, disabled 8 by @soxoj in https://github.com/soxoj/maigret/pull/2505
* Bump to 0.6.0 by @soxoj in https://github.com/soxoj/maigret/pull/2506
* Update workflow to trigger on published releases by @soxoj in https://github.com/soxoj/maigret/pull/2508
**Full Changelog**: https://github.com/soxoj/maigret/compare/v0.5.0...v0.6.0
## [0.5.0] - 2025-08-10
* Site Supression by @C3n7ral051nt4g3ncy in https://github.com/soxoj/maigret/pull/627
* Bump yarl from 1.7.2 to 1.8.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/626
+1 -2
View File
@@ -1,7 +1,6 @@
MIT License
Copyright (c) 2019 Sherlock Project
Copyright (c) 2020-2021 Soxoj
Copyright (c) 2020-2026 Soxoj
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
-452
View File
@@ -1,452 +0,0 @@
# Site checks — guide (Maigret)
Working document for future changes: workflow, findings from reviews, and practical steps. See also [`site-checks-playbook.md`](site-checks-playbook.md) (short checklist), [`socid_extractor_improvements.log`](socid_extractor_improvements.log) (proposals for upstream identity extraction), and the code in [`maigret/checking.py`](../maigret/checking.py).
**Documentation maintenance:** whenever you improve Maigret, add search tooling, or change check logic, update **this file** and [`site-checks-playbook.md`](site-checks-playbook.md) in sync (see the section at the end). If you change rules about the JSON API check or the `socid_extractor` log format, update **[`socid_extractor_improvements.log`](socid_extractor_improvements.log)** (template / header) together with this guide.
---
## 1. How checks work
Logic lives in `process_site_result` ([`maigret/checking.py`](../maigret/checking.py)):
| `checkType` | Meaning |
|-------------|---------|
| `message` | Profile is “found” if the HTML contains **none** of the `absenceStrs` substrings **and** at least one `presenseStrs` marker matches. If `presenseStrs` is **empty**, presence is treated as true for **any** page (risky configuration). |
| `status_code` | HTTP **2xx** is enough — only safe if the server does **not** return 200 for “user not found”. |
| `response_url` | Custom flow with **redirects disabled** so the status/URL of the *first* response can be used. |
For other `checkType` values, [`make_site_result`](../maigret/checking.py) sets **`allow_redirects=True`**: the client follows redirects and `process_site_result` sees the **final** response body and status (not the pre-redirect hop). You do **not** need to “turn on” follow-redirect separately for most sites.
Sites with an `engine` field (e.g. XenForo) are merged with a template from the `engines` section in [`maigret/resources/data.json`](../maigret/resources/data.json) ([`MaigretSite.update_from_engine`](../maigret/sites.py)).
### `urlProbe`: probe URL vs reported profile URL
- **`url`** — pattern for the **public profile page** users should open (what appears in reports as `url_user`). Supports `{username}`, `{urlMain}`, `{urlSubpath}`; the username segment is URL-encoded when the string is built ([`make_site_result`](../maigret/checking.py)).
- **`urlProbe`** (optional) — if set, Maigret sends the HTTP **GET** (or HEAD where applicable) to **this** URL for the check, instead of to `url`. Same placeholders. Use it when the reliable signal is a **JSON/API** endpoint but the human-facing link must stay on the main site (e.g. `https://picsart.com/u/{username}` + probe `https://api.picsart.com/users/show/{username}.json`, or GitHubs `https://github.com/{username}` + `https://api.github.com/users/{username}`).
If `urlProbe` is omitted, the probe URL defaults to `url`.
### Redirects and final URL as a signal
If the **HTML shell** looks the same for “user exists” and “user does not exist” (typical SPA), it is still worth checking whether the **server** behaves differently:
- **Final URL** after redirects (e.g. profile canonical URL vs `/404` path).
- **Redirect chain** length or target host (e.g. lander vs profile).
If that differs reliably, you may be able to use **`checkType`: `response_url`** in [`data.json`](../maigret/resources/data.json) (no auto-follow) or extend logic — but only when the difference is stable.
**Server-side HTTP vs client-side navigation.** Maigret follows **HTTP** redirects only; it does **not** run JavaScript. If the browser shows a navigation to `/u/name/posts` or `/not-found` **after** the SPA bundle loads, that may never appear as an extra hop in `curl`/aiohttp — only a **trailing-slash** `301` might show up. Always confirm with `curl -sIL` / a small script whether the **Location** chain differs for real vs fake users before relying on URL-based rules.
**Empirical check (claimed vs non-existent usernames, `GET` with follow redirects, no JS):**
| Site | Result |
|------|--------|
| **Kaskus** | No HTTP redirects beyond the request path; same generic `<title>` and near-identical body length — **no** discriminating signal from redirects alone. |
| **Bibsonomy** | Both requests redirect to **`/pow-challenge/?return=/user/...`** (proof-of-work). Only the `return` path changes with the username; **both** existing and fake hit the same challenge flow — not a profile-vs-missing distinction. |
| **Picsart (web UI `https://picsart.com/u/{username}`)** | Only a **trailing-slash** `301`; the first HTML is the same empty app shell (~3 KiB) for real and fake users. Browser-only routes such as `…/posts` vs `…/not-found` are **not** visible as additional HTTP redirects in this pipeline. |
**Picsart — workable check via public API.** The site exposes **`https://api.picsart.com/users/show/{username}.json`**: JSON with `"status":"success"` and a user object when the account exists, and `"reason":"user_not_found"` when it does not. Put that URL in **`urlProbe`**, set **`url`** to the web profile pattern **`https://picsart.com/u/{username}`**, and use **`checkType`: `message`** with narrow `presenseStrs` / `absenceStrs` so reports show the human link while the request hits the API (see **`urlProbe`** above).
For **Kaskus** and **Bibsonomy**, HTTP-level comparison still does **not** unlock a safe check without PoW / richer signals; keep **`disabled: true`** until something stable appears (API, SSR markers, etc.).
---
## 2. Standard checks: public JSON API and `socid_extractor` log
### 2.1 Public JSON API (always)
When diagnosing a site—especially **SPAs**, **soft 404s**, or **near-identical HTML** for real vs fake users—**routinely look for a public JSON (or JSON-like) API** used for profile or user lookup. Typical leads: paths containing `/api/`, `/v1/`, `graphql`, `users/show`, `.json` suffixes, or the same endpoints mobile apps use. Verify with `curl` (or the Maigret request path) that **claimed** and **unclaimed** usernames produce **reliably different** bodies or status codes. If such an endpoint is more stable than HTML, put it in **`urlProbe`** and keep **`url`** as the canonical profile page on the main site (see **`urlProbe`** in section 1). If there is no separate public URL for humans, you may still point **`url`** at the API only (reports will show that URL).
This is a **standard** part of site-check work, not an optional extra.
### 2.2 Mandatory: [`LLM/socid_extractor_improvements.log`](socid_extractor_improvements.log)
If you discover **either**:
1. **JSON embedded in HTML** with user/profile fields (inline scripts, `__NEXT_DATA__`, `application/ld+json`, hydration blobs, etc.), or
2. A **standalone JSON HTTP response** (public API) with user/profile data for that service,
you **must append** a proposal block to **[`LLM/socid_extractor_improvements.log`](socid_extractor_improvements.log)**.
**Why:** Maigret calls [`socid_extractor.extract`](https://pypi.org/project/socid-extractor/) on the response body ([`extract_ids_data` in `checking.py`](../maigret/checking.py)) to fill `ids_data`. New payloads usually need a **new scheme** upstream (`flags`, `regex`, optional `extract_json`, `fields`, optional `url_mutations` / `transforms`), matching patterns such as **`GitHub API`** or **`Gitlab API`** in `socid_extractor`s `schemes.py`.
**Each log entry must include:**
- **Date** — ISO `YYYY-MM-DD` (day you add the entry).
- **Example username** — Prefer the sites `usernameClaimed` from `data.json`, or any account that reproduces the payload.
- **Proposal** — Use the **block template** in the log file: detection idea, optional URL mutation, and field mappings in the same style as existing schemes.
If the service is **already covered** by an existing `socid_extractor` scheme, add a **short** entry anyway (date, example username, scheme name, “already implemented”) so there is an audit trail.
Do **not** paste secrets, cookies, or full private JSON; short key names and structure hints are enough.
---
## 3. Improvement workflow
### Phase A — Reproduce
1. Targeted run:
```bash
maigret --db /path/to/maigret/resources/data.json \
TEST_USERNAME \
--site "SiteName" \
--print-not-found --print-errors \
--no-progressbar -vv
```
2. Run separately with a **real** existing username and a **definitely non-existent** one (as `usernameClaimed` / `usernameUnclaimed` in JSON).
3. If needed: `-vvv` and `debug.log` (raw response).
4. Automated pair check:
```bash
maigret --db ... --self-check --site "SiteName" --no-progressbar
```
### Phase B — Classify the cause
| Symptom | Likely cause |
|---------|----------------|
| False “found” with `status_code` | Soft 404 (200 on a “not found” page). |
| False “found” with `message` | Overly broad `presenseStrs` (`name`, `email`, JSON keys) or stale `absenceStrs`. |
| Same HTML for different users | SPA / skeleton shell before hydration — also compare **final URL / redirect chain** (see above); if still identical, often `disabled`. |
| Login page instead of profile | XenForo etc.: guest, `ignore403`, “must be logged in” strings. |
| reCAPTCHA / “Checking your browser” / “not a bot” | Bot protection; Maigrets default User-Agent may worsen the response. |
| Redirect to another domain / lander | Stale URL template. |
### Phase C — Edits in [`data.json`](../maigret/resources/data.json)
1. Update `url` / `urlMain` if needed (HTTPS, new profile path).
2. Replace inappropriate `status_code` with `message` (or `response_url`), choosing:
- **`absenceStrs`** — only what reliably appears on the “user does not exist” page;
- **`presenseStrs`** — narrow markers of a real profile (avoid generic words).
3. For XenForo: override only fields that differ in the site entry; do not break the global `engines` template.
4. Refresh `usernameClaimed` / `usernameUnclaimed` if reference accounts disappeared.
5. Set **`headers`** (e.g. another `User-Agent`) if the site serves a captcha only to “suspicious” clients.
6. Use **`errors`**: HTML substring → meaningful check error (UNKNOWN), so it is not confused with “available”.
### Phase D — Decision criteria
| Outcome | When to use |
|---------|-------------|
| **Check fixed** | The `claimed` / `unclaimed` pair behaves predictably, `--self-check` passes, no regression on a similar site with the same engine. |
| **Check disabled** (`disabled: true`) | Cloudflare / anti-bot / login required / indistinguishable SPA without stable markers. |
| **Entry removed** | **Only** if the domain/service is gone (NXDOMAIN, clearly dead project), not “because it is hard to fix”. |
### Phase E — Before commit
- `maigret --self-check` for affected sites.
- `make test`.
---
## 4. Findings from reviews (concrete site batch)
Summary from an earlier false-positive review for: OpenSea, Mercado Livre, Redtube, Toms Guide, Kaggle, Kaskus, Livemaster, TechPowerUp, authorSTREAM, Bibsonomy, Bulbagarden, iXBT, Serebii, Picsart, Hashnode, hi5.
### What most often broke checks
1. **`status_code` where content checks are needed** — soft 404 with status 200.
2. **Broad `presenseStrs`** — matches on error pages or generic SPA shells.
3. **XenForo + guest** — HTML includes strings like “You must be logged in” that overlap the engine template.
4. **User-Agent** — on some sites (e.g. Kaggle) the default UA triggered a reCAPTCHA page instead of profile HTML; a deliberate `User-Agent` in site `headers` helped.
5. **SPAs and redirects** — identical first HTML, redirect to lander / another product (hi5 → Tagged), URL format changes by region (Mercado Livre).
### What worked as a fix
- Switching to **`message`** with narrow strings from **`<title>`** or unique markup where stable (**Kaggle**, **Mercado Livre**, **Hashnode**).
- For **Kaggle**, additionally: **`headers`**, **`errors`** for browser-check text.
- **Redtube** stayed valid on **`status_code`** with a stable **404** for non-existent users.
- **Picsart**: the web profile URL is a thin SPA shell; use the **JSON API** (`api.picsart.com/users/show/{username}.json`) in **`url`** with **`message`**-style markers (`"status":"success"` vs `user_not_found`), not the browser-only `/posts` vs `/not-found` navigation.
- For **Weblate / Anubis Anti-Bot**: Setting `headers` with a basic script User-Agent (e.g. `python-requests/2.25.1`) rather than the default browser UA completely bypassed the Anubis Proof-of-Work challenge HTTP 307 redirect, instantly recovering the native HTTP 404 framework.
### What required disabling checks
Where you **cannot** reliably tell “profile exists” from “no profile” without bypassing protection, login, or full JS:
- Anti-bot / captcha / “not a bot” page;
- Guest-only access to the needed page;
- SPA with indistinguishable first response;
- Forums returning **403** and a login page instead of a member profile for the member-search URL;
- Stale URLs that redirect to a stub.
In those cases **`disabled: true`** is better than false “found”; remove the DB entry only on **actual** domain death.
### Code notes
- For the `status_code` branch in `process_site_result`, use **strict** comparison `check_type == "status_code"`, not a substring match inside `"status_code"`.
- Treat empty `presenseStrs` with `message` as risky: when debugging, watch DEBUG-level logs if that diagnostics exists in code.
---
## 5. Future ideas (Maigret improvements)
- A mode or script: one site, two usernames, print statuses and first N bytes of the response (wrapper around `maigret()`).
- Document in CLI help that **`--use-disabled-sites`** is needed to analyze disabled entries.
---
## 6. Development utilities
### 6.1 `utils/site_check.py` — Single site diagnostics
A comprehensive utility for testing individual sites with multiple modes:
```bash
# Basic comparison of claimed vs unclaimed (aiohttp)
python utils/site_check.py --site "VK" --check-claimed
# Test via Maigret's checker directly
python utils/site_check.py --site "VK" --maigret
# Compare aiohttp vs Maigret results (find discrepancies)
python utils/site_check.py --site "VK" --compare-methods
# Full diagnosis with recommendations
python utils/site_check.py --site "VK" --diagnose
# Test with custom URL
python utils/site_check.py --url "https://example.com/{username}" --compare user1 user2
# Find a valid username for a site
python utils/site_check.py --site "VK" --find-user
```
**Key features:**
- `--maigret` — Uses Maigret's actual checking code, not raw aiohttp
- `--compare-methods` — Shows if aiohttp and Maigret see different results (useful for debugging)
- `--diagnose` — Validates checkType against actual responses, suggests fixes
- Color output with markers detection (captcha, cloudflare, login, etc.)
- `--json` flag for machine-readable output
**When to use each mode:**
| Mode | Use case |
|------|----------|
| `--check-claimed` | Quick sanity check: do claimed/unclaimed still differ? |
| `--maigret` | Verify Maigret's actual behavior matches expectations |
| `--compare-methods` | Debug "works in curl but fails in Maigret" issues |
| `--diagnose` | Full analysis when a site is broken, get fix recommendations |
### 6.2 `utils/check_top_n.py` — Mass site checking
Batch-check top N sites by Alexa rank with categorized reporting:
```bash
# Check top 100 sites
python utils/check_top_n.py --top 100
# Faster with more parallelism
python utils/check_top_n.py --top 100 --parallel 10
# Output JSON report
python utils/check_top_n.py --top 100 --output report.json
# Only show broken sites
python utils/check_top_n.py --top 100 --only-broken
```
**Output categories:**
- `working` — Site check passes
- `broken` — Check fails (wrong status, missing markers)
- `timeout` — Request timed out
- `anti_bot` — 403/429 or captcha detected
- `error` — Connection or other errors
- `disabled` — Already disabled in data.json
**Report includes:**
- Summary counts by category
- List of broken sites with issues
- Recommendations for fixes (e.g., "Switch to checkType: status_code")
### 6.3 Self-check behavior (`--self-check`)
The self-check command has been improved to be less aggressive:
```bash
# Check sites WITHOUT auto-disabling (default)
maigret --self-check --site "VK"
# Auto-disable failing sites (old behavior)
maigret --self-check --site "VK" --auto-disable
# Show detailed diagnosis for each failure
maigret --self-check --site "VK" --diagnose
```
**Behavior changes:**
| Flag | Effect |
|------|--------|
| `--self-check` alone | Reports issues but does NOT disable sites |
| `--auto-disable` | Automatically disables sites that fail (opt-in) |
| `--diagnose` | Prints detailed diagnosis with recommendations |
**Why this matters:**
- Old behavior was too aggressive — sites got disabled without explanation
- New behavior reports issues and suggests fixes
- Explicit `--auto-disable` required to modify database
---
## 7. Lessons learned (practical observations)
Collected from hands-on work fixing top-ranked sites (Reddit, Wikipedia, Microsoft Learn, Baidu, etc.).
### 7.1 JSON API is the first thing to look for
Both Reddit and Microsoft Learn had working public APIs that solved the problem entirely. The web pages were SPAs or blocked by anti-bot measures, but the APIs worked reliably:
- **Reddit**: `https://api.reddit.com/user/{username}/about` — returns JSON with user data or `{"message": "Not Found", "error": 404}`.
- **Microsoft Learn**: `https://learn.microsoft.com/api/profiles/{username}` — returns JSON with `userName` field or HTTP 404.
This confirms the playbook recommendation: always check for `/api/`, `.json`, GraphQL endpoints before giving up on a site.
### 7.2 `urlProbe` is a powerful tool
It separates "what we check" (API) from "what we show the user" (human-readable profile URL). Reddit is a perfect example:
```json
{
"url": "https://www.reddit.com/user/{username}",
"urlProbe": "https://api.reddit.com/user/{username}/about",
"checkType": "message",
"presenseStrs": ["\"name\":"],
"absenceStrs": ["Not Found"]
}
```
The check hits the API, but reports display `www.reddit.com/user/blue`.
### 7.3 aiohttp ≠ curl ≠ requests
Wikipedia returned HTTP 200 for `curl` and Python `requests`, but HTTP 403 for `aiohttp`. This is **TLS fingerprinting** — the server identifies the HTTP library by cryptographic characteristics of the TLS handshake, not by headers.
**Key insight:** Changing `User-Agent` does **not** help against TLS fingerprinting. Always test with aiohttp directly (or via Maigret with `-vvv` and `debug.log`), not just `curl`.
```python
# This returns 403 for Wikipedia even with browser UA:
async with aiohttp.ClientSession() as session:
async with session.get(url, headers={"User-Agent": "Mozilla/5.0 ..."}) as resp:
print(resp.status) # 403
```
### 7.4 HTTP 403 in Maigret can mean different things
Initially it seemed Wikipedia was returning 403, but `curl` showed 200. Only `debug.log` revealed the real picture — aiohttp was getting blocked at TLS level.
**Lesson:** Use `-vvv` flag and inspect `debug.log` for raw response status and body. The warning message alone may be misleading.
### 7.5 Dead services migrate, not disappear
MSDN Social and TechNet profiles redirected to Microsoft Learn. Instead of deleting old entries:
1. Keep old entries with `disabled: true` as historical record.
2. Create a new entry for the current service with working API.
This preserves audit trail and avoids breaking existing workflows.
### 7.6 `status_code` is more reliable than `message` for APIs
Microsoft Learn API returns HTTP 404 for non-existent users — a clean signal without HTML parsing. For JSON APIs that return proper HTTP status codes, `status_code` is often the best choice:
```json
{
"checkType": "status_code",
"urlProbe": "https://learn.microsoft.com/api/profiles/{username}"
}
```
No need for fragile string matching when the API speaks HTTP correctly.
### 7.8 Engine templates can silently break across many sites
The **vBulletin** engine template has `absenceStrs` in five languages ("This user has not registered…", "Пользователь не зарегистрирован…", etc.). In a batch review of ~12 vBulletin forums (oneclickchicks, mirf, Pesiq, VKMOnline, forum.zone-game.info, etc.), **none** of the absence strings matched — the forums returned identical pages for both claimed and unclaimed usernames. Root cause: many of these forums require login to view member profiles, so they serve a generic page (no "user not registered" message at all) instead of an informative error.
**Lesson:** When a whole engine class shows false positives, do not patch sites one by one — check whether the **engine template** itself still matches the actual error pages. A template written for one version/language pack may silently stop working after a forum upgrade or config change.
### 7.9 Search-by-author URLs are architecturally unreliable
Several sites (OnanistovNet, Shoppingzone, Pogovorim, Astrogalaxy, Sexwin) used a phpBB-style `search.php?keywords=&terms=all&author={username}` URL as the check endpoint. This searches for **posts** by that author, not for the user account itself. Even if the markers worked, a user who exists but has zero posts would be indistinguishable from a non-existent user. And in practice, the sites changed their response format — some now return HTTP 404, others dropped the expected Russian absence text altogether.
**Lesson:** Avoid author-search URLs as the check endpoint; they test "has posts" rather than "account exists" and are doubly fragile (both logic mismatch and format drift).
### 7.10 Some sites generate a page for any path — permanent false positives
Two distinct patterns:
- **Pbase** creates a stub page titled "pbase Artist {username}" for **every** URL, real or fake. Both return HTTP 200 with nearly identical content (~3.3 KB). No markers can distinguish them.
- **ffm.bio** is even trickier: for the non-existent username `a.slomkoowski` it generated a page titled "mr.a" with description "a is a", apparently fuzzy-matching the path to the closest real entry. Both return HTTP 200 with large, content-rich pages.
**Lesson:** Before writing markers for a site, verify that the "unclaimed" URL actually produces an **error-like** response (different status, different title, unique error text). If the site always returns a plausible-looking page, no combination of `presenseStrs` / `absenceStrs` will help — `disabled: true` is the only safe option.
### 7.11 TLS fingerprinting can degrade over time (Kaggle)
Kaggle was previously fixed with a custom `User-Agent` header and `errors` for the "Checking your browser" captcha page. In the latest batch review, aiohttp receives HTTP 404 with identical content for **both** claimed and unclaimed usernames — the site now blocks the entire request before it reaches the profile page. This matches the TLS fingerprinting pattern seen earlier with Wikipedia (section 7.3), but here the degradation happened **after** a working fix was already in place.
**Lesson:** Sites that rely on bot-detection can tighten their rules at any time. A working `User-Agent` override today may fail tomorrow. When a previously fixed site starts returning identical responses for both usernames, suspect TLS fingerprinting first, and accept `disabled: true` if no public API is available.
### 7.12 API endpoints may bypass Cloudflare even when the main site is blocked
All four Fandom wikis returned HTTP 403 with a Cloudflare "Just a moment..." challenge when aiohttp accessed the user profile page (`/wiki/User:{username}`). However, the **MediaWiki API** on the same domain (`/api.php?action=query&list=users&ususers={username}&format=json`) returned clean JSON without any challenge. Similarly, **Substack** served a captcha-laden SPA for `/@{username}`, but its `public_profile` API (`/api/v1/user/{username}/public_profile`) responded with proper JSON and correct HTTP 404 for missing users.
This is likely because API routes are excluded from the Cloudflare WAF rules or use a different pipeline than the HTML-serving paths.
**Lesson:** When a site's main pages are blocked by Cloudflare or similar WAF, still check API endpoints on the **same domain** — they may not go through the same protection layer. This is especially true for:
- MediaWiki's `api.php` on wiki farms (Fandom, Wikia, self-hosted MediaWiki)
- REST API paths (`/api/v1/`, `/api/v2/`) on SPA-heavy sites
- Internal data endpoints that the SPA itself calls
### 7.13 GraphQL APIs often support GET, not just POST
**hashnode** exposes a GraphQL endpoint at `https://gql.hashnode.com`. While GraphQL is typically associated with POST requests, many implementations also support **GET** with the query passed as a URL parameter. This is critical for Maigret, which only supports GET/HEAD for `urlProbe`.
```
GET https://gql.hashnode.com?query=%7Buser(username%3A%20%22melwinalm%22)%20%7B%20name%20username%20%7D%7D
→ {"data":{"user":{"name":"Melwin D'Almeida","username":"melwinalm"}}}
GET https://gql.hashnode.com?query=%7Buser(username%3A%20%22a.slomkoowski%22)%20%7B%20name%20username%20%7D%7D
→ {"data":{"user":null}}
```
**Lesson:** Before giving up on a GraphQL-only site, try the same query via GET with `?query=...` (URL-encoded). Many GraphQL servers accept both methods.
### 7.14 URL-encoding resolves template placeholder conflicts
The hashnode GraphQL query `{user(username: "{username}") { name }}` contains curly braces that conflict with Maigret's `{username}` placeholder — Python's `str.format()` would raise a `KeyError` on `{user(username...}`.
The fix: URL-encode the GraphQL braces (`{``%7B`, `}``%7D`) but leave `{username}` as-is. Python's `.format()` only interprets literal `{…}` as placeholders, not `%7B…%7D`, and the GraphQL server decodes the percent-encoding on its end:
```
urlProbe: https://gql.hashnode.com?query=%7Buser(username%3A%20%22{username}%22)%20%7B%20name%20username%20%7D%7D
```
After `.format(username="melwinalm")`:
```
https://gql.hashnode.com?query=%7Buser(username%3A%20%22melwinalm%22)%20%7B%20name%20username%20%7D%7D
```
**Lesson:** When a `urlProbe` needs literal curly braces (GraphQL, JSON in URL, etc.), percent-encode them. This is a general technique for any `data.json` URL field processed by `.format()`.
### 7.7 The playbook classification works
The decision tree from the documentation accurately describes real-world cases:
| Situation | Playbook says | Actual result |
|-----------|---------------|---------------|
| Captcha (Baidu) | `disabled: true` | Correct |
| TLS fingerprinting (Wikipedia) | `disabled: true` (anti-bot) | Correct |
| Working API available (Reddit, MS Learn) | Use `urlProbe` | Correct |
| Service migrated (MSDN → MS Learn) | Update URL or create new entry | Correct |
---
## Documentation maintenance
For any of the changes below, **always** keep these artifacts in sync — this file ([`site-checks-guide.md`](site-checks-guide.md)), [`site-checks-playbook.md`](site-checks-playbook.md), and (when rules or templates change) the header/template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log):
- Maigret code changes (including [`maigret/checking.py`](../maigret/checking.py), request executors, CLI);
- New or changed search tools / helper utilities for site checks;
- Changes to rules or semantics of `checkType`, `data.json` fields, self-check, etc.;
- Changes to the **public JSON API** diagnostic step or **mandatory** `socid_extractor` logging rules.
Prefer updating the guide, playbook, and log template in one commit or in the same task so instructions do not diverge. **Append-only:** new proposals go at the bottom of `socid_extractor_improvements.log`; do not delete historical entries when editing the template.
-87
View File
@@ -1,87 +0,0 @@
# Site checks — playbook (Maigret)
Short checklist for edits to [`maigret/resources/data.json`](../maigret/resources/data.json) and, when needed, [`maigret/checking.py`](../maigret/checking.py). Full guide: [`site-checks-guide.md`](site-checks-guide.md). Upstream extraction proposals: [`socid_extractor_improvements.log`](socid_extractor_improvements.log).
**Documentation maintenance:** whenever you improve Maigret, add search tooling, or change check logic, update **both** this file and [`site-checks-guide.md`](site-checks-guide.md) (see the “Documentation maintenance” section at the end of that file). When JSON API / `socid_extractor` logging rules change, update the **template header** in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) in the same change.
## 0. Standard checks (do alongside reproduce / classify)
- **Public JSON API:** always look for a stable JSON (or GraphQL JSON) profile endpoint (`/api/`, `.json`, mobile-style URLs). When the API is more reliable than HTML, set **`urlProbe`** to that endpoint and keep **`url`** as the human-readable profile link (e.g. `https://picsart.com/u/{username}`). If there is no separate profile URL, use the API as `url` only. Details: **`urlProbe`** and section **2.1** in [`site-checks-guide.md`](site-checks-guide.md).
- **`socid_extractor` log (mandatory):** if you find **embedded user JSON in HTML** or a **standalone JSON profile API**, append a dated entry (with **example username**) to [`socid_extractor_improvements.log`](socid_extractor_improvements.log). Details: section **2.2** in [`site-checks-guide.md`](site-checks-guide.md).
## 1. Reproduce
- Run a targeted check:
`maigret USER --db /path/to/maigret/resources/data.json --site "SiteName" --print-not-found --print-errors --no-progressbar -vv`
- Compare an **existing** and a **non-existent** username (as `usernameClaimed` / `usernameUnclaimed` in JSON).
- With `-vvv`, inspect `debug.log` (raw response in the log).
## 2. Classify the cause
| Symptom | Typical cause | Action |
|--------|-----------------|--------|
| HTTP 200 for “user does not exist” | Soft 404 | Move from `status_code` to `message` or `response_url`; add `absenceStrs` / narrow `presenseStrs` |
| Generic words match (`name`, `email`) | `presenseStrs` too broad | Remove generic markers; add profile-specific ones |
| Same HTML without JS | SPA / skeleton shell | Compare **final URL and HTTP redirects** (Maigret already follows redirects by default). If the browser shows extra routes (`/posts`, `/not-found`) only **after JS**, they will **not** appear to Maigret — try a **public JSON/API** endpoint for the same site if one exists. See **Redirects and final URL** and **Picsart** in [`site-checks-guide.md`](site-checks-guide.md). |
| 403 / “Log in” / guest-only | Auth or anti-bot required | `disabled: true` |
| reCAPTCHA / “Checking your browser” | Bot protection | Try a reasonable `User-Agent` in `headers`; else `errors` + UNKNOWN or `disabled` |
| Domain does not resolve / persistent timeout | Dead service | Remove entry **only** after confirming the domain is dead |
## 3. Data edits
1. Update `url` / `urlMain` if needed (HTTPS redirects). Use optional **`urlProbe`** when the HTTP check should hit a different URL than the profile link shown in reports (API vs web UI).
2. For `message`: **always** tune string pairs so `absenceStrs` fire on “no user” pages and `presenseStrs` fire on real profiles without false absence hits.
3. Engine (`engine`, e.g. XenForo): override only differing fields in the site entry so other sites are not broken.
4. Keep `status_code` only if the response **reliably** differs by status code without soft 404.
## 4. Verify
- `maigret --self-check --site "SiteName" --db ...` for touched entries.
- `make test` before commit.
## 5. Code notes
- `process_site_result` uses strict comparison to `"status_code"` for `checkType` (not a substring trick).
- Empty `presenseStrs` with `message` means “presence always true”; a debug line is logged only at DEBUG level.
## 6. Development utilities
Quick reference for site check utilities. Full details: section **6** in [`site-checks-guide.md`](site-checks-guide.md).
| Command | Purpose |
|---------|---------|
| `python utils/site_check.py --site "X" --check-claimed` | Quick aiohttp comparison |
| `python utils/site_check.py --site "X" --maigret` | Test via Maigret checker |
| `python utils/site_check.py --site "X" --compare-methods` | Find aiohttp vs Maigret discrepancies |
| `python utils/site_check.py --site "X" --diagnose` | Full diagnosis with fix recommendations |
| `python utils/check_top_n.py --top 100` | Mass-check top 100 sites |
| `maigret --self-check --site "X"` | Self-check (reports only, no auto-disable) |
| `maigret --self-check --site "X" --auto-disable` | Self-check with auto-disable |
| `maigret --self-check --site "X" --diagnose` | Self-check with detailed diagnosis |
## 7. Quick tips (lessons learned)
Practical observations from fixing top-ranked sites. Full details: section **7** in [`site-checks-guide.md`](site-checks-guide.md).
| Tip | Why it matters |
|-----|----------------|
| **API first** | Reddit, Microsoft Learn — APIs worked when web pages were blocked. Always check `/api/`, `.json` endpoints. |
| **`urlProbe` separates check from display** | Check via API, show human URL in reports. Example: Reddit API → `www.reddit.com/user/` link. |
| **aiohttp ≠ curl** | Wikipedia returned 200 for curl, 403 for aiohttp (TLS fingerprinting). Always test with Maigret directly. |
| **Use `debug.log`** | Run with `-vvv` to see raw response. Warning messages alone can be misleading. |
| **`status_code` for clean APIs** | If API returns proper 404 for missing users, prefer `status_code` over `message`. |
| **Migrate, don't delete** | MSDN → Microsoft Learn: keep old entry disabled, create new one for current service. |
| **Engine templates break silently** | vBulletin `absenceStrs` failed on ~12 forums at once — many require login, showing a generic page with no error text. Check the engine template first. |
| **Search-by-author is unreliable** | phpBB `search.php?author=` checks for posts, not accounts. A user with zero posts looks identical to a non-existent user. Avoid these URLs. |
| **Some sites always generate a page** | Pbase stubs "pbase Artist {name}" for any path; ffm.bio fuzzy-matches to the nearest real entry. No markers can help — `disabled: true`. |
| **TLS fingerprinting degrades over time** | Kaggle's custom `User-Agent` fix stopped working — aiohttp now gets 404 for both usernames. Accept `disabled: true` when no API exists. |
| **API endpoints bypass Cloudflare** | Fandom `api.php` and Substack `/api/v1/` returned clean JSON while main pages were blocked by Cloudflare. Always try API paths on the same domain. |
| **Inspect Network tab for POST APIs** | Many modern platforms (e.g., Discord) heavily protect HTML profiles but expose unauthenticated `POST` endpoints for username checks. Maigret supports this natively: define `"request_method": "POST"` and `"request_payload": {"username": "{username}"}` in `data.json` to query them! |
| **Strict JSON markers are bulletproof** | When probing APIs, use `checkType: "message"` with exact JSON substrings (like `"{\"taken\": false}"`). Unlike HTML layout checks, this approach is immune to UI redesigns, A/B testing, and language translations. |
| **GraphQL supports GET too** | hashnode GraphQL works via `GET ?query=...` (URL-encoded). You can use either native POST payloads or GET `urlProbe` for GraphQL. |
| **URL-encode braces for template safety** | GraphQL `{...}` conflicts with Maigret's `{username}`. Use `%7B`/`%7D` for literal braces in `urlProbe``.format()` ignores percent-encoded chars. |
| **Anti-bot bypass via simple UA** | "Anubis" anti-bot PoW screens (like on Weblate) intercept modern browser UAs via HTTP 307. Hardcoding `"headers": {"User-Agent": "python-requests/2.25.1"}` circumvents the scraper filter and restores default detection logic. |
## 8. Documentation maintenance
When you change Maigret, add search tools, or change check logic, keep **this playbook**, [`site-checks-guide.md`](site-checks-guide.md), and (when applicable) the template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) aligned. New log **entries** are append-only at the bottom of that file.
-4
View File
@@ -1,4 +0,0 @@
include LICENSE
include README.md
include requirements.txt
include maigret/resources/*
+134 -73
View File
@@ -1,7 +1,7 @@
# Maigret
<p align="center">
<p align="center">
<div align="center">
<div>
<a href="https://pypi.org/project/maigret/">
<img alt="PyPI version badge for Maigret" src="https://img.shields.io/pypi/v/maigret?style=flat-square" />
</a>
@@ -17,53 +17,97 @@
<a href="https://github.com/soxoj/maigret">
<img alt="View count for Maigret project" src="https://komarev.com/ghpvc/?username=maigret&color=brightgreen&label=views&style=flat-square" />
</a>
</p>
<p align="center">
<img src="https://raw.githubusercontent.com/soxoj/maigret/main/static/maigret.png" height="300"/>
</p>
</p>
</div>
<br>
<div>
<img src="https://raw.githubusercontent.com/soxoj/maigret/main/static/maigret.png" height="300" alt="Maigret logo"/>
</div>
<br>
</div>
<i>The Commissioner Jules Maigret is a fictional French police detective, created by Georges Simenon. His investigation method is based on understanding the personality of different people and their interactions.</i>
**Maigret** collects a dossier on a person **by username only**, checking for accounts on a huge number of sites and gathering all the available information from web pages. No API keys required.
<b>👉👉👉 [Online Telegram bot](https://t.me/maigret_search_bot)</b>
## Contents
## About
- [In one minute](#in-one-minute)
- [Main features](#main-features)
- [Demo](#demo)
- [Installation](#installation)
- [Usage](#usage)
- [Contributing](#contributing)
- [Commercial Use](#commercial-use)
- [About](#about)
**Maigret** collects a dossier on a person **by username only**, checking for accounts on a huge number of sites and gathering all the available information from web pages. No API keys are required. Maigret is an easy-to-use and powerful fork of [Sherlock](https://github.com/sherlock-project/sherlock).
<a id="one-minute"></a>
## In one minute
Currently supports more than 3000 sites ([full list](https://github.com/soxoj/maigret/blob/main/sites.md)), search is launched against 500 popular sites in descending order of popularity by default. Also supported checking Tor sites, I2P sites, and domains (via DNS resolving).
Ensure you have Python 3.10 or higher.
## Powered By Maigret
```bash
pip install maigret
maigret YOUR_USERNAME
```
These are professional tools for social media content analysis and OSINT investigations that use Maigret (banners are clickable).
No install? Try the [Telegram bot](https://t.me/maigret_search_bot) or a [Cloud Shell](#cloud-shells).
Want a web UI? See [how to launch it](#web-interface).
See also: [Quick start](https://maigret.readthedocs.io/en/latest/quick-start.html).
## Main features
- Supports 3,000+ sites ([see full list](https://github.com/soxoj/maigret/blob/main/sites.md)). A default run checks the 500 highest-ranked sites by traffic; pass `-a` to scan everything, or `--tags` to narrow by category/country.
- Embeddable in Python projects — import `maigret` and run searches programmatically (see [library usage](https://maigret.readthedocs.io/en/latest/library-usage.html)).
- [Extracts](https://github.com/soxoj/socid_extractor) all available information about the account owner from profile pages and site APIs, including links to other accounts.
- Performs recursive search using discovered usernames and other IDs.
- Allows filtering by tags (site categories, countries).
- Detects and partially bypasses blocks, censorship, and CAPTCHA.
- Fetches an [auto-updated site database](https://maigret.readthedocs.io/en/latest/settings.html#database-auto-update) from GitHub each run (once per 24 hours), and falls back to the built-in database if offline.
- Works with Tor and I2P websites; able to check domains.
- Ships with a [web interface](#web-interface) for browsing results as a graph and downloading reports in every format from a single page.
For the complete feature list, see the [features documentation](https://maigret.readthedocs.io/en/latest/features.html).
### Used by
Professional OSINT and social-media analysis tools built on Maigret:
<a href="https://github.com/SocialLinks-IO/sociallinks-api"><img height="60" alt="Social Links API" src="https://github.com/user-attachments/assets/789747b2-d7a0-4d4e-8868-ffc4427df660"></a>
<a href="https://sociallinks.io/products/sl-crimewall"><img height="60" alt="Social Links Crimewall" src="https://github.com/user-attachments/assets/0b18f06c-2f38-477b-b946-1be1a632a9d1"></a>
<a href="https://usersearch.ai/"><img height="60" alt="UserSearch" src="https://github.com/user-attachments/assets/66daa213-cf7d-40cf-9267-42f97cf77580"></a>
## Main features
## Demo
* Profile page parsing, [extraction](https://github.com/soxoj/socid_extractor) of personal info, links to other profiles, etc.
* Recursive search by new usernames and other IDs found
* Search by tags (site categories, countries)
* Censorship and captcha detection
* Requests retries
### Video
See the full description of Maigret features [in the documentation](https://maigret.readthedocs.io/en/latest/features.html).
<a href="https://asciinema.org/a/Ao0y7N0TTxpS0pisoprQJdylZ">
<img src="https://asciinema.org/a/Ao0y7N0TTxpS0pisoprQJdylZ.svg" alt="asciicast" width="600">
</a>
### Reports
[PDF report](https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotographycars.pdf), [HTML report](https://htmlpreview.github.io/?https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotographycars.html)
![HTML report screenshot](https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotography_html_screenshot.png)
![XMind 8 report screenshot](https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotography_xmind_screenshot.png)
[Full console output](https://raw.githubusercontent.com/soxoj/maigret/main/static/recursive_search.md)
## Installation
‼️ Maigret is available online via [official Telegram bot](https://t.me/maigret_search_bot). Consider using it if you don't want to install anything.
Already ran the [In one minute](#one-minute) steps? You're set. Below are alternative methods.
Don't want to install anything? Use the [Telegram bot](https://t.me/maigret_search_bot).
### Windows
Standalone EXE-binaries for Windows are located in [Releases section](https://github.com/soxoj/maigret/releases) of GitHub repository.
Download a standalone EXE from [Releases](https://github.com/soxoj/maigret/releases). Video guide: https://youtu.be/qIgwTZOmMmM.
Video guide on how to run it: https://youtu.be/qIgwTZOmMmM.
<a id="cloud-shells"></a>
### Cloud Shells
### Installation in Cloud Shells
You can launch Maigret using cloud shells and Jupyter notebooks. Press one of the buttons below and follow the instructions to launch it in your browser.
Run Maigret in the browser via cloud shells or Jupyter notebooks:
[![Open in Cloud Shell](https://user-images.githubusercontent.com/27065646/92304704-8d146d80-ef80-11ea-8c29-0deaabb1c702.png)](https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/soxoj/maigret&tutorial=README.md)
<a href="https://repl.it/github/soxoj/maigret"><img src="https://replit.com/badge/github/soxoj/maigret" alt="Run on Replit" height="50"></a>
@@ -71,12 +115,7 @@ You can launch Maigret using cloud shells and Jupyter notebooks. Press one of th
<a href="https://colab.research.google.com/gist/soxoj/879b51bc3b2f8b695abb054090645000/maigret-collab.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" height="45"></a>
<a href="https://mybinder.org/v2/gist/soxoj/9d65c2f4d3bec5dd25949197ea73cf3a/HEAD"><img src="https://mybinder.org/badge_logo.svg" alt="Open In Binder" height="45"></a>
### Local installation
Maigret can be installed using pip, Docker, or simply can be launched from the cloned repo.
**NOTE**: Python 3.10 or higher and pip is required, **Python 3.11 is recommended.**
### Local installation (pip)
```bash
# install from pypi
@@ -86,7 +125,7 @@ pip3 install maigret
maigret username
```
### Cloning a repository
### From source
```bash
# or clone and install manually
@@ -112,7 +151,13 @@ docker run -v /mydir:/app/reports soxoj/maigret:latest username --html
docker build -t maigret .
```
## Usage examples
### Troubleshooting
Build errors? See the [troubleshooting guide](https://maigret.readthedocs.io/en/latest/installation.html#troubleshooting).
## Usage
### Examples
```bash
# make HTML, PDF, and Xmind8 reports
@@ -120,6 +165,12 @@ maigret user --html
maigret user --pdf
maigret user --xmind #Output not compatible with xmind 2022+
# machine-readable exports
maigret user --json ndjson # newline-delimited JSON (also: --json simple)
maigret user --csv
maigret user --txt
maigret user --graph # interactive D3 graph (HTML)
# search on sites marked with tags photo & dating
maigret user --tags photo,dating
@@ -130,11 +181,12 @@ maigret user --tags us
maigret user1 user2 user3 -a
```
Use `maigret --help` to get full options description. Also options [are documented](https://maigret.readthedocs.io/en/latest/command-line-options.html).
Run `maigret --help` for all options. Docs: [CLI options](https://maigret.readthedocs.io/en/latest/command-line-options.html), [more examples](https://maigret.readthedocs.io/en/latest/usage-examples.html). Running into 403s or timeouts? See [TROUBLESHOOTING.md](TROUBLESHOOTING.md).
<a id="web-interface"></a>
### Web interface
You can run Maigret with a web interface, where you can view the graph with results and download reports of all formats on a single page.
Maigret has a built-in web UI with a results graph and downloadable reports.
<details>
<summary>Web Interface Screenshots</summary>
@@ -145,62 +197,71 @@ You can run Maigret with a web interface, where you can view the graph with resu
</details>
Instructions:
1. Run Maigret with the ``--web`` flag and specify the port number.
```console
maigret --web 5000
```
2. Open http://127.0.0.1:5000 in your browser and enter one or more usernames to make a search.
3. Wait a bit for the search to complete and view the graph with results, the table with all accounts found, and download reports of all formats.
Open http://127.0.0.1:5000, enter a username, and view results.
### Python library
**Maigret can be embedded in your own Python projects.** The CLI is a thin wrapper around an async function you can call directly — build custom pipelines, feed results into your own tooling, or run it inside a larger OSINT workflow.
See the full [library usage guide](https://maigret.readthedocs.io/en/latest/library-usage.html) for a working example, async patterns, and how to filter sites by tag.
### Useful CLI flags
- `--parse URL` — parse a profile page, extract IDs/usernames, and use them to kick off a recursive search.
- `--permute` — generate likely username variants from two or more inputs (e.g. `john doe``johndoe`, `j.doe`, …) and search for all of them.
- `--self-check [--auto-disable]` — verify `usernameClaimed` / `usernameUnclaimed` pairs against live sites for maintainers auditing the database.
### Tor / I2P / proxies
Maigret can route checks through a proxy, Tor, or I2P — useful for `.onion` / `.i2p` sites and for bypassing WAFs that block datacenter IPs.
```bash
# any HTTP/SOCKS proxy
maigret user --proxy socks5://127.0.0.1:1080
# Tor (default gateway socks5://127.0.0.1:9050)
maigret user --tor-proxy socks5://127.0.0.1:9050
# I2P (default gateway http://127.0.0.1:4444)
maigret user --i2p-proxy http://127.0.0.1:4444
```
Start your Tor / I2P daemon before running the command — Maigret does not manage these gateways.
## Contributing
Maigret has open-source code, so you may contribute your own sites by adding them to `data.json` file, or bring changes to it's code!
Add or fix new sites surgically in `data.json` (no `json.load`/`json.dump`), then run `./utils/update_site_data.py` to regenerate `sites.md` and the database metadata, and open a pull request. For more details, see the [CONTRIBUTING guide](https://github.com/soxoj/maigret/blob/main/CONTRIBUTING.md) and [development docs](https://maigret.readthedocs.io/en/latest/development.html). Release history: [CHANGELOG.md](CHANGELOG.md).
For more information about development and contribution, please read the [development documentation](https://maigret.readthedocs.io/en/latest/development.html).
## Commercial Use
## Demo with page parsing and recursive username search
The open-source Maigret is MIT-licensed and free for commercial use without restriction — but site checks break over time and need active maintenance.
### Video (asciinema)
For serious commercial use — with a **daily-updated site database** or a **username-check API** — reach out: 📧 [maigret@soxoj.com](mailto:maigret@soxoj.com)
<a href="https://asciinema.org/a/Ao0y7N0TTxpS0pisoprQJdylZ">
<img src="https://asciinema.org/a/Ao0y7N0TTxpS0pisoprQJdylZ.svg" alt="asciicast" width="600">
</a>
- Private site database — 5 000+ sites, updated daily (separate from the public open-source database)
- Username check API — integrate Maigret into your product
### Reports
## About
[PDF report](https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotographycars.pdf), [HTML report](https://htmlpreview.github.io/?https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotographycars.html)
### Disclaimer
![HTML report screenshot](https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotography_html_screenshot.png)
**For educational and lawful purposes only.** You are responsible for complying with all applicable laws (GDPR, CCPA, etc.) in your jurisdiction. The authors bear no responsibility for misuse.
![XMind 8 report screenshot](https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotography_xmind_screenshot.png)
### Feedback
[Full console output](https://raw.githubusercontent.com/soxoj/maigret/main/static/recursive_search.md)
[Open an issue](https://github.com/soxoj/maigret/issues) · [GitHub Discussions](https://github.com/soxoj/maigret/discussions) · [Telegram](https://t.me/soxoj)
## Disclaimer
### SOWEL classification
**This tool is intended for educational and lawful purposes only.** The developers do not endorse or encourage any illegal activities or misuse of this tool. Regulations regarding the collection and use of personal data vary by country and region, including but not limited to GDPR in the EU, CCPA in the USA, and similar laws worldwide.
It is your sole responsibility to ensure that your use of this tool complies with all applicable laws and regulations in your jurisdiction. Any illegal use of this tool is strictly prohibited, and you are fully accountable for your actions.
The authors and developers of this tool bear no responsibility for any misuse or unlawful activities conducted by its users.
## Feedback
If you have any questions, suggestions, or feedback, please feel free to [open an issue](https://github.com/soxoj/maigret/issues), create a [GitHub discussion](https://github.com/soxoj/maigret/discussions), or contact the author directly via [Telegram](https://t.me/soxoj).
## SOWEL classification
This tool uses the following OSINT techniques:
OSINT techniques used:
- [SOTL-2.2. Search For Accounts On Other Platforms](https://sowel.soxoj.com/other-platform-accounts)
- [SOTL-6.1. Check Logins Reuse To Find Another Account](https://sowel.soxoj.com/logins-reuse)
- [SOTL-6.2. Check Nicknames Reuse To Find Another Account](https://sowel.soxoj.com/nicknames-reuse)
## License
### License
MIT © [Maigret](https://github.com/soxoj/maigret)<br/>
MIT © [Sherlock Project](https://github.com/sherlock-project/)<br/>
Original Creator of Sherlock Project - [Siddharth Dushantha](https://github.com/sdushantha)
MIT © [Maigret](https://github.com/soxoj/maigret)
+91
View File
@@ -0,0 +1,91 @@
# Troubleshooting
Common issues when running Maigret and how to fix them. If none of this helps, [open an issue](https://github.com/soxoj/maigret/issues) with the output of `maigret --version` and the exact command you ran.
## "Lots of sites fail / timeout / return 403"
This is by far the most common report. It almost always comes from anti-bot protection (Cloudflare, DDoS-Guard, Akamai, etc.) or a slow network — not from a bug in Maigret.
**Results vary a lot depending on where you run from.** The same command on the same username can produce very different output on:
- **Mobile internet** (4G/5G) — usually the best results. Carrier NAT shares your IP with thousands of real users, so WAFs rarely block it.
- **Home broadband** — generally good, though some ISPs are reputation-flagged.
- **Hosting / cloud / VPS infrastructure** (AWS, GCP, DigitalOcean, Hetzner, etc.) — the worst case. Datacenter IP ranges are blanket-blocked or challenged by most WAFs, so you will see many false negatives and 403s.
If a run looks suspiciously empty, **try a different network before assuming Maigret is broken**: tether from your phone, switch between Wi-Fi and mobile, or move the run off a VPS onto a residential machine. Comparing results across two networks is also the fastest way to tell whether a missing account is genuinely missing or just blocked on the current IP.
Once you have a sense of the baseline, try these tweaks in order:
1. **Raise the timeout.** The default is 30 seconds. On mobile networks or for slow sites, bump it:
```bash
maigret user --timeout 60
```
2. **Retry failed checks.** Transient 5xx / timeouts often clear on a second try:
```bash
maigret user --retries 2
```
3. **Lower parallelism.** Some WAFs rate-limit aggressively. Maigret defaults to 100 concurrent connections (`-n` / `--max-connections`) — dropping this makes you look less like a scanner:
```bash
maigret user -n 20
```
4. **Route through a residential proxy.** Datacenter IPs (AWS, GCP, DigitalOcean) are blanket-blocked by many WAFs. A residential / mobile proxy usually fixes this:
```bash
maigret user --proxy http://user:pass@residential-proxy:port
```
Note: Tor (`--tor-proxy`) rarely helps here — most WAFs block Tor exit nodes just as aggressively as datacenter IPs. Use Tor only when you actually need to reach `.onion` sites (see below).
If specific sites *always* fail regardless of the above, they are likely broken in the database (stale markers, new WAF, site redesign). Report them with `--print-errors` output so a maintainer can look at the check config.
## "No results at all" / "maigret: command not found"
- **`command not found`** — `pip install maigret` put the binary under `~/.local/bin` (Linux/macOS) or `%APPDATA%\Python\Scripts` (Windows). Add that directory to `PATH`, or run `python3 -m maigret user` instead.
- **Empty output** — check that you actually passed a username; `maigret` alone prints help. Also confirm Python 3.10+ with `python3 --version`.
## "SSL / certificate errors"
Usually caused by a corporate MITM proxy or an outdated `certifi` bundle.
```bash
pip install --upgrade certifi
```
If you are behind a corporate proxy, set `HTTPS_PROXY` / `HTTP_PROXY` environment variables and pass `--proxy "$HTTPS_PROXY"` so Maigret uses the same route.
## ".onion / .i2p sites are skipped"
These sites only load through the matching gateway. Start your Tor or I2P daemon first, then:
```bash
# Tor
maigret user --tor-proxy socks5://127.0.0.1:9050
# I2P
maigret user --i2p-proxy http://127.0.0.1:4444
```
Maigret does not launch or manage these daemons — they must already be running.
## "The PDF / XMind / HTML report looks wrong"
- **PDF** — requires `weasyprint` and its system dependencies (Pango, Cairo, GDK-PixBuf). On Debian/Ubuntu: `apt install libpango-1.0-0 libpangoft2-1.0-0`. macOS: `brew install pango`.
- **XMind** — the `--xmind` flag generates **XMind 8** files. XMind 2022+ (Zen / XMind 2023) uses a different format and will not open them. Use XMind 8 or convert via `--html`.
- **HTML** looks unstyled — open it through a local file path (`file:///...`), not via a preview pane that strips CSS.
## "The site database is out of date"
Maigret auto-fetches a fresh `data.json` from GitHub once every 24 hours. To force-refresh now:
```bash
maigret user --force-update
```
To run entirely against the local built-in copy (e.g. offline):
```bash
maigret user --no-autoupdate
```
## Still stuck?
- [Open an issue](https://github.com/soxoj/maigret/issues) — include your OS, Python version, Maigret version, and the full command.
- Ask in [GitHub Discussions](https://github.com/soxoj/maigret/discussions) or the [Telegram](https://t.me/soxoj) channel.
+106 -7
View File
@@ -82,11 +82,63 @@ id types, sites will be filtered automatically.
ids. Useful for repeated scanning with found known irrelevant usernames.
``--db`` - Load Maigret database from a JSON file or an online, valid,
JSON file.
JSON file. See :ref:`custom-database` below.
``--no-autoupdate`` - Disable the automatic database update check that
runs at startup. The currently cached (or bundled) database is used
as-is.
``--force-update`` - Force a database update check at startup, ignoring
the usual check interval. Implies ``--no-autoupdate`` for the rest of
the run after the explicit update finishes.
``--retries RETRIES`` - Count of attempts to restart temporarily failed
requests.
.. _custom-database:
Using a custom sites database
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``--db`` flag accepts three forms:
1. **HTTP(S) URL** — fetched as-is, e.g.
``--db https://example.com/my_db.json``.
2. **Local file path** — absolute (``--db /tmp/private.json``) or
relative to the current working directory
(``--db LLM/maigret_private_db.json``).
3. **Module-relative path** — kept for backwards compatibility, resolved
against the installed ``maigret/`` package directory (e.g. the
default ``resources/data.json``).
Resolution order for local paths: the path is first tried as given
(absolute or cwd-relative); if that file does not exist, Maigret falls
back to the legacy module-relative resolution. If neither location
contains the file, Maigret exits with an error rather than silently
loading the bundled database.
When ``--db`` points to a custom file, automatic database updates are
skipped — the file is used exactly as provided.
On every run Maigret prints the database it actually loaded, for
example::
[+] Using sites database: /path/to/maigret_private_db.json (6 sites)
If loading the requested database fails for any other reason (corrupt
JSON, missing required keys, …), Maigret prints a warning, falls back
to the bundled database, and reports the fallback explicitly::
[-] Falling back to bundled database: /…/maigret/resources/data.json
[+] Using sites database: /…/maigret/resources/data.json (3154 sites)
A typical invocation against a private database, with auto-update
disabled and all sites scanned, looks like::
python3 -m maigret username \
--db LLM/maigret_private_db.json \
--no-autoupdate -a
Reports
-------
@@ -106,6 +158,9 @@ username).
``-J``, ``--json`` - Generate a JSON report of specific type: simple,
ndjson (one report per username). E.g. ``--json ndjson``
``-M``, ``--md`` - Generate a Markdown report (general report on all
usernames). See :ref:`markdown-report` below.
``-fo``, ``--folderoutput`` - Results will be saved to this folder,
``results`` by default. Will be created if doesnt exist.
@@ -130,16 +185,60 @@ Other operations modes
``--version`` - Display version information and dependencies.
``--self-check`` - Do self-checking for sites and database and disable
non-working ones **for current search session** by default. Its useful
for testing new internet connection (it depends on provider/hosting on
which sites there will be censorship stub or captcha display). After
checking Maigret asks if you want to save updates, answering y/Y will
rewrite the local database.
``--self-check`` - Do self-checking for sites and database. Each site is
tested by looking up its known-claimed and known-unclaimed usernames and
verifying that the results match expectations. Individual site failures
(network errors, unexpected exceptions, etc.) are caught and logged
without stopping the overall process, so the check always runs to
completion. After checking, Maigret reports a summary of issues found.
If any sites were disabled (see ``--auto-disable``), Maigret asks if you
want to save updates; answering y/Y will rewrite the local database.
``--auto-disable`` - Used with ``--self-check``: automatically disable
sites that fail checks (incorrect detection of claimed/unclaimed
usernames, connection errors, or unexpected exceptions). Without this
flag, ``--self-check`` only **reports** issues without modifying the
database.
``--diagnose`` - Used with ``--self-check``: print detailed diagnosis
information for each failing site, including the check type, the list
of issues found, and recommendations (e.g. suggesting a different
``checkType``).
``--submit URL`` - Do an automatic analysis of the given account URL or
site main page URL to determine the site engine and methods to check
account presence. After checking Maigret asks if you want to add the
site, answering y/Y will rewrite the local database.
.. _markdown-report:
Markdown report (LLM-friendly)
------------------------------
The ``--md`` / ``-M`` flag generates a Markdown report designed for both human reading and analysis by AI assistants (ChatGPT, Claude, etc.).
.. code-block:: console
maigret username --md
The report includes:
- **Summary** with aggregated personal data (all fullnames, locations, bios found across accounts), country tags, website tags, first/last seen timestamps.
- **Per-account sections** with profile URL, site tags, and all extracted fields (username, bio, follower count, linked accounts, etc.).
- **Possible false positives** disclaimer explaining that accounts may belong to different people.
- **Ethical use** notice about applicable data protection laws.
**Using with AI tools:**
The Markdown format is optimized for LLM context windows. You can feed the report directly to an AI assistant for follow-up analysis:
.. code-block:: console
# Generate the report
maigret johndoe --md
# Feed it to an AI tool
cat reports/report_johndoe.md | llm "Analyze this OSINT report and summarize key findings"
The structured Markdown with per-site sections makes it easy for AI tools to extract relationships, cross-reference identities, and identify patterns across accounts.
+41 -1
View File
@@ -69,6 +69,21 @@ Use the following commands to check Maigret:
make speed
Site naming conventions
-----------------------------------------------
Site names are the keys in ``data.json`` and appear in user-facing reports. Follow these rules:
- **Title Case** by default: ``Product Hunt``, ``Hacker News``.
- **Lowercase** only if the brand itself is written that way: ``kofi``, ``note``, ``hi5``.
- **No domain suffix** (``calendly.com````Calendly``), unless the domain is part of the recognized brand name: ``last.fm``, ``VC.ru``, ``Archive.org``.
- **No full UPPERCASE** unless the brand is an acronym: ``VK``, ``CNET``, ``ICQ``, ``IFTTT``.
- **No** ``www.`` **or** ``https://`` **prefix** in the name.
- **Spaces** are allowed when the brand uses them: ``Star Citizen``, ``Google Maps``.
- **{username} templates** in names are acceptable: ``{username}.tilda.ws``.
When in doubt, check how the service refers to itself on its homepage.
How to fix false-positives
-----------------------------------------------
@@ -81,7 +96,7 @@ You should make your git commits from your maigret git repo folder, or else the
If you already know which site has a false-positive and want to fix it specifically, go to the next step.
Otherwise, simply run a search with a random username (e.g. `laiuhi3h4gi3u4hgt`) and check the results.
Alternatively, you can use `the Telegram bot <https://t.me/osint_maigret_bot>`_.
Alternatively, you can use `the Telegram bot <https://t.me/maigret_search_bot>`_.
2. Open the account link in your browser and check:
@@ -122,6 +137,31 @@ There are few options for sites data.json helpful in various cases:
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
- ``requestMethod`` - set the HTTP method to use (e.g., ``POST``). By default, Maigret natively defaults to GET or HEAD.
- ``requestPayload`` - a dictionary with the JSON payload to send for POST requests (e.g., ``{"username": "{username}"}``), extremely useful for parsing GraphQL or modern JSON APIs.
- ``protection`` - a list of protection types detected on the site (see below).
``protection`` (site protection tracking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``protection`` field records what kind of anti-bot protection a site uses. Maigret reads this field and automatically applies the appropriate bypass mechanism.
Supported values:
- ``tls_fingerprint`` — the site fingerprints the TLS handshake (JA3/JA4) and blocks non-browser clients. Maigret automatically uses ``curl_cffi`` with Chrome browser emulation to bypass this. Requires the ``curl_cffi`` package (included as a dependency). Examples: Instagram, NPM, Codepen, Kickstarter, Letterboxd.
- ``ip_reputation`` — the site blocks requests from datacenter/cloud IPs regardless of headers or TLS. Cannot be bypassed automatically; run Maigret from a regular internet connection (not a datacenter) or use a proxy (``--proxy``). Examples: Reddit, Patreon, Figma.
- ``js_challenge`` — the site serves a JavaScript challenge page (e.g. "Just a moment...") that cannot be solved without a browser. Maigret detects challenge signatures and returns UNKNOWN instead of a false positive.
- ``aws_waf_js_challenge`` — the site is protected by AWS WAF with a JavaScript challenge. Symptom: HTTP 202 with empty body and ``x-amzn-waf-action: challenge`` header (a token-granting challenge that requires executing the CAPTCHA/challenge JS bundle). Neither ``curl_cffi`` TLS impersonation nor User-Agent changes bypass this — a real browser or the official AWS WAF challenge-solver SDK is required. Currently marked for documentation only; sites using this protection stay ``disabled: true`` until a solver is integrated. Example: Dreamwidth.
Example:
.. code-block:: json
"Instagram": {
"url": "https://www.instagram.com/{username}/",
"checkType": "message",
"presenseStrs": ["\"routePath\":\"\\/"],
"absenceStrs": ["\"routePath\":null"],
"protection": ["tls_fingerprint"]
}
``urlProbe`` (optional profile probe URL)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+29
View File
@@ -170,6 +170,35 @@ Maigret will do retries of the requests with temporary errors got (connection fa
One attempt by default, can be changed with option ``--retries N``.
Database self-check
-------------------
Maigret includes a self-check mode (``--self-check``) that validates every site
in the database by looking up its known-claimed and known-unclaimed usernames
and verifying that the detection results match expectations.
The self-check is **error-resilient**: if an individual site check raises an
unexpected exception (e.g. a network error or a parsing failure), the error is
caught, logged, and recorded as an issue — the remaining sites continue to be
checked without interruption. This means the process always runs to completion,
even when checking hundreds of sites with ``-a --self-check``.
Use ``--auto-disable`` together with ``--self-check`` to automatically disable
sites that fail checks. Without it, issues are only reported. Use ``--diagnose``
to print detailed per-site diagnosis including the check type, specific issues,
and recommendations.
.. code-block:: console
# Report-only mode (no changes to the database)
maigret --self-check
# Automatically disable failing sites and save updates
maigret -a --self-check --auto-disable
# Show detailed diagnosis for each failing site
maigret -a --self-check --diagnose
Archives and mirrors checking
-----------------------------
+2
View File
@@ -29,6 +29,7 @@ You may be interested in:
- :doc:`Usage examples <usage-examples>`
- :doc:`Command line options <command-line-options>`
- :doc:`Features list <features>`
- :doc:`Library usage <library-usage>`
.. toctree::
:hidden:
@@ -39,6 +40,7 @@ You may be interested in:
usage-examples
command-line-options
features
library-usage
philosophy
supported-identifier-types
tags
+38 -3
View File
@@ -4,7 +4,7 @@ Installation
============
Maigret can be installed using pip, Docker, or simply can be launched from the cloned repo.
Also, it is available online via `official Telegram bot <https://t.me/osint_maigret_bot>`_,
Also, it is available online via `official Telegram bot <https://t.me/maigret_search_bot>`_,
source code of a bot is `available on GitHub <https://github.com/soxoj/maigret-tg-bot>`_.
Windows Standalone EXE-binaries
@@ -45,8 +45,7 @@ Press one of the buttons below and follow the instructions to launch it in your
Local installation from PyPi
----------------------------
Please note that the sites database in the PyPI package may be outdated.
If you encounter frequent false positive results, we recommend installing the latest development version from GitHub instead.
Maigret ships with a bundled site database. After installation from PyPI (or any other method), it can **automatically fetch a newer compatible database from GitHub** when you run it—see :ref:`database-auto-update` in :doc:`settings`.
.. note::
Python 3.10 or higher and pip is required, **Python 3.11 is recommended.**
@@ -90,3 +89,39 @@ Docker
# manual build
docker build -t maigret .
Troubleshooting
---------------
If you encounter build errors during installation such as ``cannot find ft2build.h``
or errors related to ``reportlab`` / ``_renderPM``, you need to install system-level
dependencies required to compile native extensions.
**Debian/Ubuntu/Kali:**
.. code-block:: bash
sudo apt install -y libfreetype6-dev libjpeg-dev libffi-dev
**Fedora/RHEL/CentOS:**
.. code-block:: bash
sudo dnf install -y freetype-devel libjpeg-devel libffi-devel
**Arch Linux:**
.. code-block:: bash
sudo pacman -S freetype2 libjpeg-turbo libffi
**macOS (Homebrew):**
.. code-block:: bash
brew install freetype
After installing the system dependencies, retry the maigret installation.
If you continue to have issues, consider using Docker instead, which includes all
necessary dependencies.
+139
View File
@@ -0,0 +1,139 @@
.. _library-usage:
Library usage
=============
Maigret's CLI is a thin wrapper around an async Python API. You can embed Maigret in your own tools, pipelines, and OSINT workflows — no need to shell out.
This page covers the common patterns. For the full argument list of the underlying function, see ``maigret.checking.maigret`` in the source.
Installation
------------
.. code-block:: bash
pip install maigret
Minimal example
---------------
A working end-to-end search against the top 500 sites:
.. code-block:: python
import asyncio
import logging
from maigret import search as maigret_search
from maigret.sites import MaigretDatabase
# Load the bundled site database
db = MaigretDatabase().load_from_path(
"maigret/resources/data.json"
)
# Pick which sites to scan (same filtering the CLI uses)
sites = db.ranked_sites_dict(top=500)
results = asyncio.run(
maigret_search(
username="soxoj",
site_dict=sites,
logger=logging.getLogger("maigret"),
timeout=30,
is_parsing_enabled=True,
)
)
for site_name, result in results.items():
if result["status"].is_found():
print(site_name, result["url_user"])
Key points:
- ``maigret_search`` is an ``async`` function — wrap it with ``asyncio.run(...)`` or ``await`` it from inside your own event loop.
- ``is_parsing_enabled=True`` turns on ``socid_extractor`` so ``result["ids_data"]`` is populated with profile fields (bio, linked accounts, uids, etc.).
- Each entry in the returned dict has a ``"status"`` object with ``is_found()``, plus ``url_user``, ``http_status``, ``rank``, ``ids_data``, and more.
Filtering sites
---------------
``ranked_sites_dict`` accepts the same filters as the CLI:
.. code-block:: python
# All sites tagged as coding, top 200 by rank
sites = db.ranked_sites_dict(top=200, tags=["coding"])
# Exclude NSFW and dating sites
sites = db.ranked_sites_dict(excluded_tags=["nsfw", "dating"])
# Only specific sites by name
sites = db.ranked_sites_dict(names=["GitHub", "Reddit", "VK"])
# Include disabled sites (useful for maintenance / self-check)
sites = db.ranked_sites_dict(disabled=True)
Running inside an existing event loop
-------------------------------------
If your application already runs an asyncio loop (FastAPI, aiohttp server, a Discord bot, etc.), ``await`` ``maigret_search`` directly instead of calling ``asyncio.run``:
.. code-block:: python
async def check_username(username: str) -> dict:
results = await maigret_search(
username=username,
site_dict=sites,
logger=logger,
timeout=30,
)
return {
name: r["url_user"]
for name, r in results.items()
if r["status"].is_found()
}
Routing through a proxy
-----------------------
The same proxy / Tor / I2P flags the CLI exposes are plain keyword arguments:
.. code-block:: python
results = await maigret_search(
username="soxoj",
site_dict=sites,
logger=logger,
proxy="socks5://127.0.0.1:1080",
tor_proxy="socks5://127.0.0.1:9050", # used for .onion sites
i2p_proxy="http://127.0.0.1:4444", # used for .i2p sites
timeout=30,
)
Full function signature
-----------------------
.. code-block:: python
async def maigret(
username: str,
site_dict: Dict[str, MaigretSite],
logger,
query_notify=None,
proxy=None,
tor_proxy=None,
i2p_proxy=None,
timeout=30,
is_parsing_enabled=False,
id_type="username",
debug=False,
forced=False,
max_connections=100,
no_progressbar=False,
cookies=None,
retries=0,
check_domains=False,
) -> QueryResultWrapper
See :doc:`command-line-options` for a description of each option — the semantics match the CLI flags one-to-one.
+24
View File
@@ -3,6 +3,10 @@
Philosophy
==========
*The Commissioner Jules Maigret is a fictional French police detective, created by Georges Simenon.
His investigation method is based on understanding the personality of different people and their
interactions.*
TL;DR: Username => Dossier
Maigret is designed to gather all the available information about person by his username.
@@ -15,3 +19,23 @@ All this information forms some dossier, but it also useful for other tools and
Each collected piece of data has a label of a certain format (for example, ``follower_count`` for the number
of subscribers or ``created_at`` for account creation time) so that it can be parsed and analyzed by various
systems and stored in databases.
Origins
-------
Maigret started from studying what OSINT investigators actually use in practice — and from
the realization that many popular tools do not deliver real investigative value. The original
research behind this observation is summarized in the article
`What's wrong with namecheckers <https://soxoj.medium.com/whats-wrong-with-namecheckers-981e5cba600e>`_.
For a broader landscape of username-checking tools, see the curated
`OSINT namecheckers list <https://github.com/soxoj/osint-namecheckers-list>`_.
Two ideas grew out of that research:
- `socid-extractor <https://github.com/soxoj/socid-extractor>`_ — a library focused on pulling
structured identity data (user IDs, full names, linked accounts, bios, timestamps, etc.) out of
account pages and public API responses, so that finding an account is not the end of the pipeline.
- **Maigret** itself — which started as a fork of
`Sherlock <https://github.com/sherlock-project/sherlock>`_ but has long since outgrown the
original project in coverage, extraction depth, and check reliability. Today Maigret is used
as a component by major OSINT vendors in their commercial products.
+74
View File
@@ -27,3 +27,77 @@ Missing any of these files is not an error.
If the next settings file contains already known option,
this option will be rewrited. So it is possible to make
custom configuration for different users and directories.
.. _database-auto-update:
Database auto-update
--------------------
Maigret ships with a bundled site database, but it gets outdated between releases. To keep the database current, Maigret automatically checks for updates on startup.
**How it works:**
1. On startup, Maigret checks if more than 24 hours have passed since the last update check.
2. If so, it fetches a lightweight metadata file (~200 bytes) from GitHub to see if a newer database is available.
3. If a newer, compatible database exists, Maigret downloads it to ``~/.maigret/data.json`` and uses it instead of the bundled copy.
4. If the download fails or the new database is incompatible with your Maigret version, the bundled database is used as a fallback.
The downloaded database has **higher priority** than the bundled one — it replaces, not overlays.
**Status messages** are printed only when an action occurs:
.. code-block:: text
[*] DB auto-update: checking for updates...
[+] DB auto-update: database updated successfully (3180 sites)
[*] DB auto-update: database is up to date (3157 sites)
[!] DB auto-update: latest database requires maigret >= 0.6.0, you have 0.5.0
**Forcing an update:**
Use the ``--force-update`` flag to check for updates immediately, ignoring the check interval:
.. code-block:: console
maigret username --force-update
The update happens at startup, then the search continues normally with the freshly downloaded database.
**Disabling auto-update:**
Use the ``--no-autoupdate`` flag to skip the update check entirely:
.. code-block:: console
maigret username --no-autoupdate
Or set it permanently in ``~/.maigret/settings.json``:
.. code-block:: json
{
"no_autoupdate": true
}
This is recommended for **Docker containers**, **CI pipelines**, and **air-gapped environments**.
**Configuration options** (in ``settings.json``):
.. list-table::
:header-rows: 1
:widths: 35 15 50
* - Setting
- Default
- Description
* - ``no_autoupdate``
- ``false``
- Disable auto-update entirely
* - ``autoupdate_check_interval_hours``
- ``24``
- How often to check for updates (in hours)
* - ``db_update_meta_url``
- GitHub raw URL
- URL of the metadata file (for custom mirrors)
**Using a custom database** with ``--db`` always skips auto-update — you are explicitly choosing your data source.
+6 -1
View File
@@ -10,7 +10,12 @@ The use of tags allows you to select a subset of the sites from big Maigret DB f
There are several types of tags:
1. **Country codes**: ``us``, ``jp``, ``br``... (`ISO 3166-1 alpha-2 <https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2>`_). These tags reflect the site language and regional origin of its users and are then used to locate the owner of a username. If the regional origin is difficult to establish or a site is positioned as worldwide, `no country code is given`. There could be multiple country code tags for one site.
1. **Country codes**: ``us``, ``jp``, ``br``... (`ISO 3166-1 alpha-2 <https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2>`_). A country tag means that having an account on the site implies a connection to that country — either origin or residence. The goal is attribution, not perfect accuracy.
- **Global sites** (GitHub, YouTube, Reddit, Medium, etc.) get **no country tag** — an account there says nothing about where a person is from.
- **Regional/local sites** where an account implies a specific country **must** have a country tag: ``VK````ru``, ``Naver````kr``, ``Zhihu````cn``.
- Multiple country tags are allowed when a service is used predominantly in a few countries (e.g. ``Xing````de``, ``eu``).
- Do **not** assign country tags based on traffic statistics alone — a site popular in India by traffic is not "Indian" if it is used globally.
2. **Site engines**. Most of them are forum engines now: ``uCoz``, ``vBulletin``, ``XenForo`` et al. Full list of engines stored in the Maigret database.
+1 -1
View File
@@ -33,7 +33,7 @@ Use Cases
If you experience many false positives, you can do the following:
- Install the last development version of Maigret from GitHub
- Run Maigret with ``--self-check`` flag and agree on disabling of problematic sites
- Run Maigret with ``--self-check --auto-disable`` flag and agree on disabling of problematic sites
3. Search for accounts with username ``machine42`` and generate HTML and PDF reports.
+1 -1
View File
@@ -1,3 +1,3 @@
"""Maigret version file"""
__version__ = '0.5.0'
__version__ = '0.6.0'
+3 -14
View File
@@ -30,17 +30,6 @@ class ParsingActivator:
jwt_token = r.json()["jwt"]
site.headers["Authorization"] = "jwt " + jwt_token
@staticmethod
def spotify(site, logger, cookies={}):
headers = dict(site.headers)
if "Authorization" in headers:
del headers["Authorization"]
import requests
r = requests.get(site.activation["url"])
bearer_token = r.json()["accessToken"]
site.headers["authorization"] = f"Bearer {bearer_token}"
@staticmethod
def weibo(site, logger):
headers = dict(site.headers)
@@ -54,7 +43,7 @@ class ParsingActivator:
logger.debug(
f"1 stage: {'success' if r.status_code == 302 else 'no 302 redirect, fail!'}"
)
location = r.headers.get("Location")
location = r.headers.get("Location", "")
# 2 stage: go to passport visitor page
headers["Referer"] = location
@@ -84,9 +73,9 @@ def import_aiohttp_cookies(cookiestxt_filename):
cookies = CookieJar()
cookies_list = []
for domain in cookies_obj._cookies.values():
for domain in cookies_obj._cookies.values(): # type: ignore[attr-defined]
for key, cookie in list(domain.values())[0].items():
c = Morsel()
c: Morsel = Morsel()
c.set(key, cookie.value, cookie.value)
c["domain"] = cookie.domain
c["path"] = cookie.path
+251 -136
View File
@@ -6,7 +6,8 @@ import random
import re
import ssl
import sys
from typing import Dict, List, Optional, Tuple
import time
from typing import Any, Dict, List, Optional, Tuple
from urllib.parse import quote
# Third party imports
@@ -15,7 +16,7 @@ from alive_progress import alive_bar
from aiohttp import ClientSession, TCPConnector, http_exceptions
from aiohttp.client_exceptions import ClientConnectorError, ServerDisconnectedError
from python_socks import _errors as proxy_errors
from socid_extractor import extract
from socid_extractor import extract # type: ignore[import-not-found]
try:
from mock import Mock
@@ -61,8 +62,6 @@ class SimpleAiohttpChecker(CheckerBase):
self.headers = None
self.allow_redirects = True
self.timeout = 0
self.allow_redirects = True
self.timeout = 0
self.method = 'get'
self.payload = None
@@ -80,7 +79,7 @@ class SimpleAiohttpChecker(CheckerBase):
async def _make_request(
self, session, url, headers, allow_redirects, timeout, method, logger, payload=None
) -> Tuple[str, int, Optional[CheckError]]:
) -> Tuple[Optional[str], int, Optional[CheckError]]:
try:
if method.lower() == 'get':
request_method = session.get
@@ -136,15 +135,21 @@ class SimpleAiohttpChecker(CheckerBase):
logger.debug(e, exc_info=True)
return None, 0, CheckError("Unexpected", str(e))
async def check(self) -> Tuple[str, int, Optional[CheckError]]:
async def check(self) -> Tuple[Optional[str], int, Optional[CheckError]]:
from aiohttp_socks import ProxyConnector
# Use a real SSL context instead of ssl=False to avoid TLS fingerprinting
# blocks by Cloudflare and similar WAFs. Certificate verification is
# disabled to handle sites with invalid/expired certs.
ssl_context = ssl.create_default_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE
connector = (
ProxyConnector.from_url(self.proxy)
if self.proxy
else TCPConnector(ssl=False)
else TCPConnector(ssl=ssl_context)
)
connector.verify_ssl = False
async with ClientSession(
connector=connector,
@@ -189,7 +194,7 @@ class AiodnsDomainResolver(CheckerBase):
self.url = url
return None
async def check(self) -> Tuple[str, int, Optional[CheckError]]:
async def check(self) -> Tuple[Optional[str], int, Optional[CheckError]]:
status = 404
error = None
text = ''
@@ -207,6 +212,76 @@ class AiodnsDomainResolver(CheckerBase):
return text, status, error
try:
from curl_cffi.requests import AsyncSession as CurlCffiAsyncSession
CURL_CFFI_AVAILABLE = True
except ImportError:
CURL_CFFI_AVAILABLE = False
class CurlCffiChecker(CheckerBase):
"""Checker using curl_cffi to emulate browser TLS fingerprint and bypass WAF."""
def __init__(self, *args, **kwargs):
self.logger = kwargs.get('logger', Mock())
self.browser_emulate = kwargs.get('browser_emulate', 'chrome')
self.url = None
self.headers = None
self.allow_redirects = True
self.timeout = 0
self.method = 'get'
self.payload = None
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
self.url = url
self.headers = headers
self.allow_redirects = allow_redirects
self.timeout = timeout
self.method = method
self.payload = payload
return None
async def close(self):
pass
async def check(self) -> Tuple[Optional[str], int, Optional[CheckError]]:
try:
async with CurlCffiAsyncSession() as session:
kwargs = {
'url': self.url,
'headers': self.headers,
'allow_redirects': self.allow_redirects,
'timeout': self.timeout if self.timeout else 10,
'impersonate': self.browser_emulate,
}
if self.payload and self.method.lower() == 'post':
kwargs['json'] = self.payload
if self.method.lower() == 'post':
response = await session.post(**kwargs)
elif self.method.lower() == 'head':
response = await session.head(**kwargs)
else:
response = await session.get(**kwargs)
status_code = response.status_code
decoded_content = response.text
self.logger.debug(decoded_content)
error = CheckError("Connection lost") if status_code == 0 else None
return decoded_content, status_code, error
except asyncio.TimeoutError as e:
return None, 0, CheckError("Request timeout", str(e))
except KeyboardInterrupt:
return None, 0, CheckError("Interrupted")
except Exception as e:
self.logger.debug(e, exc_info=True)
return None, 0, CheckError("Unexpected", str(e))
class CheckerMock:
def __init__(self, *args, **kwargs):
pass
@@ -214,7 +289,7 @@ class CheckerMock:
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
return None
async def check(self) -> Tuple[str, int, Optional[CheckError]]:
async def check(self) -> Tuple[Optional[str], int, Optional[CheckError]]:
await asyncio.sleep(0)
return '', 0, None
@@ -260,7 +335,12 @@ def debug_response_logging(url, html_text, status_code, check_error):
def process_site_result(
response, query_notify, logger, results_info: QueryResultWrapper, site: MaigretSite
response,
query_notify,
logger,
results_info: QueryResultWrapper,
site: MaigretSite,
response_time: Optional[float] = None,
):
if not response:
return results_info
@@ -288,9 +368,6 @@ def process_site_result(
html_text, status_code, check_error = response
# TODO: add elapsed request time counting
response_time = None
if logger.level == logging.DEBUG:
debug_response_logging(url, html_text, status_code, check_error)
@@ -463,8 +540,18 @@ def make_site_result(
# workaround to prevent slash errors
url = re.sub("(?<!:)/+", "/", url)
# always clearweb_checker for now
checker = options["checkers"][site.protocol]
# Select checker: use curl_cffi for sites requiring TLS impersonation
needs_impersonation = 'tls_fingerprint' in site.protection
if needs_impersonation and CURL_CFFI_AVAILABLE:
checker = CurlCffiChecker(logger=logger, browser_emulate='chrome')
elif needs_impersonation and not CURL_CFFI_AVAILABLE:
logger.warning(
f"Site {site.name} requires TLS impersonation (curl_cffi) but it's not installed. "
"Install with: pip install curl_cffi"
)
checker = options["checkers"][site.protocol]
else:
checker = options["checkers"][site.protocol]
# site check is disabled
if site.disabled and not options['forced']:
@@ -583,7 +670,10 @@ async def check_site_for_username(
print(f"error, no checker for {site.name}")
return site.name, default_result
elapsed = 0.0
t0 = time.perf_counter()
response = await checker.check()
elapsed += time.perf_counter() - t0
html_text = response[0] if response and response[0] else ""
# Retry once after token-style activation (e.g. Twitter guest token refresh).
@@ -616,10 +706,13 @@ async def check_site_for_username(
method=checker.method,
payload=getattr(checker, 'payload', None),
)
t1 = time.perf_counter()
response = await checker.check()
elapsed += time.perf_counter() - t1
response_result = process_site_result(
response, query_notify, logger, default_result, site
response, query_notify, logger, default_result, site,
response_time=elapsed,
)
query_notify.update(response_result['status'], site.similar_search)
@@ -799,7 +892,7 @@ async def maigret(
with alive_bar(
len(tasks_dict), title="Searching", force_tty=True, disable=no_progressbar
) as progress:
async for result in executor.run(tasks_dict.values()):
async for result in executor.run(list(tasks_dict.values())): # type: ignore[arg-type]
cur_results.append(result)
progress()
@@ -875,135 +968,149 @@ async def site_self_check(
If False (default), only report issues without disabling.
diagnose: If True, print detailed diagnosis information.
"""
changes = {
changes: Dict[str, Any] = {
"disabled": False,
"issues": [],
"recommendations": [],
}
check_data = [
(site.username_claimed, MaigretCheckStatus.CLAIMED),
(site.username_unclaimed, MaigretCheckStatus.AVAILABLE),
]
try:
check_data = [
(site.username_claimed, MaigretCheckStatus.CLAIMED),
(site.username_unclaimed, MaigretCheckStatus.AVAILABLE),
]
logger.info(f"Checking {site.name}...")
logger.info(f"Checking {site.name}...")
results_cache = {}
results_cache = {}
for username, status in check_data:
async with semaphore:
results_dict = await maigret(
username=username,
site_dict={site.name: site},
logger=logger,
timeout=30,
id_type=site.type,
forced=True,
no_progressbar=True,
retries=1,
proxy=proxy,
tor_proxy=tor_proxy,
i2p_proxy=i2p_proxy,
cookies=cookies,
)
# don't disable entries with other ids types
# TODO: make normal checking
if site.name not in results_dict:
logger.info(results_dict)
changes["issues"].append(f"Site {site.name} not in results (wrong id_type?)")
if auto_disable:
changes["disabled"] = True
continue
logger.debug(results_dict)
result = results_dict[site.name]["status"]
results_cache[username] = results_dict[site.name]
if result.error and 'Cannot connect to host' in result.error.desc:
changes["issues"].append(f"Cannot connect to host")
if auto_disable:
changes["disabled"] = True
site_status = result.status
if site_status != status:
if site_status == MaigretCheckStatus.UNKNOWN:
msgs = site.absence_strs
etype = site.check_type
error_msg = f"Error checking {username}: {result.context}"
changes["issues"].append(error_msg)
logger.warning(
f"Error while searching {username} in {site.name}: {result.context}, {msgs}, type {etype}"
for username, status in check_data:
async with semaphore:
results_dict = await maigret(
username=username,
site_dict={site.name: site},
logger=logger,
timeout=30,
id_type=site.type,
forced=True,
no_progressbar=True,
retries=1,
proxy=proxy,
tor_proxy=tor_proxy,
i2p_proxy=i2p_proxy,
cookies=cookies,
)
# don't disable sites after the error
# meaning that the site could be available, but returned error for the check
# e.g. many sites protected by cloudflare and available in general
if skip_errors:
pass
# don't disable in case of available username
elif status == MaigretCheckStatus.CLAIMED and auto_disable:
changes["disabled"] = True
elif status == MaigretCheckStatus.CLAIMED:
changes["issues"].append(f"Claimed user '{username}' not detected as claimed")
logger.warning(
f"Not found `{username}` in {site.name}, must be claimed"
)
logger.info(results_dict[site.name])
if auto_disable:
changes["disabled"] = True
else:
changes["issues"].append(f"Unclaimed user '{username}' detected as claimed")
logger.warning(f"Found `{username}` in {site.name}, must be available")
logger.info(results_dict[site.name])
# don't disable entries with other ids types
# TODO: make normal checking
if site.name not in results_dict:
logger.info(results_dict)
changes["issues"].append(f"Site {site.name} not in results (wrong id_type?)")
if auto_disable:
changes["disabled"] = True
continue
logger.debug(results_dict)
result = results_dict[site.name]["status"]
results_cache[username] = results_dict[site.name]
if result.error and 'Cannot connect to host' in result.error.desc:
changes["issues"].append("Cannot connect to host")
if auto_disable:
changes["disabled"] = True
logger.info(f"Site {site.name} checking is finished")
site_status = result.status
# Generate recommendations based on issues
if changes["issues"] and len(results_cache) == 2:
claimed_result = results_cache.get(site.username_claimed, {})
unclaimed_result = results_cache.get(site.username_unclaimed, {})
if site_status != status:
if site_status == MaigretCheckStatus.UNKNOWN:
msgs = site.absence_strs
etype = site.check_type
error_msg = f"Error checking {username}: {result.context}"
changes["issues"].append(error_msg)
logger.warning(
f"Error while searching {username} in {site.name}: {result.context}, {msgs}, type {etype}"
)
# don't disable sites after the error
# meaning that the site could be available, but returned error for the check
# e.g. many sites protected by cloudflare and available in general
if skip_errors:
pass
# don't disable in case of available username
elif status == MaigretCheckStatus.CLAIMED and auto_disable:
changes["disabled"] = True
elif status == MaigretCheckStatus.CLAIMED:
changes["issues"].append(f"Claimed user '{username}' not detected as claimed")
logger.warning(
f"Not found `{username}` in {site.name}, must be claimed"
)
logger.info(results_dict[site.name])
if auto_disable:
changes["disabled"] = True
else:
changes["issues"].append(f"Unclaimed user '{username}' detected as claimed")
logger.warning(f"Found `{username}` in {site.name}, must be available")
logger.info(results_dict[site.name])
if auto_disable:
changes["disabled"] = True
claimed_http = claimed_result.get("http_status")
unclaimed_http = unclaimed_result.get("http_status")
logger.info(f"Site {site.name} checking is finished")
if claimed_http and unclaimed_http:
if claimed_http != unclaimed_http and site.check_type != "status_code":
changes["recommendations"].append(
f"Consider checkType: status_code (HTTP {claimed_http} vs {unclaimed_http})"
)
# Generate recommendations based on issues
if changes["issues"] and len(results_cache) == 2:
claimed_result = results_cache.get(site.username_claimed, {})
unclaimed_result = results_cache.get(site.username_unclaimed, {})
# Print diagnosis if requested
if diagnose and changes["issues"]:
print(f"\n--- {site.name} DIAGNOSIS ---")
print(f" Check type: {site.check_type}")
print(f" Issues:")
for issue in changes["issues"]:
print(f" - {issue}")
if changes["recommendations"]:
print(f" Recommendations:")
for rec in changes["recommendations"]:
print(f" -> {rec}")
claimed_http = claimed_result.get("http_status")
unclaimed_http = unclaimed_result.get("http_status")
# Only modify site if auto_disable is enabled
if auto_disable and changes["disabled"] != site.disabled:
site.disabled = changes["disabled"]
logger.info(f"Switching property 'disabled' for {site.name} to {site.disabled}")
db.update_site(site)
if not silent:
action = "Disabled" if site.disabled else "Enabled"
print(f"{action} site {site.name}...")
elif changes["issues"] and not silent and not diagnose:
# Report issues without disabling
print(f"Issues found in {site.name}: {len(changes['issues'])} (not auto-disabled)")
if claimed_http and unclaimed_http:
if claimed_http != unclaimed_http and site.check_type != "status_code":
changes["recommendations"].append(
f"Consider checkType: status_code (HTTP {claimed_http} vs {unclaimed_http})"
)
# remove service tag "unchecked"
if "unchecked" in site.tags:
site.tags.remove("unchecked")
db.update_site(site)
# Print diagnosis if requested
if diagnose and changes["issues"]:
print(f"\n--- {site.name} DIAGNOSIS ---")
print(f" Check type: {site.check_type}")
print(" Issues:")
for issue in changes["issues"]:
print(f" - {issue}")
if changes["recommendations"]:
print(" Recommendations:")
for rec in changes["recommendations"]:
print(f" -> {rec}")
# Only modify site if auto_disable is enabled
if auto_disable and changes["disabled"] != site.disabled:
site.disabled = changes["disabled"]
logger.info(f"Switching property 'disabled' for {site.name} to {site.disabled}")
db.update_site(site)
if not silent:
action = "Disabled" if site.disabled else "Enabled"
print(f"{action} site {site.name}...")
elif changes["issues"] and not silent and not diagnose:
# Report issues without disabling
print(f"Issues found in {site.name}: {len(changes['issues'])} (not auto-disabled)")
# remove service tag "unchecked"
if "unchecked" in site.tags:
site.tags.remove("unchecked")
db.update_site(site)
except Exception as e:
logger.warning(
f"Self-check of {site.name} failed with unexpected error: {e}",
exc_info=True,
)
changes["issues"].append(f"Unexpected error: {e}")
if auto_disable and not site.disabled:
changes["disabled"] = True
site.disabled = True
db.update_site(site)
if not silent:
print(f"Disabled site {site.name} (unexpected error)...")
return changes
@@ -1019,6 +1126,7 @@ async def self_check(
i2p_proxy=None,
auto_disable=False,
diagnose=False,
no_progressbar=False,
) -> dict:
"""
Run self-check on sites.
@@ -1053,9 +1161,20 @@ async def self_check(
tasks.append((site.name, future))
if tasks:
with alive_bar(len(tasks), title='Self-checking', force_tty=True) as progress:
with alive_bar(len(tasks), title='Self-checking', force_tty=True, disable=no_progressbar) as progress:
for site_name, f in tasks:
result = await f
try:
result = await f
except Exception as e:
logger.warning(
f"Self-check task for {site_name} raised unexpected error: {e}",
exc_info=True,
)
result = {
"disabled": False,
"issues": [f"Unexpected error: {e}"],
"recommendations": [],
}
result['site_name'] = site_name
all_results.append(result)
progress() # Update the progress bar
@@ -1091,10 +1210,6 @@ async def self_check(
needs_update = total_disabled != 0 or unchecked_new_count != unchecked_old_count
# For backwards compatibility, return bool if auto_disable is True
if auto_disable:
return needs_update
return {
'needs_update': needs_update,
'results': all_results,
@@ -1118,7 +1233,7 @@ def parse_usernames(extracted_ids_data, logger) -> Dict:
elif "usernames" in k:
try:
tree = ast.literal_eval(v)
if type(tree) == list:
if isinstance(tree, list):
for n in tree:
new_usernames[n] = "username"
except Exception as e:
+342
View File
@@ -0,0 +1,342 @@
"""
Database auto-update logic for maigret.
Checks a lightweight meta file to determine if a newer site database is available,
downloads it if compatible, and caches it locally in ~/.maigret/.
"""
import hashlib
import json
import logging
import os
import os.path as path
import tempfile
from datetime import datetime, timezone
from typing import Optional
import requests
from colorama import Fore, Style
from .__version__ import __version__
logger = logging.getLogger("maigret")
_use_color = True
def _print_info(msg: str) -> None:
text = f"[*] {msg}"
if _use_color:
print(Style.BRIGHT + Fore.GREEN + text + Style.RESET_ALL)
else:
print(text)
def _print_success(msg: str) -> None:
text = f"[+] {msg}"
if _use_color:
print(Style.BRIGHT + Fore.GREEN + text + Style.RESET_ALL)
else:
print(text)
def _print_warning(msg: str) -> None:
text = f"[!] {msg}"
if _use_color:
print(Style.BRIGHT + Fore.YELLOW + text + Style.RESET_ALL)
else:
print(text)
DEFAULT_META_URL = (
"https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/db_meta.json"
)
DEFAULT_CHECK_INTERVAL_HOURS = 24
MAIGRET_HOME = path.expanduser("~/.maigret")
CACHED_DB_PATH = path.join(MAIGRET_HOME, "data.json")
STATE_PATH = path.join(MAIGRET_HOME, "autoupdate_state.json")
BUNDLED_DB_PATH = path.join(path.dirname(path.realpath(__file__)), "resources", "data.json")
def _parse_version(version_str: str) -> tuple:
"""Parse a version string like '0.5.0' into a comparable tuple (0, 5, 0)."""
try:
return tuple(int(x) for x in version_str.strip().split("."))
except (ValueError, AttributeError):
return (0, 0, 0)
def _ensure_maigret_home() -> None:
os.makedirs(MAIGRET_HOME, exist_ok=True)
def _load_state() -> dict:
try:
with open(STATE_PATH, "r", encoding="utf-8") as f:
return json.load(f)
except (FileNotFoundError, json.JSONDecodeError, OSError):
return {}
def _save_state(state: dict) -> None:
_ensure_maigret_home()
tmp_path = STATE_PATH + ".tmp"
try:
with open(tmp_path, "w", encoding="utf-8") as f:
json.dump(state, f, indent=2, ensure_ascii=False)
os.replace(tmp_path, STATE_PATH)
except OSError:
try:
os.unlink(tmp_path)
except OSError:
pass
def _needs_check(state: dict, interval_hours: int) -> bool:
last_check = state.get("last_check_at")
if not last_check:
return True
try:
last_dt = datetime.fromisoformat(last_check.replace("Z", "+00:00"))
elapsed = (datetime.now(timezone.utc) - last_dt).total_seconds() / 3600
return elapsed >= interval_hours
except (ValueError, TypeError):
return True
def _fetch_meta(meta_url: str, timeout: int = 10) -> Optional[dict]:
try:
response = requests.get(meta_url, timeout=timeout)
if response.status_code == 200:
return response.json()
except Exception:
pass
return None
def _is_version_compatible(meta: dict) -> bool:
min_ver = meta.get("min_maigret_version", "0.0.0")
return _parse_version(__version__) >= _parse_version(min_ver)
def _is_update_available(meta: dict, state: dict) -> bool:
if not path.isfile(CACHED_DB_PATH):
return True
remote_date = meta.get("updated_at", "")
cached_date = state.get("last_meta", {}).get("updated_at", "")
return remote_date > cached_date
def _download_and_verify(data_url: str, expected_sha256: str, timeout: int = 60) -> Optional[str]:
_ensure_maigret_home()
tmp_fd, tmp_path = tempfile.mkstemp(dir=MAIGRET_HOME, suffix=".json")
try:
response = requests.get(data_url, timeout=timeout)
if response.status_code != 200:
return None
content = response.content
actual_sha256 = hashlib.sha256(content).hexdigest()
if actual_sha256 != expected_sha256:
_print_warning("DB auto-update: SHA-256 mismatch, download rejected")
return None
# Validate JSON structure
data = json.loads(content)
if not all(k in data for k in ("sites", "engines", "tags")):
_print_warning("DB auto-update: invalid database structure")
return None
os.write(tmp_fd, content)
os.close(tmp_fd)
tmp_fd = None
os.replace(tmp_path, CACHED_DB_PATH)
return CACHED_DB_PATH
except Exception:
return None
finally:
if tmp_fd is not None:
os.close(tmp_fd)
try:
os.unlink(tmp_path)
except OSError:
pass
def _best_local() -> str:
"""Return cached DB if it exists and is valid, otherwise bundled."""
if path.isfile(CACHED_DB_PATH):
try:
with open(CACHED_DB_PATH, "r", encoding="utf-8") as f:
data = json.load(f)
if "sites" in data:
return CACHED_DB_PATH
except (json.JSONDecodeError, OSError):
pass
return BUNDLED_DB_PATH
def _now_iso() -> str:
return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
def resolve_db_path(
db_file_arg: str,
no_autoupdate: bool = False,
meta_url: str = DEFAULT_META_URL,
check_interval_hours: int = DEFAULT_CHECK_INTERVAL_HOURS,
color: bool = True,
) -> str:
"""
Determine which database file to use, potentially downloading an update.
Returns the path to the database file that should be loaded.
"""
global _use_color
_use_color = color
default_db_name = "resources/data.json"
# User specified a custom DB — skip auto-update
is_url = db_file_arg.startswith("http://") or db_file_arg.startswith("https://")
is_default = db_file_arg == default_db_name
if is_url:
return db_file_arg
if not is_default:
# Try the path as-is (absolute or relative to cwd) first.
if path.isfile(db_file_arg):
return path.abspath(db_file_arg)
# Fall back to legacy behavior: resolve relative to the maigret module dir.
module_relative = path.join(path.dirname(path.realpath(__file__)), db_file_arg)
if module_relative != db_file_arg and path.isfile(module_relative):
return module_relative
if module_relative != db_file_arg:
raise FileNotFoundError(
f"Custom database file not found: {db_file_arg!r} "
f"(also tried {module_relative!r})"
)
raise FileNotFoundError(f"Custom database file not found: {db_file_arg!r}")
# Auto-update disabled
if no_autoupdate:
return _best_local()
# Check interval
_ensure_maigret_home()
state = _load_state()
if not _needs_check(state, check_interval_hours):
return _best_local()
# Time to check
_print_info("DB auto-update: checking for updates...")
meta = _fetch_meta(meta_url)
if meta is None:
_print_warning("DB auto-update: could not reach update server, using local database")
state["last_check_at"] = _now_iso()
_save_state(state)
return _best_local()
# Version compatibility
if not _is_version_compatible(meta):
min_ver = meta.get("min_maigret_version", "?")
_print_warning(
f"DB auto-update: latest database requires maigret >= {min_ver}, "
f"you have {__version__}. Please upgrade with: pip install -U maigret"
)
state["last_check_at"] = _now_iso()
_save_state(state)
return _best_local()
# Check if update available
if not _is_update_available(meta, state):
sites_count = meta.get("sites_count", "?")
_print_info(f"DB auto-update: database is up to date ({sites_count} sites)")
state["last_check_at"] = _now_iso()
state["last_meta"] = meta
_save_state(state)
return _best_local()
# Download update
new_count = meta.get("sites_count", "?")
old_count = state.get("last_meta", {}).get("sites_count")
if old_count:
_print_info(f"DB auto-update: downloading updated database ({new_count} sites, was {old_count})...")
else:
_print_info(f"DB auto-update: downloading database ({new_count} sites)...")
data_url = meta.get("data_url", "")
expected_sha = meta.get("data_sha256", "")
result = _download_and_verify(data_url, expected_sha)
if result is None:
_print_warning("DB auto-update: download failed, using local database")
state["last_check_at"] = _now_iso()
_save_state(state)
return _best_local()
_print_success(f"DB auto-update: database updated successfully ({new_count} sites)")
state["last_check_at"] = _now_iso()
state["last_meta"] = meta
state["cached_db_sha256"] = expected_sha
_save_state(state)
return CACHED_DB_PATH
def force_update(
meta_url: str = DEFAULT_META_URL,
color: bool = True,
) -> bool:
"""
Force check for database updates and download if available.
Returns True if database was updated, False otherwise.
"""
global _use_color
_use_color = color
_ensure_maigret_home()
_print_info("DB update: checking for updates...")
meta = _fetch_meta(meta_url)
if meta is None:
_print_warning("DB update: could not reach update server")
return False
if not _is_version_compatible(meta):
min_ver = meta.get("min_maigret_version", "?")
_print_warning(
f"DB update: latest database requires maigret >= {min_ver}, "
f"you have {__version__}. Please upgrade with: pip install -U maigret"
)
return False
state = _load_state()
new_count = meta.get("sites_count", "?")
old_count = state.get("last_meta", {}).get("sites_count")
if not _is_update_available(meta, state):
_print_info(f"DB update: database is already up to date ({new_count} sites)")
state["last_check_at"] = _now_iso()
state["last_meta"] = meta
_save_state(state)
return False
if old_count:
_print_info(f"DB update: downloading updated database ({new_count} sites, was {old_count})...")
else:
_print_info(f"DB update: downloading database ({new_count} sites)...")
data_url = meta.get("data_url", "")
expected_sha = meta.get("data_sha256", "")
result = _download_and_verify(data_url, expected_sha)
if result is None:
_print_warning("DB update: download failed")
return False
_print_success(f"DB update: database updated successfully ({new_count} sites)")
state["last_check_at"] = _now_iso()
state["last_meta"] = meta
state["cached_db_sha256"] = expected_sha
_save_state(state)
return True
+2
View File
@@ -58,6 +58,8 @@ COMMON_ERRORS = {
'Censorship', 'MGTS'
),
'Incapsula incident ID': CheckError('Bot protection', 'Incapsula'),
'<title>Client Challenge</title>': CheckError('Bot protection', 'Anti-bot challenge'),
'<title>DDoS-Guard</title>': CheckError('Bot protection', 'DDoS-Guard'),
'Сайт заблокирован хостинг-провайдером': CheckError(
'Site-specific', 'Site is disabled (Beget)'
),
+7 -6
View File
@@ -1,4 +1,5 @@
import asyncio
import inspect
import sys
import time
from typing import Any, Iterable, List, Callable
@@ -103,7 +104,7 @@ class AsyncioProgressbarQueueExecutor(AsyncExecutor):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.workers_count = kwargs.get('in_parallel', 10)
self.queue = asyncio.Queue(self.workers_count)
self.queue: asyncio.Queue = asyncio.Queue(self.workers_count)
self.timeout = kwargs.get('timeout')
# Pass a progress function; alive_bar by default
self.progress_func = kwargs.get('progress_func', alive_bar)
@@ -113,7 +114,7 @@ class AsyncioProgressbarQueueExecutor(AsyncExecutor):
async def increment_progress(self, count):
"""Update progress by calling the provided progress function."""
if self.progress:
if asyncio.iscoroutinefunction(self.progress):
if inspect.iscoroutinefunction(self.progress):
await self.progress(count)
else:
self.progress(count)
@@ -124,7 +125,7 @@ class AsyncioProgressbarQueueExecutor(AsyncExecutor):
"""Stop the progress tracking."""
if hasattr(self.progress, "close") and self.progress:
close_func = self.progress.close
if asyncio.iscoroutinefunction(close_func):
if inspect.iscoroutinefunction(close_func):
await close_func()
else:
close_func()
@@ -184,10 +185,10 @@ class AsyncioQueueGeneratorExecutor:
# Deprecated: will be removed soon, don't use it
def __init__(self, *args, **kwargs):
self.workers_count = kwargs.get('in_parallel', 10)
self.queue = asyncio.Queue()
self.queue: asyncio.Queue = asyncio.Queue()
self.timeout = kwargs.get('timeout')
self.logger = kwargs['logger']
self._results = asyncio.Queue()
self._results: asyncio.Queue = asyncio.Queue()
self._stop_signal = object()
async def worker(self):
@@ -209,7 +210,7 @@ class AsyncioQueueGeneratorExecutor:
result = kwargs.get('default')
await self._results.put(result)
except Exception as e:
self.logger.error(f"Error in worker: {e}")
self.logger.error(f"Error in worker: {e}", exc_info=True)
finally:
self.queue.task_done()
+79 -12
View File
@@ -13,7 +13,7 @@ from argparse import ArgumentParser, RawDescriptionHelpFormatter
from typing import List, Tuple
import os.path as path
from socid_extractor import extract, parse
from socid_extractor import extract, parse # type: ignore[import-not-found]
from .__version__ import __version__
from .checking import (
@@ -37,6 +37,7 @@ from .report import (
get_plaintext_report,
sort_report_by_data_points,
save_graph_report,
save_markdown_report,
)
from .sites import MaigretDatabase
from .submit import Submitter
@@ -75,7 +76,7 @@ def extract_ids_from_page(url, logger, timeout=5) -> dict:
elif 'usernames' in k:
try:
tree = ast.literal_eval(v)
if type(tree) == list:
if isinstance(tree, list):
for n in tree:
results[n] = 'username'
except Exception as e:
@@ -201,6 +202,20 @@ def setup_arguments_parser(settings: Settings):
default=settings.sites_db_path,
help="Load Maigret database from a JSON file or HTTP web resource.",
)
parser.add_argument(
"--no-autoupdate",
action="store_true",
dest="no_autoupdate",
default=settings.no_autoupdate,
help="Disable automatic database updates on startup.",
)
parser.add_argument(
"--force-update",
action="store_true",
dest="force_update",
default=False,
help="Force check for database updates and download if available.",
)
parser.add_argument(
"--cookies-jar-file",
metavar="COOKIE_FILE",
@@ -451,6 +466,14 @@ def setup_arguments_parser(settings: Settings):
default=settings.pdf_report,
help="Generate a PDF report (general report on all usernames).",
)
report_group.add_argument(
"-M",
"--md",
action="store_true",
dest="md",
default=settings.md_report,
help="Generate a Markdown report (general report on all usernames).",
)
report_group.add_argument(
"-G",
"--graph",
@@ -543,9 +566,25 @@ async def main():
else:
args.exclude_tags = []
db_file = args.db_file \
if (args.db_file.startswith("http://") or args.db_file.startswith("https://")) \
else path.join(path.dirname(path.realpath(__file__)), args.db_file)
from .db_updater import resolve_db_path, force_update, BUNDLED_DB_PATH
if args.force_update:
force_update(
meta_url=settings.db_update_meta_url,
color=not args.no_color,
)
try:
db_file = resolve_db_path(
db_file_arg=args.db_file,
no_autoupdate=args.no_autoupdate or args.force_update,
meta_url=settings.db_update_meta_url,
check_interval_hours=settings.autoupdate_check_interval_hours,
color=not args.no_color,
)
except FileNotFoundError as e:
logger.error(str(e))
sys.exit(2)
if args.top_sites == 0 or args.all_sites:
args.top_sites = sys.maxsize
@@ -560,7 +599,21 @@ async def main():
)
# Create object with all information about sites we are aware of.
db = MaigretDatabase().load_from_path(db_file)
try:
db = MaigretDatabase().load_from_path(db_file)
query_notify.success(f'Using sites database: {db_file} ({len(db.sites)} sites)')
except Exception as e:
logger.warning(f"Failed to load database from {db_file}: {e}")
if db_file != BUNDLED_DB_PATH:
query_notify.warning(
f'Falling back to bundled database: {BUNDLED_DB_PATH}'
)
db = MaigretDatabase().load_from_path(BUNDLED_DB_PATH)
query_notify.success(
f'Using sites database: {BUNDLED_DB_PATH} ({len(db.sites)} sites)'
)
else:
raise
get_top_sites_for_id = lambda x: db.ranked_sites_dict(
top=args.top_sites,
tags=args.tags,
@@ -600,13 +653,10 @@ async def main():
i2p_proxy=args.i2p_proxy,
auto_disable=args.auto_disable,
diagnose=args.diagnose,
no_progressbar=args.no_progressbar,
)
# Handle both old (bool) and new (dict) return types
if isinstance(check_result, dict):
is_need_update = check_result.get('needs_update', False)
else:
is_need_update = check_result
is_need_update = check_result.get('needs_update', False)
if is_need_update:
if input('Do you want to save changes permanently? [Yn]\n').lower() in (
@@ -772,7 +822,7 @@ async def main():
# reporting for all the result
if general_results:
if args.html or args.pdf:
if args.html or args.pdf or args.md:
query_notify.warning('Generating report info...')
report_context = generate_report_context(general_results)
# determine main username
@@ -792,6 +842,23 @@ async def main():
save_pdf_report(filename, report_context)
query_notify.warning(f'PDF report on all usernames saved in {filename}')
if args.md:
username = username.replace('/', '_')
filename = report_filepath_tpl.format(username=username, postfix='.md')
run_flags = []
if args.tags:
run_flags.append(f"--tags {args.tags}")
if args.site_list:
run_flags.append(f"--site {','.join(args.site_list)}")
if args.all_sites:
run_flags.append("--all-sites")
run_info = {
"sites_count": sum(len(d) for _, _, d in general_results),
"flags": " ".join(run_flags) if run_flags else None,
}
save_markdown_report(filename, report_context, run_info=run_info)
query_notify.warning(f'Markdown report on all usernames saved in {filename}')
if args.graph:
username = username.replace('/', '_')
filename = report_filepath_tpl.format(
+3 -4
View File
@@ -1,7 +1,6 @@
"""Sherlock Notify Module
"""Console and query notification helpers.
This module defines the objects for notifying the caller about the
results of queries.
This module defines objects for notifying the caller about the results of queries.
"""
import sys
@@ -174,7 +173,7 @@ class QueryNotifyPrint(QueryNotify):
else:
return self.make_simple_terminal_notify(*args)
def start(self, message, id_type):
def start(self, message=None, id_type="username"):
"""Notify Start.
Will print the title to the standard output.
+151 -12
View File
@@ -7,7 +7,7 @@ import os
from datetime import datetime
from typing import Dict, Any
import xmind
import xmind # type: ignore[import-untyped]
from dateutil.tz import gettz
from dateutil.parser import parse as parse_datetime_str
from jinja2 import Template
@@ -79,7 +79,7 @@ def save_pdf_report(filename: str, context: dict):
filled_template = template.render(**context)
# moved here to speed up the launch of Maigret
from xhtml2pdf import pisa
from xhtml2pdf import pisa # type: ignore[import-untyped]
with open(filename, "w+b") as f:
pisa.pisaDocument(io.StringIO(filled_template), dest=f, default_css=css)
@@ -91,9 +91,9 @@ def save_json_report(filename: str, username: str, results: dict, report_type: s
class MaigretGraph:
other_params = {'size': 10, 'group': 3}
site_params = {'size': 15, 'group': 2}
username_params = {'size': 20, 'group': 1}
other_params: dict = {'size': 10, 'group': 3}
site_params: dict = {'size': 15, 'group': 2}
username_params: dict = {'size': 20, 'group': 1}
def __init__(self, graph):
self.G = graph
@@ -121,12 +121,12 @@ class MaigretGraph:
def save_graph_report(filename: str, username_results: list, db: MaigretDatabase):
import networkx as nx
G = nx.Graph()
G: Any = nx.Graph()
graph = MaigretGraph(G)
base_site_nodes = {}
site_account_nodes = {}
processed_values = {} # Track processed values to avoid duplicates
processed_values: Dict[str, Any] = {} # Track processed values to avoid duplicates
for username, id_type, results in username_results:
# Add username node, using normalized version directly if different
@@ -239,7 +239,7 @@ def save_graph_report(filename: str, username_results: list, db: MaigretDatabase
G.remove_nodes_from(single_degree_sites)
# Generate interactive visualization
from pyvis.network import Network
from pyvis.network import Network # type: ignore[import-untyped]
nt = Network(notebook=True, height="750px", width="100%")
nt.from_nx(G)
@@ -257,6 +257,144 @@ def get_plaintext_report(context: dict) -> str:
return output.strip()
def _md_format_value(value) -> str:
"""Format a value for Markdown output, detecting links."""
if isinstance(value, list):
return ", ".join(str(v) for v in value)
s = str(value)
if s.startswith("http://") or s.startswith("https://"):
return f"[{s}]({s})"
return s
def save_markdown_report(filename: str, context: dict, run_info: dict = None):
username = context.get("username", "unknown")
generated_at = context.get("generated_at", "")
brief = context.get("brief", "")
countries = context.get("countries_tuple_list", [])
interests = context.get("interests_tuple_list", [])
first_seen = context.get("first_seen")
results = context.get("results", [])
# Collect ALL values for key fields across all accounts
all_fields: Dict[str, list] = {}
last_seen = None
for _, _, data in results:
for _, v in data.items():
if not v.get("found") or v.get("is_similar"):
continue
ids_data = v.get("ids_data", {})
# Map multiple source fields to unified output fields
field_sources = {
"fullname": ("fullname", "name"),
"location": ("location", "country", "city", "country_code", "locale", "region"),
"gender": ("gender",),
"bio": ("bio", "about", "description"),
}
for out_field, source_keys in field_sources.items():
for src in source_keys:
val = ids_data.get(src)
if val:
all_fields.setdefault(out_field, [])
val_str = str(val)
if val_str not in all_fields[out_field]:
all_fields[out_field].append(val_str)
# Track last_seen
for ts_field in ("last_online", "latest_activity_at", "updated_at"):
ts = ids_data.get(ts_field)
if ts and (last_seen is None or str(ts) > str(last_seen)):
last_seen = ts
lines = []
lines.append(f"# Report by searching on username \"{username}\"\n")
# Generated line with run info
gen_line = f"Generated at {generated_at} by [Maigret](https://github.com/soxoj/maigret)"
if run_info:
parts = []
if run_info.get("sites_count"):
parts.append(f"{run_info['sites_count']} sites checked")
if run_info.get("flags"):
parts.append(f"flags: `{run_info['flags']}`")
if parts:
gen_line += f" ({', '.join(parts)})"
lines.append(f"{gen_line}\n")
# Summary
lines.append("## Summary\n")
lines.append(f"{brief}\n")
if all_fields:
lines.append("**Information extracted from accounts:**\n")
for field, values in all_fields.items():
title = CaseConverter.snake_to_title(field)
lines.append(f"- {title}: {'; '.join(values)}")
lines.append("")
if countries:
geo = ", ".join(f"{code} (x{count})" for code, count in countries)
lines.append(f"**Country tags:** {geo}\n")
if interests:
tags = ", ".join(f"{tag} (x{count})" for tag, count in interests)
lines.append(f"**Website tags:** {tags}\n")
if first_seen:
lines.append(f"**First seen:** {first_seen}")
if last_seen:
lines.append(f"**Last seen:** {last_seen}")
if first_seen or last_seen:
lines.append("")
# Accounts found
lines.append("## Accounts found\n")
for u, id_type, data in results:
for site_name, v in data.items():
if not v.get("found") or v.get("is_similar"):
continue
lines.append(f"### {site_name}\n")
lines.append(f"- **URL:** [{v.get('url_user', '')}]({v.get('url_user', '')})")
tags = v.get("status") and v["status"].tags or []
if tags:
lines.append(f"- **Tags:** {', '.join(tags)}")
lines.append("")
ids_data = v.get("ids_data", {})
if ids_data:
for field, value in ids_data.items():
if field == "image":
continue
title = CaseConverter.snake_to_title(field)
lines.append(f"- {title}: {_md_format_value(value)}")
lines.append("")
# Possible false positives
lines.append("## Possible false positives\n")
lines.append(
f"This report was generated by searching for accounts matching the username `{username}`. "
f"Accounts listed above may belong to different people who happen to use the same "
f"or similar username. Results without extracted personal information could contain "
f"some false positive findings. Always verify findings before drawing conclusions.\n"
)
# Ethical use
lines.append("## Ethical use\n")
lines.append(
"This report is a result of a technical collection of publicly available information "
"from online accounts and does not constitute personal data processing. If you intend "
"to use this data for personal data processing or collection purposes, ensure your use "
"complies with applicable laws and regulations in your jurisdiction (such as GDPR, "
"CCPA, and similar).\n"
)
with open(filename, "w", encoding="utf-8") as f:
f.write("\n".join(lines))
"""
REPORTS GENERATING
"""
@@ -353,11 +491,12 @@ def generate_report_context(username_results: list):
if k in ["country", "locale"]:
try:
if is_country_tag(k):
tag = pycountry.countries.get(alpha_2=v).alpha_2.lower()
country = pycountry.countries.get(alpha_2=v)
tag = country.alpha_2.lower() # type: ignore[union-attr]
else:
tag = pycountry.countries.search_fuzzy(v)[
0
].alpha_2.lower()
].alpha_2.lower() # type: ignore[attr-defined]
# TODO: move countries to another struct
tags[tag] = tags.get(tag, 0) + 1
except Exception as e:
@@ -513,8 +652,8 @@ def add_xmind_subtopic(userlink, k, v, supposed_data):
def design_xmind_sheet(sheet, username, results):
alltags = {}
supposed_data = {}
alltags: Dict[str, Any] = {}
supposed_data: Dict[str, Any] = {}
sheet.setTitle("%s Analysis" % (username))
root_topic1 = sheet.getRootTopic()
+2460 -3349
View File
File diff suppressed because it is too large Load Diff
+8
View File
@@ -0,0 +1,8 @@
{
"version": 1,
"updated_at": "2026-04-21T00:02:26Z",
"sites_count": 3141,
"min_maigret_version": "0.6.0",
"data_sha256": "d93fb2d051328b60126c98fbf02841a6974549f0c8c9220a207a9172b3ee0c90",
"data_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/data.json"
}
+5 -1
View File
@@ -54,5 +54,9 @@
"graph_report": false,
"pdf_report": false,
"html_report": false,
"web_interface_port": 5000
"md_report": false,
"web_interface_port": 5000,
"no_autoupdate": false,
"db_update_meta_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/db_meta.json",
"autoupdate_check_interval_hours": 24
}
+4
View File
@@ -42,7 +42,11 @@ class Settings:
pdf_report: bool
html_report: bool
graph_report: bool
md_report: bool
web_interface_port: int
no_autoupdate: bool
db_update_meta_url: str
autoupdate_check_interval_hours: int
# submit mode settings
presence_strings: list
+34 -7
View File
@@ -92,10 +92,12 @@ class MaigretSite:
# Alexa traffic rank
alexa_rank = None
# Source (in case a site is a mirror of another site)
source = None
source: Optional[str] = None
# URL protocol (http/https)
protocol = ''
# Protection types detected on this site (e.g. ["tls_fingerprint", "ddos_guard"])
protection: List[str] = []
def __init__(self, name, information):
self.name = name
@@ -173,7 +175,7 @@ class MaigretSite:
self.__dict__[CaseConverter.camel_to_snake(group)],
)
self.url_regexp = URLMatcher.make_profile_url_regexp(url, self.regex_check)
self.url_regexp = URLMatcher.make_profile_url_regexp(url, self.regex_check or "")
def detect_username(self, url: str) -> Optional[str]:
if self.url_regexp:
@@ -462,9 +464,9 @@ class MaigretDatabase:
"tags": self._tags,
}
json_data = json.dumps(db_data, indent=4)
json_data = json.dumps(db_data, indent=4, ensure_ascii=False)
with open(filename, "w") as f:
with open(filename, "w", encoding="utf-8") as f:
f.write(json_data)
return self
@@ -564,7 +566,7 @@ class MaigretDatabase:
def get_scan_stats(self, sites_dict):
sites = sites_dict or self.sites_dict
found_flags = {}
found_flags: Dict[str, int] = {}
for _, s in sites.items():
if "presense_flag" in s.stats:
flag = s.stats["presense_flag"]
@@ -585,8 +587,10 @@ class MaigretDatabase:
def get_db_stats(self, is_markdown=False):
# Initialize counters
sites_dict = self.sites_dict
urls = {}
tags = {}
urls: Dict[str, int] = {}
tags: Dict[str, int] = {}
engine_total: Dict[str, int] = {}
engine_enabled: Dict[str, int] = {}
disabled_count = 0
message_checks_one_factor = 0
status_checks = 0
@@ -609,6 +613,14 @@ class MaigretDatabase:
elif site.check_type == 'status_code':
status_checks += 1
# Count engines
if site.engine:
engine_total[site.engine] = engine_total.get(site.engine, 0) + 1
if not site.disabled:
engine_enabled[site.engine] = (
engine_enabled.get(site.engine, 0) + 1
)
# Count tags
if not site.tags:
tags["NO_TAGS"] = tags.get("NO_TAGS", 0) + 1
@@ -645,11 +657,26 @@ class MaigretDatabase:
f"Sites with probing: {', '.join(sorted(site_with_probing))}",
f"Sites with activation: {', '.join(sorted(site_with_activation))}",
self._format_top_items("profile URLs", urls, 20, is_markdown),
self._format_engine_stats(engine_total, engine_enabled, is_markdown),
self._format_top_items("tags", tags, 20, is_markdown, self._tags),
]
return separator.join(output)
def _format_engine_stats(self, engine_total, engine_enabled, is_markdown):
"""Format per-engine enabled/total counts, sorted by total descending."""
output = "Sites by engine:\n"
for engine, total in sorted(
engine_total.items(), key=lambda x: x[1], reverse=True
):
enabled = engine_enabled.get(engine, 0)
perc = round(100 * enabled / total, 1) if total else 0.0
if is_markdown:
output += f"- `{engine}`: {enabled}/{total} ({perc}%)\n"
else:
output += f"{enabled}/{total} ({perc}%)\t{engine}\n"
return output
def _format_top_items(
self, title, items_dict, limit, is_markdown, valid_items=None
):
+27 -24
View File
@@ -6,8 +6,7 @@ import logging
from typing import Any, Dict, List, Optional, Tuple
from aiohttp import ClientSession, TCPConnector
from aiohttp_socks import ProxyConnector
import cloudscraper
import cloudscraper # type: ignore[import-untyped]
from colorama import Fore, Style
from .activation import import_aiohttp_cookies
@@ -68,8 +67,10 @@ class Submitter:
else:
cookie_jar = import_aiohttp_cookies(args.cookie_file)
connector = ProxyConnector.from_url(proxy) if proxy else TCPConnector(ssl=False)
connector.verify_ssl = False
ssl_context = __import__('ssl').create_default_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = __import__('ssl').CERT_NONE
connector = ProxyConnector.from_url(proxy) if proxy else TCPConnector(ssl=ssl_context)
self.session = ClientSession(
connector=connector, trust_env=True, cookie_jar=cookie_jar
)
@@ -88,7 +89,9 @@ class Submitter:
alexa_rank = 0
try:
alexa_rank = int(root.find('.//REACH').attrib['RANK'])
reach_elem = root.find('.//REACH')
if reach_elem is not None:
alexa_rank = int(reach_elem.attrib['RANK'])
except Exception:
pass
@@ -127,7 +130,7 @@ class Submitter:
async def detect_known_engine(
self, url_exists, url_mainpage, session, follow_redirects, headers
) -> [List[MaigretSite], str]:
) -> Tuple[List[MaigretSite], str]:
session = session or self.session
resp_text, _ = await self.get_html_response_to_compare(
@@ -191,8 +194,9 @@ class Submitter:
# TODO: replace with checking.py/SimpleAiohttpChecker call
@staticmethod
async def get_html_response_to_compare(
url: str, session: ClientSession = None, redirects=False, headers: Dict = None
url: str, session: Optional[ClientSession] = None, redirects=False, headers: Optional[Dict] = None
):
assert session is not None, "session must not be None"
async with session.get(
url, allow_redirects=redirects, headers=headers
) as response:
@@ -211,10 +215,10 @@ class Submitter:
username: str,
url_exists: str,
cookie_filename="", # TODO: use cookies
session: ClientSession = None,
session: Optional[ClientSession] = None,
follow_redirects=False,
headers: dict = None,
) -> Tuple[List[str], List[str], str, str]:
headers: Optional[dict] = None,
) -> Tuple[Optional[List[str]], Optional[List[str]], str, str]:
random_username = generate_random_username()
url_of_non_existing_account = url_exists.lower().replace(
@@ -269,11 +273,8 @@ class Submitter:
tokens_a = set(re.split(f'[{self.SEPARATORS}]', first_html_response))
tokens_b = set(re.split(f'[{self.SEPARATORS}]', second_html_response))
a_minus_b = tokens_a.difference(tokens_b)
b_minus_a = tokens_b.difference(tokens_a)
a_minus_b = list(map(lambda x: x.strip('\\'), a_minus_b))
b_minus_a = list(map(lambda x: x.strip('\\'), b_minus_a))
a_minus_b: List[str] = [x.strip('\\') for x in tokens_a.difference(tokens_b)]
b_minus_a: List[str] = [x.strip('\\') for x in tokens_b.difference(tokens_a)]
# Filter out strings containing usernames
a_minus_b = [s for s in a_minus_b if username.lower() not in s.lower()]
@@ -378,7 +379,7 @@ class Submitter:
).strip()
if field in ['tags', 'presense_strs', 'absence_strs']:
new_value = list(map(str.strip, new_value.split(',')))
new_value = list(map(str.strip, new_value.split(','))) # type: ignore[assignment]
if new_value:
setattr(site, field, new_value)
@@ -424,12 +425,12 @@ class Submitter:
f"{Fore.YELLOW}[!] Sites with domain \"{domain_raw}\" already exists in the Maigret database!{Style.RESET_ALL}"
)
status = lambda s: "(disabled)" if s.disabled else ""
site_status = lambda s: "(disabled)" if s.disabled else ""
url_block = lambda s: f"\n\t{s.url_main}\n\t{s.url}"
print(
"\n".join(
[
f"{site.name} {status(site)}{url_block(site)}"
f"{site.name} {site_status(site)}{url_block(site)}"
for site in matched_sites
]
)
@@ -497,7 +498,7 @@ class Submitter:
)
print('Detecting site engine, please wait...')
sites = []
sites: List[MaigretSite] = []
text = None
try:
sites, text = await self.detect_known_engine(
@@ -510,7 +511,7 @@ class Submitter:
except KeyboardInterrupt:
print('Engine detect process is interrupted.')
if 'cloudflare' in text.lower():
if text and 'cloudflare' in text.lower():
print(
'Cloudflare protection detected. I will use cloudscraper for further work'
)
@@ -573,6 +574,8 @@ class Submitter:
found = True
break
assert chosen_site is not None, "No sites to check"
if not found:
print(
f"{Fore.RED}[!] The check for site '{chosen_site.name}' failed!{Style.RESET_ALL}"
@@ -631,8 +634,8 @@ class Submitter:
# chosen_site.alexa_rank = rank
self.logger.info(chosen_site.json)
site_data = chosen_site.strip_engine_data()
self.logger.info(site_data.json)
stripped_site = chosen_site.strip_engine_data()
self.logger.info(stripped_site.json)
if old_site:
# Update old site with new values and log changes
@@ -651,7 +654,7 @@ class Submitter:
for field, display_name in fields_to_check.items():
old_value = getattr(old_site, field)
new_value = getattr(site_data, field)
new_value = getattr(stripped_site, field)
if field == 'tags' and not new_tags:
continue
if str(old_value) != str(new_value):
@@ -661,7 +664,7 @@ class Submitter:
old_site.__dict__[field] = new_value
# update the site
final_site = old_site if old_site else site_data
final_site = old_site if old_site else stripped_site
self.db.update_site(final_site)
# save the db in file
+6 -3
View File
@@ -8,7 +8,7 @@ from typing import Any
DEFAULT_USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36",
]
@@ -71,7 +71,10 @@ class URLMatcher:
def ascii_data_display(data: str) -> Any:
return ast.literal_eval(data)
try:
return ast.literal_eval(data)
except (ValueError, SyntaxError):
return data
def get_dict_ascii_tree(items, prepend="", new_line=True):
@@ -86,7 +89,7 @@ def get_dict_ascii_tree(items, prepend="", new_line=True):
new_result + new_line if num != len(items) - 1 else last_result + new_line
)
if type(item) == tuple:
if isinstance(item, tuple):
field_name, field_value = item
if field_value.startswith("['"):
is_last_item = num == len(items) - 1
+3 -2
View File
@@ -13,6 +13,7 @@ import os
import asyncio
from datetime import datetime
from threading import Thread
from typing import Any, Dict
import maigret
import maigret.settings
from maigret.sites import MaigretDatabase
@@ -23,7 +24,7 @@ app = Flask(__name__)
app.secret_key = os.getenv('FLASK_SECRET_KEY', os.urandom(24).hex())
# add background job tracking
background_jobs = {}
background_jobs: Dict[str, Any] = {}
job_results = {}
# Configuration
@@ -260,7 +261,7 @@ def search():
target=process_search_task, args=(usernames, options, timestamp)
),
}
background_jobs[timestamp]['thread'].start()
background_jobs[timestamp]['thread'].start() # type: ignore[union-attr]
return redirect(url_for('status', timestamp=timestamp))
Generated
+1002 -862
View File
File diff suppressed because it is too large Load Diff
+2 -2
View File
@@ -1,5 +1,5 @@
maigret @ https://github.com/soxoj/maigret/archive/refs/heads/main.zip
pefile==2023.2.7 # do not bump while pyinstaller is 6.11.1, there is a conflict
psutil==7.1.3
pyinstaller==6.16.0
psutil==7.2.2
pyinstaller==6.19.0
pywin32-ctypes==0.2.3
+12 -6
View File
@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
[tool.poetry]
name = "maigret"
version = "0.5.0"
version = "0.6.0"
description = "🕵️‍♂️ Collect a dossier on a person by username from thousands of sites."
authors = ["Soxoj <soxoj@protonmail.com>"]
readme = "README.md"
@@ -15,6 +15,11 @@ repository = "https://github.com/soxoj/maigret"
classifiers = [
"Development Status :: 5 - Production/Stable",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: 3.14",
"Intended Audience :: Information Technology",
"Operating System :: OS Independent",
"License :: OSI Approved :: MIT License",
@@ -38,18 +43,18 @@ arabic-reshaper = "^3.0.0"
async-timeout = "^5.0.1"
attrs = ">=25.3,<27.0"
certifi = ">=2025.6.15,<2027.0.0"
chardet = "^5.0.0"
chardet = ">=5,<8"
colorama = "^0.4.6"
future = "^1.0.0"
future-annotations= "^1.0.0"
html5lib = "^1.1"
idna = "^3.4"
Jinja2 = "^3.1.6"
lxml = ">=5.3,<7.0"
lxml = ">=6.0.2,<7.0"
MarkupSafe = "^3.0.2"
mock = "^5.1.0"
multidict = "^6.6.3"
pycountry = "^24.6.1"
pycountry = ">=24.6.1,<27.0.0"
PyPDF2 = "^3.0.1"
PySocks = "^1.7.1"
python-bidi = "^0.6.3"
@@ -57,7 +62,7 @@ requests = "^2.32.4"
requests-futures = "^1.0.2"
requests-toolbelt = "^1.0.0"
six = "^1.17.0"
socid-extractor = "^0.0.27"
socid-extractor = ">=0.0.27,<0.0.29"
soupsieve = "^2.6"
stem = "^1.8.1"
torrequest = "^0.1.0"
@@ -74,6 +79,7 @@ cloudscraper = "^1.2.71"
flask = {extras = ["async"], version = "^3.1.1"}
asgiref = "^3.9.1"
platformdirs = "^4.3.8"
curl-cffi = ">=0.14,<1.0"
[tool.poetry.group.dev.dependencies]
@@ -94,4 +100,4 @@ black = ">=25.1,<27.0"
[tool.poetry.scripts]
# Run with: poetry run maigret <username>
maigret = "maigret.maigret:run"
update_sitesmd = "utils.update_site_data:main"
update_sitesmd = "utils.update_site_data:main"
+1
View File
@@ -3,4 +3,5 @@
filterwarnings =
error
ignore::UserWarning
ignore:codecs.open\(\) is deprecated:DeprecationWarning:xmind.core.saver
asyncio_mode=auto
-3
View File
@@ -1,3 +0,0 @@
[mutmut]
paths_to_mutate=maigret/
tests_dir=tests/
+928 -1024
View File
File diff suppressed because it is too large Load Diff
+1 -1
View File
@@ -3,7 +3,7 @@ icon: static/maigret.png
name: maigret
summary: 🕵️‍♂️ Collect a dossier on a person by username from thousands of sites.
description: |
**Maigret** collects a dossier on a person **by username only**, checking for accounts on a huge number of sites and gathering all the available information from web pages. No API keys required. Maigret is an easy-to-use and powerful fork of Sherlock.
**Maigret** collects a dossier on a person **by username only**, checking for accounts on a huge number of sites and gathering all the available information from web pages. No API keys required.
Currently supported more than 3000 sites, search is launched against 500 popular sites in descending order of popularity by default. Also supported checking of Tor sites, I2P sites, and domains (via DNS resolving).
+72
View File
@@ -1,7 +1,12 @@
import asyncio
import logging
from mock import Mock
import pytest
from maigret import search
from maigret.checking import check_site_for_username, process_site_result
from maigret.result import MaigretCheckResult, MaigretCheckStatus
def site_result_except(server, username, **kwargs):
@@ -67,3 +72,70 @@ async def test_checking_by_message_negative(httpserver, local_test_db):
result = await search('unclaimed', site_dict=sites_dict, logger=Mock())
assert result['Message']['status'].is_found() is True
def test_process_site_result_threads_response_time(local_test_db):
"""process_site_result must thread the response_time kwarg into the result's query_time."""
site = local_test_db.sites_dict['StatusCode']
results_info = {
'username': 'claimed',
'parsing_enabled': False,
'url_user': site.url.replace('{username}', 'claimed'),
'status': None,
'rank': 0,
'url_main': site.url_main,
'ids_data': {},
}
response = ('body', 200, None)
logger = logging.getLogger('test')
query_notify = Mock()
out = process_site_result(
response, query_notify, logger, results_info, site,
response_time=1.234,
)
assert out['status'].query_time == pytest.approx(1.234)
def test_process_site_result_defaults_response_time_to_none(local_test_db):
"""Omitting response_time keeps query_time as None (backward compatible)."""
site = local_test_db.sites_dict['StatusCode']
results_info = {
'username': 'claimed',
'parsing_enabled': False,
'url_user': site.url.replace('{username}', 'claimed'),
'status': None,
'rank': 0,
'url_main': site.url_main,
'ids_data': {},
}
out = process_site_result(
('body', 200, None), Mock(), logging.getLogger('test'), results_info, site,
)
assert out['status'].query_time is None
@pytest.mark.slow
@pytest.mark.asyncio
async def test_query_time_populated_from_http_check(httpserver, local_test_db):
"""check_site_for_username measures HTTP round-trip and populates query_time."""
sites_dict = local_test_db.sites_dict
# Delay the response on the test HTTP server to produce a measurable query_time.
DELAY = 0.25
def delayed_handler(request):
import time as _time
_time.sleep(DELAY)
from werkzeug.wrappers import Response
return Response('ok', status=200)
httpserver.expect_request('/url', query_string='id=claimed').respond_with_handler(delayed_handler)
result = await search('claimed', site_dict={'StatusCode': sites_dict['StatusCode']}, logger=Mock())
status = result['StatusCode']['status']
assert status.is_found() is True
assert isinstance(status.query_time, float)
assert status.query_time >= DELAY
# Upper bound: the measurement should not wildly exceed the server delay.
assert status.query_time < DELAY + 5.0
+3
View File
@@ -48,6 +48,9 @@ DEFAULT_ARGS: Dict[str, Any] = {
'web': None,
'with_domains': False,
'xmind': False,
'md': False,
'no_autoupdate': False,
'force_update': False,
}
+83
View File
@@ -4,6 +4,30 @@ import pytest
from maigret.utils import is_country_tag
TOP_SITES_ALEXA_RANK_LIMIT = 50
KNOWN_SOCIAL_DOMAINS = [
"facebook.com",
"instagram.com",
"twitter.com",
"tiktok.com",
"vk.com",
"reddit.com",
"pinterest.com",
"snapchat.com",
"linkedin.com",
"tumblr.com",
"threads.net",
"bsky.app",
"myspace.com",
"weibo.com",
"mastodon.social",
"gab.com",
"minds.com",
"clubhouse.com",
]
@pytest.mark.slow
def test_tags_validity(default_db):
unknown_tags = set()
@@ -19,3 +43,62 @@ def test_tags_validity(default_db):
# if you see "unchecked" tag error, please, do
# maigret --db `pwd`/maigret/resources/data.json --self-check --tag unchecked --use-disabled-sites
assert unknown_tags == set()
@pytest.mark.slow
def test_top_sites_have_category_tag(default_db):
"""Top sites by alexaRank must have at least one category tag (not just country codes)."""
sites_ranked = sorted(
[s for s in default_db.sites if s.alexa_rank],
key=lambda s: s.alexa_rank,
)[:TOP_SITES_ALEXA_RANK_LIMIT]
missing_category = []
for site in sites_ranked:
category_tags = [t for t in site.tags if not is_country_tag(t)]
if not category_tags:
missing_category.append(f"{site.name} (rank {site.alexa_rank})")
assert missing_category == [], (
f"{len(missing_category)} top-{TOP_SITES_ALEXA_RANK_LIMIT} sites have no category tag: "
+ ", ".join(missing_category[:20])
)
@pytest.mark.slow
def test_no_unused_tags_in_registry(default_db):
"""Every tag in the registry should be used by at least one site."""
all_used_tags = set()
for site in default_db.sites:
for tag in site.tags:
if not is_country_tag(tag):
all_used_tags.add(tag)
registered_tags = set(default_db._tags)
unused = registered_tags - all_used_tags
assert unused == set(), f"Tags registered but not used by any site: {unused}"
@pytest.mark.slow
def test_social_networks_have_social_tag(default_db):
"""Known social network domains must have the 'social' tag."""
from urllib.parse import urlparse
missing_social = []
for site in default_db.sites:
url = site.url_main or ""
try:
hostname = urlparse(url).hostname or ""
except Exception:
continue
for domain in KNOWN_SOCIAL_DOMAINS:
if hostname == domain or hostname.endswith("." + domain):
if "social" not in site.tags:
missing_social.append(f"{site.name} ({domain})")
break
assert missing_social == [], (
f"{len(missing_social)} known social networks missing 'social' tag: "
+ ", ".join(missing_social)
)
+236
View File
@@ -0,0 +1,236 @@
"""Tests for the database auto-update system."""
import json
import os
import hashlib
from datetime import datetime, timezone, timedelta
from unittest.mock import patch, MagicMock
import pytest
from maigret.db_updater import (
_parse_version,
_needs_check,
_is_version_compatible,
_is_update_available,
_load_state,
_save_state,
_best_local,
_now_iso,
resolve_db_path,
force_update,
CACHED_DB_PATH,
BUNDLED_DB_PATH,
STATE_PATH,
MAIGRET_HOME,
)
def test_parse_version():
assert _parse_version("0.5.0") == (0, 5, 0)
assert _parse_version("1.2.3") == (1, 2, 3)
assert _parse_version("bad") == (0, 0, 0)
assert _parse_version("") == (0, 0, 0)
def test_needs_check_no_state():
assert _needs_check({}, 24) is True
def test_needs_check_recent():
state = {"last_check_at": _now_iso()}
assert _needs_check(state, 24) is False
def test_needs_check_expired():
old_time = (datetime.now(timezone.utc) - timedelta(hours=25)).strftime("%Y-%m-%dT%H:%M:%SZ")
state = {"last_check_at": old_time}
assert _needs_check(state, 24) is True
def test_needs_check_corrupt():
state = {"last_check_at": "not-a-date"}
assert _needs_check(state, 24) is True
def test_version_compatible():
with patch("maigret.db_updater.__version__", "0.5.0"):
assert _is_version_compatible({"min_maigret_version": "0.5.0"}) is True
assert _is_version_compatible({"min_maigret_version": "0.4.0"}) is True
assert _is_version_compatible({"min_maigret_version": "0.6.0"}) is False
assert _is_version_compatible({}) is True # missing field = compatible
def test_update_available_no_cache(tmp_path):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "nonexistent.json")):
assert _is_update_available({"updated_at": "2026-01-01T00:00:00Z"}, {}) is True
def test_update_available_newer(tmp_path):
cache = tmp_path / "data.json"
cache.write_text("{}")
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
state = {"last_meta": {"updated_at": "2026-01-01T00:00:00Z"}}
meta = {"updated_at": "2026-02-01T00:00:00Z"}
assert _is_update_available(meta, state) is True
def test_update_available_same(tmp_path):
cache = tmp_path / "data.json"
cache.write_text("{}")
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
state = {"last_meta": {"updated_at": "2026-01-01T00:00:00Z"}}
meta = {"updated_at": "2026-01-01T00:00:00Z"}
assert _is_update_available(meta, state) is False
def test_load_state_missing(tmp_path):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "missing.json")):
assert _load_state() == {}
def test_load_state_corrupt(tmp_path):
corrupt = tmp_path / "state.json"
corrupt.write_text("not json{{{")
with patch("maigret.db_updater.STATE_PATH", str(corrupt)):
assert _load_state() == {}
def test_save_and_load_state(tmp_path):
state_file = tmp_path / "state.json"
with patch("maigret.db_updater.STATE_PATH", str(state_file)):
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
_save_state({"last_check_at": "2026-01-01T00:00:00Z"})
loaded = _load_state()
assert loaded["last_check_at"] == "2026-01-01T00:00:00Z"
def test_best_local_with_valid_cache(tmp_path):
cache = tmp_path / "data.json"
cache.write_text('{"sites": {}, "engines": {}, "tags": []}')
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
assert _best_local() == str(cache)
def test_best_local_with_corrupt_cache(tmp_path):
cache = tmp_path / "data.json"
cache.write_text("not json")
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
assert _best_local() == BUNDLED_DB_PATH
def test_best_local_no_cache(tmp_path):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "missing.json")):
assert _best_local() == BUNDLED_DB_PATH
def test_resolve_db_path_custom_url():
result = resolve_db_path("https://example.com/db.json")
assert result == "https://example.com/db.json"
def test_resolve_db_path_custom_file(tmp_path):
custom_db = tmp_path / "custom" / "path.json"
custom_db.parent.mkdir(parents=True)
custom_db.write_text("{}")
result = resolve_db_path(str(custom_db))
assert result.endswith("custom/path.json")
def test_resolve_db_path_no_autoupdate(tmp_path):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "missing.json")):
result = resolve_db_path("resources/data.json", no_autoupdate=True)
assert result == BUNDLED_DB_PATH
def test_resolve_db_path_no_autoupdate_with_cache(tmp_path):
cache = tmp_path / "data.json"
cache.write_text('{"sites": {}, "engines": {}, "tags": []}')
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
result = resolve_db_path("resources/data.json", no_autoupdate=True)
assert result == str(cache)
@patch("maigret.db_updater._fetch_meta")
def test_resolve_db_path_network_failure(mock_fetch, tmp_path):
mock_fetch.return_value = None
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "state.json")):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "missing.json")):
result = resolve_db_path("resources/data.json")
assert result == BUNDLED_DB_PATH
# --- force_update tests ---
@patch("maigret.db_updater._fetch_meta")
def test_force_update_network_failure(mock_fetch, tmp_path):
mock_fetch.return_value = None
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "state.json")):
assert force_update() is False
@patch("maigret.db_updater._fetch_meta")
def test_force_update_incompatible_version(mock_fetch, tmp_path):
mock_fetch.return_value = {"min_maigret_version": "99.0.0", "sites_count": 100}
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "state.json")):
assert force_update() is False
@patch("maigret.db_updater._download_and_verify")
@patch("maigret.db_updater._fetch_meta")
def test_force_update_success(mock_fetch, mock_download, tmp_path):
mock_fetch.return_value = {
"min_maigret_version": "0.1.0",
"sites_count": 3200,
"updated_at": "2099-01-01T00:00:00Z",
"data_url": "https://example.com/data.json",
"data_sha256": "abc123",
}
mock_download.return_value = str(tmp_path / "data.json")
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "state.json")):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "missing.json")):
assert force_update() is True
state = _load_state()
assert state["last_meta"]["sites_count"] == 3200
@patch("maigret.db_updater._fetch_meta")
def test_force_update_already_up_to_date(mock_fetch, tmp_path):
cache = tmp_path / "data.json"
cache.write_text('{"sites": {}, "engines": {}, "tags": []}')
state_file = tmp_path / "state.json"
state_file.write_text(json.dumps({
"last_check_at": _now_iso(),
"last_meta": {"updated_at": "2026-01-01T00:00:00Z", "sites_count": 3000},
}))
mock_fetch.return_value = {
"min_maigret_version": "0.1.0",
"sites_count": 3000,
"updated_at": "2026-01-01T00:00:00Z",
}
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(state_file)):
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
assert force_update() is False
@patch("maigret.db_updater._download_and_verify")
@patch("maigret.db_updater._fetch_meta")
def test_force_update_download_fails(mock_fetch, mock_download, tmp_path):
mock_fetch.return_value = {
"min_maigret_version": "0.1.0",
"sites_count": 3200,
"updated_at": "2099-01-01T00:00:00Z",
"data_url": "https://example.com/data.json",
"data_sha256": "abc123",
}
mock_download.return_value = None
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "state.json")):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "missing.json")):
assert force_update() is False
+2 -2
View File
@@ -36,7 +36,7 @@ def test_notify_about_errors():
},
}
results = notify_about_errors(results, query_notify=None, show_statistics=True)
notifications = notify_about_errors(results, query_notify=None, show_statistics=True)
# Check the output
expected_output = [
@@ -55,4 +55,4 @@ def test_notify_about_errors():
('Access denied: 25.0%', '!'),
('You can see detailed site check errors with a flag `--print-errors`', '-'),
]
assert results == expected_output
assert notifications == expected_output
+10 -9
View File
@@ -3,6 +3,7 @@
import pytest
import asyncio
import logging
from typing import Any, List, Tuple, Callable, Dict
from maigret.executors import (
AsyncioSimpleExecutor,
AsyncioProgressbarExecutor,
@@ -21,7 +22,7 @@ async def func(n):
@pytest.mark.asyncio
async def test_simple_asyncio_executor():
tasks = [(func, [n], {}) for n in range(10)]
tasks: List[Tuple[Callable, list, dict]] = [(func, [n], {}) for n in range(10)]
executor = AsyncioSimpleExecutor(logger=logger)
assert await executor.run(tasks) == [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
assert executor.execution_time > 0.2
@@ -30,7 +31,7 @@ async def test_simple_asyncio_executor():
@pytest.mark.asyncio
async def test_asyncio_progressbar_executor():
tasks = [(func, [n], {}) for n in range(10)]
tasks: List[Tuple[Callable, list, dict]] = [(func, [n], {}) for n in range(10)]
executor = AsyncioProgressbarExecutor(logger=logger)
# no guarantees for the results order
@@ -41,7 +42,7 @@ async def test_asyncio_progressbar_executor():
@pytest.mark.asyncio
async def test_asyncio_progressbar_semaphore_executor():
tasks = [(func, [n], {}) for n in range(10)]
tasks: List[Tuple[Callable, list, dict]] = [(func, [n], {}) for n in range(10)]
executor = AsyncioProgressbarSemaphoreExecutor(logger=logger, in_parallel=5)
# no guarantees for the results order
@@ -53,7 +54,7 @@ async def test_asyncio_progressbar_semaphore_executor():
@pytest.mark.slow
@pytest.mark.asyncio
async def test_asyncio_progressbar_queue_executor():
tasks = [(func, [n], {}) for n in range(10)]
tasks: List[Tuple[Callable, list, dict]] = [(func, [n], {}) for n in range(10)]
executor = AsyncioProgressbarQueueExecutor(logger=logger, in_parallel=2)
assert await executor.run(tasks) == [0, 1, 3, 2, 4, 6, 7, 5, 9, 8]
@@ -81,22 +82,22 @@ async def test_asyncio_progressbar_queue_executor():
@pytest.mark.asyncio
async def test_asyncio_queue_generator_executor():
tasks = [(func, [n], {}) for n in range(10)]
tasks: List[Tuple[Callable, list, dict]] = [(func, [n], {}) for n in range(10)]
executor = AsyncioQueueGeneratorExecutor(logger=logger, in_parallel=2)
results = [result async for result in executor.run(tasks)]
results = [result async for result in executor.run(tasks)] # type: ignore[arg-type]
assert results == [0, 1, 3, 2, 4, 6, 7, 5, 9, 8]
assert executor.execution_time > 0.5
assert executor.execution_time < 0.6
executor = AsyncioQueueGeneratorExecutor(logger=logger, in_parallel=3)
results = [result async for result in executor.run(tasks)]
results = [result async for result in executor.run(tasks)] # type: ignore[arg-type]
assert results == [0, 3, 1, 4, 6, 2, 7, 9, 5, 8]
assert executor.execution_time > 0.4
assert executor.execution_time < 0.5
executor = AsyncioQueueGeneratorExecutor(logger=logger, in_parallel=5)
results = [result async for result in executor.run(tasks)]
results = [result async for result in executor.run(tasks)] # type: ignore[arg-type]
assert results in (
[0, 3, 6, 1, 4, 7, 9, 2, 5, 8],
[0, 3, 6, 1, 4, 9, 7, 2, 5, 8],
@@ -105,7 +106,7 @@ async def test_asyncio_queue_generator_executor():
assert executor.execution_time < 0.4
executor = AsyncioQueueGeneratorExecutor(logger=logger, in_parallel=10)
results = [result async for result in executor.run(tasks)]
results = [result async for result in executor.run(tasks)] # type: ignore[arg-type]
assert results == [0, 3, 6, 9, 1, 4, 7, 2, 5, 8]
assert executor.execution_time > 0.2
assert executor.execution_time < 0.3
+84 -2
View File
@@ -2,6 +2,7 @@
import asyncio
import copy
from unittest.mock import patch
import pytest
from mock import Mock
@@ -11,7 +12,8 @@ from maigret.maigret import (
extract_ids_from_page,
extract_ids_from_results,
)
from maigret.sites import MaigretSite
from maigret.checking import site_self_check
from maigret.sites import MaigretSite, MaigretDatabase
from maigret.result import MaigretCheckResult, MaigretCheckStatus
from tests.conftest import RESULTS_EXAMPLE
@@ -37,6 +39,86 @@ async def test_self_check_db(test_db):
assert test_db.sites_dict['InvalidInactive'].disabled is True
@pytest.mark.slow
@pytest.mark.asyncio
async def test_self_check_no_progressbar(test_db):
"""Verify that no_progressbar=True disables the alive_bar in self_check."""
logger = Mock()
with patch('maigret.checking.alive_bar') as mock_alive_bar:
mock_bar = Mock()
mock_alive_bar.return_value.__enter__ = Mock(return_value=mock_bar)
mock_alive_bar.return_value.__exit__ = Mock(return_value=False)
await self_check(
test_db, test_db.sites_dict, logger, silent=True,
no_progressbar=True,
)
# First call is the self-check progress bar; subsequent calls are
# from inner search() invocations.
self_check_call = mock_alive_bar.call_args_list[0]
_, kwargs = self_check_call
assert kwargs.get('title') == 'Self-checking'
assert kwargs.get('disable') is True
@pytest.mark.slow
@pytest.mark.asyncio
async def test_self_check_progressbar_enabled_by_default(test_db):
"""Verify that alive_bar is enabled by default (no_progressbar=False)."""
logger = Mock()
with patch('maigret.checking.alive_bar') as mock_alive_bar:
mock_bar = Mock()
mock_alive_bar.return_value.__enter__ = Mock(return_value=mock_bar)
mock_alive_bar.return_value.__exit__ = Mock(return_value=False)
await self_check(
test_db, test_db.sites_dict, logger, silent=True,
)
self_check_call = mock_alive_bar.call_args_list[0]
_, kwargs = self_check_call
assert kwargs.get('title') == 'Self-checking'
assert kwargs.get('disable') is False
@pytest.mark.asyncio
async def test_site_self_check_handles_exception(test_db):
"""Verify that site_self_check catches unexpected exceptions and returns a valid result."""
logger = Mock()
sem = asyncio.Semaphore(1)
site = test_db.sites_dict['ValidActive']
with patch('maigret.checking.maigret', side_effect=RuntimeError("test crash")):
result = await site_self_check(site, logger, sem, test_db)
assert isinstance(result, dict)
assert "issues" in result
assert len(result["issues"]) > 0
assert any("Unexpected error" in issue for issue in result["issues"])
@pytest.mark.asyncio
async def test_self_check_handles_task_exception(test_db):
"""Verify that self_check continues when individual site checks raise exceptions."""
logger = Mock()
with patch('maigret.checking.maigret', side_effect=RuntimeError("test crash")):
result = await self_check(
test_db, test_db.sites_dict, logger, silent=True,
no_progressbar=True,
)
assert isinstance(result, dict)
assert 'results' in result
assert len(result['results']) == len(test_db.sites_dict)
for r in result['results']:
assert 'site_name' in r
assert 'issues' in r
@pytest.mark.slow
@pytest.mark.skip(reason="broken, fixme")
def test_maigret_results(test_db):
@@ -112,7 +194,7 @@ def test_extract_ids_from_page(test_db):
def test_extract_ids_from_results(test_db):
TEST_EXAMPLE = copy.deepcopy(RESULTS_EXAMPLE)
TEST_EXAMPLE: dict = copy.deepcopy(RESULTS_EXAMPLE)
TEST_EXAMPLE['Reddit']['ids_usernames'] = {'test1': 'yandex_public_id'}
TEST_EXAMPLE['Reddit']['ids_links'] = ['https://www.reddit.com/user/test2']
+1 -1
View File
@@ -6,7 +6,7 @@ import os
import pytest
from io import StringIO
import xmind
import xmind # type: ignore[import-untyped]
from jinja2 import Template
from maigret.report import (
+3 -1
View File
@@ -1,8 +1,10 @@
"""Maigret Database test functions"""
from typing import Any, Dict
from maigret.sites import MaigretDatabase, MaigretSite
EXAMPLE_DB = {
EXAMPLE_DB: Dict[str, Any] = {
'engines': {
"XenForo": {
"presenseStrs": ["XenForo"],
+2 -2
View File
@@ -28,7 +28,7 @@ async def test_detect_known_engine(test_db, local_test_db):
url_exists = "https://devforum.zoom.us/u/adam"
url_mainpage = "https://devforum.zoom.us/"
# Mock extract_username_dialog to return "adam"
submitter.extract_username_dialog = MagicMock(return_value="adam")
submitter.extract_username_dialog = MagicMock(return_value="adam") # type: ignore[method-assign]
sites, resp_text = await submitter.detect_known_engine(
url_exists, url_mainpage, session=None, follow_redirects=False, headers=None
@@ -111,7 +111,7 @@ async def test_check_features_manually_success(settings):
@pytest.mark.slow
@pytest.mark.asyncio
async def test_check_features_manually_success(settings):
async def test_check_features_manually_cloudflare(settings):
# Setup
db = MaigretDatabase()
logger = logging.getLogger("test_logger")
+59
View File
@@ -0,0 +1,59 @@
"""Generate db_meta.json from data.json for the auto-update system."""
import argparse
import hashlib
import json
import os.path as path
import sys
from datetime import datetime, timezone
RESOURCES_DIR = path.join(path.dirname(path.dirname(path.abspath(__file__))), "maigret", "resources")
DATA_JSON_PATH = path.join(RESOURCES_DIR, "data.json")
META_JSON_PATH = path.join(RESOURCES_DIR, "db_meta.json")
DEFAULT_DATA_URL = "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/data.json"
def get_current_version():
version_file = path.join(path.dirname(path.dirname(path.abspath(__file__))), "maigret", "__version__.py")
with open(version_file) as f:
for line in f:
if line.startswith("__version__"):
return line.split("=")[1].strip().strip("'\"")
return "0.0.0"
def main():
parser = argparse.ArgumentParser(description="Generate db_meta.json from data.json")
parser.add_argument("--min-version", default=None, help="Minimum compatible maigret version (default: current version)")
parser.add_argument("--data-url", default=DEFAULT_DATA_URL, help="URL where data.json can be downloaded")
args = parser.parse_args()
min_version = args.min_version or get_current_version()
with open(DATA_JSON_PATH, "rb") as f:
raw = f.read()
sha256 = hashlib.sha256(raw).hexdigest()
data = json.loads(raw)
sites_count = len(data.get("sites", {}))
meta = {
"version": 1,
"updated_at": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
"sites_count": sites_count,
"min_maigret_version": min_version,
"data_sha256": sha256,
"data_url": args.data_url,
}
with open(META_JSON_PATH, "w", encoding="utf-8") as f:
json.dump(meta, f, indent=4, ensure_ascii=False)
print(f"Generated {META_JSON_PATH}")
print(f" sites: {sites_count}")
print(f" sha256: {sha256[:16]}...")
print(f" min_version: {min_version}")
if __name__ == "__main__":
main()
+67 -9
View File
@@ -26,6 +26,7 @@ sys.path.insert(0, str(Path(__file__).parent.parent))
try:
import aiohttp
from yarl import URL as YarlURL
except ImportError:
print("aiohttp not installed. Run: pip install aiohttp")
sys.exit(1)
@@ -74,8 +75,14 @@ def color(text: str, c: str) -> str:
async def check_url_aiohttp(url: str, headers: dict = None, follow_redirects: bool = True,
timeout: int = 15, ssl_verify: bool = False) -> dict:
"""Check a URL using aiohttp and return detailed response info."""
timeout: int = 15, ssl_verify: bool = False,
method: str = "GET", payload: dict = None) -> dict:
"""Check a URL using aiohttp and return detailed response info.
Args:
method: HTTP method ("GET" or "POST").
payload: JSON payload for POST requests (dict, will be serialized).
"""
headers = headers or DEFAULT_HEADERS.copy()
result = {
"method": "aiohttp",
@@ -96,7 +103,14 @@ async def check_url_aiohttp(url: str, headers: dict = None, follow_redirects: bo
timeout_obj = aiohttp.ClientTimeout(total=timeout)
async with aiohttp.ClientSession(connector=connector, timeout=timeout_obj) as session:
async with session.get(url, headers=headers, allow_redirects=follow_redirects) as resp:
# Use encoded=True if URL contains percent-encoded chars to prevent double-encoding
request_url = YarlURL(url, encoded=True) if '%' in url else url
request_kwargs = dict(headers=headers, allow_redirects=follow_redirects)
if method.upper() == "POST" and payload is not None:
request_kwargs["json"] = payload
request_fn = session.post if method.upper() == "POST" else session.get
async with request_fn(request_url, **request_kwargs) as resp:
result["status"] = resp.status
result["final_url"] = str(resp.url)
@@ -438,21 +452,54 @@ async def diagnose_site(site_config: dict, site_name: str) -> dict:
print(f" {color('[!]', Colors.RED)} No usernameClaimed defined")
return diagnosis
# Build full URL
# Build full URL (display URL)
url_template = url.replace("{urlMain}", url_main).replace("{urlSubpath}", site_config.get("urlSubpath", ""))
# Build probe URL (what Maigret actually requests)
url_probe = site_config.get("urlProbe", "")
if url_probe:
probe_template = url_probe.replace("{urlMain}", url_main).replace("{urlSubpath}", site_config.get("urlSubpath", ""))
else:
probe_template = url_template
# Detect request method and payload
request_method = site_config.get("requestMethod", "GET").upper()
request_payload_template = site_config.get("requestPayload")
headers = DEFAULT_HEADERS.copy()
# For API probes (urlProbe, POST), use neutral Accept header instead of text/html
# which can cause servers to return HTML instead of JSON
if url_probe or request_method == "POST":
headers["Accept"] = "*/*"
if site_config.get("headers"):
headers.update(site_config["headers"])
if url_probe:
print(f" urlProbe: {url_probe}")
if request_method != "GET":
print(f" requestMethod: {request_method}")
if request_payload_template:
print(f" requestPayload: {request_payload_template}")
# 2. Connectivity test
print(f"\n--- {color('2. CONNECTIVITY TEST', Colors.BOLD)} ---")
url_claimed = url_template.replace("{username}", claimed)
url_unclaimed = url_template.replace("{username}", unclaimed)
probe_claimed = probe_template.replace("{username}", claimed)
probe_unclaimed = probe_template.replace("{username}", unclaimed)
# Build payloads with username substituted
payload_claimed = None
payload_unclaimed = None
if request_payload_template and request_method == "POST":
payload_claimed = json.loads(
json.dumps(request_payload_template).replace("{username}", claimed)
)
payload_unclaimed = json.loads(
json.dumps(request_payload_template).replace("{username}", unclaimed)
)
result_claimed, result_unclaimed = await asyncio.gather(
check_url_aiohttp(url_claimed, headers),
check_url_aiohttp(url_unclaimed, headers)
check_url_aiohttp(probe_claimed, headers, method=request_method, payload=payload_claimed),
check_url_aiohttp(probe_unclaimed, headers, method=request_method, payload=payload_unclaimed)
)
print(f" Claimed ({claimed}): status={result_claimed['status']}, error={result_claimed['error']}")
@@ -523,7 +570,18 @@ async def diagnose_site(site_config: dict, site_name: str) -> dict:
diagnosis["warnings"].append(f"absenceStrs not found in unclaimed page")
print(f" {color('[WARN]', Colors.YELLOW)} absenceStrs not found in unclaimed page")
if presense_found_claimed and not absence_found_claimed and absence_found_unclaimed:
# Check works if: claimed is detected as present AND unclaimed is detected as absent.
# Presence detection: presenseStrs found (or empty = always true).
# Absence detection: absenceStrs found in unclaimed (or empty = never, rely on presenseStrs only).
# With only presenseStrs: works if found in claimed but NOT in unclaimed.
# With only absenceStrs: works if found in unclaimed but NOT in claimed.
# With both: standard combination.
claimed_is_present = presense_found_claimed and not absence_found_claimed
unclaimed_is_absent = (
(absence_strs and absence_found_unclaimed) or
(presense_strs and not presense_found_unclaimed)
)
if claimed_is_present and unclaimed_is_absent:
print(f" {color('[OK]', Colors.GREEN)} Message check should work correctly")
diagnosis["working"] = True
+84
View File
@@ -4,6 +4,7 @@ This module generates the listing of supported sites in file `SITES.md`
and pretty prints file with sites data.
"""
import sys
import socket
import requests
import logging
import threading
@@ -64,6 +65,49 @@ def get_base_domain(url):
return ""
def check_dns(domain, timeout=5):
"""Check if a domain resolves via DNS. Returns True if it resolves."""
try:
socket.setdefaulttimeout(timeout)
socket.getaddrinfo(domain, None)
return True
except (socket.gaierror, socket.timeout, OSError):
return False
def check_sites_dns(sites):
"""Check DNS resolution for all sites. Returns a set of site names that failed."""
SKIP_TLDS = ('.onion', '.i2p')
domains = {}
for site in sites:
domain = get_base_domain(site.url_main)
if domain and not any(domain.endswith(tld) for tld in SKIP_TLDS):
domains.setdefault(domain, []).append(site)
failed_sites = set()
results = {}
def resolve(domain):
results[domain] = check_dns(domain)
threads = []
for domain in domains:
t = threading.Thread(target=resolve, args=(domain,))
threads.append(t)
t.start()
for t in threads:
t.join()
for domain, resolved in results.items():
if not resolved:
for site in domains[domain]:
failed_sites.add(site.name)
logging.warning(f"DNS resolution failed for {domain}")
return failed_sites
def get_step_rank(rank):
def get_readable_rank(r):
return RANKS[str(r)]
@@ -86,6 +130,8 @@ def main():
parser.add_argument('--empty-only', help='update only sites without rating', action='store_true')
parser.add_argument('--exclude-engine', help='do not update score with certain engine',
action="append", dest="exclude_engine_list", default=[])
parser.add_argument('--dns-check', help='disable sites whose domains do not resolve via DNS',
action='store_true')
pool = list()
@@ -103,6 +149,24 @@ Rank data fetched from Majestic Million by domains.
""")
if args.dns_check:
print("Checking DNS resolution for all site domains...")
failed = check_sites_dns(sites_subset)
disabled_count = 0
re_enabled_count = 0
for site in sites_subset:
if site.name in failed:
if not site.disabled:
site.disabled = True
disabled_count += 1
print(f" Disabled {site.name}: DNS does not resolve ({get_base_domain(site.url_main)})")
else:
if site.disabled:
# Re-enable previously disabled site if DNS now resolves
# (only if it was likely disabled due to DNS failure)
pass
print(f"DNS check complete: {disabled_count} site(s) disabled, {len(failed)} domain(s) unresolvable.")
majestic_ranks = {}
if args.with_rank:
majestic_ranks = fetch_majestic_million()
@@ -153,6 +217,26 @@ Rank data fetched from Majestic Million by domains.
site_file.write(f'\nThe list was updated at ({datetime.now(timezone.utc).date()})\n')
db.save_to_file(args.base_file)
# Regenerate db_meta.json to stay in sync with data.json
try:
import hashlib, json, os
db_data_raw = open(args.base_file, 'rb').read()
db_data_parsed = json.loads(db_data_raw)
meta = {
"version": 1,
"updated_at": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
"sites_count": len(db_data_parsed.get("sites", {})),
"min_maigret_version": "0.5.0",
"data_sha256": hashlib.sha256(db_data_raw).hexdigest(),
"data_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/data.json",
}
meta_path = os.path.join(os.path.dirname(args.base_file), "db_meta.json")
with open(meta_path, "w", encoding="utf-8") as mf:
json.dump(meta, mf, indent=4, ensure_ascii=False)
print(f"Updated {meta_path} ({meta['sites_count']} sites)")
except Exception as e:
print(f"Warning: could not regenerate db_meta.json: {e}")
statistics_text = db.get_db_stats(is_markdown=True)
site_file.write('## Statistics\n\n')
site_file.write(statistics_text)