Compare commits

...

122 Commits

Author SHA1 Message Date
Soxoj bbeb7e9015 Merge branch 'main' into check-error-test 2026-05-09 11:00:11 +03:00
Soxoj 199426fb17 Add test for CheckError bug 2026-05-09 10:59:16 +03:00
idontknow 9838176205 Fix context field using class instead of instance in error handling (#2627)
In process_site_result(), when a check_error is present, the context
field was set to str(CheckError) (the class itself) instead of
str(check_error) (the error instance). This caused the context to
contain the string representation of the class rather than the actual
error message.

Before fix: context = "<class 'maigret.errors.CheckError'>"
After fix: context = "Request timeout error: slow server"
2026-05-09 10:58:06 +03:00
Soxoj 5c93b206e7 Cloudflare bypass webgate (#2628) 2026-05-09 10:48:43 +03:00
dependabot[bot] b98a134fcf build(deps-dev): bump mypy from 1.20.2 to 2.0.0 (#2625)
Bumps [mypy](https://github.com/python/mypy) from 1.20.2 to 2.0.0.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](https://github.com/python/mypy/compare/v1.20.2...v2.0.0)

---
updated-dependencies:
- dependency-name: mypy
  dependency-version: 2.0.0
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-07 23:52:15 +03:00
dependabot[bot] 1258ee0898 build(deps): bump python-bidi from 0.6.7 to 0.6.9 (#2622)
Bumps [python-bidi](https://github.com/MeirKriheli/python-bidi) from 0.6.7 to 0.6.9.
- [Release notes](https://github.com/MeirKriheli/python-bidi/releases)
- [Changelog](https://github.com/MeirKriheli/python-bidi/blob/master/CHANGELOG.rst)
- [Commits](https://github.com/MeirKriheli/python-bidi/compare/v0.6.7...v0.6.9)

---
updated-dependencies:
- dependency-name: python-bidi
  dependency-version: 0.6.9
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-06 10:55:08 +02:00
Soxoj 79e93ab715 AI mode documentation (#2620) 2026-05-05 22:21:00 +02:00
Danilo Salve 52c8917e2c refactor:reduces the cognitive complexity of get_ai_analysis (#2581) 2026-05-05 20:52:34 +02:00
Soxoj 846feb6e7e Add web interface tests (#2619) 2026-05-05 19:32:01 +02:00
Sayon Dey c510734e5e Fix network graph height to use viewport units (#2590) 2026-05-05 18:46:47 +02:00
Soxoj 03b62027f6 Fixed duplicates of YouTube and Periscope (#2618) 2026-05-05 14:02:37 +02:00
Soxoj f293bff417 Fix site checks: 7 fixed, 1 disabled, 1 dead deleted (#2616) 2026-05-04 23:40:58 +02:00
github-actions[bot] 341db55099 Updated site list and statistics (#2615)
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-05-04 14:24:49 +02:00
Soxoj a77a8b3e84 Reddit fix (#2614) 2026-05-04 14:12:22 +02:00
Soxoj 3ff05b240a Fix site checks: 8 → ip_reputation, 6 fixed, 9 disabled, 1 dead deleted (#2611) 2026-05-03 20:02:45 +02:00
Sayon Dey 05d1eb6fb0 Improved Python Package Workflow (#2594) 2026-05-03 11:25:06 +02:00
Sayon Dey 6cf5604075 Improve startup error message for missing dependencies (#2593)
* Improve startup error message for missing dependencies

* Enhance error message for missing dependencies

Updated import error message to include installation instructions for PyPI and cloned repository.

* Enhance missing dependency error message

Updated error message for missing dependency to include installation instructions for both PyPI and local repository.
2026-05-03 11:10:31 +02:00
github-actions[bot] ff0ffce427 Updated site list and statistics (#2607)
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-05-03 10:49:46 +02:00
HaiTao Wu ac1e3d33ec docs: add Simplified Chinese README translation (#2606) 2026-05-03 10:35:15 +02:00
Julio César Suástegui 8b5dce1d3c fix: disable RomanticCollection check (#2588)
* fix: disable RomanticCollection check

* chore: regenerate db metadata

---------

Co-authored-by: Julio César Suástegui <juliosuas@users.noreply.github.com>
2026-05-02 15:29:45 +02:00
Sayon Dey f897598f98 Fix outdated Google Colab setup instructions (#2591) 2026-05-02 15:21:16 +02:00
Soxoj 606fba01b4 Update CONTRIBUTING.md with instructions for developers (#2589) 2026-05-02 10:39:56 +02:00
egrezeli 9dbefcef11 Fix ID extraction crash when regex groups are optional (#2572)
* Fix ID extraction crash when regex groups are optional

Handle None capture groups in username/id extraction and add regression coverage for optional trailing groups.

* Remove leftover line that overwrote safe _id in extract_id_from_url
2026-05-01 00:14:40 +02:00
dependabot[bot] 533884bad5 build(deps): bump reportlab from 4.4.10 to 4.5.0 (#2578)
Bumps [reportlab](https://www.reportlab.com/) from 4.4.10 to 4.5.0.

---
updated-dependencies:
- dependency-name: reportlab
  dependency-version: 4.5.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-30 22:44:05 +02:00
github-actions[bot] 12c8721a16 Updated site list and statistics (#2576)
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-29 17:20:42 +02:00
Soxoj b79f8aca28 Add site checks: 18 new sites (#2575) 2026-04-29 16:55:47 +02:00
dependabot[bot] 1a9fe77d6e build(deps): bump arabic-reshaper from 3.0.0 to 3.0.1 (#2573)
Bumps [arabic-reshaper](https://github.com/mpcabd/python-arabic-reshaper) from 3.0.0 to 3.0.1.
- [Release notes](https://github.com/mpcabd/python-arabic-reshaper/releases)
- [Commits](https://github.com/mpcabd/python-arabic-reshaper/compare/v3.0.0...v3.0.1)

---
updated-dependencies:
- dependency-name: arabic-reshaper
  dependency-version: 3.0.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-29 12:37:52 +02:00
Soxoj 1352bd35c6 Fix site checks: 5 fixed, 4 disabled; fix UA leak bug (#2569) 2026-04-26 14:51:44 +02:00
Soxoj 3960510b63 Fix site checks: 7 fixed, 1 disabled (#2565)
False-positive site probe issues #2531, #2542, #2556, #2559, #2560, #2561, #2563, #2496.
2026-04-26 12:34:52 +02:00
Soxoj a7bda700b4 Add Docker web image with multi-stage building (#2564) 2026-04-26 11:45:08 +02:00
Soxoj e962b8c693 Fix site checks: 5 fixed; readme fix (#2562)
* Fix site checks: 5 fixed; readme fix

* Logging improvements

* Improve YouTube data extraction
2026-04-25 18:15:38 +02:00
Julio César Suástegui c6cfef84ce test: loosen executor timing upper bounds for slower CI (#2558)
the <0.3/<0.4/etc upper bounds don't leave room for darwin or
emulated/aarch64 runners, which have been seeing 0.7s+ on tests
that expected <0.3s.

bumped each upper bound by +0.7s. lower bounds unchanged — they
still validate that tasks ran in parallel rather than serially.

refs #679

Co-authored-by: Julio César Suástegui <juliosuas@users.noreply.github.com>
2026-04-25 15:24:43 +02:00
dependabot[bot] b0ed09eb3e build(deps): bump idna from 3.12 to 3.13 (#2553)
Bumps [idna](https://github.com/kjd/idna) from 3.12 to 3.13.
- [Release notes](https://github.com/kjd/idna/releases)
- [Changelog](https://github.com/kjd/idna/blob/master/HISTORY.rst)
- [Commits](https://github.com/kjd/idna/compare/v3.12...v3.13)

---
updated-dependencies:
- dependency-name: idna
  dependency-version: '3.13'
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-25 15:23:35 +02:00
dependabot[bot] 4e3bd3ab58 build(deps): bump pyinstaller from 6.19.0 to 6.20.0 (#2554)
Bumps [pyinstaller](https://github.com/pyinstaller/pyinstaller) from 6.19.0 to 6.20.0.
- [Release notes](https://github.com/pyinstaller/pyinstaller/releases)
- [Changelog](https://github.com/pyinstaller/pyinstaller/blob/develop/doc/CHANGES.rst)
- [Commits](https://github.com/pyinstaller/pyinstaller/compare/v6.19.0...v6.20.0)

---
updated-dependencies:
- dependency-name: pyinstaller
  dependency-version: 6.20.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-24 16:25:18 +02:00
Soxoj 77c11df119 Fix Google Cloud Shell launch (#2557) 2026-04-23 21:45:27 +02:00
Soxoj 25026e21ea Fix site checks: 4 → ip_reputation, 9 fixed, 16 disabled, 3 dead dele… (#2555)
* Fix site checks: 4 → ip_reputation, 9 fixed, 16 disabled, 3 dead deleted; clarify ip_reputation tag semantics

* Improved test coverage
2026-04-23 21:17:07 +02:00
Soxoj b1004588af AI mode (#2529)
* Add AI mode
2026-04-23 12:12:54 +02:00
dependabot[bot] 4bd2f7cb35 build(deps): bump certifi from 2026.2.25 to 2026.4.22 (#2552)
Bumps [certifi](https://github.com/certifi/python-certifi) from 2026.2.25 to 2026.4.22.
- [Commits](https://github.com/certifi/python-certifi/compare/2026.02.25...2026.04.22)

---
updated-dependencies:
- dependency-name: certifi
  dependency-version: 2026.4.22
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-23 09:52:52 +02:00
Soxoj 5e1cc45c17 Fix site checks: 12 fixed, 19 disabled; add new protection tags (#2550) 2026-04-22 20:25:41 +02:00
Soxoj d9b361b626 Fix site checks: 3 → ip_reputation, 10 fixed, 6 disabled, 2 dead deleted (#2549) 2026-04-22 12:46:53 +02:00
dependabot[bot] bfc6601c96 build(deps): bump idna from 3.11 to 3.12 (#2548)
Bumps [idna](https://github.com/kjd/idna) from 3.11 to 3.12.
- [Release notes](https://github.com/kjd/idna/releases)
- [Changelog](https://github.com/kjd/idna/blob/master/HISTORY.rst)
- [Commits](https://github.com/kjd/idna/compare/v3.11...v3.12)

---
updated-dependencies:
- dependency-name: idna
  dependency-version: '3.12'
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-22 10:26:13 +02:00
dependabot[bot] 53ff696707 build(deps-dev): bump mypy from 1.20.1 to 1.20.2 (#2547)
Bumps [mypy](https://github.com/python/mypy) from 1.20.1 to 1.20.2.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](https://github.com/python/mypy/compare/v1.20.1...v1.20.2)

---
updated-dependencies:
- dependency-name: mypy
  dependency-version: 1.20.2
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-22 10:24:17 +02:00
Soxoj 0131f0b64c Add OnlyFans with activation mechanism; updated site ranks (#2546) 2026-04-21 19:03:45 +02:00
github-actions[bot] a5e558c5e8 Updated site list and statistics (#2545)
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-21 18:27:51 +02:00
Soxoj e8393bfce3 Fix site checks: 3 fixed, 2 → ip_reputation, 7 disabled, 1 dead deleted (#2543) 2026-04-21 16:02:36 +02:00
github-actions[bot] 519eeb4d21 Updated site list and statistics (#2541)
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-21 11:27:40 +02:00
Soxoj 98f03c153b Add 3 crypto sites (Polymarket, Zora, Revolut.me), added crypto inves… (#2538)
* Add 3 crypto sites (Polymarket, Zora, Revolut.me), added crypto investigation use case page in docs

* Added fintech tag

* Updated sites metadata
2026-04-21 11:08:48 +02:00
Soxoj 1f823e8322 Fix site checks: 3 fixed, 2 → ip_reputation, 7 disabled, 1 dead deleted (#2539) 2026-04-21 10:58:45 +02:00
Soxoj d6905a8fd8 Fix site checks: 4 fixed, 14 → ip_reputation, 8 disabled, 5 dead deleted (#2537) 2026-04-21 00:40:24 +02:00
Soxoj 7d216638fa fix site checks: 14 sites → ip_reputation, 7 disabled, 5 dead deleted (#2536) 2026-04-20 23:51:18 +02:00
Soxoj fb71f26fd0 Fix site checks: recover 6 CF sites via tls_fingerprint, 500px GraphQL, delete 4 dead domains (#2535) 2026-04-20 22:41:51 +02:00
dependabot[bot] 621b104523 build(deps): bump lxml from 6.0.4 to 6.1.0 (#2533)
Bumps [lxml](https://github.com/lxml/lxml) from 6.0.4 to 6.1.0.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](https://github.com/lxml/lxml/compare/lxml-6.0.4...lxml-6.1.0)

---
updated-dependencies:
- dependency-name: lxml
  dependency-version: 6.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-20 20:48:47 +02:00
Soxoj 37ce4fe728 Update of Readme and documentation (#2514)
* Big readme update

* Readme and documentation update

* Readme structure update

* Small fixes

* Changelog update
2026-04-17 17:42:36 +02:00
Soxoj f74f82ee13 Fixed: Hack MD, DailyKos, Mywed, WikimapiaSearch, TikTok Online Viewer (#2526, #2522, #2523, #2500, #2496); Disabled: Radiokot, Lurkmore, Mylespaul, AppleDiscussions, Loveplanet (#2524, #2511, #2498) (#2528) 2026-04-17 17:04:50 +02:00
dependabot[bot] 7e6d70a680 build(deps): bump pypdf from 6.10.0 to 6.10.2 (#2527)
Bumps [pypdf](https://github.com/py-pdf/pypdf) from 6.10.0 to 6.10.2.
- [Release notes](https://github.com/py-pdf/pypdf/releases)
- [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md)
- [Commits](https://github.com/py-pdf/pypdf/compare/6.10.0...6.10.2)

---
updated-dependencies:
- dependency-name: pypdf
  dependency-version: 6.10.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-17 15:00:33 +02:00
dependabot[bot] e900d4a853 build(deps): bump chardet from 7.4.2 to 7.4.3 (#2521)
Bumps [chardet](https://github.com/chardet/chardet) from 7.4.2 to 7.4.3.
- [Release notes](https://github.com/chardet/chardet/releases)
- [Changelog](https://github.com/chardet/chardet/blob/main/docs/changelog.rst)
- [Commits](https://github.com/chardet/chardet/compare/7.4.2...7.4.3)

---
updated-dependencies:
- dependency-name: chardet
  dependency-version: 7.4.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-16 12:05:44 +02:00
dependabot[bot] 9ee4eb9b69 build(deps): bump pillow from 12.1.1 to 12.2.0 (#2520)
Bumps [pillow](https://github.com/python-pillow/Pillow) from 12.1.1 to 12.2.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/12.1.1...12.2.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 12.2.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-14 11:01:45 +02:00
dependabot[bot] 53f21eda98 build(deps-dev): bump mypy from 1.20.0 to 1.20.1 (#2518)
Bumps [mypy](https://github.com/python/mypy) from 1.20.0 to 1.20.1.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](https://github.com/python/mypy/compare/v1.20.0...v1.20.1)

---
updated-dependencies:
- dependency-name: mypy
  dependency-version: 1.20.1
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 16:47:55 +02:00
dependabot[bot] 1a0f36ffb6 build(deps): bump chardet from 7.4.1 to 7.4.2 (#2517)
Bumps [chardet](https://github.com/chardet/chardet) from 7.4.1 to 7.4.2.
- [Release notes](https://github.com/chardet/chardet/releases)
- [Changelog](https://github.com/chardet/chardet/blob/main/docs/changelog.rst)
- [Commits](https://github.com/chardet/chardet/compare/7.4.1...7.4.2)

---
updated-dependencies:
- dependency-name: chardet
  dependency-version: 7.4.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 15:15:41 +02:00
dependabot[bot] 14114e681c build(deps): bump lxml from 6.0.3 to 6.0.4 (#2519)
Bumps [lxml](https://github.com/lxml/lxml) from 6.0.3 to 6.0.4.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](https://github.com/lxml/lxml/compare/lxml-6.0.3...lxml-6.0.4)

---
updated-dependencies:
- dependency-name: lxml
  dependency-version: 6.0.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 15:15:17 +02:00
dependabot[bot] bc0649e6a8 build(deps-dev): bump tuna from 0.5.11 to 0.5.13 (#2516)
Bumps [tuna](https://github.com/nschloe/tuna) from 0.5.11 to 0.5.13.
- [Release notes](https://github.com/nschloe/tuna/releases)
- [Commits](https://github.com/nschloe/tuna/compare/v0.5.11...v0.5.13)

---
updated-dependencies:
- dependency-name: tuna
  dependency-version: 0.5.13
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 11:36:27 +02:00
Soxoj 8267367bed Support Python 3.14 in tests (#2515) 2026-04-11 15:11:09 +02:00
Michael Sargis 28cb6c9ffb Remove duplicate attribute initialization in SimpleAiohttpChecker.__init__ (#2513)
self.allow_redirects and self.timeout were each initialized twice in
SimpleAiohttpChecker.__init__, which is redundant code.

Co-authored-by: zocomputer <help@zocomputer.com>
2026-04-11 10:35:56 +02:00
dependabot[bot] 7a31328325 build(deps): bump pypdf from 6.9.2 to 6.10.0 (#2512)
Bumps [pypdf](https://github.com/py-pdf/pypdf) from 6.9.2 to 6.10.0.
- [Release notes](https://github.com/py-pdf/pypdf/releases)
- [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md)
- [Commits](https://github.com/py-pdf/pypdf/compare/6.9.2...6.10.0)

---
updated-dependencies:
- dependency-name: pypdf
  dependency-version: 6.10.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-11 00:14:13 +02:00
Soxoj 7fd9bb3692 Update workflow to trigger on published releases (#2508) 2026-04-10 12:34:00 +02:00
Soxoj 385f9f5bb3 Bump to 0.6.0 (#2506) 2026-04-10 12:30:43 +02:00
Soxoj dc8751ac55 Added 3 sites, fixed 6, disabled 8 (#2505) 2026-04-10 12:26:41 +02:00
Copilot 9303b1686d Disable Kinja.com site check (#2503) 2026-04-10 12:16:28 +02:00
dependabot[bot] aa80bd4232 build(deps): bump lxml from 6.0.2 to 6.0.3 (#2501)
Bumps [lxml](https://github.com/lxml/lxml) from 6.0.2 to 6.0.3.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](https://github.com/lxml/lxml/compare/lxml-6.0.2...lxml-6.0.3)

---
updated-dependencies:
- dependency-name: lxml
  dependency-version: 6.0.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 11:08:04 +02:00
dependabot[bot] f5c4b1c35d build(deps): bump socid-extractor from 0.0.27 to 0.0.28 (#2502)
Bumps [socid-extractor](https://github.com/soxoj/socid-extractor) from 0.0.27 to 0.0.28.
- [Release notes](https://github.com/soxoj/socid-extractor/releases)
- [Changelog](https://github.com/soxoj/socid-extractor/blob/master/CHANGELOG.md)
- [Commits](https://github.com/soxoj/socid-extractor/compare/0.0.27...v0.0.28)

---
updated-dependencies:
- dependency-name: socid-extractor
  dependency-version: 0.0.28
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 11:05:54 +02:00
Soxoj 5e24117e93 Fix false positives (#2499)
* Re-disable 29 false positives from #2478
2026-04-09 17:48:45 +02:00
Soxoj 777e503e30 Re-enable 69 stale-disabled sites validated via self-check (#2478)
Total: 2539 → 2608 enabled sites (+69).
2026-04-09 12:27:48 +02:00
dependabot[bot] c222c96aeb build(deps): bump platformdirs from 4.9.4 to 4.9.6 (#2477)
Bumps [platformdirs](https://github.com/tox-dev/platformdirs) from 4.9.4 to 4.9.6.
- [Release notes](https://github.com/tox-dev/platformdirs/releases)
- [Changelog](https://github.com/tox-dev/platformdirs/blob/main/docs/changelog.rst)
- [Commits](https://github.com/tox-dev/platformdirs/compare/4.9.4...4.9.6)

---
updated-dependencies:
- dependency-name: platformdirs
  dependency-version: 4.9.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-09 10:05:17 +02:00
Soxoj b213f6e079 vBulletin cleanup, Flarum sites, engine stats, UA bump (#2476) 2026-04-09 01:17:24 +02:00
dependabot[bot] 9354331874 build(deps): bump cryptography from 46.0.6 to 46.0.7 (#2475)
Bumps [cryptography](https://github.com/pyca/cryptography) from 46.0.6 to 46.0.7.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/46.0.6...46.0.7)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-version: 46.0.7
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-09 01:14:44 +02:00
dependabot[bot] 8a82eb6ee6 build(deps): bump chardet from 7.4.0.post2 to 7.4.1 (#2472)
Bumps [chardet](https://github.com/chardet/chardet) from 7.4.0.post2 to 7.4.1.
- [Release notes](https://github.com/chardet/chardet/releases)
- [Changelog](https://github.com/chardet/chardet/blob/main/docs/changelog.rst)
- [Commits](https://github.com/chardet/chardet/compare/7.4.0.post2...7.4.1)

---
updated-dependencies:
- dependency-name: chardet
  dependency-version: 7.4.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-08 20:28:55 +02:00
github-actions[bot] a61f3b32c4 Updated site list and statistics (#2474)
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-08 14:34:48 +02:00
Copilot fbb8255518 Update HackTheBox and Wikipedia to use new API endpoints (#2470)
* Initial plan

* Update HackTheBox and Wikipedia to use new API endpoints for username checking

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/6dc9147c-787f-4f4f-8903-7b9873ac6ac9

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-08 10:23:07 +02:00
dependabot[bot] 9bad5d8269 build(deps-dev): bump pytest from 9.0.2 to 9.0.3 (#2473)
Bumps [pytest](https://github.com/pytest-dev/pytest) from 9.0.2 to 9.0.3.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pytest-dev/pytest/compare/9.0.2...9.0.3)

---
updated-dependencies:
- dependency-name: pytest
  dependency-version: 9.0.3
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-08 09:30:36 +02:00
Olivier Cervello a8e7ab4540 Bump lxml minimum to 6.0.2 for Python 3.14 compatibility (#2279)
* Bump lxml minimum to 6.0.2 for Python 3.14 compatibility

lxml 5.x fails to build on Python 3.14 due to incompatible pointer
types in Cython-generated C code. lxml 6.0.2 compiles correctly.

Fixes #2266

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update poetry.lock to match pyproject.toml changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-08 00:55:21 +02:00
Soxoj 6db1df2ddb Fix failing test for custom DB path resolution (#2468)
* Fix `--db` bug

* Fix test_resolve_db_path_custom_file to create the file before testing

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/3ea7b2e8-0565-4fca-8ec2-eff8eb4ee617

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-04-08 00:53:57 +02:00
Copilot 23adc178ea Fix crash on -a --self-check by adding exception handling to site check coroutines (#2466)
* Initial plan

* Fix crash on -a --self-check by adding exception handling in site_self_check and self_check

Wrap the body of site_self_check in try/except to catch unexpected errors
and always return a valid changes dict. Also add a safety-net try/except
in self_check around awaiting individual site check futures so that a
single site failure doesn't crash the entire self-check process.

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/5e27d620-5cbb-43d2-a9f9-ecb53a29904d

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

* Restore @pytest.mark.slow on test_maigret_results

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/5e27d620-5cbb-43d2-a9f9-ecb53a29904d

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

* Document --self-check error resilience, --auto-disable, and --diagnose in docs/

Update command-line-options.rst with expanded --self-check description
and new --auto-disable and --diagnose entries. Add a "Database self-check"
section to features.rst explaining error-resilient behaviour and usage
examples. Update usage-examples.rst to reference --auto-disable.

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/af1f0f09-9112-4902-8475-e81d235ff3ed

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-07 19:44:09 +02:00
Soxoj 6834483360 Fix Spotify, add Spotify Community forum (#2467) 2026-04-07 18:25:13 +02:00
Copilot 6ed8fdefcc Add installation troubleshooting for missing system dependencies (#2465)
* Initial plan

* Add installation troubleshooting section for system dependency errors

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/6c3a5612-bdd5-4611-ba77-aea7ab52e304

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

* Simplify README troubleshooting to a link to the full docs

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/6c557093-0643-4980-93ad-973e2d3141ef

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-04-07 17:54:02 +02:00
Soxoj 3fd34afb77 Sites fixes (#2464) 2026-04-06 21:41:16 +02:00
Soxoj ad95302745 Add Markdown reports for LLM analysis (#2463) 2026-04-06 18:26:43 +02:00
dependabot[bot] 44a6c729e3 build(deps): bump curl-cffi from 0.14.0 to 0.15.0 (#2462)
Bumps [curl-cffi](https://github.com/lexiforest/curl_cffi) from 0.14.0 to 0.15.0.
- [Release notes](https://github.com/lexiforest/curl_cffi/releases)
- [Changelog](https://github.com/lexiforest/curl_cffi/blob/main/docs/changelog.rst)
- [Commits](https://github.com/lexiforest/curl_cffi/compare/v0.14.0...v0.15.0)

---
updated-dependencies:
- dependency-name: curl-cffi
  dependency-version: 0.15.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 15:04:48 +02:00
Soxoj 6d0a22b738 False positive fixes (#2460)
* Fix false positives: APClips, Taplink, gentoo, Discord.bio, ChaturBate; disable 7Cups, playtime, openriskmanual, reactos; update tags

* Fix db_meta.json regeneration in update_site_data.py (inline instead of module import)

* Fix false positives: disable Bit.ly, Firearmstalk, Needrom, Travelblog; fix gentoo, Discord.bio, brickimedia via API; remove dead sites dreamhost, typepad
2026-04-04 19:08:51 +02:00
Soxoj abce3c9be4 Fix false positives (#2459)
* Fix false positives: APClips, Taplink, gentoo, Discord.bio, ChaturBate; disable 7Cups, playtime, openriskmanual, reactos; update tags

* Fix db_meta.json regeneration in update_site_data.py (inline instead of module import)
2026-04-04 18:22:21 +02:00
Soxoj 269d50eedc DB update mechanism (#2458)
* Database update mechanism
2026-04-04 18:00:50 +02:00
Soxoj e8f4318e5d Added Crypto/Web3 site checks (#2457) 2026-04-04 16:49:12 +02:00
Soxoj 75289c78bf Update of MIT License (#2455) 2026-04-03 18:02:54 +02:00
Julio César Suástegui eeb38ccdc0 fix(data): update InterPals absence string to match current site response (#2442)
The previous absence string 'The requested user does not exist or is inactive'
no longer matches the live site response. InterPals now returns 'User not found'
for non-existent profiles, causing false positives for all username searches.

Tested against interpals.net/noneownsthisusername (non-existent) and
interpals.net/blue (claimed) to confirm detection accuracy.

Closes #2433

Co-authored-by: Julio César Suástegui <juliosuas@users.noreply.github.com>
2026-04-03 13:43:33 +02:00
Soxoj d136014576 Multiple lint and types fixes (#2454) 2026-04-02 21:01:49 +02:00
Soxoj 5d502eaef6 Add site protection tracking system, fix broken site checks (Instagra… (#2452)
* Add site protection tracking system, fix broken site checks (Instagram, StackOverflow, LeetCode, Boosty, LiveLib), preserve unicode in data.json

* Update poetry.lock by running poetry lock

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/14333f41-67d5-4e28-a782-9730b31fc667

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-04-02 20:28:20 +02:00
dependabot[bot] 9e8a701c54 build(deps): bump aiohttp from 3.13.4 to 3.13.5 (#2448)
---
updated-dependencies:
- dependency-name: aiohttp
  dependency-version: 3.13.5
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-02 08:11:42 +02:00
dependabot[bot] 7b67c61240 build(deps-dev): bump mypy from 1.19.1 to 1.20.0 (#2447)
Bumps [mypy](https://github.com/python/mypy) from 1.19.1 to 1.20.0.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](https://github.com/python/mypy/compare/v1.19.1...v1.20.0)

---
updated-dependencies:
- dependency-name: mypy
  dependency-version: 1.20.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-01 20:12:29 +02:00
dependabot[bot] 0e113c4592 build(deps): bump requests from 2.33.0 to 2.33.1 (#2444)
Bumps [requests](https://github.com/psf/requests) from 2.33.0 to 2.33.1.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.33.0...v2.33.1)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.33.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-31 16:49:17 +02:00
dependabot[bot] fb4e17be92 build(deps): bump pygments from 2.18.0 to 2.20.0 (#2440)
Bumps [pygments](https://github.com/pygments/pygments) from 2.18.0 to 2.20.0.
- [Release notes](https://github.com/pygments/pygments/releases)
- [Changelog](https://github.com/pygments/pygments/blob/master/CHANGES)
- [Commits](https://github.com/pygments/pygments/compare/2.18.0...2.20.0)

---
updated-dependencies:
- dependency-name: pygments
  dependency-version: 2.20.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 20:25:10 +02:00
dependabot[bot] adb19e5930 build(deps): bump aiohttp from 3.13.3 to 3.13.4 (#2435)
---
updated-dependencies:
- dependency-name: aiohttp
  dependency-version: 3.13.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 14:27:54 +02:00
dependabot[bot] 116fae3e0f build(deps): bump platformdirs from 4.5.0 to 4.9.4 (#2434)
Bumps [platformdirs](https://github.com/tox-dev/platformdirs) from 4.5.0 to 4.9.4.
- [Release notes](https://github.com/tox-dev/platformdirs/releases)
- [Changelog](https://github.com/tox-dev/platformdirs/blob/main/docs/changelog.rst)
- [Commits](https://github.com/tox-dev/platformdirs/compare/4.5.0...4.9.4)

---
updated-dependencies:
- dependency-name: platformdirs
  dependency-version: 4.9.4
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 11:09:27 +02:00
dependabot[bot] bf495cd57e build(deps): bump chardet from 5.2.0 to 7.4.0.post2 (#2436)
Bumps [chardet](https://github.com/chardet/chardet) from 5.2.0 to 7.4.0.post2.
- [Release notes](https://github.com/chardet/chardet/releases)
- [Changelog](https://github.com/chardet/chardet/blob/main/docs/changelog.rst)
- [Commits](https://github.com/chardet/chardet/compare/5.2.0...7.4.0.post2)

---
updated-dependencies:
- dependency-name: chardet
  dependency-version: 7.4.0.post2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 11:09:14 +02:00
dependabot[bot] e49aa533df build(deps): bump multidict from 6.7.0 to 6.7.1 (#2396)
Bumps [multidict](https://github.com/aio-libs/multidict) from 6.7.0 to 6.7.1.
- [Release notes](https://github.com/aio-libs/multidict/releases)
- [Changelog](https://github.com/aio-libs/multidict/blob/master/CHANGES.rst)
- [Commits](https://github.com/aio-libs/multidict/compare/v6.7.0...v6.7.1)

---
updated-dependencies:
- dependency-name: multidict
  dependency-version: 6.7.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-29 12:20:48 +02:00
Soxoj 5aa7f6429b Overhaul site tags and naming: add social tag to 33 networks, fill mi… (#2430)
* Overhaul site tags and naming: add social tag to 33 networks, fill missing tags for 213 top-1000 sites, clean up false us/in country tags (~374 sites), normalize site names to Title Case, add tag validation tests, document tagging and naming rules
Remove LLM folder: ask @soxoj for the up-to-date version!

* Remove LLM/ from version control

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 19:48:16 +01:00
Soxoj a5d337b765 Tags and site names improvements (#2427)
- Added social tag to social networks (33 sites)
- Fixed wrong tags (8 sites)
- Filled empty tags for 213 sites in top-1000
- Country tag cleanup (~374 sites)
- Site naming normalization (75 sites)
- New tests (3)
- Documentation updates
2026-03-28 15:42:12 +01:00
dependabot[bot] 5aa0c908b0 build(deps): bump cryptography from 46.0.5 to 46.0.6 (#2422)
Bumps [cryptography](https://github.com/pyca/cryptography) from 46.0.5 to 46.0.6.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/46.0.5...46.0.6)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-version: 46.0.6
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-28 10:10:35 +01:00
Soxoj 51b452ad71 Add urlProbes (#2425) 2026-03-28 00:08:02 +01:00
Soxoj fa1a4d1b4a Sites re-check (#2423) 2026-03-27 22:41:55 +01:00
dependabot[bot] 184519b202 build(deps): bump soupsieve from 2.8 to 2.8.3 (#2404)
Bumps [soupsieve](https://github.com/facelessuser/soupsieve) from 2.8 to 2.8.3.
- [Release notes](https://github.com/facelessuser/soupsieve/releases)
- [Commits](https://github.com/facelessuser/soupsieve/compare/2.8...2.8.3)

---
updated-dependencies:
- dependency-name: soupsieve
  dependency-version: 2.8.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-27 22:41:40 +01:00
dependabot[bot] a203eecbb2 build(deps-dev): bump pytest from 9.0.1 to 9.0.2 (#2381)
Bumps [pytest](https://github.com/pytest-dev/pytest) from 9.0.1 to 9.0.2.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pytest-dev/pytest/compare/9.0.1...9.0.2)

---
updated-dependencies:
- dependency-name: pytest
  dependency-version: 9.0.2
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-27 22:15:56 +01:00
dependabot[bot] dde1cd5d78 build(deps): bump psutil from 7.1.3 to 7.2.2 (#2406)
Bumps [psutil](https://github.com/giampaolo/psutil) from 7.1.3 to 7.2.2.
- [Changelog](https://github.com/giampaolo/psutil/blob/master/docs/changelog.rst)
- [Commits](https://github.com/giampaolo/psutil/compare/v7.1.3...v7.2.2)

---
updated-dependencies:
- dependency-name: psutil
  dependency-version: 7.2.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-27 15:58:24 +01:00
dependabot[bot] 547512519b build(deps): bump pyinstaller from 6.16.0 to 6.19.0 (#2405)
Bumps [pyinstaller](https://github.com/pyinstaller/pyinstaller) from 6.16.0 to 6.19.0.
- [Release notes](https://github.com/pyinstaller/pyinstaller/releases)
- [Changelog](https://github.com/pyinstaller/pyinstaller/blob/develop/doc/CHANGES.rst)
- [Commits](https://github.com/pyinstaller/pyinstaller/compare/v6.16.0...v6.19.0)

---
updated-dependencies:
- dependency-name: pyinstaller
  dependency-version: 6.19.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-27 09:19:04 +01:00
Soxoj b333a2e2b2 Readme update: commercial use (#2403) 2026-03-26 21:51:53 +01:00
dependabot[bot] 2835ec71c7 build(deps): bump requests from 2.32.5 to 2.33.0 (#2394)
Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.32.5...v2.33.0)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.33.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-26 21:10:51 +01:00
github-actions[bot] af67a6a3f3 Updated site list and statistics (#2399)
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
2026-03-26 16:36:23 +01:00
dependabot[bot] 4f737b5260 build(deps-dev): bump pytest-httpserver from 1.1.0 to 1.1.5 (#2397)
Bumps [pytest-httpserver](https://github.com/csernazs/pytest-httpserver) from 1.1.0 to 1.1.5.
- [Release notes](https://github.com/csernazs/pytest-httpserver/releases)
- [Changelog](https://github.com/csernazs/pytest-httpserver/blob/master/CHANGES.rst)
- [Commits](https://github.com/csernazs/pytest-httpserver/compare/1.1.0...1.1.5)

---
updated-dependencies:
- dependency-name: pytest-httpserver
  dependency-version: 1.1.5
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-26 16:12:53 +01:00
dependabot[bot] 185e09e4ea build(deps): bump pypdf from 6.9.1 to 6.9.2 (#2392)
Bumps [pypdf](https://github.com/py-pdf/pypdf) from 6.9.1 to 6.9.2.
- [Release notes](https://github.com/py-pdf/pypdf/releases)
- [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md)
- [Commits](https://github.com/py-pdf/pypdf/compare/6.9.1...6.9.2)

---
updated-dependencies:
- dependency-name: pypdf
  dependency-version: 6.9.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-25 23:33:41 +01:00
dependabot[bot] 5865e0f375 build(deps): bump yarl from 1.22.0 to 1.23.0 (#2383)
---
updated-dependencies:
- dependency-name: yarl
  dependency-version: 1.23.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-25 16:59:46 +01:00
dependabot[bot] 815c8cb2f3 build(deps): bump asgiref from 3.11.0 to 3.11.1 (#2384)
Bumps [asgiref](https://github.com/django/asgiref) from 3.11.0 to 3.11.1.
- [Changelog](https://github.com/django/asgiref/blob/main/CHANGELOG.txt)
- [Commits](https://github.com/django/asgiref/compare/3.11.0...3.11.1)

---
updated-dependencies:
- dependency-name: asgiref
  dependency-version: 3.11.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-25 15:03:02 +01:00
Soxoj 656fe1df24 Added Max.ru check; --no-progressbar flag fixed (#2386) 2026-03-25 11:48:12 +01:00
dependabot[bot] 1c5dc5f152 build(deps): bump pycountry from 24.6.1 to 26.2.16 (#2382)
Bumps [pycountry](https://github.com/pycountry/pycountry) from 24.6.1 to 26.2.16.
- [Release notes](https://github.com/pycountry/pycountry/releases)
- [Changelog](https://github.com/pycountry/pycountry/blob/main/HISTORY.txt)
- [Commits](https://github.com/pycountry/pycountry/compare/24.6.1...26.2.16)

---
updated-dependencies:
- dependency-name: pycountry
  dependency-version: 26.2.16
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-25 09:54:11 +01:00
Soxoj bc3d9faad9 Fix false-positive site checks reported by Maigret Bot (#2376) 2026-03-24 23:01:11 +01:00
72 changed files with 12802 additions and 8156 deletions
+8 -1
View File
@@ -1,3 +1,10 @@
#!/bin/sh
echo 'Activating update_sitesmd hook script...'
poetry run update_sitesmd
poetry run update_sitesmd
echo 'Regenerating db_meta.json...'
python3 utils/generate_db_meta.py
git add maigret/resources/db_meta.json
git add maigret/resources/data.json
git add sites.md
+48 -10
View File
@@ -2,7 +2,7 @@ name: Build docker image and push to DockerHub
on:
push:
branches: [ main ]
branches: [ main, dev ]
jobs:
docker:
@@ -10,24 +10,62 @@ jobs:
steps:
-
name: Set up QEMU
uses: docker/setup-qemu-action@v1
uses: docker/setup-qemu-action@v3
-
name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
uses: docker/setup-buildx-action@v3
-
name: Login to DockerHub
uses: docker/login-action@v1
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_HUB_USERNAME }}
password: ${{ secrets.DOCKER_HUB_ACCESS_TOKEN }}
-
name: Build and push
id: docker_build
uses: docker/build-push-action@v2
name: Extract metadata (CLI)
id: meta_cli
uses: docker/metadata-action@v5
with:
images: ${{ secrets.DOCKER_HUB_USERNAME }}/maigret
tags: |
type=raw,value=latest,enable={{is_default_branch}}
type=ref,event=branch
type=sha,prefix=
-
name: Extract metadata (Web UI)
id: meta_web
uses: docker/metadata-action@v5
with:
images: ${{ secrets.DOCKER_HUB_USERNAME }}/maigret
tags: |
type=raw,value=web,enable={{is_default_branch}}
type=ref,event=branch,suffix=-web
type=sha,prefix=web-
-
name: Build and push (CLI, default)
id: docker_build_cli
uses: docker/build-push-action@v6
with:
push: true
tags: ${{ secrets.DOCKER_HUB_USERNAME }}/maigret:latest
target: cli
tags: ${{ steps.meta_cli.outputs.tags }}
labels: ${{ steps.meta_cli.outputs.labels }}
platforms: linux/amd64,linux/arm64
cache-from: type=gha
cache-to: type=gha,mode=max
-
name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
name: Build and push (Web UI)
id: docker_build_web
uses: docker/build-push-action@v6
with:
push: true
target: web
tags: ${{ steps.meta_web.outputs.tags }}
labels: ${{ steps.meta_web.outputs.labels }}
platforms: linux/amd64,linux/arm64
cache-from: type=gha
cache-to: type=gha,mode=max
-
name: Image digests
run: |
echo "cli: ${{ steps.docker_build_cli.outputs.digest }}"
echo "web: ${{ steps.docker_build_web.outputs.digest }}"
+35 -28
View File
@@ -2,41 +2,48 @@ name: Linting and testing
on:
push:
branches: [ main ]
branches: [main]
pull_request:
branches: [ main ]
branches: [main]
types: [opened, synchronize, reopened]
jobs:
build:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13"]
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install system dependencies
run: |
sudo apt-get update && sudo apt-get install -y libcairo2-dev
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install poetry
python -m poetry install --with dev
- name: Test with Coverage and Pytest (Fail if coverage is low)
run: |
poetry run coverage run --source=./maigret -m pytest --reruns 3 --reruns-delay 5 tests
poetry run coverage report --fail-under=60
poetry run coverage html
- name: Upload coverage report
uses: actions/upload-artifact@v4
with:
name: htmlcov-${{ strategy.job-index }}
path: htmlcov
- name: Checkout
uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y libcairo2-dev
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install poetry
python -m poetry install --with dev
- name: Test with Coverage and Pytest (fail if coverage is low)
run: |
poetry run coverage run --source=./maigret -m pytest --reruns 3 --reruns-delay 5 tests
poetry run coverage report --fail-under=60
poetry run coverage html
- name: Upload coverage report
uses: actions/upload-artifact@v4
with:
name: htmlcov-${{ strategy.job-index }}
path: htmlcov
+23 -14
View File
@@ -1,21 +1,30 @@
name: Upload Python Package to PyPI when a Release is Created
name: Upload Python Package to PyPI when a Release is Published
on:
release:
types: [created]
push:
tags:
- "v*"
permissions:
id-token: write
contents: read
types: [published]
jobs:
build-and-publish:
pypi-publish:
name: Publish release to PyPI
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/maigret
permissions:
id-token: write
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
- run: uv build
- name: Publish to PyPI (Trusted Publishing)
uses: pypa/gh-action-pypi-publish@release/v1
- name: Set up Python
uses: actions/setup-python@v4
with:
packages-dir: dist
python-version: "3.x"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build
- name: Build package
run: |
python -m build
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
+3
View File
@@ -27,6 +27,9 @@ jobs:
pip3 install .
python3 ./utils/update_site_data.py --empty-only
- name: Regenerate db_meta.json
run: python3 utils/generate_db_meta.py
- name: Remove ambiguous main tag
run: git tag -d main || true
+3 -1
View File
@@ -42,4 +42,6 @@ settings.json
# other
*.egg-info
build
build
LLM
lib
+191
View File
@@ -1,5 +1,196 @@
# Changelog
## [0.6.0] - 2025-04-10
## What's Changed
* Updated workflows: added 3.13 to test, updated pypi upload by @soxoj in https://github.com/soxoj/maigret/pull/2111
* Bump pypdf from 5.1.0 to 6.0.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2122
* Bump coverage from 7.9.2 to 7.10.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2117
* Bump soupsieve from 2.6 to 2.7 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2118
* Bump mock from 5.1.0 to 5.2.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2116
* Bump pytest-asyncio from 1.0.0 to 1.1.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2114
* Bump pytest-cov from 6.0.0 to 6.2.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2115
* Bump xhtml2pdf from 0.2.16 to 0.2.17 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2149
* Bump requests from 2.32.4 to 2.32.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2165
* Bump lxml from 5.3.0 to 6.0.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2146
* Bump aiodns from 3.2.0 to 3.5.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2148
* Bump alive-progress from 3.2.0 to 3.3.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2145
* Bump certifi from 2025.6.15 to 2025.8.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2147
* Disabled some sites giving false positive results by @soxoj in https://github.com/soxoj/maigret/pull/2170
* Bump flask from 3.1.1 to 3.1.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2175
* Bump pyinstaller from 6.11.1 to 6.15.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2174
* Bump mypy from 1.14.1 to 1.17.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2173
* Bump pytest from 8.3.4 to 8.4.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2172
* Bump flake8 from 7.1.1 to 7.3.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2171
* Bump aiohttp from 3.12.14 to 3.12.15 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2181
* Bump coverage from 7.10.3 to 7.10.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2180
* Bump psutil from 6.1.1 to 7.0.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2179
* Bump lxml from 6.0.0 to 6.0.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2178
* Bump multidict from 6.6.3 to 6.6.4 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2177
* Bump soupsieve from 2.7 to 2.8 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2185
* Bump typing-extensions from 4.14.1 to 4.15.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2182
* Bump python-bidi from 0.6.3 to 0.6.6 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2183
* Bump platformdirs from 4.3.8 to 4.4.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2184
* Make web interface accessible for Docker deployment by default by @soxoj in https://github.com/soxoj/maigret/pull/2189
* Bump coverage from 7.10.5 to 7.10.6 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2192
* Bump pytest-rerunfailures from 15.1 to 16.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2191
* Bump pytest-rerunfailures from 15.1 to 16.0.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2193
* Bump pytest from 8.4.1 to 8.4.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2194
* Bump pytest-cov from 6.2.1 to 6.3.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2195
* Bump pytest-cov from 6.3.0 to 7.0.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2196
* Bump mypy from 1.17.1 to 1.18.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2197
* Bump black from 25.1.0 to 25.9.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2203
* Bump mypy from 1.18.1 to 1.18.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2202
* Bump pytest-asyncio from 1.1.0 to 1.2.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2200
* Bump pyinstaller from 6.15.0 to 6.16.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2199
* Bump reportlab from 4.4.3 to 4.4.4 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2206
* Bump coverage from 7.10.6 to 7.10.7 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2207
* Bump psutil from 7.0.0 to 7.1.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2201
* Bump asgiref from 3.9.1 to 3.9.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2204
* Bump lxml from 6.0.1 to 6.0.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2208
* Bump platformdirs from 4.4.0 to 4.5.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2223
* Bump asgiref from 3.9.2 to 3.10.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2220
* Bump yarl from 1.20.1 to 1.22.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2221
* Bump markupsafe from 3.0.2 to 3.0.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2209
* Bump multidict from 6.6.4 to 6.7.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2224
* Bump idna from 3.10 to 3.11 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2227
* Bump aiohttp from 3.12.15 to 3.13.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2225
* Bump coverage from 7.10.7 to 7.11.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2230
* Bump certifi from 2025.8.3 to 2025.10.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2228
* Bump pytest-rerunfailures from 16.0.1 to 16.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2229
* Bump attrs from 25.3.0 to 25.4.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2226
* Bump aiohttp from 3.13.0 to 3.13.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2237
* Bump pypdf from 6.0.0 to 6.1.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2233
* Bump black from 25.9.0 to 25.11.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2239
* Bump python-bidi from 0.6.6 to 0.6.7 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2234
* Bump psutil from 7.1.0 to 7.1.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2240
* Bump coverage from 7.11.0 to 7.12.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2241
* Bump werkzeug from 3.1.3 to 3.1.4 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2248
* Bump pypdf from 6.1.3 to 6.4.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2245
* Bump asgiref from 3.10.0 to 3.11.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2243
* Bump pytest-asyncio from 1.2.0 to 1.3.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2242
* Bump aiohttp from 3.13.2 to 3.13.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2261
* Bump pytest from 8.4.2 to 9.0.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2244
* Bump mypy from 1.18.2 to 1.19.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2250
* ♻️ Refactor: Hardcoded relative path for database file by @tang-vu in https://github.com/soxoj/maigret/pull/2285
* ✨ Quality: Missing tests for settings cascade and override logic by @tang-vu in https://github.com/soxoj/maigret/pull/2287
* ✨ Quality: Unexpanded tilde in file path by @tang-vu in https://github.com/soxoj/maigret/pull/2283
* Bump urllib3 from 2.5.0 to 2.6.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2262
* Bump pillow from 11.0.0 to 12.1.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2271
* Bump black from 25.11.0 to 26.3.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2280
* Bump cryptography from 44.0.1 to 46.0.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2270
* Bump pypdf from 6.4.0 to 6.9.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2281
* Dockerfile fix by @soxoj in https://github.com/soxoj/maigret/pull/2290
* Fixed false positives in top-500 by @soxoj in https://github.com/soxoj/maigret/pull/2292
* Update Telegram bot link in README by @soxoj in https://github.com/soxoj/maigret/pull/2293
* Pyinstaller GitHub workflow fix by @soxoj in https://github.com/soxoj/maigret/pull/2298
* Twitter fixed, mirrors mechanism improvement by @soxoj in https://github.com/soxoj/maigret/pull/2299
* build(deps): bump flask from 3.1.2 to 3.1.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2289
* Bump reportlab from 4.4.4 to 4.4.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2251
* build(deps): bump werkzeug from 3.1.4 to 3.1.6 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2288
* Bump certifi from 2025.10.5 to 2025.11.12 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2249
* Update Telegram bot link in README by @soxoj in https://github.com/soxoj/maigret/pull/2300
* Improve site-check quality by @soxoj in https://github.com/soxoj/maigret/pull/2301
* feat(sites): fix false positives: disable 74 broken sites, fix 8 with… by @soxoj in https://github.com/soxoj/maigret/pull/2302
* Update sites list workflow by @soxoj in https://github.com/soxoj/maigret/pull/2303
* Bump svglib from 1.5.1 to 1.6.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2205
* feat(workflow): fix update site data workflow dependency by @soxoj in https://github.com/soxoj/maigret/pull/2306
* Re-enable taplink.cc with browser User-Agent to bypass Cloudflare by @Copilot in https://github.com/soxoj/maigret/pull/2308
* feat(workflow): fix update site data workflow err by @soxoj in https://github.com/soxoj/maigret/pull/2312
* Update site data workflow fix: remove ambiguous main tag by @soxoj in https://github.com/soxoj/maigret/pull/2313
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2314
* Fix Love.Mail.ru: update to numeric-only identifiers and new profile URL by @Copilot in https://github.com/soxoj/maigret/pull/2307
* Remove dead site xxxforum.org by @Copilot in https://github.com/soxoj/maigret/pull/2310
* Disable forums.developer.nvidia.com (auth-gated user profiles) by @Copilot in https://github.com/soxoj/maigret/pull/2305
* Pin requests-toolbelt>=1.0.0 to fix urllib3 v2 incompatibility by @Copilot in https://github.com/soxoj/maigret/pull/2316
* build(deps): bump reportlab from 4.4.5 to 4.4.10 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2323
* build(deps-dev): bump coverage from 7.12.0 to 7.13.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2321
* build(deps-dev): bump pytest-cov from 7.0.0 to 7.1.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2320
* build(deps): bump aiohttp-socks from 0.10.1 to 0.11.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2319
* Disable false-positive site probe: amateurvoyeurforum.com by @Copilot in https://github.com/soxoj/maigret/pull/2332
* Disable forums.stevehoffman.tv due to false positives by @Copilot in https://github.com/soxoj/maigret/pull/2331
* [WIP] Fix false-positive probe for vegalab site by @Copilot in https://github.com/soxoj/maigret/pull/2336
* Fix RoyalCams site check using BongaCams white-label pattern by @Copilot in https://github.com/soxoj/maigret/pull/2334
* Fix Setlist site check: switch to message checkType with proper markers by @Copilot in https://github.com/soxoj/maigret/pull/2333
* [WIP] Fix invalid link on forums.imore.com by @Copilot in https://github.com/soxoj/maigret/pull/2337
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2315
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2339
* Fix false-positive site probe: Re-enable Taplink with message checkType by @Copilot in https://github.com/soxoj/maigret/pull/2326
* build(deps): bump aiodns from 3.5.0 to 4.0.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2345
* build(deps-dev): bump mypy from 1.19.0 to 1.19.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2347
* Disable Librusec site check (false positive) by @Copilot in https://github.com/soxoj/maigret/pull/2349
* Disable MirTesen site check (false positive) by @Copilot in https://github.com/soxoj/maigret/pull/2350
* build(deps): bump attrs from 25.4.0 to 26.1.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2344
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2341
* feat: add cybersecurity platforms + re-enable Root-Me by @juliosuas in https://github.com/soxoj/maigret/pull/2318
* Fix club.cnews.ru false positive: switch from status_code to message checkType by @Copilot in https://github.com/soxoj/maigret/pull/2342
* Fix SoundCloud false-positive: switch to message-based check by @Copilot in https://github.com/soxoj/maigret/pull/2355
* build(deps): bump certifi from 2025.11.12 to 2026.2.25 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2346
* feat: add tag blacklisting via `--exclude-tags` by @Copilot in https://github.com/soxoj/maigret/pull/2352
* Fix domain substring matching and NoneType crash in submit dialog by @Copilot in https://github.com/soxoj/maigret/pull/2367
* feat(core): add POST request support, new sites, migrate to Majestic Million ranking by @soxoj in https://github.com/soxoj/maigret/pull/2317
* Fix update-site-data workflow race condition on branch push by @Copilot in https://github.com/soxoj/maigret/pull/2366
* Fix false-positive site checks reported by Maigret Bot by @soxoj in https://github.com/soxoj/maigret/pull/2376
* build(deps): bump pycountry from 24.6.1 to 26.2.16 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2382
* Added Max.ru check; --no-progressbar flag fixed by @soxoj in https://github.com/soxoj/maigret/pull/2386
* build(deps): bump asgiref from 3.11.0 to 3.11.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2384
* build(deps): bump yarl from 1.22.0 to 1.23.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2383
* build(deps): bump pypdf from 6.9.1 to 6.9.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2392
* build(deps-dev): bump pytest-httpserver from 1.1.0 to 1.1.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2397
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2399
* build(deps): bump requests from 2.32.5 to 2.33.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2394
* Readme update: commercial use by @soxoj in https://github.com/soxoj/maigret/pull/2403
* build(deps): bump pyinstaller from 6.16.0 to 6.19.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2405
* build(deps): bump psutil from 7.1.3 to 7.2.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2406
* build(deps-dev): bump pytest from 9.0.1 to 9.0.2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2381
* build(deps): bump soupsieve from 2.8 to 2.8.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2404
* Sites re-check by @soxoj in https://github.com/soxoj/maigret/pull/2423
* Add urlProbes by @soxoj in https://github.com/soxoj/maigret/pull/2425
* build(deps): bump cryptography from 46.0.5 to 46.0.6 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2422
* Tags and site names improvements by @soxoj in https://github.com/soxoj/maigret/pull/2427
* Overhaul site tags and naming: add social tag to 33 networks, fill mi… by @soxoj in https://github.com/soxoj/maigret/pull/2430
* build(deps): bump multidict from 6.7.0 to 6.7.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2396
* build(deps): bump chardet from 5.2.0 to 7.4.0.post2 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2436
* build(deps): bump platformdirs from 4.5.0 to 4.9.4 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2434
* build(deps): bump aiohttp from 3.13.3 to 3.13.4 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2435
* build(deps): bump pygments from 2.18.0 to 2.20.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2440
* build(deps): bump requests from 2.33.0 to 2.33.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2444
* build(deps-dev): bump mypy from 1.19.1 to 1.20.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2447
* build(deps): bump aiohttp from 3.13.4 to 3.13.5 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2448
* Add site protection tracking system, fix broken site checks (Instagra… by @soxoj in https://github.com/soxoj/maigret/pull/2452
* Multiple lint and types fixes by @soxoj in https://github.com/soxoj/maigret/pull/2454
* fix(data): update InterPals absence string to match current site response by @juliosuas in https://github.com/soxoj/maigret/pull/2442
* Update of MIT License by @soxoj in https://github.com/soxoj/maigret/pull/2455
* Added Crypto/Web3 site checks by @soxoj in https://github.com/soxoj/maigret/pull/2457
* DB update mechanism by @soxoj in https://github.com/soxoj/maigret/pull/2458
* Fix false positives by @soxoj in https://github.com/soxoj/maigret/pull/2459
* False positive fixes by @soxoj in https://github.com/soxoj/maigret/pull/2460
* build(deps): bump curl-cffi from 0.14.0 to 0.15.0 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2462
* Add Markdown reports for LLM analysis by @soxoj in https://github.com/soxoj/maigret/pull/2463
* Sites fixes by @soxoj in https://github.com/soxoj/maigret/pull/2464
* Add installation troubleshooting for missing system dependencies by @Copilot in https://github.com/soxoj/maigret/pull/2465
* Fix Spotify, add Spotify Community forum by @soxoj in https://github.com/soxoj/maigret/pull/2467
* Fix crash on `-a --self-check` by adding exception handling to site check coroutines by @Copilot in https://github.com/soxoj/maigret/pull/2466
* Fix failing test for custom DB path resolution by @soxoj in https://github.com/soxoj/maigret/pull/2468
* Bump lxml minimum to 6.0.2 for Python 3.14 compatibility by @ocervell in https://github.com/soxoj/maigret/pull/2279
* build(deps-dev): bump pytest from 9.0.2 to 9.0.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2473
* Update HackTheBox and Wikipedia to use new API endpoints by @Copilot in https://github.com/soxoj/maigret/pull/2470
* Automated Sites List Update by @github-actions[bot] in https://github.com/soxoj/maigret/pull/2474
* build(deps): bump chardet from 7.4.0.post2 to 7.4.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2472
* build(deps): bump cryptography from 46.0.6 to 46.0.7 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2475
* vBulletin cleanup, Flarum sites, engine stats, UA bump by @soxoj in https://github.com/soxoj/maigret/pull/2476
* build(deps): bump platformdirs from 4.9.4 to 4.9.6 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2477
* Re-enable 69 stale-disabled sites validated via self-check by @soxoj in https://github.com/soxoj/maigret/pull/2478
* Fix false positives by @soxoj in https://github.com/soxoj/maigret/pull/2499
* build(deps): bump socid-extractor from 0.0.27 to 0.0.28 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2502
* build(deps): bump lxml from 6.0.2 to 6.0.3 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/2501
* Disable Kinja.com site check by @Copilot in https://github.com/soxoj/maigret/pull/2503
* Added 3 sites, fixed 6, disabled 8 by @soxoj in https://github.com/soxoj/maigret/pull/2505
* Bump to 0.6.0 by @soxoj in https://github.com/soxoj/maigret/pull/2506
* Update workflow to trigger on published releases by @soxoj in https://github.com/soxoj/maigret/pull/2508
**Full Changelog**: https://github.com/soxoj/maigret/compare/v0.5.0...v0.6.0
## [0.5.0] - 2025-08-10
* Site Supression by @C3n7ral051nt4g3ncy in https://github.com/soxoj/maigret/pull/627
* Bump yarl from 1.7.2 to 1.8.1 by @dependabot[bot] in https://github.com/soxoj/maigret/pull/626
+159 -27
View File
@@ -1,53 +1,185 @@
# How to contribute
Hey! I'm really glad you're reading this. Maigret contains a lot of sites, and it is very hard to keep all the sites operational. That's why any fix is important.
Hey! I'm really glad you're reading this. Maigret contains a lot of sites, and it is very hard to keep all the sites operational. That's why any fix is important.
## Code of Conduct
Please read and follow the [Code of Conduct](CODE_OF_CONDUCT.md) to foster a welcoming and inclusive community.
## How to add a new site
## Local setup
#### Beginner level
Install Maigret with development dependencies via [Poetry](https://python-poetry.org/):
You can use Maigret **submit mode** (`maigret --submit URL`) to add a new site or update an existing site. In this mode Maigret do an automatic analysis of the given account URL or site main page URL to determine the site engine and methods to check account presence. After checking Maigret asks if you want to add the site, answering y/Y will rewrite the local database.
```bash
git clone https://github.com/soxoj/maigret && cd maigret
poetry install --with dev
```
#### Advanced level
Activate the repo's git hooks **once after cloning**:
You can edit [the database JSON file](https://github.com/soxoj/maigret/blob/main/maigret/resources/data.json) (`./maigret/resources/data.json`) manually.
```bash
git config --local core.hooksPath .githooks/
```
The pre-commit hook does two things every time you commit changes that touch the site database:
- regenerates the database signature `maigret/resources/db_meta.json` (used to detect compatible auto-updates), and
- regenerates `sites.md` (the human-readable list of supported sites with per-engine statistics).
It also auto-stages the regenerated files so they land in the same commit as your edits. **Always run `git commit` from inside the repo so the hook can fire** — without it, your PR will land with a stale signature and a stale `sites.md`, and database auto-update will misbehave for users on your branch.
## How to contribute
There are two main ways to help.
### 1. Add a new site
**Beginner.** Use the `--submit` mode — Maigret takes a single existing-account URL, auto-detects the site engine, picks `presenseStrs` / `absenceStrs`, and offers to add the entry:
```bash
maigret --submit https://example.com/users/alice
```
`--submit` works well when the site has clean status codes and no anti-bot protection. It will *not* discover a public JSON API (`urlProbe`), classify protection (`tls_fingerprint`, `cf_js_challenge`, `ip_reputation`, ...), or recognise SPA / soft-404 pages. For those, fall back to manual editing.
**Advanced.** Edit `maigret/resources/data.json` by hand — see *Editing `data.json` safely* below. There is also an `add-a-site` issue template if you want a maintainer to do it for you.
### 2. Fix existing sites
The most useful work in this project is keeping checks accurate over time. Sites change layout, switch engines, add Cloudflare, redirect to login walls — every fix is welcome.
**Where to start.** Good candidates:
- Issues with the `false-positive` label, especially those opened automatically by the Telegram bot.
- Sites currently `disabled: true` in `data.json` — many were disabled on a transient symptom and have since healed.
- Sites for which `--self-check --diagnose` reports a problem.
- A focused audit of one engine (vBulletin, XenForo, phpBB, Discourse, Flarum, ...). Engine-wide breakage usually has a single root cause and several sites can be fixed in one PR.
**Diagnose with built-in tools.**
> By default, Maigret skips entries with `disabled: true` in every mode (`--self-check`, `--site`, plain search). Whenever your target is a disabled site — diagnosing it, validating a fix, running the two-filter check below — pass **`--use-disabled-sites`** explicitly. Without the flag, the site is silently dropped from the run and you get an empty result that looks like "everything's fine".
- Per-site diagnosis with recommendations:
```bash
maigret --self-check --site "SiteName" --diagnose
# add --use-disabled-sites if the entry is currently disabled
```
Without `--auto-disable`, this only reports — it never edits the database. Add `--auto-disable` only when you really want to write the result back.
- Single-site comparison of claimed vs unclaimed responses (status, markers, headers):
```bash
python utils/site_check.py --site "SiteName" --diagnose
python utils/site_check.py --site "SiteName" --compare-methods # raw aiohttp vs Maigret's checker
```
- Mass check of top-N sites:
```bash
python utils/check_top_n.py --top 100 --only-broken
```
### Understanding `checkType`
Each site entry uses one of three `checkType` modes to decide whether a profile exists. Picking the right one for your site is the most important data-modeling decision in `data.json`:
- **`message`** (most common, most flexible) — Maigret fetches the page and inspects the HTML body. The profile is reported as found when the body contains at least one substring from `presenseStrs` **and** none of the substrings from `absenceStrs`. Pick narrow, profile-specific markers: a `<title>` fragment unique to profile pages, a CSS class only rendered on profiles (e.g. `"profile-card"`), or a JSON field name from an embedded data blob (`"displayName":`). Avoid generic words (`name`, `email`) and HTML/ARIA boilerplate (`polite`, `alert`, `navigation`, `status`) — they match on every page including error and anti-bot challenge pages, and produce false positives. If the marker contains non-ASCII text, double-check the page is UTF-8 (some legacy sites serve KOI8-R or Windows-1251, in which case byte-level matching silently fails — prefer ASCII markers or a JSON API).
- **`status_code`** — Maigret only looks at the HTTP status code; 2xx means "found", anything else means "not found". Use this only when the site reliably returns proper status codes — typically clean JSON APIs that return HTTP 200 for real users and HTTP 404 for missing ones. Don't use it for sites that return HTTP 200 with a soft "user not found" page (this is the single most common cause of false-positive checks).
- **`response_url`** — Maigret follows the redirect chain and inspects the final URL. Useful when the server reliably redirects missing-user URLs to a different path (e.g. `/login`, `/404`, the homepage) while existing-user URLs stay put. For most sites `message` is a better fit; reach for `response_url` only when a redirect-based signal is genuinely the most stable one.
**`urlProbe` (optional, works with any `checkType`).** If the most reliable signal lives at a different URL than the public profile page — a JSON API, a GraphQL endpoint, a mobile-app route — set `urlProbe` to that URL. Maigret fetches `urlProbe` for the check, but reports continue to show the human-readable `url` so users see a profile link they can click. Examples: GitHub uses `https://github.com/{username}` as `url` and `https://api.github.com/users/{username}` as `urlProbe`; Picsart uses the web profile as `url` and `https://api.picsart.com/users/show/{username}.json` as `urlProbe`. A clean public API is almost always more stable than parsing HTML — it's worth probing for one before settling on `message` against the SPA shell.
**Errors vs absence.** Anything that means "the server can't answer right now" — rate limits, captchas, "Checking your browser", "unusual traffic", maintenance pages — belongs in `errors` (mapping the substring to a human-readable error string), not in `absenceStrs`. The `errors` mechanism produces an UNKNOWN result instead of a false CLAIMED or false AVAILABLE.
Full reference for `checkType`, `urlProbe`, `engine`, and the rest of the `data.json` schema is in the [development guide](docs/source/development.rst), section *How to fix false-positives*.
### Editing `data.json` safely
`data.json` is a single ~36 000-line JSON file. **Make surgical, line-level edits only.** Never rewrite it by reading it into a Python dict and dumping it back — `json.load` + `json.dump` reformats every entry and produces an unreviewable 70 000-line diff. The same rule applies to any helper script that touches the file: it must preserve the original formatting of untouched entries.
If your editor reformats JSON on save, disable that for `data.json` before editing.
### Two-filter validation when re-enabling a site
Removing `disabled: true` requires **two** independent checks. `--self-check` alone is not sufficient — it only verifies the two specific usernames recorded in the entry, so a site that returns CLAIMED for *any* arbitrary username will still pass the self-check.
```bash
# Filter 1: self-check on the recorded claimed/unclaimed pair
maigret --self-check --site "SiteName" --use-disabled-sites
# Filter 2: live probe with a clearly fake username — nothing should match
maigret noonewouldeverusethis7 --site "SiteName" --use-disabled-sites --print-not-found
```
Both filters need `--use-disabled-sites`, since a candidate for re-enable still has `disabled: true` in the working tree until your edit lands. If you forget the flag, both commands silently no-op.
If the second command reports `[+]` for the fake username, the check is a false positive — do not enable. This step takes seconds and is non-negotiable for any re-enable PR.
## Site naming, tags, and protection
- **Site naming conventions** (Title Case by default, brand-specific exceptions, no `www.` prefix, etc.) are documented in the [development guide](docs/source/development.rst), section *Site naming conventions*.
- **Country tags** (`us`, `ru`, `kr`, ...) attribute an account to a country of origin or residence — they're not a traffic-share label. Global services (GitHub, YouTube, Reddit) get **no** country tag; regional services (VK → `ru`, Naver → `kr`) **must** have one. Don't assign a country tag from Alexa/SimilarWeb audience stats.
- **Category tags** must come from the canonical `"tags"` array at the bottom of `data.json`. The `test_tags_validity` test fails if you introduce an unregistered tag. If no existing tag fits well, either pick the closest reasonable match or add the new tag to the canonical list as an explicit, separate change. Don't use platform names (`writefreely`, `pixelfed`) — use category names (`blog`, `photo`).
- **Protection tags** (`tls_fingerprint`, `ip_reputation`, `cf_js_challenge`, `cf_firewall`, `aws_waf_js_challenge`, `ddos_guard_challenge`, `js_challenge`, `custom_bot_protection`) describe the kind of anti-bot protection a site uses. One of them — **`tls_fingerprint`** — is load-bearing: when a site fingerprints the TLS handshake (JA3/JA4) and blocks non-browser clients, tagging it with `tls_fingerprint` makes Maigret automatically swap its HTTP client to [`curl_cffi`](https://github.com/lexiforest/curl_cffi) with Chrome browser emulation, which is usually enough to pass. The site stays `enabled` — no `disabled: true` is needed. Examples: Instagram, NPM, Codepen, Kickstarter, Letterboxd. The remaining tags are documentation-only and pair with `disabled: true` until a per-provider solver is integrated. The full taxonomy and the rules for picking the right tag are in the [development guide](docs/source/development.rst), section *protection (site protection tracking)*. Don't add a protection tag without empirical evidence it applies in the current environment.
## Testing
There are CI checks for every PR to the Maigret repository. But it will be better to run `make format`, `make link` and `make test` to ensure you've made a corrent changes.
CI runs the same checks on every PR, but please run them locally first:
```bash
make format # auto-format with black
make lint # flake / mypy
make test # pytest with coverage
```
## Submitting changes
To submit you changes you must [send a GitHub PR](https://github.com/soxoj/maigret/pulls) to the Maigret project.
Always write a clear log message for your commits. One-line messages are fine for small changes, but bigger changes should look like this:
Open a [GitHub PR](https://github.com/soxoj/maigret/pulls) against `main`. Always write a clear log message:
$ git commit -m "A brief summary of the commit
>
> A paragraph describing what changed and its impact."
```
$ git commit -m "A brief summary of the commit
>
> A paragraph describing what changed and its impact."
```
One-line messages are fine for small changes; bigger changes should explain the *why* in the body.
## Coding conventions
### General Guidelines
### General
- Try to follow [PEP 8](https://www.python.org/dev/peps/pep-0008/) for Python code style.
- Ensure your code passes all tests before submitting a pull request.
- Follow [PEP 8](https://www.python.org/dev/peps/pep-0008/) for Python.
- Make sure all tests pass before opening the PR.
### Code Style
### Code style
- **Indentation**: Use 4 spaces per indentation level.
- **Imports**:
- Standard library imports should be placed at the top.
- Third-party imports should follow.
- Group imports logically.
- **Indentation**: 4 spaces per level.
- **Imports**: standard library first, third-party next, project-local last; group them logically.
### Naming Conventions
### Naming
- **Variables and Functions**: Use `snake_case`.
- **Classes**: Use `CamelCase`.
- **Constants**: Use `UPPER_CASE`.
Start reading the code and you'll get the hang of it. ;)
- **Variables and functions**: `snake_case`.
- **Classes**: `CamelCase`.
- **Constants**: `UPPER_CASE`.
Start reading the code and you'll get the hang of it.
## Getting help
If you're stuck on something — a check that won't behave, a setup error, an unclear field in `data.json`, or just want to discuss an approach before opening a PR — there are two places to ask:
- [GitHub Discussions](https://github.com/soxoj/maigret/discussions) — searchable, public, good for technical questions and design ideas. Prefer this for anything other contributors might run into too.
- Telegram: [@soxoj](https://t.me/soxoj) — direct channel to the maintainer, good for quick questions and informal chat.
Bug reports and feature requests still belong in [GitHub Issues](https://github.com/soxoj/maigret/issues).
## License
Maigret is MIT-licensed; by submitting a contribution you agree to publish it under the same license. There is no CLA.
+10 -1
View File
@@ -1,4 +1,4 @@
FROM python:3.11-slim
FROM python:3.11-slim AS base
LABEL maintainer="Soxoj <soxoj@protonmail.com>"
WORKDIR /app
RUN pip install --no-cache-dir --upgrade pip
@@ -15,4 +15,13 @@ COPY . .
RUN YARL_NO_EXTENSIONS=1 python3 -m pip install --no-cache-dir .
# For production use, set FLASK_HOST to a specific IP address for security
ENV FLASK_HOST=0.0.0.0
# Web UI variant: auto-launches the web interface on $PORT
FROM base AS web
ENV PORT=5000
EXPOSE 5000
ENTRYPOINT ["sh", "-c", "exec maigret --web \"$PORT\""]
# Default variant (last stage = `docker build .` target): CLI, backwards-compatible
FROM base AS cli
ENTRYPOINT ["maigret"]
+1 -2
View File
@@ -1,7 +1,6 @@
MIT License
Copyright (c) 2019 Sherlock Project
Copyright (c) 2020-2021 Soxoj
Copyright (c) 2020-2026 Soxoj
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
-452
View File
@@ -1,452 +0,0 @@
# Site checks — guide (Maigret)
Working document for future changes: workflow, findings from reviews, and practical steps. See also [`site-checks-playbook.md`](site-checks-playbook.md) (short checklist), [`socid_extractor_improvements.log`](socid_extractor_improvements.log) (proposals for upstream identity extraction), and the code in [`maigret/checking.py`](../maigret/checking.py).
**Documentation maintenance:** whenever you improve Maigret, add search tooling, or change check logic, update **this file** and [`site-checks-playbook.md`](site-checks-playbook.md) in sync (see the section at the end). If you change rules about the JSON API check or the `socid_extractor` log format, update **[`socid_extractor_improvements.log`](socid_extractor_improvements.log)** (template / header) together with this guide.
---
## 1. How checks work
Logic lives in `process_site_result` ([`maigret/checking.py`](../maigret/checking.py)):
| `checkType` | Meaning |
|-------------|---------|
| `message` | Profile is “found” if the HTML contains **none** of the `absenceStrs` substrings **and** at least one `presenseStrs` marker matches. If `presenseStrs` is **empty**, presence is treated as true for **any** page (risky configuration). |
| `status_code` | HTTP **2xx** is enough — only safe if the server does **not** return 200 for “user not found”. |
| `response_url` | Custom flow with **redirects disabled** so the status/URL of the *first* response can be used. |
For other `checkType` values, [`make_site_result`](../maigret/checking.py) sets **`allow_redirects=True`**: the client follows redirects and `process_site_result` sees the **final** response body and status (not the pre-redirect hop). You do **not** need to “turn on” follow-redirect separately for most sites.
Sites with an `engine` field (e.g. XenForo) are merged with a template from the `engines` section in [`maigret/resources/data.json`](../maigret/resources/data.json) ([`MaigretSite.update_from_engine`](../maigret/sites.py)).
### `urlProbe`: probe URL vs reported profile URL
- **`url`** — pattern for the **public profile page** users should open (what appears in reports as `url_user`). Supports `{username}`, `{urlMain}`, `{urlSubpath}`; the username segment is URL-encoded when the string is built ([`make_site_result`](../maigret/checking.py)).
- **`urlProbe`** (optional) — if set, Maigret sends the HTTP **GET** (or HEAD where applicable) to **this** URL for the check, instead of to `url`. Same placeholders. Use it when the reliable signal is a **JSON/API** endpoint but the human-facing link must stay on the main site (e.g. `https://picsart.com/u/{username}` + probe `https://api.picsart.com/users/show/{username}.json`, or GitHubs `https://github.com/{username}` + `https://api.github.com/users/{username}`).
If `urlProbe` is omitted, the probe URL defaults to `url`.
### Redirects and final URL as a signal
If the **HTML shell** looks the same for “user exists” and “user does not exist” (typical SPA), it is still worth checking whether the **server** behaves differently:
- **Final URL** after redirects (e.g. profile canonical URL vs `/404` path).
- **Redirect chain** length or target host (e.g. lander vs profile).
If that differs reliably, you may be able to use **`checkType`: `response_url`** in [`data.json`](../maigret/resources/data.json) (no auto-follow) or extend logic — but only when the difference is stable.
**Server-side HTTP vs client-side navigation.** Maigret follows **HTTP** redirects only; it does **not** run JavaScript. If the browser shows a navigation to `/u/name/posts` or `/not-found` **after** the SPA bundle loads, that may never appear as an extra hop in `curl`/aiohttp — only a **trailing-slash** `301` might show up. Always confirm with `curl -sIL` / a small script whether the **Location** chain differs for real vs fake users before relying on URL-based rules.
**Empirical check (claimed vs non-existent usernames, `GET` with follow redirects, no JS):**
| Site | Result |
|------|--------|
| **Kaskus** | No HTTP redirects beyond the request path; same generic `<title>` and near-identical body length — **no** discriminating signal from redirects alone. |
| **Bibsonomy** | Both requests redirect to **`/pow-challenge/?return=/user/...`** (proof-of-work). Only the `return` path changes with the username; **both** existing and fake hit the same challenge flow — not a profile-vs-missing distinction. |
| **Picsart (web UI `https://picsart.com/u/{username}`)** | Only a **trailing-slash** `301`; the first HTML is the same empty app shell (~3 KiB) for real and fake users. Browser-only routes such as `…/posts` vs `…/not-found` are **not** visible as additional HTTP redirects in this pipeline. |
**Picsart — workable check via public API.** The site exposes **`https://api.picsart.com/users/show/{username}.json`**: JSON with `"status":"success"` and a user object when the account exists, and `"reason":"user_not_found"` when it does not. Put that URL in **`urlProbe`**, set **`url`** to the web profile pattern **`https://picsart.com/u/{username}`**, and use **`checkType`: `message`** with narrow `presenseStrs` / `absenceStrs` so reports show the human link while the request hits the API (see **`urlProbe`** above).
For **Kaskus** and **Bibsonomy**, HTTP-level comparison still does **not** unlock a safe check without PoW / richer signals; keep **`disabled: true`** until something stable appears (API, SSR markers, etc.).
---
## 2. Standard checks: public JSON API and `socid_extractor` log
### 2.1 Public JSON API (always)
When diagnosing a site—especially **SPAs**, **soft 404s**, or **near-identical HTML** for real vs fake users—**routinely look for a public JSON (or JSON-like) API** used for profile or user lookup. Typical leads: paths containing `/api/`, `/v1/`, `graphql`, `users/show`, `.json` suffixes, or the same endpoints mobile apps use. Verify with `curl` (or the Maigret request path) that **claimed** and **unclaimed** usernames produce **reliably different** bodies or status codes. If such an endpoint is more stable than HTML, put it in **`urlProbe`** and keep **`url`** as the canonical profile page on the main site (see **`urlProbe`** in section 1). If there is no separate public URL for humans, you may still point **`url`** at the API only (reports will show that URL).
This is a **standard** part of site-check work, not an optional extra.
### 2.2 Mandatory: [`LLM/socid_extractor_improvements.log`](socid_extractor_improvements.log)
If you discover **either**:
1. **JSON embedded in HTML** with user/profile fields (inline scripts, `__NEXT_DATA__`, `application/ld+json`, hydration blobs, etc.), or
2. A **standalone JSON HTTP response** (public API) with user/profile data for that service,
you **must append** a proposal block to **[`LLM/socid_extractor_improvements.log`](socid_extractor_improvements.log)**.
**Why:** Maigret calls [`socid_extractor.extract`](https://pypi.org/project/socid-extractor/) on the response body ([`extract_ids_data` in `checking.py`](../maigret/checking.py)) to fill `ids_data`. New payloads usually need a **new scheme** upstream (`flags`, `regex`, optional `extract_json`, `fields`, optional `url_mutations` / `transforms`), matching patterns such as **`GitHub API`** or **`Gitlab API`** in `socid_extractor`s `schemes.py`.
**Each log entry must include:**
- **Date** — ISO `YYYY-MM-DD` (day you add the entry).
- **Example username** — Prefer the sites `usernameClaimed` from `data.json`, or any account that reproduces the payload.
- **Proposal** — Use the **block template** in the log file: detection idea, optional URL mutation, and field mappings in the same style as existing schemes.
If the service is **already covered** by an existing `socid_extractor` scheme, add a **short** entry anyway (date, example username, scheme name, “already implemented”) so there is an audit trail.
Do **not** paste secrets, cookies, or full private JSON; short key names and structure hints are enough.
---
## 3. Improvement workflow
### Phase A — Reproduce
1. Targeted run:
```bash
maigret --db /path/to/maigret/resources/data.json \
TEST_USERNAME \
--site "SiteName" \
--print-not-found --print-errors \
--no-progressbar -vv
```
2. Run separately with a **real** existing username and a **definitely non-existent** one (as `usernameClaimed` / `usernameUnclaimed` in JSON).
3. If needed: `-vvv` and `debug.log` (raw response).
4. Automated pair check:
```bash
maigret --db ... --self-check --site "SiteName" --no-progressbar
```
### Phase B — Classify the cause
| Symptom | Likely cause |
|---------|----------------|
| False “found” with `status_code` | Soft 404 (200 on a “not found” page). |
| False “found” with `message` | Overly broad `presenseStrs` (`name`, `email`, JSON keys) or stale `absenceStrs`. |
| Same HTML for different users | SPA / skeleton shell before hydration — also compare **final URL / redirect chain** (see above); if still identical, often `disabled`. |
| Login page instead of profile | XenForo etc.: guest, `ignore403`, “must be logged in” strings. |
| reCAPTCHA / “Checking your browser” / “not a bot” | Bot protection; Maigrets default User-Agent may worsen the response. |
| Redirect to another domain / lander | Stale URL template. |
### Phase C — Edits in [`data.json`](../maigret/resources/data.json)
1. Update `url` / `urlMain` if needed (HTTPS, new profile path).
2. Replace inappropriate `status_code` with `message` (or `response_url`), choosing:
- **`absenceStrs`** — only what reliably appears on the “user does not exist” page;
- **`presenseStrs`** — narrow markers of a real profile (avoid generic words).
3. For XenForo: override only fields that differ in the site entry; do not break the global `engines` template.
4. Refresh `usernameClaimed` / `usernameUnclaimed` if reference accounts disappeared.
5. Set **`headers`** (e.g. another `User-Agent`) if the site serves a captcha only to “suspicious” clients.
6. Use **`errors`**: HTML substring → meaningful check error (UNKNOWN), so it is not confused with “available”.
### Phase D — Decision criteria
| Outcome | When to use |
|---------|-------------|
| **Check fixed** | The `claimed` / `unclaimed` pair behaves predictably, `--self-check` passes, no regression on a similar site with the same engine. |
| **Check disabled** (`disabled: true`) | Cloudflare / anti-bot / login required / indistinguishable SPA without stable markers. |
| **Entry removed** | **Only** if the domain/service is gone (NXDOMAIN, clearly dead project), not “because it is hard to fix”. |
### Phase E — Before commit
- `maigret --self-check` for affected sites.
- `make test`.
---
## 4. Findings from reviews (concrete site batch)
Summary from an earlier false-positive review for: OpenSea, Mercado Livre, Redtube, Toms Guide, Kaggle, Kaskus, Livemaster, TechPowerUp, authorSTREAM, Bibsonomy, Bulbagarden, iXBT, Serebii, Picsart, Hashnode, hi5.
### What most often broke checks
1. **`status_code` where content checks are needed** — soft 404 with status 200.
2. **Broad `presenseStrs`** — matches on error pages or generic SPA shells.
3. **XenForo + guest** — HTML includes strings like “You must be logged in” that overlap the engine template.
4. **User-Agent** — on some sites (e.g. Kaggle) the default UA triggered a reCAPTCHA page instead of profile HTML; a deliberate `User-Agent` in site `headers` helped.
5. **SPAs and redirects** — identical first HTML, redirect to lander / another product (hi5 → Tagged), URL format changes by region (Mercado Livre).
### What worked as a fix
- Switching to **`message`** with narrow strings from **`<title>`** or unique markup where stable (**Kaggle**, **Mercado Livre**, **Hashnode**).
- For **Kaggle**, additionally: **`headers`**, **`errors`** for browser-check text.
- **Redtube** stayed valid on **`status_code`** with a stable **404** for non-existent users.
- **Picsart**: the web profile URL is a thin SPA shell; use the **JSON API** (`api.picsart.com/users/show/{username}.json`) in **`url`** with **`message`**-style markers (`"status":"success"` vs `user_not_found`), not the browser-only `/posts` vs `/not-found` navigation.
- For **Weblate / Anubis Anti-Bot**: Setting `headers` with a basic script User-Agent (e.g. `python-requests/2.25.1`) rather than the default browser UA completely bypassed the Anubis Proof-of-Work challenge HTTP 307 redirect, instantly recovering the native HTTP 404 framework.
### What required disabling checks
Where you **cannot** reliably tell “profile exists” from “no profile” without bypassing protection, login, or full JS:
- Anti-bot / captcha / “not a bot” page;
- Guest-only access to the needed page;
- SPA with indistinguishable first response;
- Forums returning **403** and a login page instead of a member profile for the member-search URL;
- Stale URLs that redirect to a stub.
In those cases **`disabled: true`** is better than false “found”; remove the DB entry only on **actual** domain death.
### Code notes
- For the `status_code` branch in `process_site_result`, use **strict** comparison `check_type == "status_code"`, not a substring match inside `"status_code"`.
- Treat empty `presenseStrs` with `message` as risky: when debugging, watch DEBUG-level logs if that diagnostics exists in code.
---
## 5. Future ideas (Maigret improvements)
- A mode or script: one site, two usernames, print statuses and first N bytes of the response (wrapper around `maigret()`).
- Document in CLI help that **`--use-disabled-sites`** is needed to analyze disabled entries.
---
## 6. Development utilities
### 6.1 `utils/site_check.py` — Single site diagnostics
A comprehensive utility for testing individual sites with multiple modes:
```bash
# Basic comparison of claimed vs unclaimed (aiohttp)
python utils/site_check.py --site "VK" --check-claimed
# Test via Maigret's checker directly
python utils/site_check.py --site "VK" --maigret
# Compare aiohttp vs Maigret results (find discrepancies)
python utils/site_check.py --site "VK" --compare-methods
# Full diagnosis with recommendations
python utils/site_check.py --site "VK" --diagnose
# Test with custom URL
python utils/site_check.py --url "https://example.com/{username}" --compare user1 user2
# Find a valid username for a site
python utils/site_check.py --site "VK" --find-user
```
**Key features:**
- `--maigret` — Uses Maigret's actual checking code, not raw aiohttp
- `--compare-methods` — Shows if aiohttp and Maigret see different results (useful for debugging)
- `--diagnose` — Validates checkType against actual responses, suggests fixes
- Color output with markers detection (captcha, cloudflare, login, etc.)
- `--json` flag for machine-readable output
**When to use each mode:**
| Mode | Use case |
|------|----------|
| `--check-claimed` | Quick sanity check: do claimed/unclaimed still differ? |
| `--maigret` | Verify Maigret's actual behavior matches expectations |
| `--compare-methods` | Debug "works in curl but fails in Maigret" issues |
| `--diagnose` | Full analysis when a site is broken, get fix recommendations |
### 6.2 `utils/check_top_n.py` — Mass site checking
Batch-check top N sites by Alexa rank with categorized reporting:
```bash
# Check top 100 sites
python utils/check_top_n.py --top 100
# Faster with more parallelism
python utils/check_top_n.py --top 100 --parallel 10
# Output JSON report
python utils/check_top_n.py --top 100 --output report.json
# Only show broken sites
python utils/check_top_n.py --top 100 --only-broken
```
**Output categories:**
- `working` — Site check passes
- `broken` — Check fails (wrong status, missing markers)
- `timeout` — Request timed out
- `anti_bot` — 403/429 or captcha detected
- `error` — Connection or other errors
- `disabled` — Already disabled in data.json
**Report includes:**
- Summary counts by category
- List of broken sites with issues
- Recommendations for fixes (e.g., "Switch to checkType: status_code")
### 6.3 Self-check behavior (`--self-check`)
The self-check command has been improved to be less aggressive:
```bash
# Check sites WITHOUT auto-disabling (default)
maigret --self-check --site "VK"
# Auto-disable failing sites (old behavior)
maigret --self-check --site "VK" --auto-disable
# Show detailed diagnosis for each failure
maigret --self-check --site "VK" --diagnose
```
**Behavior changes:**
| Flag | Effect |
|------|--------|
| `--self-check` alone | Reports issues but does NOT disable sites |
| `--auto-disable` | Automatically disables sites that fail (opt-in) |
| `--diagnose` | Prints detailed diagnosis with recommendations |
**Why this matters:**
- Old behavior was too aggressive — sites got disabled without explanation
- New behavior reports issues and suggests fixes
- Explicit `--auto-disable` required to modify database
---
## 7. Lessons learned (practical observations)
Collected from hands-on work fixing top-ranked sites (Reddit, Wikipedia, Microsoft Learn, Baidu, etc.).
### 7.1 JSON API is the first thing to look for
Both Reddit and Microsoft Learn had working public APIs that solved the problem entirely. The web pages were SPAs or blocked by anti-bot measures, but the APIs worked reliably:
- **Reddit**: `https://api.reddit.com/user/{username}/about` — returns JSON with user data or `{"message": "Not Found", "error": 404}`.
- **Microsoft Learn**: `https://learn.microsoft.com/api/profiles/{username}` — returns JSON with `userName` field or HTTP 404.
This confirms the playbook recommendation: always check for `/api/`, `.json`, GraphQL endpoints before giving up on a site.
### 7.2 `urlProbe` is a powerful tool
It separates "what we check" (API) from "what we show the user" (human-readable profile URL). Reddit is a perfect example:
```json
{
"url": "https://www.reddit.com/user/{username}",
"urlProbe": "https://api.reddit.com/user/{username}/about",
"checkType": "message",
"presenseStrs": ["\"name\":"],
"absenceStrs": ["Not Found"]
}
```
The check hits the API, but reports display `www.reddit.com/user/blue`.
### 7.3 aiohttp ≠ curl ≠ requests
Wikipedia returned HTTP 200 for `curl` and Python `requests`, but HTTP 403 for `aiohttp`. This is **TLS fingerprinting** — the server identifies the HTTP library by cryptographic characteristics of the TLS handshake, not by headers.
**Key insight:** Changing `User-Agent` does **not** help against TLS fingerprinting. Always test with aiohttp directly (or via Maigret with `-vvv` and `debug.log`), not just `curl`.
```python
# This returns 403 for Wikipedia even with browser UA:
async with aiohttp.ClientSession() as session:
async with session.get(url, headers={"User-Agent": "Mozilla/5.0 ..."}) as resp:
print(resp.status) # 403
```
### 7.4 HTTP 403 in Maigret can mean different things
Initially it seemed Wikipedia was returning 403, but `curl` showed 200. Only `debug.log` revealed the real picture — aiohttp was getting blocked at TLS level.
**Lesson:** Use `-vvv` flag and inspect `debug.log` for raw response status and body. The warning message alone may be misleading.
### 7.5 Dead services migrate, not disappear
MSDN Social and TechNet profiles redirected to Microsoft Learn. Instead of deleting old entries:
1. Keep old entries with `disabled: true` as historical record.
2. Create a new entry for the current service with working API.
This preserves audit trail and avoids breaking existing workflows.
### 7.6 `status_code` is more reliable than `message` for APIs
Microsoft Learn API returns HTTP 404 for non-existent users — a clean signal without HTML parsing. For JSON APIs that return proper HTTP status codes, `status_code` is often the best choice:
```json
{
"checkType": "status_code",
"urlProbe": "https://learn.microsoft.com/api/profiles/{username}"
}
```
No need for fragile string matching when the API speaks HTTP correctly.
### 7.8 Engine templates can silently break across many sites
The **vBulletin** engine template has `absenceStrs` in five languages ("This user has not registered…", "Пользователь не зарегистрирован…", etc.). In a batch review of ~12 vBulletin forums (oneclickchicks, mirf, Pesiq, VKMOnline, forum.zone-game.info, etc.), **none** of the absence strings matched — the forums returned identical pages for both claimed and unclaimed usernames. Root cause: many of these forums require login to view member profiles, so they serve a generic page (no "user not registered" message at all) instead of an informative error.
**Lesson:** When a whole engine class shows false positives, do not patch sites one by one — check whether the **engine template** itself still matches the actual error pages. A template written for one version/language pack may silently stop working after a forum upgrade or config change.
### 7.9 Search-by-author URLs are architecturally unreliable
Several sites (OnanistovNet, Shoppingzone, Pogovorim, Astrogalaxy, Sexwin) used a phpBB-style `search.php?keywords=&terms=all&author={username}` URL as the check endpoint. This searches for **posts** by that author, not for the user account itself. Even if the markers worked, a user who exists but has zero posts would be indistinguishable from a non-existent user. And in practice, the sites changed their response format — some now return HTTP 404, others dropped the expected Russian absence text altogether.
**Lesson:** Avoid author-search URLs as the check endpoint; they test "has posts" rather than "account exists" and are doubly fragile (both logic mismatch and format drift).
### 7.10 Some sites generate a page for any path — permanent false positives
Two distinct patterns:
- **Pbase** creates a stub page titled "pbase Artist {username}" for **every** URL, real or fake. Both return HTTP 200 with nearly identical content (~3.3 KB). No markers can distinguish them.
- **ffm.bio** is even trickier: for the non-existent username `a.slomkoowski` it generated a page titled "mr.a" with description "a is a", apparently fuzzy-matching the path to the closest real entry. Both return HTTP 200 with large, content-rich pages.
**Lesson:** Before writing markers for a site, verify that the "unclaimed" URL actually produces an **error-like** response (different status, different title, unique error text). If the site always returns a plausible-looking page, no combination of `presenseStrs` / `absenceStrs` will help — `disabled: true` is the only safe option.
### 7.11 TLS fingerprinting can degrade over time (Kaggle)
Kaggle was previously fixed with a custom `User-Agent` header and `errors` for the "Checking your browser" captcha page. In the latest batch review, aiohttp receives HTTP 404 with identical content for **both** claimed and unclaimed usernames — the site now blocks the entire request before it reaches the profile page. This matches the TLS fingerprinting pattern seen earlier with Wikipedia (section 7.3), but here the degradation happened **after** a working fix was already in place.
**Lesson:** Sites that rely on bot-detection can tighten their rules at any time. A working `User-Agent` override today may fail tomorrow. When a previously fixed site starts returning identical responses for both usernames, suspect TLS fingerprinting first, and accept `disabled: true` if no public API is available.
### 7.12 API endpoints may bypass Cloudflare even when the main site is blocked
All four Fandom wikis returned HTTP 403 with a Cloudflare "Just a moment..." challenge when aiohttp accessed the user profile page (`/wiki/User:{username}`). However, the **MediaWiki API** on the same domain (`/api.php?action=query&list=users&ususers={username}&format=json`) returned clean JSON without any challenge. Similarly, **Substack** served a captcha-laden SPA for `/@{username}`, but its `public_profile` API (`/api/v1/user/{username}/public_profile`) responded with proper JSON and correct HTTP 404 for missing users.
This is likely because API routes are excluded from the Cloudflare WAF rules or use a different pipeline than the HTML-serving paths.
**Lesson:** When a site's main pages are blocked by Cloudflare or similar WAF, still check API endpoints on the **same domain** — they may not go through the same protection layer. This is especially true for:
- MediaWiki's `api.php` on wiki farms (Fandom, Wikia, self-hosted MediaWiki)
- REST API paths (`/api/v1/`, `/api/v2/`) on SPA-heavy sites
- Internal data endpoints that the SPA itself calls
### 7.13 GraphQL APIs often support GET, not just POST
**hashnode** exposes a GraphQL endpoint at `https://gql.hashnode.com`. While GraphQL is typically associated with POST requests, many implementations also support **GET** with the query passed as a URL parameter. This is critical for Maigret, which only supports GET/HEAD for `urlProbe`.
```
GET https://gql.hashnode.com?query=%7Buser(username%3A%20%22melwinalm%22)%20%7B%20name%20username%20%7D%7D
→ {"data":{"user":{"name":"Melwin D'Almeida","username":"melwinalm"}}}
GET https://gql.hashnode.com?query=%7Buser(username%3A%20%22a.slomkoowski%22)%20%7B%20name%20username%20%7D%7D
→ {"data":{"user":null}}
```
**Lesson:** Before giving up on a GraphQL-only site, try the same query via GET with `?query=...` (URL-encoded). Many GraphQL servers accept both methods.
### 7.14 URL-encoding resolves template placeholder conflicts
The hashnode GraphQL query `{user(username: "{username}") { name }}` contains curly braces that conflict with Maigret's `{username}` placeholder — Python's `str.format()` would raise a `KeyError` on `{user(username...}`.
The fix: URL-encode the GraphQL braces (`{` → `%7B`, `}` → `%7D`) but leave `{username}` as-is. Python's `.format()` only interprets literal `{…}` as placeholders, not `%7B…%7D`, and the GraphQL server decodes the percent-encoding on its end:
```
urlProbe: https://gql.hashnode.com?query=%7Buser(username%3A%20%22{username}%22)%20%7B%20name%20username%20%7D%7D
```
After `.format(username="melwinalm")`:
```
https://gql.hashnode.com?query=%7Buser(username%3A%20%22melwinalm%22)%20%7B%20name%20username%20%7D%7D
```
**Lesson:** When a `urlProbe` needs literal curly braces (GraphQL, JSON in URL, etc.), percent-encode them. This is a general technique for any `data.json` URL field processed by `.format()`.
### 7.7 The playbook classification works
The decision tree from the documentation accurately describes real-world cases:
| Situation | Playbook says | Actual result |
|-----------|---------------|---------------|
| Captcha (Baidu) | `disabled: true` | Correct |
| TLS fingerprinting (Wikipedia) | `disabled: true` (anti-bot) | Correct |
| Working API available (Reddit, MS Learn) | Use `urlProbe` | Correct |
| Service migrated (MSDN → MS Learn) | Update URL or create new entry | Correct |
---
## Documentation maintenance
For any of the changes below, **always** keep these artifacts in sync — this file ([`site-checks-guide.md`](site-checks-guide.md)), [`site-checks-playbook.md`](site-checks-playbook.md), and (when rules or templates change) the header/template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log):
- Maigret code changes (including [`maigret/checking.py`](../maigret/checking.py), request executors, CLI);
- New or changed search tools / helper utilities for site checks;
- Changes to rules or semantics of `checkType`, `data.json` fields, self-check, etc.;
- Changes to the **public JSON API** diagnostic step or **mandatory** `socid_extractor` logging rules.
Prefer updating the guide, playbook, and log template in one commit or in the same task so instructions do not diverge. **Append-only:** new proposals go at the bottom of `socid_extractor_improvements.log`; do not delete historical entries when editing the template.
-87
View File
@@ -1,87 +0,0 @@
# Site checks — playbook (Maigret)
Short checklist for edits to [`maigret/resources/data.json`](../maigret/resources/data.json) and, when needed, [`maigret/checking.py`](../maigret/checking.py). Full guide: [`site-checks-guide.md`](site-checks-guide.md). Upstream extraction proposals: [`socid_extractor_improvements.log`](socid_extractor_improvements.log).
**Documentation maintenance:** whenever you improve Maigret, add search tooling, or change check logic, update **both** this file and [`site-checks-guide.md`](site-checks-guide.md) (see the “Documentation maintenance” section at the end of that file). When JSON API / `socid_extractor` logging rules change, update the **template header** in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) in the same change.
## 0. Standard checks (do alongside reproduce / classify)
- **Public JSON API:** always look for a stable JSON (or GraphQL JSON) profile endpoint (`/api/`, `.json`, mobile-style URLs). When the API is more reliable than HTML, set **`urlProbe`** to that endpoint and keep **`url`** as the human-readable profile link (e.g. `https://picsart.com/u/{username}`). If there is no separate profile URL, use the API as `url` only. Details: **`urlProbe`** and section **2.1** in [`site-checks-guide.md`](site-checks-guide.md).
- **`socid_extractor` log (mandatory):** if you find **embedded user JSON in HTML** or a **standalone JSON profile API**, append a dated entry (with **example username**) to [`socid_extractor_improvements.log`](socid_extractor_improvements.log). Details: section **2.2** in [`site-checks-guide.md`](site-checks-guide.md).
## 1. Reproduce
- Run a targeted check:
`maigret USER --db /path/to/maigret/resources/data.json --site "SiteName" --print-not-found --print-errors --no-progressbar -vv`
- Compare an **existing** and a **non-existent** username (as `usernameClaimed` / `usernameUnclaimed` in JSON).
- With `-vvv`, inspect `debug.log` (raw response in the log).
## 2. Classify the cause
| Symptom | Typical cause | Action |
|--------|-----------------|--------|
| HTTP 200 for “user does not exist” | Soft 404 | Move from `status_code` to `message` or `response_url`; add `absenceStrs` / narrow `presenseStrs` |
| Generic words match (`name`, `email`) | `presenseStrs` too broad | Remove generic markers; add profile-specific ones |
| Same HTML without JS | SPA / skeleton shell | Compare **final URL and HTTP redirects** (Maigret already follows redirects by default). If the browser shows extra routes (`/posts`, `/not-found`) only **after JS**, they will **not** appear to Maigret — try a **public JSON/API** endpoint for the same site if one exists. See **Redirects and final URL** and **Picsart** in [`site-checks-guide.md`](site-checks-guide.md). |
| 403 / “Log in” / guest-only | Auth or anti-bot required | `disabled: true` |
| reCAPTCHA / “Checking your browser” | Bot protection | Try a reasonable `User-Agent` in `headers`; else `errors` + UNKNOWN or `disabled` |
| Domain does not resolve / persistent timeout | Dead service | Remove entry **only** after confirming the domain is dead |
## 3. Data edits
1. Update `url` / `urlMain` if needed (HTTPS redirects). Use optional **`urlProbe`** when the HTTP check should hit a different URL than the profile link shown in reports (API vs web UI).
2. For `message`: **always** tune string pairs so `absenceStrs` fire on “no user” pages and `presenseStrs` fire on real profiles without false absence hits.
3. Engine (`engine`, e.g. XenForo): override only differing fields in the site entry so other sites are not broken.
4. Keep `status_code` only if the response **reliably** differs by status code without soft 404.
## 4. Verify
- `maigret --self-check --site "SiteName" --db ...` for touched entries.
- `make test` before commit.
## 5. Code notes
- `process_site_result` uses strict comparison to `"status_code"` for `checkType` (not a substring trick).
- Empty `presenseStrs` with `message` means “presence always true”; a debug line is logged only at DEBUG level.
## 6. Development utilities
Quick reference for site check utilities. Full details: section **6** in [`site-checks-guide.md`](site-checks-guide.md).
| Command | Purpose |
|---------|---------|
| `python utils/site_check.py --site "X" --check-claimed` | Quick aiohttp comparison |
| `python utils/site_check.py --site "X" --maigret` | Test via Maigret checker |
| `python utils/site_check.py --site "X" --compare-methods` | Find aiohttp vs Maigret discrepancies |
| `python utils/site_check.py --site "X" --diagnose` | Full diagnosis with fix recommendations |
| `python utils/check_top_n.py --top 100` | Mass-check top 100 sites |
| `maigret --self-check --site "X"` | Self-check (reports only, no auto-disable) |
| `maigret --self-check --site "X" --auto-disable` | Self-check with auto-disable |
| `maigret --self-check --site "X" --diagnose` | Self-check with detailed diagnosis |
## 7. Quick tips (lessons learned)
Practical observations from fixing top-ranked sites. Full details: section **7** in [`site-checks-guide.md`](site-checks-guide.md).
| Tip | Why it matters |
|-----|----------------|
| **API first** | Reddit, Microsoft Learn — APIs worked when web pages were blocked. Always check `/api/`, `.json` endpoints. |
| **`urlProbe` separates check from display** | Check via API, show human URL in reports. Example: Reddit API → `www.reddit.com/user/` link. |
| **aiohttp ≠ curl** | Wikipedia returned 200 for curl, 403 for aiohttp (TLS fingerprinting). Always test with Maigret directly. |
| **Use `debug.log`** | Run with `-vvv` to see raw response. Warning messages alone can be misleading. |
| **`status_code` for clean APIs** | If API returns proper 404 for missing users, prefer `status_code` over `message`. |
| **Migrate, don't delete** | MSDN → Microsoft Learn: keep old entry disabled, create new one for current service. |
| **Engine templates break silently** | vBulletin `absenceStrs` failed on ~12 forums at once — many require login, showing a generic page with no error text. Check the engine template first. |
| **Search-by-author is unreliable** | phpBB `search.php?author=` checks for posts, not accounts. A user with zero posts looks identical to a non-existent user. Avoid these URLs. |
| **Some sites always generate a page** | Pbase stubs "pbase Artist {name}" for any path; ffm.bio fuzzy-matches to the nearest real entry. No markers can help — `disabled: true`. |
| **TLS fingerprinting degrades over time** | Kaggle's custom `User-Agent` fix stopped working — aiohttp now gets 404 for both usernames. Accept `disabled: true` when no API exists. |
| **API endpoints bypass Cloudflare** | Fandom `api.php` and Substack `/api/v1/` returned clean JSON while main pages were blocked by Cloudflare. Always try API paths on the same domain. |
| **Inspect Network tab for POST APIs** | Many modern platforms (e.g., Discord) heavily protect HTML profiles but expose unauthenticated `POST` endpoints for username checks. Maigret supports this natively: define `"request_method": "POST"` and `"request_payload": {"username": "{username}"}` in `data.json` to query them! |
| **Strict JSON markers are bulletproof** | When probing APIs, use `checkType: "message"` with exact JSON substrings (like `"{\"taken\": false}"`). Unlike HTML layout checks, this approach is immune to UI redesigns, A/B testing, and language translations. |
| **GraphQL supports GET too** | hashnode GraphQL works via `GET ?query=...` (URL-encoded). You can use either native POST payloads or GET `urlProbe` for GraphQL. |
| **URL-encode braces for template safety** | GraphQL `{...}` conflicts with Maigret's `{username}`. Use `%7B`/`%7D` for literal braces in `urlProbe``.format()` ignores percent-encoded chars. |
| **Anti-bot bypass via simple UA** | "Anubis" anti-bot PoW screens (like on Weblate) intercept modern browser UAs via HTTP 307. Hardcoding `"headers": {"User-Agent": "python-requests/2.25.1"}` circumvents the scraper filter and restores default detection logic. |
## 8. Documentation maintenance
When you change Maigret, add search tools, or change check logic, keep **this playbook**, [`site-checks-guide.md`](site-checks-guide.md), and (when applicable) the template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) aligned. New log **entries** are append-only at the bottom of that file.
-4
View File
@@ -1,4 +0,0 @@
include LICENSE
include README.md
include requirements.txt
include maigret/resources/*
+259 -149
View File
@@ -1,7 +1,7 @@
# Maigret
<p align="center">
<p align="center">
<div align="center">
<div>
<a href="https://pypi.org/project/maigret/">
<img alt="PyPI version badge for Maigret" src="https://img.shields.io/pypi/v/maigret?style=flat-square" />
</a>
@@ -17,154 +17,73 @@
<a href="https://github.com/soxoj/maigret">
<img alt="View count for Maigret project" src="https://komarev.com/ghpvc/?username=maigret&color=brightgreen&label=views&style=flat-square" />
</a>
</p>
<p align="center">
<img src="https://raw.githubusercontent.com/soxoj/maigret/main/static/maigret.png" height="300"/>
</p>
</p>
</div>
<br>
<div>
<img src="https://raw.githubusercontent.com/soxoj/maigret/main/static/maigret.png" height="300" alt="Maigret logo"/>
</div>
<br>
<div>
<b>English</b> · <a href="README.zh-CN.md">简体中文</a>
</div>
<br>
</div>
<i>The Commissioner Jules Maigret is a fictional French police detective, created by Georges Simenon. His investigation method is based on understanding the personality of different people and their interactions.</i>
**Maigret** collects a dossier on a person **by username only**, checking for accounts on a huge number of sites and gathering all the available information from web pages. No API keys required.
<b>👉👉👉 [Online Telegram bot](https://t.me/maigret_search_bot)</b>
## Contents
## About
- [In one minute](#in-one-minute)
- [Main features](#main-features)
- [Demo](#demo)
- [Installation](#installation)
- [Usage](#usage)
- [Contributing](#contributing)
- [Commercial Use](#commercial-use)
- [About](#about)
**Maigret** collects a dossier on a person **by username only**, checking for accounts on a huge number of sites and gathering all the available information from web pages. No API keys are required. Maigret is an easy-to-use and powerful fork of [Sherlock](https://github.com/sherlock-project/sherlock).
<a id="one-minute"></a>
## In one minute
Currently supports more than 3000 sites ([full list](https://github.com/soxoj/maigret/blob/main/sites.md)), search is launched against 500 popular sites in descending order of popularity by default. Also supported checking Tor sites, I2P sites, and domains (via DNS resolving).
Ensure you have Python 3.10 or higher.
## Powered By Maigret
```bash
pip install maigret
maigret YOUR_USERNAME
```
These are professional tools for social media content analysis and OSINT investigations that use Maigret (banners are clickable).
No install? Try the [Telegram bot](https://t.me/maigret_search_bot) or a [Cloud Shell](#cloud-shells).
Want a web UI? See [how to launch it](#web-interface).
See also: [Quick start](https://maigret.readthedocs.io/en/latest/quick-start.html).
## Main features
- Supports 3,000+ sites ([see full list](https://github.com/soxoj/maigret/blob/main/sites.md)). A default run checks the 500 highest-ranked sites by traffic; pass `-a` to scan everything, or `--tags` to narrow by category/country.
- Embeddable in Python projects — import `maigret` and run searches programmatically (see [library usage](https://maigret.readthedocs.io/en/latest/library-usage.html)).
- [Extracts](https://github.com/soxoj/socid_extractor) all available information about the account owner from profile pages and site APIs, including links to other accounts.
- Performs recursive search using discovered usernames and other IDs.
- Allows filtering by tags (site categories, countries).
- Detects and partially bypasses blocks, censorship, and CAPTCHA.
- Fetches an [auto-updated site database](https://maigret.readthedocs.io/en/latest/settings.html#database-auto-update) from GitHub each run (once per 24 hours), and falls back to the built-in database if offline.
- Works with Tor and I2P websites; able to check domains.
- Ships with a [web interface](#web-interface) for browsing results as a graph and downloading reports in every format from a single page.
- Optional [AI analysis mode](#ai-analysis) (`--ai`) that turns raw findings into a short investigation summary using an OpenAI-compatible API.
For the complete feature list, see the [features documentation](https://maigret.readthedocs.io/en/latest/features.html).
### Used by
Professional OSINT and social-media analysis tools built on Maigret:
<a href="https://github.com/SocialLinks-IO/sociallinks-api"><img height="60" alt="Social Links API" src="https://github.com/user-attachments/assets/789747b2-d7a0-4d4e-8868-ffc4427df660"></a>
<a href="https://sociallinks.io/products/sl-crimewall"><img height="60" alt="Social Links Crimewall" src="https://github.com/user-attachments/assets/0b18f06c-2f38-477b-b946-1be1a632a9d1"></a>
<a href="https://usersearch.ai/"><img height="60" alt="UserSearch" src="https://github.com/user-attachments/assets/66daa213-cf7d-40cf-9267-42f97cf77580"></a>
## Main features
## Demo
* Profile page parsing, [extraction](https://github.com/soxoj/socid_extractor) of personal info, links to other profiles, etc.
* Recursive search by new usernames and other IDs found
* Search by tags (site categories, countries)
* Censorship and captcha detection
* Requests retries
See the full description of Maigret features [in the documentation](https://maigret.readthedocs.io/en/latest/features.html).
## Installation
‼️ Maigret is available online via [official Telegram bot](https://t.me/maigret_search_bot). Consider using it if you don't want to install anything.
### Windows
Standalone EXE-binaries for Windows are located in [Releases section](https://github.com/soxoj/maigret/releases) of GitHub repository.
Video guide on how to run it: https://youtu.be/qIgwTZOmMmM.
### Installation in Cloud Shells
You can launch Maigret using cloud shells and Jupyter notebooks. Press one of the buttons below and follow the instructions to launch it in your browser.
[![Open in Cloud Shell](https://user-images.githubusercontent.com/27065646/92304704-8d146d80-ef80-11ea-8c29-0deaabb1c702.png)](https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/soxoj/maigret&tutorial=README.md)
<a href="https://repl.it/github/soxoj/maigret"><img src="https://replit.com/badge/github/soxoj/maigret" alt="Run on Replit" height="50"></a>
<a href="https://colab.research.google.com/gist/soxoj/879b51bc3b2f8b695abb054090645000/maigret-collab.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" height="45"></a>
<a href="https://mybinder.org/v2/gist/soxoj/9d65c2f4d3bec5dd25949197ea73cf3a/HEAD"><img src="https://mybinder.org/badge_logo.svg" alt="Open In Binder" height="45"></a>
### Local installation
Maigret can be installed using pip, Docker, or simply can be launched from the cloned repo.
**NOTE**: Python 3.10 or higher and pip is required, **Python 3.11 is recommended.**
```bash
# install from pypi
pip3 install maigret
# usage
maigret username
```
### Cloning a repository
```bash
# or clone and install manually
git clone https://github.com/soxoj/maigret && cd maigret
# build and install
pip3 install .
# usage
maigret username
```
### Docker
```bash
# official image
docker pull soxoj/maigret
# usage
docker run -v /mydir:/app/reports soxoj/maigret:latest username --html
# manual build
docker build -t maigret .
```
## Usage examples
```bash
# make HTML, PDF, and Xmind8 reports
maigret user --html
maigret user --pdf
maigret user --xmind #Output not compatible with xmind 2022+
# search on sites marked with tags photo & dating
maigret user --tags photo,dating
# search on sites marked with tag us
maigret user --tags us
# search for three usernames on all available sites
maigret user1 user2 user3 -a
```
Use `maigret --help` to get full options description. Also options [are documented](https://maigret.readthedocs.io/en/latest/command-line-options.html).
### Web interface
You can run Maigret with a web interface, where you can view the graph with results and download reports of all formats on a single page.
<details>
<summary>Web Interface Screenshots</summary>
![Web interface: how to start](https://raw.githubusercontent.com/soxoj/maigret/main/static/web_interface_screenshot_start.png)
![Web interface: results](https://raw.githubusercontent.com/soxoj/maigret/main/static/web_interface_screenshot.png)
</details>
Instructions:
1. Run Maigret with the ``--web`` flag and specify the port number.
```console
maigret --web 5000
```
2. Open http://127.0.0.1:5000 in your browser and enter one or more usernames to make a search.
3. Wait a bit for the search to complete and view the graph with results, the table with all accounts found, and download reports of all formats.
## Contributing
Maigret has open-source code, so you may contribute your own sites by adding them to `data.json` file, or bring changes to it's code!
For more information about development and contribution, please read the [development documentation](https://maigret.readthedocs.io/en/latest/development.html).
## Demo with page parsing and recursive username search
### Video (asciinema)
### Video
<a href="https://asciinema.org/a/Ao0y7N0TTxpS0pisoprQJdylZ">
<img src="https://asciinema.org/a/Ao0y7N0TTxpS0pisoprQJdylZ.svg" alt="asciicast" width="600">
@@ -180,27 +99,218 @@ For more information about development and contribution, please read the [develo
[Full console output](https://raw.githubusercontent.com/soxoj/maigret/main/static/recursive_search.md)
## Disclaimer
## Installation
**This tool is intended for educational and lawful purposes only.** The developers do not endorse or encourage any illegal activities or misuse of this tool. Regulations regarding the collection and use of personal data vary by country and region, including but not limited to GDPR in the EU, CCPA in the USA, and similar laws worldwide.
Already ran the [In one minute](#one-minute) steps? You're set. Below are alternative methods.
It is your sole responsibility to ensure that your use of this tool complies with all applicable laws and regulations in your jurisdiction. Any illegal use of this tool is strictly prohibited, and you are fully accountable for your actions.
Don't want to install anything? Use the [Telegram bot](https://t.me/maigret_search_bot).
The authors and developers of this tool bear no responsibility for any misuse or unlawful activities conducted by its users.
### Windows
## Feedback
Download a standalone EXE from [Releases](https://github.com/soxoj/maigret/releases). Video guide: https://youtu.be/qIgwTZOmMmM.
If you have any questions, suggestions, or feedback, please feel free to [open an issue](https://github.com/soxoj/maigret/issues), create a [GitHub discussion](https://github.com/soxoj/maigret/discussions), or contact the author directly via [Telegram](https://t.me/soxoj).
<a id="cloud-shells"></a>
### Cloud Shells
## SOWEL classification
Run Maigret in the browser via cloud shells or Jupyter notebooks:
This tool uses the following OSINT techniques:
<a href="https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/soxoj/maigret&tutorial=cloudshell-tutorial.md"><img src="https://user-images.githubusercontent.com/27065646/92304704-8d146d80-ef80-11ea-8c29-0deaabb1c702.png" alt="Open in Cloud Shell" height="50"></a>
<a href="https://repl.it/github/soxoj/maigret"><img src="https://replit.com/badge/github/soxoj/maigret" alt="Run on Replit" height="50"></a>
<a href="https://colab.research.google.com/gist/soxoj/879b51bc3b2f8b695abb054090645000/maigret-collab.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" height="45"></a>
<a href="https://mybinder.org/v2/gist/soxoj/9d65c2f4d3bec5dd25949197ea73cf3a/HEAD"><img src="https://mybinder.org/badge_logo.svg" alt="Open In Binder" height="45"></a>
### Local installation (pip)
```bash
# install from pypi
pip3 install maigret
# usage
maigret username
```
### From source
```bash
# or clone and install manually
git clone https://github.com/soxoj/maigret && cd maigret
# build and install
pip3 install .
# usage
maigret username
```
### Docker
Two image variants are published:
- `soxoj/maigret:latest` — CLI mode (default)
- `soxoj/maigret:web` — auto-launches the [web interface](#web-interface)
```bash
# official image (CLI)
docker pull soxoj/maigret
# CLI usage
docker run -v /mydir:/app/reports soxoj/maigret:latest username --html
# Web UI (open http://localhost:5000)
docker run -p 5000:5000 soxoj/maigret:web
# Web UI on a custom port
docker run -e PORT=8080 -p 8080:8080 soxoj/maigret:web
# manual build
docker build -t maigret . # CLI image (default target)
docker build --target web -t maigret-web . # Web UI image
```
### Troubleshooting
Build errors? See the [troubleshooting guide](https://maigret.readthedocs.io/en/latest/installation.html#troubleshooting).
## Usage
### Examples
```bash
# make HTML, PDF, and Xmind8 reports
maigret user --html
maigret user --pdf
maigret user --xmind #Output not compatible with xmind 2022+
# machine-readable exports
maigret user --json ndjson # newline-delimited JSON (also: --json simple)
maigret user --csv
maigret user --txt
maigret user --graph # interactive D3 graph (HTML)
# search on sites marked with tags photo & dating
maigret user --tags photo,dating
# search on sites marked with tag us
maigret user --tags us
# search for three usernames on all available sites
maigret user1 user2 user3 -a
# AI-assisted investigation summary (needs OPENAI_API_KEY)
maigret user --ai
```
Run `maigret --help` for all options. Docs: [CLI options](https://maigret.readthedocs.io/en/latest/command-line-options.html), [more examples](https://maigret.readthedocs.io/en/latest/usage-examples.html). Running into 403s or timeouts? See [TROUBLESHOOTING.md](TROUBLESHOOTING.md).
<a id="web-interface"></a>
### Web interface
Maigret has a built-in web UI with a results graph and downloadable reports.
<details>
<summary>Web Interface Screenshots</summary>
![Web interface: how to start](https://raw.githubusercontent.com/soxoj/maigret/main/static/web_interface_screenshot_start.png)
![Web interface: results](https://raw.githubusercontent.com/soxoj/maigret/main/static/web_interface_screenshot.png)
</details>
```console
maigret --web 5000
```
Open http://127.0.0.1:5000, enter a username, and view results.
### Python library
**Maigret can be embedded in your own Python projects.** The CLI is a thin wrapper around an async function you can call directly — build custom pipelines, feed results into your own tooling, or run it inside a larger OSINT workflow.
See the full [library usage guide](https://maigret.readthedocs.io/en/latest/library-usage.html) for a working example, async patterns, and how to filter sites by tag.
### Useful CLI flags
- `--parse URL` — parse a profile page, extract IDs/usernames, and use them to kick off a recursive search.
- `--permute` — generate likely username variants from two or more inputs (e.g. `john doe``johndoe`, `j.doe`, …) and search for all of them.
- `--self-check [--auto-disable]` — verify `usernameClaimed` / `usernameUnclaimed` pairs against live sites for maintainers auditing the database.
- `--ai` / `--ai-model` — run the [AI analysis](#ai-analysis) over the search results and stream a short investigation summary to the terminal.
<a id="ai-analysis"></a>
### AI analysis
`--ai` collects the search results, builds an internal Markdown report, and sends it to an OpenAI-compatible chat completion endpoint to produce a short, neutral investigation summary (likely real name, location, occupation, interests, languages, confidence, follow-up leads). Per-site progress is suppressed and the model's output is streamed to stdout.
```bash
export OPENAI_API_KEY=sk-...
maigret user --ai
# pick a different model
maigret user --ai --ai-model gpt-4o-mini
```
The key can also be set as `openai_api_key` in `settings.json`. The endpoint defaults to `https://api.openai.com/v1`, but `openai_api_base_url` in `settings.json` can point to any OpenAI-compatible API (Azure OpenAI, OpenRouter, a local server, …). See the [settings docs](https://maigret.readthedocs.io/en/latest/settings.html) for the full list of options.
### Tor / I2P / proxies
Maigret can route checks through a proxy, Tor, or I2P — useful for `.onion` / `.i2p` sites and for bypassing WAFs that block datacenter IPs.
```bash
# any HTTP/SOCKS proxy
maigret user --proxy socks5://127.0.0.1:1080
# Tor (default gateway socks5://127.0.0.1:9050)
maigret user --tor-proxy socks5://127.0.0.1:9050
# I2P (default gateway http://127.0.0.1:4444)
maigret user --i2p-proxy http://127.0.0.1:4444
```
Start your Tor / I2P daemon before running the command — Maigret does not manage these gateways.
### Cloudflare bypass
> **Experimental.** The Cloudflare webgate is under active development; the configuration schema, CLI behaviour, and the set of routed sites may change without backwards-compatibility guarantees.
A subset of sites in the database require a real browser to solve a JavaScript challenge. Maigret can offload these checks to a local [FlareSolverr](https://github.com/FlareSolverr/FlareSolverr) instance:
```bash
docker run -d -p 8191:8191 --name flaresolverr ghcr.io/flaresolverr/flaresolverr:latest
maigret --cloudflare-bypass <username>
```
The bypass is opt-in (`--cloudflare-bypass` or `cloudflare_bypass.enabled` in `settings.json`) and only fires for sites whose `protection` field matches. See the [feature docs](https://maigret.readthedocs.io/en/latest/features.html#cloudflare-bypass) for backend options and configuration.
## Contributing
Add or fix new sites surgically in `data.json` (no `json.load`/`json.dump`), then run `./utils/update_site_data.py` to regenerate `sites.md` and the database metadata, and open a pull request. For more details, see the [CONTRIBUTING guide](https://github.com/soxoj/maigret/blob/main/CONTRIBUTING.md) and [development docs](https://maigret.readthedocs.io/en/latest/development.html). Release history: [CHANGELOG.md](CHANGELOG.md).
## Commercial Use
The open-source Maigret is MIT-licensed and free for commercial use without restriction — but site checks break over time and need active maintenance.
For serious commercial use — with a **daily-updated site database** or a **username-check API** — reach out: 📧 [maigret@soxoj.com](mailto:maigret@soxoj.com)
- Private site database — 5 000+ sites, updated daily (separate from the public open-source database)
- Username check API — integrate Maigret into your product
## About
### Disclaimer
**For educational and lawful purposes only.** You are responsible for complying with all applicable laws (GDPR, CCPA, etc.) in your jurisdiction. The authors bear no responsibility for misuse.
### Feedback
[Open an issue](https://github.com/soxoj/maigret/issues) · [GitHub Discussions](https://github.com/soxoj/maigret/discussions) · [Telegram](https://t.me/soxoj)
### SOWEL classification
OSINT techniques used:
- [SOTL-2.2. Search For Accounts On Other Platforms](https://sowel.soxoj.com/other-platform-accounts)
- [SOTL-6.1. Check Logins Reuse To Find Another Account](https://sowel.soxoj.com/logins-reuse)
- [SOTL-6.2. Check Nicknames Reuse To Find Another Account](https://sowel.soxoj.com/nicknames-reuse)
## License
### License
MIT © [Maigret](https://github.com/soxoj/maigret)<br/>
MIT © [Sherlock Project](https://github.com/sherlock-project/)<br/>
Original Creator of Sherlock Project - [Siddharth Dushantha](https://github.com/sdushantha)
MIT © [Maigret](https://github.com/soxoj/maigret)
+310
View File
@@ -0,0 +1,310 @@
# Maigret
<div align="center">
<div>
<a href="https://pypi.org/project/maigret/">
<img alt="Maigret 的 PyPI 版本" src="https://img.shields.io/pypi/v/maigret?style=flat-square" />
</a>
<a href="https://pypi.org/project/maigret/">
<img alt="Maigret 的 PyPI 周下载量" src="https://img.shields.io/pypi/dw/maigret?style=flat-square" />
</a>
<a href="https://github.com/soxoj/maigret">
<img alt="所需最低 Python 版本:3.10+" src="https://img.shields.io/badge/Python-3.10%2B-brightgreen?style=flat-square" />
</a>
<a href="https://github.com/soxoj/maigret/blob/main/LICENSE">
<img alt="Maigret 的开源许可证" src="https://img.shields.io/github/license/soxoj/maigret?style=flat-square" />
</a>
<a href="https://github.com/soxoj/maigret">
<img alt="Maigret 项目访问量" src="https://komarev.com/ghpvc/?username=maigret&color=brightgreen&label=views&style=flat-square" />
</a>
</div>
<br>
<div>
<img src="https://raw.githubusercontent.com/soxoj/maigret/main/static/maigret.png" height="300" alt="Maigret logo"/>
</div>
<br>
<div>
<a href="README.md">English</a> · <b>简体中文</b>
</div>
<br>
</div>
**Maigret** 仅凭一个用户名,就能在大量站点上查找其账号,并从网页中收集所有可获取的公开信息,为目标人物生成一份档案。无需任何 API 密钥。
## 目录
- [一分钟上手](#one-minute)
- [核心特性](#main-features)
- [演示](#demo)
- [安装](#installation)
- [使用](#usage)
- [参与贡献](#contributing)
- [商业使用](#commercial-use)
- [关于](#about)
<a id="one-minute"></a>
## 一分钟上手
请先确认本机的 Python 版本不低于 3.10。
```bash
pip install maigret
maigret YOUR_USERNAME
```
不想本地安装?可以试试 [Telegram 机器人](https://t.me/maigret_search_bot),或者使用[云端 Shell](#cloud-shells)。
想要一个 Web 界面?参见[启动方式](#web-interface)。
延伸阅读:[快速入门](https://maigret.readthedocs.io/en/latest/quick-start.html)。
<a id="main-features"></a>
## 核心特性
- 支持 3000+ 站点(完整列表见 [sites.md](https://github.com/soxoj/maigret/blob/main/sites.md))。默认仅检查访问量排名前 500 的站点;加上 `-a` 可全量扫描,或使用 `--tags` 按分类/国家筛选。
- 可作为 Python 库嵌入到自己的项目中——直接 `import maigret` 即可在代码里发起搜索(参见[库使用文档](https://maigret.readthedocs.io/en/latest/library-usage.html))。
- 通过 [socid_extractor](https://github.com/soxoj/socid_extractor) 从个人主页和站点 API 中[提取](https://github.com/soxoj/socid_extractor)账号所有者的所有可获取信息,包括指向其他账号的链接。
- 基于已发现的用户名和其他 ID,执行递归搜索。
- 支持按标签(站点分类、国家)进行筛选。
- 能够检测并部分绕过封锁、审查和 CAPTCHA。
- 每次运行时(每 24 小时一次)从 GitHub 拉取一份[自动更新的站点数据库](https://maigret.readthedocs.io/en/latest/settings.html#database-auto-update);离线时会回退到内置数据库。
- 可访问 Tor 与 I2P 站点;支持检查域名。
- 自带一个 [Web 界面](#web-interface),可在同一页面将结果以图谱方式浏览,并下载各种格式的报告。
- 可选的 [AI 分析模式](#ai-analysis)(`--ai`),通过 OpenAI 兼容 API 将原始搜索结果整理成一份简短的调查摘要。
完整特性列表请见[特性文档](https://maigret.readthedocs.io/en/latest/features.html)。
### 谁在使用
基于 Maigret 构建的专业 OSINT 与社交媒体分析工具:
<a href="https://github.com/SocialLinks-IO/sociallinks-api"><img height="60" alt="Social Links API" src="https://github.com/user-attachments/assets/789747b2-d7a0-4d4e-8868-ffc4427df660"></a>
<a href="https://sociallinks.io/products/sl-crimewall"><img height="60" alt="Social Links Crimewall" src="https://github.com/user-attachments/assets/0b18f06c-2f38-477b-b946-1be1a632a9d1"></a>
<a href="https://usersearch.ai/"><img height="60" alt="UserSearch" src="https://github.com/user-attachments/assets/66daa213-cf7d-40cf-9267-42f97cf77580"></a>
<a id="demo"></a>
## 演示
### 视频
<a href="https://asciinema.org/a/Ao0y7N0TTxpS0pisoprQJdylZ">
<img src="https://asciinema.org/a/Ao0y7N0TTxpS0pisoprQJdylZ.svg" alt="asciicast" width="600">
</a>
### 报告示例
[PDF 报告](https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotographycars.pdf)、[HTML 报告](https://htmlpreview.github.io/?https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotographycars.html)
![HTML 报告截图](https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotography_html_screenshot.png)
![XMind 8 报告截图](https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotography_xmind_screenshot.png)
[完整的命令行输出示例](https://raw.githubusercontent.com/soxoj/maigret/main/static/recursive_search.md)
<a id="installation"></a>
## 安装
如果你已经按[一分钟上手](#one-minute)的步骤跑通了,就无需再装。下面列出几种可选的安装方式。
什么都不想装?直接用 [Telegram 机器人](https://t.me/maigret_search_bot)。
### Windows
从 [Releases](https://github.com/soxoj/maigret/releases) 下载独立的 EXE 文件。视频指引:https://youtu.be/qIgwTZOmMmM。
<a id="cloud-shells"></a>
### 云端 Shell
通过云端 Shell 或 Jupyter Notebook 在浏览器里运行 Maigret:
<a href="https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/soxoj/maigret&tutorial=cloudshell-tutorial.md"><img src="https://user-images.githubusercontent.com/27065646/92304704-8d146d80-ef80-11ea-8c29-0deaabb1c702.png" alt="Open in Cloud Shell" height="50"></a>
<a href="https://repl.it/github/soxoj/maigret"><img src="https://replit.com/badge/github/soxoj/maigret" alt="Run on Replit" height="50"></a>
<a href="https://colab.research.google.com/gist/soxoj/879b51bc3b2f8b695abb054090645000/maigret-collab.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" height="45"></a>
<a href="https://mybinder.org/v2/gist/soxoj/9d65c2f4d3bec5dd25949197ea73cf3a/HEAD"><img src="https://mybinder.org/badge_logo.svg" alt="Open In Binder" height="45"></a>
### 本地安装(pip)
```bash
# 从 PyPI 安装
pip3 install maigret
# 使用
maigret username
```
### 从源码安装
```bash
# 也可以克隆仓库后手动安装
git clone https://github.com/soxoj/maigret && cd maigret
# 构建并安装
pip3 install .
# 使用
maigret username
```
### Docker
官方提供两个镜像变体:
- `soxoj/maigret:latest` —— CLI 模式(默认)
- `soxoj/maigret:web` —— 自动启动 [Web 界面](#web-interface)
```bash
# 拉取官方镜像(CLI)
docker pull soxoj/maigret
# CLI 用法
docker run -v /mydir:/app/reports soxoj/maigret:latest username --html
# Web UI(在 http://localhost:5000 打开)
docker run -p 5000:5000 soxoj/maigret:web
# 自定义 Web UI 端口
docker run -e PORT=8080 -p 8080:8080 soxoj/maigret:web
# 手动构建
docker build -t maigret . # CLI 镜像(默认 target)
docker build --target web -t maigret-web . # Web UI 镜像
```
### 故障排查
构建报错?请见[故障排查指南](https://maigret.readthedocs.io/en/latest/installation.html#troubleshooting)。
<a id="usage"></a>
## 使用
### 示例
```bash
# 生成 HTML、PDF、XMind 8 报告
maigret user --html
maigret user --pdf
maigret user --xmind # 与 XMind 2022+ 不兼容
# 机器可读的导出格式
maigret user --json ndjson # 行分隔 JSON(也支持 --json simple)
maigret user --csv
maigret user --txt
maigret user --graph # 交互式 D3 图谱(HTML)
# 仅在带有 photo 与 dating 标签的站点上搜索
maigret user --tags photo,dating
# 仅在带有 us 标签的站点上搜索
maigret user --tags us
# 同时在所有站点上搜索三个用户名
maigret user1 user2 user3 -a
# AI 辅助调查摘要(需要 OPENAI_API_KEY)
maigret user --ai
```
完整选项请运行 `maigret --help`。文档:[命令行选项](https://maigret.readthedocs.io/en/latest/command-line-options.html)、[更多示例](https://maigret.readthedocs.io/en/latest/usage-examples.html)。遇到 403 或超时?参见 [TROUBLESHOOTING.md](TROUBLESHOOTING.md)。
<a id="web-interface"></a>
### Web 界面
Maigret 内置一个 Web UI,提供结果图谱视图和报告下载。
<details>
<summary>Web 界面截图</summary>
![Web 界面:启动页](https://raw.githubusercontent.com/soxoj/maigret/main/static/web_interface_screenshot_start.png)
![Web 界面:结果页](https://raw.githubusercontent.com/soxoj/maigret/main/static/web_interface_screenshot.png)
</details>
```console
maigret --web 5000
```
在浏览器中打开 http://127.0.0.1:5000,输入用户名即可查看结果。
### Python 库
**Maigret 可以嵌入到你自己的 Python 项目里使用。** CLI 只是对一个异步函数的薄包装,你完全可以直接调用它——构建自定义流水线、把结果接入自家工具,或将其嵌入更大的 OSINT 工作流。
完整示例(包含异步用法和按标签筛选站点)请参见[库使用指南](https://maigret.readthedocs.io/en/latest/library-usage.html)。
### 常用 CLI 参数
- `--parse URL` —— 解析一个个人主页,从中提取 ID/用户名,并以此为起点发起递归搜索。
- `--permute` —— 基于两个或更多输入生成可能的用户名变体(例如 `john doe``johndoe``j.doe` …)并对其逐一搜索。
- `--self-check [--auto-disable]` —— 维护者用于核对数据库的工具:针对线上站点验证 `usernameClaimed` / `usernameUnclaimed` 配对是否仍然有效。
- `--ai` / `--ai-model` —— 启用 [AI 分析](#ai-analysis),将搜索结果交给 OpenAI 兼容 API,并把简短的调查摘要流式输出到终端。
<a id="ai-analysis"></a>
### AI 分析
`--ai` 会先收集搜索结果、在内存中构建 Markdown 报告,再将其发送到一个 OpenAI 兼容的 chat completion 接口,生成一份简短、克制的调查摘要(最可能的真实姓名、所在地、职业、兴趣、语言、置信度以及后续线索)。开启该模式后,逐站点的进度输出会被静默,模型的输出会以流式方式打印到 stdout。
```bash
export OPENAI_API_KEY=sk-...
maigret user --ai
# 切换到其它模型
maigret user --ai --ai-model gpt-4o-mini
```
API key 也可以写入 `settings.json``openai_api_key` 字段。接口地址默认为 `https://api.openai.com/v1`,通过在 `settings.json` 中设置 `openai_api_base_url`,可以指向任何 OpenAI 兼容的服务(Azure OpenAI、OpenRouter、本地推理服务等)。完整选项见[配置文档](https://maigret.readthedocs.io/en/latest/settings.html)。
### Tor / I2P / 代理
Maigret 支持通过代理、Tor 或 I2P 转发请求——这对访问 `.onion` / `.i2p` 站点,以及绕过会拦截数据中心 IP 的 WAF 都很有用。
```bash
# 任意 HTTP/SOCKS 代理
maigret user --proxy socks5://127.0.0.1:1080
# Tor(默认网关 socks5://127.0.0.1:9050)
maigret user --tor-proxy socks5://127.0.0.1:9050
# I2P(默认网关 http://127.0.0.1:4444)
maigret user --i2p-proxy http://127.0.0.1:4444
```
请先启动 Tor / I2P 守护进程再运行上述命令——Maigret 不会替你管理这些网关。
<a id="contributing"></a>
## 参与贡献
请精确地在 `data.json` 里新增或修复站点(不要使用 `json.load`/`json.dump` 整体读写),然后运行 `./utils/update_site_data.py` 重新生成 `sites.md` 和数据库元数据,再提交 Pull Request。更多细节见 [CONTRIBUTING 指南](https://github.com/soxoj/maigret/blob/main/CONTRIBUTING.md) 和[开发文档](https://maigret.readthedocs.io/en/latest/development.html)。版本历史见 [CHANGELOG.md](CHANGELOG.md)。
<a id="commercial-use"></a>
## 商业使用
开源版本的 Maigret 采用 MIT 许可证,可不受限制地用于商业用途——但站点检查会随时间失效,需要持续维护。
如果你有更严肃的商业需求——希望使用**每日更新的站点数据库**或**用户名查询 API**——欢迎联系:📧 [maigret@soxoj.com](mailto:maigret@soxoj.com)
- 私有站点数据库 —— 5000+ 站点,每日更新(独立于公开开源数据库)
- 用户名查询 API —— 将 Maigret 集成进你的产品
<a id="about"></a>
## 关于
### 免责声明
**仅供教育与合法用途。** 使用者需自行承担遵守所在司法辖区相关法律(GDPR、CCPA 等)的责任。作者不对任何滥用行为负责。
### 反馈
[提交 issue](https://github.com/soxoj/maigret/issues) · [GitHub Discussions](https://github.com/soxoj/maigret/discussions) · [Telegram](https://t.me/soxoj)
### SOWEL 分类
涉及到的 OSINT 技术:
- [SOTL-2.2. Search For Accounts On Other Platforms](https://sowel.soxoj.com/other-platform-accounts)
- [SOTL-6.1. Check Logins Reuse To Find Another Account](https://sowel.soxoj.com/logins-reuse)
- [SOTL-6.2. Check Nicknames Reuse To Find Another Account](https://sowel.soxoj.com/nicknames-reuse)
### 许可证
MIT © [Maigret](https://github.com/soxoj/maigret)
+91
View File
@@ -0,0 +1,91 @@
# Troubleshooting
Common issues when running Maigret and how to fix them. If none of this helps, [open an issue](https://github.com/soxoj/maigret/issues) with the output of `maigret --version` and the exact command you ran.
## "Lots of sites fail / timeout / return 403"
This is by far the most common report. It almost always comes from anti-bot protection (Cloudflare, DDoS-Guard, Akamai, etc.) or a slow network — not from a bug in Maigret.
**Results vary a lot depending on where you run from.** The same command on the same username can produce very different output on:
- **Mobile internet** (4G/5G) — usually the best results. Carrier NAT shares your IP with thousands of real users, so WAFs rarely block it.
- **Home broadband** — generally good, though some ISPs are reputation-flagged.
- **Hosting / cloud / VPS infrastructure** (AWS, GCP, DigitalOcean, Hetzner, etc.) — the worst case. Datacenter IP ranges are blanket-blocked or challenged by most WAFs, so you will see many false negatives and 403s.
If a run looks suspiciously empty, **try a different network before assuming Maigret is broken**: tether from your phone, switch between Wi-Fi and mobile, or move the run off a VPS onto a residential machine. Comparing results across two networks is also the fastest way to tell whether a missing account is genuinely missing or just blocked on the current IP.
Once you have a sense of the baseline, try these tweaks in order:
1. **Raise the timeout.** The default is 30 seconds. On mobile networks or for slow sites, bump it:
```bash
maigret user --timeout 60
```
2. **Retry failed checks.** Transient 5xx / timeouts often clear on a second try:
```bash
maigret user --retries 2
```
3. **Lower parallelism.** Some WAFs rate-limit aggressively. Maigret defaults to 100 concurrent connections (`-n` / `--max-connections`) — dropping this makes you look less like a scanner:
```bash
maigret user -n 20
```
4. **Route through a residential proxy.** Datacenter IPs (AWS, GCP, DigitalOcean) are blanket-blocked by many WAFs. A residential / mobile proxy usually fixes this:
```bash
maigret user --proxy http://user:pass@residential-proxy:port
```
Note: Tor (`--tor-proxy`) rarely helps here — most WAFs block Tor exit nodes just as aggressively as datacenter IPs. Use Tor only when you actually need to reach `.onion` sites (see below).
If specific sites *always* fail regardless of the above, they are likely broken in the database (stale markers, new WAF, site redesign). Report them with `--print-errors` output so a maintainer can look at the check config.
## "No results at all" / "maigret: command not found"
- **`command not found`** — `pip install maigret` put the binary under `~/.local/bin` (Linux/macOS) or `%APPDATA%\Python\Scripts` (Windows). Add that directory to `PATH`, or run `python3 -m maigret user` instead.
- **Empty output** — check that you actually passed a username; `maigret` alone prints help. Also confirm Python 3.10+ with `python3 --version`.
## "SSL / certificate errors"
Usually caused by a corporate MITM proxy or an outdated `certifi` bundle.
```bash
pip install --upgrade certifi
```
If you are behind a corporate proxy, set `HTTPS_PROXY` / `HTTP_PROXY` environment variables and pass `--proxy "$HTTPS_PROXY"` so Maigret uses the same route.
## ".onion / .i2p sites are skipped"
These sites only load through the matching gateway. Start your Tor or I2P daemon first, then:
```bash
# Tor
maigret user --tor-proxy socks5://127.0.0.1:9050
# I2P
maigret user --i2p-proxy http://127.0.0.1:4444
```
Maigret does not launch or manage these daemons — they must already be running.
## "The PDF / XMind / HTML report looks wrong"
- **PDF** — requires `weasyprint` and its system dependencies (Pango, Cairo, GDK-PixBuf). On Debian/Ubuntu: `apt install libpango-1.0-0 libpangoft2-1.0-0`. macOS: `brew install pango`.
- **XMind** — the `--xmind` flag generates **XMind 8** files. XMind 2022+ (Zen / XMind 2023) uses a different format and will not open them. Use XMind 8 or convert via `--html`.
- **HTML** looks unstyled — open it through a local file path (`file:///...`), not via a preview pane that strips CSS.
## "The site database is out of date"
Maigret auto-fetches a fresh `data.json` from GitHub once every 24 hours. To force-refresh now:
```bash
maigret user --force-update
```
To run entirely against the local built-in copy (e.g. offline):
```bash
maigret user --no-autoupdate
```
## Still stuck?
- [Open an issue](https://github.com/soxoj/maigret/issues) — include your OS, Python version, Maigret version, and the full command.
- Ask in [GitHub Discussions](https://github.com/soxoj/maigret/discussions) or the [Telegram](https://t.me/soxoj) channel.
+69
View File
@@ -0,0 +1,69 @@
# Maigret
<div align="center">
<img src="https://raw.githubusercontent.com/soxoj/maigret/main/static/maigret.png" height="220" alt="Maigret logo"/>
</div>
**Maigret** collects a dossier on a person **by username only**, checking for accounts on a huge number of sites and gathering all the available information from web pages. No API keys required.
## Installation
Google Cloud Shell does not ship with all the system libraries Maigret needs (`libcairo2-dev`, `pkg-config`). The helper script below installs them and then builds Maigret from the cloned source.
Copy the command and run it in the Cloud Shell terminal:
```bash
./utils/cloudshell_install.sh
```
When the script finishes, verify the install:
```bash
maigret --version
```
## Usage examples
Run a basic search for a username. By default Maigret checks the **500 highest-ranked sites by traffic** — pass `-a` to scan the full 3,000+ database.
```bash
maigret soxoj
```
Search several usernames at once:
```bash
maigret user1 user2 user3
```
Narrow the run to sites related to cryptocurrency via the `crypto` tag (you can also use country tags):
```bash
maigret vitalik.eth --tags crypto
```
Generate reports in HTML, PDF, and XMind 8 formats:
```bash
maigret soxoj --html
maigret soxoj --pdf
maigret soxoj --xmind
```
Download a generated report from Cloud Shell to your local machine:
```bash
cloudshell download reports/report_soxoj.pdf
```
Tune reliability on flaky networks — raise the timeout and retry failed checks:
```bash
maigret soxoj --timeout 60 --retries 2
```
For the full list of options see `maigret --help` or the [CLI documentation](https://maigret.readthedocs.io/en/latest/command-line-options.html).
## Further reading
Full project documentation: [maigret.readthedocs.io](https://maigret.readthedocs.io/)
+173 -7
View File
@@ -82,11 +82,74 @@ id types, sites will be filtered automatically.
ids. Useful for repeated scanning with found known irrelevant usernames.
``--db`` - Load Maigret database from a JSON file or an online, valid,
JSON file.
JSON file. See :ref:`custom-database` below.
``--no-autoupdate`` - Disable the automatic database update check that
runs at startup. The currently cached (or bundled) database is used
as-is.
``--force-update`` - Force a database update check at startup, ignoring
the usual check interval. Implies ``--no-autoupdate`` for the rest of
the run after the explicit update finishes.
``--retries RETRIES`` - Count of attempts to restart temporarily failed
requests.
``--cloudflare-bypass`` *(experimental)* - Route checks for sites tagged
``protection: ["cf_js_challenge"]`` / ``["cf_firewall"]`` / ``["webgate"]``
through a local Chrome-based solver (FlareSolverr by default). The bypass
is opt-in — without this flag (or
``settings.cloudflare_bypass.enabled = true``) those sites are checked
the usual way, which Cloudflare almost always blocks: you get an UNKNOWN
status with a JS-challenge / firewall error rather than a real result.
Configure the backend in ``settings.cloudflare_bypass.modules``.
See :ref:`cloudflare-bypass`. **Experimental** — the flag, schema and
routing rules may change without backwards-compatibility guarantees.
.. _custom-database:
Using a custom sites database
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``--db`` flag accepts three forms:
1. **HTTP(S) URL** — fetched as-is, e.g.
``--db https://example.com/my_db.json``.
2. **Local file path** — absolute (``--db /tmp/private.json``) or
relative to the current working directory
(``--db LLM/maigret_private_db.json``).
3. **Module-relative path** — kept for backwards compatibility, resolved
against the installed ``maigret/`` package directory (e.g. the
default ``resources/data.json``).
Resolution order for local paths: the path is first tried as given
(absolute or cwd-relative); if that file does not exist, Maigret falls
back to the legacy module-relative resolution. If neither location
contains the file, Maigret exits with an error rather than silently
loading the bundled database.
When ``--db`` points to a custom file, automatic database updates are
skipped — the file is used exactly as provided.
On every run Maigret prints the database it actually loaded, for
example::
[+] Using sites database: /path/to/maigret_private_db.json (6 sites)
If loading the requested database fails for any other reason (corrupt
JSON, missing required keys, …), Maigret prints a warning, falls back
to the bundled database, and reports the fallback explicitly::
[-] Falling back to bundled database: /…/maigret/resources/data.json
[+] Using sites database: /…/maigret/resources/data.json (3154 sites)
A typical invocation against a private database, with auto-update
disabled and all sites scanned, looks like::
python3 -m maigret username \
--db LLM/maigret_private_db.json \
--no-autoupdate -a
Reports
-------
@@ -106,6 +169,17 @@ username).
``-J``, ``--json`` - Generate a JSON report of specific type: simple,
ndjson (one report per username). E.g. ``--json ndjson``
``-M``, ``--md`` - Generate a Markdown report (general report on all
usernames). See :ref:`markdown-report` below.
``--ai`` - Run an AI-powered analysis of the search results using an
OpenAI-compatible chat completion API. The internal Markdown report is
sent to the model, which returns a short investigation summary that is
streamed to the terminal. See :ref:`ai-analysis` below.
``--ai-model`` - Model name to use with ``--ai``. Defaults to
``openai_model`` from settings (``gpt-4o`` out of the box).
``-fo``, ``--folderoutput`` - Results will be saved to this folder,
``results`` by default. Will be created if doesnt exist.
@@ -130,16 +204,108 @@ Other operations modes
``--version`` - Display version information and dependencies.
``--self-check`` - Do self-checking for sites and database and disable
non-working ones **for current search session** by default. Its useful
for testing new internet connection (it depends on provider/hosting on
which sites there will be censorship stub or captcha display). After
checking Maigret asks if you want to save updates, answering y/Y will
rewrite the local database.
``--self-check`` - Do self-checking for sites and database. Each site is
tested by looking up its known-claimed and known-unclaimed usernames and
verifying that the results match expectations. Individual site failures
(network errors, unexpected exceptions, etc.) are caught and logged
without stopping the overall process, so the check always runs to
completion. After checking, Maigret reports a summary of issues found.
If any sites were disabled (see ``--auto-disable``), Maigret asks if you
want to save updates; answering y/Y will rewrite the local database.
``--auto-disable`` - Used with ``--self-check``: automatically disable
sites that fail checks (incorrect detection of claimed/unclaimed
usernames, connection errors, or unexpected exceptions). Without this
flag, ``--self-check`` only **reports** issues without modifying the
database.
``--diagnose`` - Used with ``--self-check``: print detailed diagnosis
information for each failing site, including the check type, the list
of issues found, and recommendations (e.g. suggesting a different
``checkType``).
``--submit URL`` - Do an automatic analysis of the given account URL or
site main page URL to determine the site engine and methods to check
account presence. After checking Maigret asks if you want to add the
site, answering y/Y will rewrite the local database.
.. _markdown-report:
Markdown report (LLM-friendly)
------------------------------
The ``--md`` / ``-M`` flag generates a Markdown report designed for both human reading and analysis by AI assistants (ChatGPT, Claude, etc.).
.. code-block:: console
maigret username --md
The report includes:
- **Summary** with aggregated personal data (all fullnames, locations, bios found across accounts), country tags, website tags, first/last seen timestamps.
- **Per-account sections** with profile URL, site tags, and all extracted fields (username, bio, follower count, linked accounts, etc.).
- **Possible false positives** disclaimer explaining that accounts may belong to different people.
- **Ethical use** notice about applicable data protection laws.
**Using with AI tools:**
The Markdown format is optimized for LLM context windows. You can feed the report directly to an AI assistant for follow-up analysis:
.. code-block:: console
# Generate the report
maigret johndoe --md
# Feed it to an AI tool
cat reports/report_johndoe.md | llm "Analyze this OSINT report and summarize key findings"
The structured Markdown with per-site sections makes it easy for AI tools to extract relationships, cross-reference identities, and identify patterns across accounts.
For a built-in alternative that calls the model for you and prints the
summary directly, see :ref:`ai-analysis` below.
.. _ai-analysis:
AI analysis (built-in)
----------------------
The ``--ai`` flag turns the search results into a short investigation
summary by sending the internal Markdown report to an OpenAI-compatible
chat completion API and streaming the model's reply to the terminal.
.. code-block:: console
export OPENAI_API_KEY=sk-...
maigret username --ai
# use a smaller / cheaper model
maigret username --ai --ai-model gpt-4o-mini
While ``--ai`` is active, per-site progress lines and the short text
report at the end are suppressed so the streamed summary is the main
output. The Markdown report itself is built in memory and is **not**
written to disk by ``--ai`` alone — combine with ``--md`` if you also
want the file on disk.
The summary follows a fixed format with sections for the most likely
real name, location, occupation, interests, languages, main website,
username variants, number of platforms, active years, a confidence
rating, and a short list of follow-up leads. The model is instructed
to rely only on what is supported by the report and to avoid mixing
clearly unrelated profiles into the main identity.
**Configuration.** The API key is resolved from
``settings.openai_api_key`` first, then from the ``OPENAI_API_KEY``
environment variable. The endpoint defaults to
``https://api.openai.com/v1`` and can be redirected to any
OpenAI-compatible service (Azure OpenAI, OpenRouter, a local server,
…) by setting ``openai_api_base_url`` in ``settings.json``. See
:ref:`settings` for the full list of options.
.. note::
``--ai`` makes a network request to the configured chat completion
endpoint and sends the full Markdown report (which contains the
gathered profile data). Use it only with providers and accounts
you trust with that data.
+58 -2
View File
@@ -69,6 +69,21 @@ Use the following commands to check Maigret:
make speed
Site naming conventions
-----------------------------------------------
Site names are the keys in ``data.json`` and appear in user-facing reports. Follow these rules:
- **Title Case** by default: ``Product Hunt``, ``Hacker News``.
- **Lowercase** only if the brand itself is written that way: ``kofi``, ``note``, ``hi5``.
- **No domain suffix** (``calendly.com````Calendly``), unless the domain is part of the recognized brand name: ``last.fm``, ``VC.ru``, ``Archive.org``.
- **No full UPPERCASE** unless the brand is an acronym: ``VK``, ``CNET``, ``ICQ``, ``IFTTT``.
- **No** ``www.`` **or** ``https://`` **prefix** in the name.
- **Spaces** are allowed when the brand uses them: ``Star Citizen``, ``Google Maps``.
- **{username} templates** in names are acceptable: ``{username}.tilda.ws``.
When in doubt, check how the service refers to itself on its homepage.
How to fix false-positives
-----------------------------------------------
@@ -81,7 +96,7 @@ You should make your git commits from your maigret git repo folder, or else the
If you already know which site has a false-positive and want to fix it specifically, go to the next step.
Otherwise, simply run a search with a random username (e.g. `laiuhi3h4gi3u4hgt`) and check the results.
Alternatively, you can use `the Telegram bot <https://t.me/osint_maigret_bot>`_.
Alternatively, you can use `the Telegram bot <https://t.me/maigret_search_bot>`_.
2. Open the account link in your browser and check:
@@ -122,6 +137,47 @@ There are few options for sites data.json helpful in various cases:
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
- ``requestMethod`` - set the HTTP method to use (e.g., ``POST``). By default, Maigret natively defaults to GET or HEAD.
- ``requestPayload`` - a dictionary with the JSON payload to send for POST requests (e.g., ``{"username": "{username}"}``), extremely useful for parsing GraphQL or modern JSON APIs.
- ``protection`` - a list of protection types detected on the site (see below).
``protection`` (site protection tracking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``protection`` field records what kind of anti-bot protection a site uses. Maigret reads this field and automatically applies the appropriate bypass mechanism where one exists.
Two categories of tag:
- **Load-bearing.** Maigret changes its HTTP client or headers based on the tag. Currently only ``tls_fingerprint`` (switches to ``curl_cffi`` with Chrome-class TLS).
- **Documentation-only.** Maigret does **not** change behavior based on the tag; it records *why* the site is hard so a future solver can target the right set of sites without re-auditing.
Within the documentation-only tags, there is a further split that dictates whether the site is ``disabled: true``:
- ``ip_reputation`` is the **only** doc-tag that **keeps the site enabled**. It means "works for most users, fails from datacenter/cloud IPs." Disabling would silently hide a working site from anyone with a clean IP. The fix is **external** to Maigret (residential IP or ``--proxy``).
- ``cf_js_challenge``, ``cf_firewall``, ``aws_waf_js_challenge``, ``ddos_guard_challenge``, ``custom_bot_protection``, ``js_challenge`` all pair with ``disabled: true``. They mean "does not work for anyone right now"; the tag identifies the provider so that when a bypass ships, every site with that tag can be re-enabled in one pass.
Supported values:
- ``tls_fingerprint`` *(load-bearing; site stays enabled)* — the site fingerprints the TLS handshake (JA3/JA4) and blocks non-browser clients. Maigret automatically uses ``curl_cffi`` with Chrome browser emulation to bypass this. Requires the ``curl_cffi`` package (included as a dependency). Examples: Instagram, NPM, Codepen, Kickstarter, Letterboxd.
- ``ip_reputation`` *(documentation-only; site stays enabled)* — the site blocks requests from datacenter/cloud IPs regardless of headers or TLS. Cannot be bypassed automatically; run Maigret from a regular internet connection (not a datacenter) or use a proxy (``--proxy``). The site is **not** marked ``disabled`` because it continues to work for users on residential IPs. Examples: Reddit, Patreon, Figma, OnlyFans.
- ``cf_js_challenge`` *(documentation-only; pair with ``disabled: true``)* — Cloudflare Managed Challenge / Turnstile JS challenge. Symptom: HTTP 403 with ``cf-mitigated: challenge`` header; body contains ``challenges.cloudflare.com``, ``_cf_chl_opt``, ``window._cf_chl``, or "Just a moment". Not bypassable via ``curl_cffi`` TLS impersonation (verified across Chrome 123/124/131, Safari 17/18, Firefox 133/135, Edge 101 — all return the same 403 challenge page); a real browser executing the challenge JS is required to obtain the clearance cookie. Sites stay ``disabled: true`` until a CF-challenge solver is integrated. Examples: DMOJ, Elakiri, Fanlore, Bdoutdoors, TheStudentRoom, forum.hr.
- ``cf_firewall`` *(documentation-only; pair with ``disabled: true``)* — Cloudflare firewall rule / bot score block (WAF action=block, **not** action=challenge). Symptom: HTTP 403 served by Cloudflare (``server: cloudflare``, ``cf-ray`` header) **without** JS-challenge markers — body typically shows "Access denied", "Attention Required", or just a bare 1015/1016/1020 error page. Unlike ``ip_reputation``, residential IPs are **not** sufficient to bypass — Cloudflare decides based on a composite of bot score, TLS fingerprint, UA, ASN, and custom site-owner rules, so ``curl_cffi`` Chrome impersonation from a residential line still returns 403. Sites stay ``disabled: true`` until a per-site bypass (cookies, real browser, or residential+clean session) is found. Examples: Fark, Fodors, Huntingnet, Hunttalk.
- ``aws_waf_js_challenge`` *(documentation-only; pair with ``disabled: true``)* — the site is protected by AWS WAF with a JavaScript challenge. Symptom: HTTP 202 with empty body and ``x-amzn-waf-action: challenge`` header (a token-granting challenge that requires executing the CAPTCHA/challenge JS bundle). Neither ``curl_cffi`` TLS impersonation nor User-Agent changes bypass this — a real browser or the official AWS WAF challenge-solver SDK is required. Sites stay ``disabled: true`` until a solver is integrated. Example: Dreamwidth.
- ``ddos_guard_challenge`` *(documentation-only; pair with ``disabled: true``)* — DDoS-Guard (ddos-guard.net) anti-bot page. Symptom: HTTP 403 with ``server: ddos-guard`` header; body contains "DDoS-Guard". DDoS-Guard fingerprints different UAs per source IP, so a single User-Agent override does not work across environments; a JS-capable bypass or DDoS-Guard-aware solver is required. Sites stay ``disabled: true`` until a solver is integrated. Example: ForumHouse.
- ``js_challenge`` *(documentation-only; pair with ``disabled: true``)***fallback** for JavaScript-challenge systems whose provider cannot be identified (custom in-house challenge pages that are not Cloudflare, AWS WAF, or any other recognized vendor). Prefer a provider-specific tag whenever the provider can be pinned down from response headers or body signatures.
- ``custom_bot_protection`` *(documentation-only; pair with ``disabled: true``)***fallback** for non-JS-challenge bot protection served by a custom/in-house system (not Cloudflare, not AWS WAF, not DDoS-Guard). Typical symptom: HTTP 403 from the site's own origin server (``server: nginx``, AWS ELB, etc.) with a branded block page, returned regardless of TLS fingerprint or residential IP. Not generically bypassable; investigate per site (cookies, session, proxy geography). Examples: Hackerearth ("HackerEarth Guardian"), FreelanceJob (nginx-level block).
**Rule: prefer provider-specific protection tags.** When a site is blocked by an identifiable anti-bot vendor, always record the vendor in the tag (``cf_js_challenge``, ``cf_firewall``, ``aws_waf_js_challenge``, ``ddos_guard_challenge``, and future additions such as ``sucuri_challenge``, ``incapsula_challenge``). The generic ``js_challenge`` and ``custom_bot_protection`` tags are reserved for custom/unknown systems. Rationale: bypass solvers are inherently provider-specific (a Cloudflare Turnstile solver does not help with AWS WAF); recording the provider in advance lets us fan out fixes the moment a per-provider solver is added, without re-auditing every disabled site. The same principle applies to other protection categories when the provider is identifiable.
Example:
.. code-block:: json
"Instagram": {
"url": "https://www.instagram.com/{username}/",
"checkType": "message",
"presenseStrs": ["\"routePath\":\"\\/"],
"absenceStrs": ["\"routePath\":null"],
"protection": ["tls_fingerprint"]
}
``urlProbe`` (optional profile probe URL)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -282,7 +338,7 @@ Documentations is auto-generated and auto-deployed from the ``docs`` directory.
To manually update documentation:
1. Change something in the ``.rst`` files in the ``docs/source`` directory.
2. Install ``pip install -r requirements.txt`` in the docs directory.
2. Install ``python -m pip install -e .`` in the docs directory.
3. Run ``make singlehtml`` in the terminal in the docs directory.
4. Open ``build/singlehtml/index.html`` in your browser to see the result.
5. If everything is ok, commit and push your changes to GitHub.
+111
View File
@@ -147,6 +147,33 @@ Also, there is a short text report in the CLI output after the end of a searchin
.. warning::
XMind 8 mindmaps are incompatible with XMind 2022!
AI analysis
-----------
Maigret can produce a short, human-readable investigation summary on top
of the raw search results using the ``--ai`` flag. It builds the
internal Markdown report, sends it to an OpenAI-compatible chat
completion endpoint, and streams the model's reply directly to the
terminal.
.. code-block:: console
export OPENAI_API_KEY=sk-...
maigret username --ai
The summary uses a fixed format with the most likely real name,
location, occupation, interests, languages, main website, username
variants, number of platforms, active years, a confidence rating, and a
short list of follow-up leads. While ``--ai`` is active, per-site
progress and the short text report are suppressed so the streamed
summary is the main output.
The endpoint, model, and API key are configured via ``settings.json``
(``openai_api_key``, ``openai_model``, ``openai_api_base_url``) or the
``OPENAI_API_KEY`` environment variable. Any OpenAI-compatible API can
be used (Azure OpenAI, OpenRouter, a local server, …). See
:ref:`ai-analysis` and :ref:`settings` for details.
Tags
----
@@ -170,6 +197,35 @@ Maigret will do retries of the requests with temporary errors got (connection fa
One attempt by default, can be changed with option ``--retries N``.
Database self-check
-------------------
Maigret includes a self-check mode (``--self-check``) that validates every site
in the database by looking up its known-claimed and known-unclaimed usernames
and verifying that the detection results match expectations.
The self-check is **error-resilient**: if an individual site check raises an
unexpected exception (e.g. a network error or a parsing failure), the error is
caught, logged, and recorded as an issue — the remaining sites continue to be
checked without interruption. This means the process always runs to completion,
even when checking hundreds of sites with ``-a --self-check``.
Use ``--auto-disable`` together with ``--self-check`` to automatically disable
sites that fail checks. Without it, issues are only reported. Use ``--diagnose``
to print detailed per-site diagnosis including the check type, specific issues,
and recommendations.
.. code-block:: console
# Report-only mode (no changes to the database)
maigret --self-check
# Automatically disable failing sites and save updates
maigret -a --self-check --auto-disable
# Show detailed diagnosis for each failing site
maigret -a --self-check --diagnose
Archives and mirrors checking
-----------------------------
@@ -181,6 +237,61 @@ The Maigret database contains not only the original websites, but also mirrors,
It allows getting additional info about the person and checking the existence of the account even if the main site is unavailable (bot protection, captcha, etc.)
.. _cloudflare-bypass:
Cloudflare webgate bypass
-------------------------
.. warning::
**Experimental feature.** The Cloudflare webgate is under active
development. The configuration schema, CLI flag behaviour, and the set
of sites that route through it may change without backwards-compatibility
guarantees. Expect rough edges (CF rate limits, occasional solver
failures) and report issues so they can be ironed out.
Some sites sit behind a full Cloudflare JavaScript challenge or a CF firewall
hard block — these are tagged ``protection: ["cf_js_challenge"]`` or
``protection: ["cf_firewall"]`` in the database and are normally kept disabled
because neither aiohttp nor curl_cffi can solve the JS challenge on their own.
Maigret can offload these checks to a local Chrome-based solver. Two backends
are supported, configured in ``settings.json`` under
``cloudflare_bypass.modules`` (the first reachable module wins; subsequent
ones are tried as a fallback chain):
* **FlareSolverr** (recommended). Runs a real Chrome instance and exposes a
JSON API. The upstream HTTP status, headers and final URL are preserved, so
``checkType: status_code`` and ``checkType: response_url`` keep working
through the bypass.
.. code-block:: console
docker run -d -p 8191:8191 --name flaresolverr ghcr.io/flaresolverr/flaresolverr:latest
* **CloudflareBypassForScraping** (legacy fallback). Returns rendered HTML
only, so the upstream status code is lost — ``checkType: message`` keeps
working but ``status_code`` checks misfire (treated as 200 on success).
Activate the bypass either with the CLI flag::
maigret --cloudflare-bypass <username>
or by setting ``cloudflare_bypass.enabled`` to ``true`` in ``settings.json``.
The bypass only fires for sites whose ``protection`` field intersects
``cloudflare_bypass.trigger_protection`` (default
``["cf_js_challenge", "cf_firewall", "webgate"]``); all other sites use the
normal aiohttp / curl_cffi path.
If all configured modules are unreachable, affected sites get an UNKNOWN
status with an actionable error pointing at the first module's URL — the
fix is almost always to start the FlareSolverr container.
FlareSolverr session reuse is automatic: Maigret pins a single
``session: <session_prefix>-<pid>`` per run, so cf_clearance cookies are
shared between checks of the same domain (510× faster on subsequent
requests to that host).
Activation
----------
The activation mechanism helps make requests to sites requiring additional authentication like cookies, JWT tokens, or custom headers.
+8
View File
@@ -29,6 +29,7 @@ You may be interested in:
- :doc:`Usage examples <usage-examples>`
- :doc:`Command line options <command-line-options>`
- :doc:`Features list <features>`
- :doc:`Library usage <library-usage>`
.. toctree::
:hidden:
@@ -39,8 +40,15 @@ You may be interested in:
usage-examples
command-line-options
features
library-usage
philosophy
supported-identifier-types
tags
settings
development
.. toctree::
:hidden:
:caption: Use cases
use-cases/crypto
+60 -3
View File
@@ -4,7 +4,7 @@ Installation
============
Maigret can be installed using pip, Docker, or simply can be launched from the cloned repo.
Also, it is available online via `official Telegram bot <https://t.me/osint_maigret_bot>`_,
Also, it is available online via `official Telegram bot <https://t.me/maigret_search_bot>`_,
source code of a bot is `available on GitHub <https://github.com/soxoj/maigret-tg-bot>`_.
Windows Standalone EXE-binaries
@@ -45,8 +45,7 @@ Press one of the buttons below and follow the instructions to launch it in your
Local installation from PyPi
----------------------------
Please note that the sites database in the PyPI package may be outdated.
If you encounter frequent false positive results, we recommend installing the latest development version from GitHub instead.
Maigret ships with a bundled site database. After installation from PyPI (or any other method), it can **automatically fetch a newer compatible database from GitHub** when you run it—see :ref:`database-auto-update` in :doc:`settings`.
.. note::
Python 3.10 or higher and pip is required, **Python 3.11 is recommended.**
@@ -90,3 +89,61 @@ Docker
# manual build
docker build -t maigret .
Troubleshooting
---------------
If you encounter build errors during installation such as ``cannot find ft2build.h``
or errors related to ``reportlab`` / ``_renderPM``, you need to install system-level
dependencies required to compile native extensions.
**Debian/Ubuntu/Kali:**
.. code-block:: bash
sudo apt install -y libfreetype6-dev libjpeg-dev libffi-dev
**Fedora/RHEL/CentOS:**
.. code-block:: bash
sudo dnf install -y freetype-devel libjpeg-devel libffi-devel
**Arch Linux:**
.. code-block:: bash
sudo pacman -S freetype2 libjpeg-turbo libffi
**macOS (Homebrew):**
.. code-block:: bash
brew install freetype
After installing the system dependencies, retry the maigret installation.
If you continue to have issues, consider using Docker instead, which includes all
necessary dependencies.
Optional: Cloudflare bypass solver
----------------------------------
.. warning::
**Experimental.** The Cloudflare webgate is under active development;
the configuration schema and CLI behaviour may change without
backwards-compatibility guarantees.
Sites tagged ``cf_js_challenge`` / ``cf_firewall`` need a real browser to pass
their JavaScript challenge. To check those sites you can run a local
`FlareSolverr <https://github.com/FlareSolverr/FlareSolverr>`_ instance —
Maigret will route protected checks to it when ``--cloudflare-bypass`` is set:
.. code-block:: bash
docker run -d -p 8191:8191 --name flaresolverr ghcr.io/flaresolverr/flaresolverr:latest
This is **optional** — Maigret runs without it; only sites whose
``protection`` field intersects ``settings.cloudflare_bypass.trigger_protection``
require the solver. See :ref:`cloudflare-bypass` for details.
+139
View File
@@ -0,0 +1,139 @@
.. _library-usage:
Library usage
=============
Maigret's CLI is a thin wrapper around an async Python API. You can embed Maigret in your own tools, pipelines, and OSINT workflows — no need to shell out.
This page covers the common patterns. For the full argument list of the underlying function, see ``maigret.checking.maigret`` in the source.
Installation
------------
.. code-block:: bash
pip install maigret
Minimal example
---------------
A working end-to-end search against the top 500 sites:
.. code-block:: python
import asyncio
import logging
from maigret import search as maigret_search
from maigret.sites import MaigretDatabase
# Load the bundled site database
db = MaigretDatabase().load_from_path(
"maigret/resources/data.json"
)
# Pick which sites to scan (same filtering the CLI uses)
sites = db.ranked_sites_dict(top=500)
results = asyncio.run(
maigret_search(
username="soxoj",
site_dict=sites,
logger=logging.getLogger("maigret"),
timeout=30,
is_parsing_enabled=True,
)
)
for site_name, result in results.items():
if result["status"].is_found():
print(site_name, result["url_user"])
Key points:
- ``maigret_search`` is an ``async`` function — wrap it with ``asyncio.run(...)`` or ``await`` it from inside your own event loop.
- ``is_parsing_enabled=True`` turns on ``socid_extractor`` so ``result["ids_data"]`` is populated with profile fields (bio, linked accounts, uids, etc.).
- Each entry in the returned dict has a ``"status"`` object with ``is_found()``, plus ``url_user``, ``http_status``, ``rank``, ``ids_data``, and more.
Filtering sites
---------------
``ranked_sites_dict`` accepts the same filters as the CLI:
.. code-block:: python
# All sites tagged as coding, top 200 by rank
sites = db.ranked_sites_dict(top=200, tags=["coding"])
# Exclude NSFW and dating sites
sites = db.ranked_sites_dict(excluded_tags=["nsfw", "dating"])
# Only specific sites by name
sites = db.ranked_sites_dict(names=["GitHub", "Reddit", "VK"])
# Include disabled sites (useful for maintenance / self-check)
sites = db.ranked_sites_dict(disabled=True)
Running inside an existing event loop
-------------------------------------
If your application already runs an asyncio loop (FastAPI, aiohttp server, a Discord bot, etc.), ``await`` ``maigret_search`` directly instead of calling ``asyncio.run``:
.. code-block:: python
async def check_username(username: str) -> dict:
results = await maigret_search(
username=username,
site_dict=sites,
logger=logger,
timeout=30,
)
return {
name: r["url_user"]
for name, r in results.items()
if r["status"].is_found()
}
Routing through a proxy
-----------------------
The same proxy / Tor / I2P flags the CLI exposes are plain keyword arguments:
.. code-block:: python
results = await maigret_search(
username="soxoj",
site_dict=sites,
logger=logger,
proxy="socks5://127.0.0.1:1080",
tor_proxy="socks5://127.0.0.1:9050", # used for .onion sites
i2p_proxy="http://127.0.0.1:4444", # used for .i2p sites
timeout=30,
)
Full function signature
-----------------------
.. code-block:: python
async def maigret(
username: str,
site_dict: Dict[str, MaigretSite],
logger,
query_notify=None,
proxy=None,
tor_proxy=None,
i2p_proxy=None,
timeout=30,
is_parsing_enabled=False,
id_type="username",
debug=False,
forced=False,
max_connections=100,
no_progressbar=False,
cookies=None,
retries=0,
check_domains=False,
) -> QueryResultWrapper
See :doc:`command-line-options` for a description of each option — the semantics match the CLI flags one-to-one.
+24
View File
@@ -3,6 +3,10 @@
Philosophy
==========
*The Commissioner Jules Maigret is a fictional French police detective, created by Georges Simenon.
His investigation method is based on understanding the personality of different people and their
interactions.*
TL;DR: Username => Dossier
Maigret is designed to gather all the available information about person by his username.
@@ -15,3 +19,23 @@ All this information forms some dossier, but it also useful for other tools and
Each collected piece of data has a label of a certain format (for example, ``follower_count`` for the number
of subscribers or ``created_at`` for account creation time) so that it can be parsed and analyzed by various
systems and stored in databases.
Origins
-------
Maigret started from studying what OSINT investigators actually use in practice — and from
the realization that many popular tools do not deliver real investigative value. The original
research behind this observation is summarized in the article
`What's wrong with namecheckers <https://soxoj.medium.com/whats-wrong-with-namecheckers-981e5cba600e>`_.
For a broader landscape of username-checking tools, see the curated
`OSINT namecheckers list <https://github.com/soxoj/osint-namecheckers-list>`_.
Two ideas grew out of that research:
- `socid-extractor <https://github.com/soxoj/socid-extractor>`_ — a library focused on pulling
structured identity data (user IDs, full names, linked accounts, bios, timestamps, etc.) out of
account pages and public API responses, so that finding an account is not the end of the pipeline.
- **Maigret** itself — which started as a fork of
`Sherlock <https://github.com/sherlock-project/sherlock>`_ but has long since outgrown the
original project in coverage, extraction depth, and check reliability. Today Maigret is used
as a component by major OSINT vendors in their commercial products.
+211
View File
@@ -27,3 +27,214 @@ Missing any of these files is not an error.
If the next settings file contains already known option,
this option will be rewrited. So it is possible to make
custom configuration for different users and directories.
.. _database-auto-update:
Database auto-update
--------------------
Maigret ships with a bundled site database, but it gets outdated between releases. To keep the database current, Maigret automatically checks for updates on startup.
**How it works:**
1. On startup, Maigret checks if more than 24 hours have passed since the last update check.
2. If so, it fetches a lightweight metadata file (~200 bytes) from GitHub to see if a newer database is available.
3. If a newer, compatible database exists, Maigret downloads it to ``~/.maigret/data.json`` and uses it instead of the bundled copy.
4. If the download fails or the new database is incompatible with your Maigret version, the bundled database is used as a fallback.
The downloaded database has **higher priority** than the bundled one — it replaces, not overlays.
**Status messages** are printed only when an action occurs:
.. code-block:: text
[*] DB auto-update: checking for updates...
[+] DB auto-update: database updated successfully (3180 sites)
[*] DB auto-update: database is up to date (3157 sites)
[!] DB auto-update: latest database requires maigret >= 0.6.0, you have 0.5.0
**Forcing an update:**
Use the ``--force-update`` flag to check for updates immediately, ignoring the check interval:
.. code-block:: console
maigret username --force-update
The update happens at startup, then the search continues normally with the freshly downloaded database.
**Disabling auto-update:**
Use the ``--no-autoupdate`` flag to skip the update check entirely:
.. code-block:: console
maigret username --no-autoupdate
Or set it permanently in ``~/.maigret/settings.json``:
.. code-block:: json
{
"no_autoupdate": true
}
This is recommended for **Docker containers**, **CI pipelines**, and **air-gapped environments**.
**Configuration options** (in ``settings.json``):
.. list-table::
:header-rows: 1
:widths: 35 15 50
* - Setting
- Default
- Description
* - ``no_autoupdate``
- ``false``
- Disable auto-update entirely
* - ``autoupdate_check_interval_hours``
- ``24``
- How often to check for updates (in hours)
* - ``db_update_meta_url``
- GitHub raw URL
- URL of the metadata file (for custom mirrors)
**Using a custom database** with ``--db`` always skips auto-update — you are explicitly choosing your data source.
Cloudflare webgate
------------------
.. warning::
**Experimental.** The ``cloudflare_bypass`` block is under active
development; field names, defaults, and the trigger-protection routing
rules may change without backwards-compatibility guarantees.
The ``cloudflare_bypass`` block in ``settings.json`` configures the optional
bypass described in :ref:`cloudflare-bypass`. Default value:
.. code-block:: json
{
"cloudflare_bypass": {
"enabled": false,
"session_prefix": "maigret",
"trigger_protection": ["cf_js_challenge", "cf_firewall", "webgate"],
"modules": [
{
"name": "flaresolverr",
"method": "json_api",
"url": "http://localhost:8191/v1",
"max_timeout_ms": 60000
},
{
"name": "chrome_webgate",
"method": "url_rewrite",
"url": "http://localhost:8000/html?url={url}&retries=1"
}
]
}
}
**Fields.**
.. list-table::
:header-rows: 1
:widths: 30 70
* - Field
- Description
* - ``enabled``
- When ``true``, the bypass is active for every run; when ``false``
(the default), it activates only on ``--cloudflare-bypass``.
* - ``trigger_protection``
- List of ``site.protection`` values that route a check through the
webgate. Sites whose protection is empty or doesn't intersect this
list use the default (aiohttp / curl_cffi) checker.
* - ``session_prefix``
- Prefix for the FlareSolverr ``session`` field. Maigret appends the
process PID so concurrent runs don't collide. Reusing a session
caches cf_clearance between checks of the same domain.
* - ``modules``
- Ordered list of backend modules. The first reachable module
handles the check; later ones serve as a fallback chain.
**Module methods.**
* ``json_api`` — FlareSolverr-compatible POST endpoint at ``url``.
Preserves real upstream HTTP status, headers and final URL.
Optional ``max_timeout_ms`` (default ``60000``) is the per-request
budget the solver is allowed to spend on the JS challenge.
* ``url_rewrite`` — legacy CloudflareBypassForScraping endpoint. The
``url`` must contain a ``{url}`` placeholder; the original probe URL
is URL-encoded and substituted in. Returns rendered HTML only —
``checkType: status_code`` and ``response_url`` checks misfire under
this method (treated as a synthetic HTTP 200 on success).
**Optional ``proxy`` field (``json_api`` only).**
A module may carry a ``proxy`` entry that the solver routes the upstream
request through. Useful when a site enforces ``ip_reputation`` rules
that block the solver host. Two forms are accepted:
.. code-block:: json
{ "proxy": "socks5://localhost:1080" }
.. code-block:: json
{ "proxy": { "url": "http://gw.example:3128",
"username": "u",
"password": "p" } }
Only ``url``/``username``/``password`` are forwarded; other keys are
dropped. Cloudflare ``Error 1015 / 1020`` responses indicate the IP is
rate-limited or banned — switch the proxy rather than retrying.
.. _ai-analysis-settings:
AI analysis
-----------
The ``--ai`` flag (see :ref:`ai-analysis`) talks to an OpenAI-compatible
chat completion API. Three settings control how that request is made:
.. list-table::
:header-rows: 1
:widths: 35 25 40
* - Setting
- Default
- Description
* - ``openai_api_key``
- ``""`` (empty)
- API key. If empty, Maigret falls back to the ``OPENAI_API_KEY``
environment variable.
* - ``openai_model``
- ``gpt-4o``
- Default model name. Overridable per-run with ``--ai-model``.
* - ``openai_api_base_url``
- ``https://api.openai.com/v1``
- Base URL of the chat completion API. Point this at any
OpenAI-compatible service (Azure OpenAI, OpenRouter, a local
server, …) to use it instead of OpenAI directly.
Example ``~/.maigret/settings.json`` snippet using a non-OpenAI
endpoint:
.. code-block:: json
{
"openai_api_key": "sk-...",
"openai_model": "gpt-4o-mini",
"openai_api_base_url": "https://openrouter.ai/api/v1"
}
The key resolution order is ``settings.openai_api_key````OPENAI_API_KEY``
environment variable; the first non-empty value wins.
.. note::
``--ai`` sends the full internal Markdown report (which contains the
gathered profile data) to the configured endpoint. Only use providers
and accounts you trust with that data.
+6 -1
View File
@@ -10,7 +10,12 @@ The use of tags allows you to select a subset of the sites from big Maigret DB f
There are several types of tags:
1. **Country codes**: ``us``, ``jp``, ``br``... (`ISO 3166-1 alpha-2 <https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2>`_). These tags reflect the site language and regional origin of its users and are then used to locate the owner of a username. If the regional origin is difficult to establish or a site is positioned as worldwide, `no country code is given`. There could be multiple country code tags for one site.
1. **Country codes**: ``us``, ``jp``, ``br``... (`ISO 3166-1 alpha-2 <https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2>`_). A country tag means that having an account on the site implies a connection to that country — either origin or residence. The goal is attribution, not perfect accuracy.
- **Global sites** (GitHub, YouTube, Reddit, Medium, etc.) get **no country tag** — an account there says nothing about where a person is from.
- **Regional/local sites** where an account implies a specific country **must** have a country tag: ``VK````ru``, ``Naver````kr``, ``Zhihu````cn``.
- Multiple country tags are allowed when a service is used predominantly in a few countries (e.g. ``Xing````de``, ``eu``).
- Do **not** assign country tags based on traffic statistics alone — a site popular in India by traffic is not "Indian" if it is used globally.
2. **Site engines**. Most of them are forum engines now: ``uCoz``, ``vBulletin``, ``XenForo`` et al. Full list of engines stored in the Maigret database.
+1 -1
View File
@@ -33,7 +33,7 @@ Use Cases
If you experience many false positives, you can do the following:
- Install the last development version of Maigret from GitHub
- Run Maigret with ``--self-check`` flag and agree on disabling of problematic sites
- Run Maigret with ``--self-check --auto-disable`` flag and agree on disabling of problematic sites
3. Search for accounts with username ``machine42`` and generate HTML and PDF reports.
+147
View File
@@ -0,0 +1,147 @@
.. _use-case-crypto:
Cryptocurrency & Web3 Investigations
=====================================
Blockchain transactions are public, but the people behind wallets are not. Maigret helps bridge this gap by finding Web3 accounts tied to a username, revealing the person behind a pseudonymous crypto persona.
Why it matters
--------------
Crypto investigations often start with a wallet address or an ENS name but hit a wall — the blockchain tells you *what* happened, not *who* did it. A username, however, is reused across platforms. If someone trades on OpenSea as ``zachxbt`` and posts on Warpcast as ``zachxbt``, Maigret connects the dots and builds a full profile.
Common scenarios:
- **Scam attribution.** A rug-pull promoter uses the same alias on Fragment (Telegram username marketplace), OpenSea, and a personal blog.
- **Sanctions compliance.** Verifying whether a counterparty's online footprint matches known sanctioned individuals.
- **Due diligence.** Before an OTC deal or DAO vote, checking whether the other party has a consistent online presence or is a freshly created sockpuppet.
- **Stolen funds tracing.** A stolen NFT appears on OpenSea under a new account — but the username matches a Warpcast profile with real-world links.
Supported sites
---------------
Maigret currently checks the following crypto and Web3 platforms:
.. list-table::
:header-rows: 1
:widths: 20 40 40
* - Site
- What it reveals
- Notes
* - **OpenSea**
- NFT collections, trading history, profile bio, linked website
-
* - **Rarible**
- NFT marketplace profile, collections, listing history
- Complements OpenSea for NFT attribution across marketplaces
* - **Zora**
- Zora Network profile, minted NFTs, creator activity
- Ethereum L2 creator platform; useful for on-chain art attribution
* - **Polymarket**
- Prediction-market profile, positions, public portfolio P&L
- Useful for political/financial prediction attribution
* - **Warpcast** (Farcaster)
- Decentralized social profile, posts, follower graph, Farcaster ID
- Every Farcaster ID maps to an Ethereum address via the on-chain ID registry
* - **Fragment**
- Telegram username ownership, TON wallet address, purchase date and price
- Valuable for linking Telegram identities to TON wallets
* - **Paragraph**
- Web3 blog/newsletter, ETH wallet address, linked Twitter handle
- Richest cross-platform data among crypto sites
* - **Tonometerbot**
- TON wallet balance, subscriber count, NFT collection, rankings
- TON blockchain analytics
* - **Spatial**
- Metaverse profile, linked social accounts (Discord, Twitter, Instagram, LinkedIn, TikTok)
- Rich cross-platform links
* - **Revolut.me**
- Payment handle: first/last name, country code, base currency, supported payment methods
- Not strictly Web3, but widely used by crypto OTC traders for fiat off-ramps; the public API returns structured KYC-adjacent data
Real-world example: zachxbt
---------------------------
`ZachXBT <https://twitter.com/zachxbt>`_ is a well-known on-chain investigator. Let's see what Maigret can find from just the username ``zachxbt``:
.. code-block:: console
maigret zachxbt --tags crypto
Maigret finds 5 accounts and automatically extracts structured data from each:
**Fragment** — confirms the Telegram username ``@zachxbt`` is claimed, reveals the TON wallet address (``EQBisZrk...``), purchase price (10 TON), and date (January 2023).
**Paragraph** — the richest result. Returns the real name used on the platform (``ZachXBT``), bio (``Scam survivor turned 2D investigator``), an Ethereum wallet address (``0x23dBf066...``), and a linked Twitter handle (``zachxbt``). The ``wallet_address`` field is especially valuable — it directly links the pseudonym to an on-chain identity.
**Warpcast** — Farcaster profile with a Farcaster ID (``fid: 20931``), profile image, and social graph (33K followers). Every Farcaster ID is tied to an Ethereum address via the on-chain ID registry, so this is another on-chain anchor.
**OpenSea** — NFT marketplace profile with bio (``On-chain sleuth | 10x rug pull survivor``), avatar (hosted on ``seadn.io`` with an Ethereum address in the URL path), and a link to an external investigations page.
**Hive Blog** — blockchain-based blog account created in March 2025. Low activity (1 post), but confirms the username is claimed across blockchain ecosystems.
From a single username, Maigret produces:
- **2 wallet addresses** — one TON (from Fragment), one Ethereum (from Paragraph)
- **1 confirmed Twitter handle**``zachxbt`` (from Paragraph)
- **1 Telegram username**``@zachxbt`` (from Fragment)
- **1 external link**``investigations.notion.site`` (from OpenSea)
- **Social graph data** — 33K Farcaster followers, blog activity timestamps
This is enough to pivot into blockchain analysis tools (Etherscan, Arkham, Nansen) using the wallet addresses, or into social media analysis using the Twitter handle.
Workflow: from username to wallet
---------------------------------
**Step 1: Search crypto platforms**
.. code-block:: console
maigret <username> --tags crypto -v
Review the results. Pay attention to:
- **Fragment** — if the username is claimed, you get a TON wallet address directly.
- **Paragraph** — blog profiles often contain an ETH address and a Twitter handle.
- **Warpcast** — Farcaster IDs map to Ethereum addresses via the on-chain registry.
- **OpenSea** — avatar URLs sometimes contain wallet addresses in the path.
**Step 2: Expand with extracted identifiers**
Maigret automatically extracts additional identifiers from found profiles (real names, linked accounts, profile URLs) and recursively searches for them. This is enabled by default. If Maigret finds a linked Twitter handle on a Paragraph profile, it will automatically search for that handle across all sites.
**Step 3: Cross-reference with non-crypto platforms**
The real power is connecting crypto personas to mainstream accounts. Drop the tag filter:
.. code-block:: console
maigret <username> -a
This checks all 3000+ sites. A match on GitHub, Reddit, or a forum can reveal the person behind the wallet.
Workflow: from wallet to identity
---------------------------------
If you start with a wallet address rather than a username, you can use complementary tools to get a username first:
1. **ENS / Unstoppable Domains** — resolve the wallet address to a human-readable name (``vitalik.eth``). Then search that name in Maigret.
2. **Etherscan labels** — check if the address has a public label (exchange, known entity).
3. **Fragment** — search the TON wallet address to find which Telegram usernames it purchased.
4. **Arkham Intelligence / Nansen** — blockchain attribution platforms that may tag the address with a known identity.
Once you have a username candidate, feed it to Maigret.
Tips
----
- **Username reuse is the #1 signal.** Crypto-native users often reuse their ENS name (``alice.eth``) or a variation (``alice_eth``, ``aliceeth``) across platforms. Try all variations.
- **Fragment is uniquely valuable** because it directly links Telegram usernames to TON wallet addresses — a rare on-chain / off-chain bridge.
- **Warpcast profiles are Ethereum-native.** Every Farcaster account is tied to an Ethereum address via the ID registry contract. If you find a Warpcast profile, you implicitly have a wallet address.
- **Paragraph often has the richest data** — wallet address, Twitter handle, bio, and activity timestamps in a single API response.
- **Use** ``--exclude-tags`` **to skip irrelevant sites** when you're focused on crypto:
.. code-block:: console
maigret alice_eth --exclude-tags porn,dating,forum
+12 -1
View File
@@ -7,7 +7,18 @@ __author_email__ = 'soxoj@protonmail.com'
from .__version__ import __version__
from .checking import maigret as search
try:
from .checking import maigret as search
except ImportError as e:
raise ImportError(
"Missing required dependency while starting Maigret.\n\n"
"If installed from PyPI:\n"
" pip install -U maigret\n\n"
"If running from a cloned repository:\n"
" pip install -e .\n\n"
"Then run Maigret as:\n"
" python -m maigret <username>"
) from e
from .maigret import main as cli
from .sites import MaigretEngine, MaigretSite, MaigretDatabase
from .notify import QueryNotifyPrint as Notifier
+1 -1
View File
@@ -1,3 +1,3 @@
"""Maigret version file"""
__version__ = '0.5.0'
__version__ = '0.6.0'
+53 -13
View File
@@ -7,7 +7,7 @@ from aiohttp import CookieJar
class ParsingActivator:
@staticmethod
def twitter(site, logger, cookies={}):
def twitter(site, logger, cookies={}, **kwargs):
headers = dict(site.headers)
del headers["x-guest-token"]
import requests
@@ -19,7 +19,7 @@ class ParsingActivator:
site.headers["x-guest-token"] = guest_token
@staticmethod
def vimeo(site, logger, cookies={}):
def vimeo(site, logger, cookies={}, **kwargs):
headers = dict(site.headers)
if "Authorization" in headers:
del headers["Authorization"]
@@ -31,18 +31,58 @@ class ParsingActivator:
site.headers["Authorization"] = "jwt " + jwt_token
@staticmethod
def spotify(site, logger, cookies={}):
headers = dict(site.headers)
if "Authorization" in headers:
del headers["Authorization"]
def onlyfans(site, logger, url=None, **kwargs):
# Signing rules (static_param / checksum_indexes / checksum_constant / format / app_token)
# live in data.json under OnlyFans.activation and rotate upstream every ~13 weeks.
# If "Please refresh the page" keeps firing after activation, refresh them from:
# https://raw.githubusercontent.com/DATAHOARDERS/dynamic-rules/main/onlyfans.json
import hashlib
import secrets
import time as _time
from urllib.parse import urlparse
import requests
r = requests.get(site.activation["url"])
bearer_token = r.json()["accessToken"]
site.headers["authorization"] = f"Bearer {bearer_token}"
act = site.activation
static_param = act["static_param"]
indexes = act["checksum_indexes"]
constant = act["checksum_constant"]
fmt = act["format"]
init_url = act["url"]
user_id = site.headers.get("user-id", "0") or "0"
def _sign(path):
t = str(int(_time.time() * 1000))
msg = "\n".join([static_param, t, path, user_id]).encode()
sha = hashlib.sha1(msg).hexdigest()
cs = sum(ord(sha[i]) for i in indexes) + constant
return t, fmt.format(sha, abs(cs))
if site.headers.get("x-bc", "").strip("0") == "":
site.headers["x-bc"] = secrets.token_hex(20)
if not site.headers.get("cookie"):
init_path = urlparse(init_url).path
t, sg = _sign(init_path)
hdrs = dict(site.headers)
hdrs["time"] = t
hdrs["sign"] = sg
hdrs.pop("cookie", None)
r = requests.get(init_url, headers=hdrs, timeout=15)
jar = "; ".join(f"{k}={v}" for k, v in r.cookies.items())
if jar:
site.headers["cookie"] = jar
logger.debug(f"OnlyFans init: got cookies {list(r.cookies.keys())}")
target_path = urlparse(url).path if url else urlparse(init_url).path
t, sg = _sign(target_path)
site.headers["time"] = t
site.headers["sign"] = sg
logger.debug(f"OnlyFans signed {target_path} time={t}")
@staticmethod
def weibo(site, logger):
def weibo(site, logger, **kwargs):
headers = dict(site.headers)
import requests
@@ -54,7 +94,7 @@ class ParsingActivator:
logger.debug(
f"1 stage: {'success' if r.status_code == 302 else 'no 302 redirect, fail!'}"
)
location = r.headers.get("Location")
location = r.headers.get("Location", "")
# 2 stage: go to passport visitor page
headers["Referer"] = location
@@ -84,9 +124,9 @@ def import_aiohttp_cookies(cookiestxt_filename):
cookies = CookieJar()
cookies_list = []
for domain in cookies_obj._cookies.values():
for domain in cookies_obj._cookies.values(): # type: ignore[attr-defined]
for key, cookie in list(domain.values())[0].items():
c = Morsel()
c: Morsel = Morsel()
c.set(key, cookie.value, cookie.value)
c["domain"] = cookie.domain
c["path"] = cookie.path
+162
View File
@@ -0,0 +1,162 @@
"""Maigret AI Analysis Module
Provides AI-powered analysis of search results using OpenAI-compatible APIs.
"""
import asyncio
import json
import os
import sys
import threading
import aiohttp
def load_ai_prompt() -> str:
"""Load the AI system prompt from the resources directory."""
maigret_path = os.path.dirname(os.path.realpath(__file__))
prompt_path = os.path.join(maigret_path, "resources", "ai_prompt.txt")
with open(prompt_path, "r", encoding="utf-8") as f:
return f.read()
def resolve_api_key(settings) -> str | None:
"""Resolve OpenAI API key from settings or environment variable.
Priority: settings.openai_api_key > OPENAI_API_KEY env var.
"""
key = getattr(settings, "openai_api_key", None)
if key:
return key
return os.environ.get("OPENAI_API_KEY")
class _Spinner:
"""Simple animated spinner for terminal output."""
FRAMES = ["", "", "", "", "", "", "", "", "", ""]
def __init__(self, text=""):
self.text = text
self._stop = threading.Event()
self._thread = None
def start(self):
self._thread = threading.Thread(target=self._spin, daemon=True)
self._thread.start()
def _spin(self):
i = 0
while not self._stop.is_set():
frame = self.FRAMES[i % len(self.FRAMES)]
sys.stderr.write(f"\r{frame} {self.text}")
sys.stderr.flush()
i += 1
self._stop.wait(0.08)
def stop(self):
self._stop.set()
if self._thread:
self._thread.join()
sys.stderr.write("\r\033[2K")
sys.stderr.flush()
async def print_streaming(text: str, delay: float = 0.04):
"""Print text word by word with a delay, simulating streaming LLM output."""
words = text.split(" ")
for i, word in enumerate(words):
if i > 0:
sys.stdout.write(" ")
sys.stdout.write(word)
sys.stdout.flush()
await asyncio.sleep(delay)
sys.stdout.write("\n")
sys.stdout.flush()
async def _check_response(resp):
"""Raise descriptive errors for non-success HTTP responses."""
if resp.status == 401:
raise RuntimeError("Invalid OpenAI API key (HTTP 401)")
if resp.status == 429:
raise RuntimeError("OpenAI API rate limit exceeded (HTTP 429)")
if resp.status != 200:
body = await resp.text()
raise RuntimeError(f"OpenAI API error (HTTP {resp.status}): {body[:500]}")
async def _stream_response(resp, spinner, first_token):
"""Stream tokens from resp, display them, and return (first_token, full_analysis)."""
full_response = []
async for line in resp.content:
decoded = line.decode("utf-8").strip()
if not decoded or not decoded.startswith("data: "):
continue
data_str = decoded[len("data: "):]
if data_str == "[DONE]":
break
try:
chunk = json.loads(data_str)
except json.JSONDecodeError:
continue
delta = chunk.get("choices", [{}])[0].get("delta", {})
content = delta.get("content", "")
if not content:
continue
if first_token:
spinner.stop()
print()
first_token = False
sys.stdout.write(content)
sys.stdout.flush()
full_response.append(content)
return first_token, "".join(full_response)
async def get_ai_analysis(
api_key: str,
markdown_report: str,
model: str = "gpt-4o",
api_base_url: str = "https://api.openai.com/v1",
) -> str:
"""Send the markdown report to an OpenAI-compatible API and return the analysis.
Uses streaming to display tokens as they arrive.
Raises on HTTP errors with descriptive messages.
"""
system_prompt = load_ai_prompt()
url = f"{api_base_url.rstrip('/')}/chat/completions"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
payload = {
"model": model,
"stream": True,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": markdown_report},
],
}
spinner = _Spinner("Analysing the data with AI...")
spinner.start()
first_token = True
try:
async with aiohttp.ClientSession() as session:
async with session.post(url, json=payload, headers=headers) as resp:
await _check_response(resp)
first_token, analysis = await _stream_response(resp, spinner, first_token)
except Exception:
spinner.stop()
raise
if first_token:
# No tokens received — stop spinner anyway
spinner.stop()
print()
return analysis
+537 -135
View File
@@ -2,11 +2,12 @@
import ast
import asyncio
import logging
import os
import random
import re
import ssl
import sys
from typing import Dict, List, Optional, Tuple
from typing import Any, Dict, List, Optional, Tuple
from urllib.parse import quote
# Third party imports
@@ -15,7 +16,7 @@ from alive_progress import alive_bar
from aiohttp import ClientSession, TCPConnector, http_exceptions
from aiohttp.client_exceptions import ClientConnectorError, ServerDisconnectedError
from python_socks import _errors as proxy_errors
from socid_extractor import extract
from socid_extractor import extract # type: ignore[import-not-found]
try:
from mock import Mock
@@ -48,6 +49,53 @@ SUPPORTED_IDS = (
BAD_CHARS = "#"
def build_cloudflare_bypass_config(
settings_obj: Optional[Any], force_enable: bool = False
) -> Optional[Dict[str, Any]]:
"""Resolve Cloudflare webgate config from settings + CLI flag.
Returns ``None`` when bypass is inactive or no usable module is configured.
Otherwise returns a dict consumed by ``CloudflareWebgateChecker``:
- ``trigger_protection``: list of ``site.protection`` values that
activate the bypass (e.g. ``["cf_js_challenge", "cf_firewall", "webgate"]``)
- ``modules``: ordered list of backend modules to try; each entry has
``name``, ``method`` (``json_api`` for FlareSolverr, ``url_rewrite``
for CloudflareBypassForScraping), and a method-specific ``url`` plus
optional ``max_timeout_ms``.
- ``session_prefix``: prefix for FlareSolverr session reuse.
"""
raw = {}
if settings_obj is not None:
raw = getattr(settings_obj, "cloudflare_bypass", {}) or {}
enabled = bool(force_enable) or bool(raw.get("enabled", False))
if not enabled:
return None
modules_raw = raw.get("modules") or []
valid_modules: List[Dict[str, Any]] = []
for module in modules_raw:
method = module.get("method")
url = module.get("url")
if method == "json_api" and url:
valid_modules.append(dict(module))
elif method == "url_rewrite" and url and "{url}" in url:
valid_modules.append(dict(module))
if not valid_modules:
return None
trigger = raw.get("trigger_protection") or [
"cf_js_challenge",
"cf_firewall",
"webgate",
]
return {
"trigger_protection": list(trigger),
"modules": valid_modules,
"session_prefix": raw.get("session_prefix", "maigret"),
}
class CheckerBase:
pass
@@ -61,8 +109,6 @@ class SimpleAiohttpChecker(CheckerBase):
self.headers = None
self.allow_redirects = True
self.timeout = 0
self.allow_redirects = True
self.timeout = 0
self.method = 'get'
self.payload = None
@@ -80,7 +126,7 @@ class SimpleAiohttpChecker(CheckerBase):
async def _make_request(
self, session, url, headers, allow_redirects, timeout, method, logger, payload=None
) -> Tuple[str, int, Optional[CheckError]]:
) -> Tuple[Optional[str], int, Optional[CheckError]]:
try:
if method.lower() == 'get':
request_method = session.get
@@ -136,15 +182,21 @@ class SimpleAiohttpChecker(CheckerBase):
logger.debug(e, exc_info=True)
return None, 0, CheckError("Unexpected", str(e))
async def check(self) -> Tuple[str, int, Optional[CheckError]]:
async def check(self) -> Tuple[Optional[str], int, Optional[CheckError]]:
from aiohttp_socks import ProxyConnector
# Use a real SSL context instead of ssl=False to avoid TLS fingerprinting
# blocks by Cloudflare and similar WAFs. Certificate verification is
# disabled to handle sites with invalid/expired certs.
ssl_context = ssl.create_default_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE
connector = (
ProxyConnector.from_url(self.proxy)
if self.proxy
else TCPConnector(ssl=False)
else TCPConnector(ssl=ssl_context)
)
connector.verify_ssl = False
async with ClientSession(
connector=connector,
@@ -189,7 +241,7 @@ class AiodnsDomainResolver(CheckerBase):
self.url = url
return None
async def check(self) -> Tuple[str, int, Optional[CheckError]]:
async def check(self) -> Tuple[Optional[str], int, Optional[CheckError]]:
status = 404
error = None
text = ''
@@ -207,6 +259,297 @@ class AiodnsDomainResolver(CheckerBase):
return text, status, error
try:
from curl_cffi.requests import AsyncSession as CurlCffiAsyncSession
CURL_CFFI_AVAILABLE = True
except ImportError:
CURL_CFFI_AVAILABLE = False
class CurlCffiChecker(CheckerBase):
"""Checker using curl_cffi to emulate browser TLS fingerprint and bypass WAF."""
def __init__(self, *args, **kwargs):
self.logger = kwargs.get('logger', Mock())
self.browser_emulate = kwargs.get('browser_emulate', 'chrome')
self.url = None
self.headers = None
self.allow_redirects = True
self.timeout = 0
self.method = 'get'
self.payload = None
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
self.url = url
self.headers = headers
self.allow_redirects = allow_redirects
self.timeout = timeout
self.method = method
self.payload = payload
return None
async def close(self):
pass
async def check(self) -> Tuple[Optional[str], int, Optional[CheckError]]:
try:
async with CurlCffiAsyncSession() as session:
# Strip the User-Agent so curl_cffi can use the impersonated browser's
# matching UA. Mixing a random UA with a Chrome TLS fingerprint trips
# composite bot scoring (e.g. Cloudflare returns a JS challenge for
# "Chrome 91 UA + Chrome 131 TLS"). Keep any site-specific custom headers.
headers = {k: v for k, v in (self.headers or {}).items()
if k.lower() not in ('user-agent', 'connection')}
kwargs = {
'url': self.url,
'headers': headers or None,
'allow_redirects': self.allow_redirects,
'timeout': self.timeout if self.timeout else 10,
'impersonate': self.browser_emulate,
}
if self.payload and self.method.lower() == 'post':
kwargs['json'] = self.payload
if self.method.lower() == 'post':
response = await session.post(**kwargs)
elif self.method.lower() == 'head':
response = await session.head(**kwargs)
else:
response = await session.get(**kwargs)
status_code = response.status_code
decoded_content = response.text
self.logger.debug(decoded_content)
error = CheckError("Connection lost") if status_code == 0 else None
return decoded_content, status_code, error
except asyncio.TimeoutError as e:
return None, 0, CheckError("Request timeout", str(e))
except KeyboardInterrupt:
return None, 0, CheckError("Interrupted")
except Exception as e:
self.logger.debug(e, exc_info=True)
return None, 0, CheckError("Unexpected", str(e))
class CloudflareWebgateChecker(CheckerBase):
"""Sends checks through a Cloudflare-bypass proxy.
Supports two backends, selected by ``modules[0].method`` in settings:
- ``json_api`` (FlareSolverr): POST to ``/v1`` with ``cmd: request.get``.
Preserves real upstream status_code, headers and final URL — drop-in
replacement for SimpleAiohttpChecker.
- ``url_rewrite`` (CloudflareBypassForScraping ``/html`` endpoint):
legacy mode. Returns rendered HTML only. Real upstream status is
lost (proxy answers 200 on success). status_code / response_url
check types degrade to "200 if HTML returned, AVAILABLE otherwise".
"""
SESSION_PREFIX_DEFAULT = "maigret"
def __init__(self, *args, **kwargs):
self.logger = kwargs.get('logger', Mock())
config = kwargs.get('config') or {}
self._modules: List[Dict[str, Any]] = []
for raw in config.get('modules') or []:
module = dict(raw)
module.setdefault('method', 'json_api')
module.setdefault('name', module.get('method'))
self._modules.append(module)
if not self._modules:
raise ValueError("CloudflareWebgateChecker requires at least one module")
# Session ID is computed per-request from the target host. Sharing a
# single session across hosts caused FlareSolverr to break in
# practice (TLS state / cookies leaking between domains), so each
# host gets its own Chrome instance.
self._session_prefix = (
f"{config.get('session_prefix', self.SESSION_PREFIX_DEFAULT)}-{os.getpid()}"
)
self.url = None
self.headers = None
self.allow_redirects = True
self.timeout = 0
self.method = 'get'
self.payload = None
@property
def session_id(self) -> str:
"""FlareSolverr session ID, scoped per target host."""
from urllib.parse import urlparse
host = urlparse(self.url or "").hostname or "default"
host_safe = re.sub(r"[^a-zA-Z0-9.-]", "_", host)
return f"{self._session_prefix}-{host_safe}"
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
self.url = url
self.headers = headers or {}
self.allow_redirects = allow_redirects
self.timeout = timeout
self.method = method
self.payload = payload
return None
async def close(self):
pass
async def check(self) -> Tuple[Optional[str], int, Optional[CheckError]]:
attempts: List[str] = []
last_error: Optional[CheckError] = None
for module in self._modules:
method = module.get('method')
module_name = module.get('name', method or '?')
if method == 'json_api':
result = await self._check_flaresolverr(module)
elif method == 'url_rewrite':
result = await self._check_url_rewrite(module)
else:
self.logger.warning(
f"Webgate module '{module_name}' has unknown method "
f"'{method}', skipping"
)
attempts.append(f"{module_name}:unknown-method")
continue
body, status, err = result
if err is None:
return result
last_error = err
attempts.append(f"{module_name}:{err.type}")
self.logger.info(
f"Webgate module '{module_name}' failed for {self.url}: "
f"{err.type}: {err.desc}. Trying next module if any."
)
# All modules failed. Give the user a single, actionable error with
# the first module's URL — that's almost always FlareSolverr, and
# the most common failure is "user forgot to start the container".
primary = self._modules[0]
primary_url = primary.get('url', '?')
primary_method = primary.get('method', '?')
hint = (
f"docker run -d -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latest"
if primary_method == 'json_api'
else "start the local proxy container"
)
last_desc = last_error.desc if last_error else "unknown"
return None, 0, CheckError(
"Webgate unavailable",
f"all {len(self._modules)} module(s) failed [{', '.join(attempts)}]. "
f"Last error: {last_desc}. "
f"Is the solver running at {primary_url}? (hint: {hint})",
)
async def _check_flaresolverr(
self, module: Dict[str, Any]
) -> Tuple[Optional[str], int, Optional[CheckError]]:
endpoint = module.get('url') or 'http://localhost:8191/v1'
max_timeout_ms = int(module.get('max_timeout_ms', 60000))
post_method = self.method.lower() == 'post'
cmd = "request.post" if post_method else "request.get"
body: Dict[str, Any] = {
"cmd": cmd,
"url": self.url,
"maxTimeout": max_timeout_ms,
"session": self.session_id,
}
proxy = module.get('proxy')
if isinstance(proxy, str) and proxy:
body["proxy"] = {"url": proxy}
elif isinstance(proxy, dict) and proxy.get("url"):
body["proxy"] = {k: v for k, v in proxy.items() if k in ("url", "username", "password")}
if post_method and self.payload is not None:
# FlareSolverr expects postData as urlencoded string for form data,
# but if site.request_payload is JSON we still send it.
body["postData"] = (
"&".join(f"{k}={quote(str(v))}" for k, v in self.payload.items())
)
timeout = max(int(self.timeout) if self.timeout else 30, max_timeout_ms / 1000 + 5)
try:
async with ClientSession() as session:
async with session.post(
endpoint, json=body, timeout=timeout
) as resp:
if resp.status >= 500:
return None, 0, CheckError(
"Webgate", f"FlareSolverr {resp.status}"
)
data = await resp.json()
except (ClientConnectorError, ServerDisconnectedError) as e:
return None, 0, CheckError("Webgate unreachable", str(e))
except asyncio.TimeoutError:
return None, 0, CheckError("Webgate timeout", endpoint)
except Exception as e:
self.logger.debug(e, exc_info=True)
return None, 0, CheckError("Webgate", str(e))
if data.get("status") != "ok":
return None, 0, CheckError("Webgate", data.get("message", "unknown"))
solution = data.get("solution") or {}
upstream_status = int(solution.get("status") or 0)
response_text = solution.get("response") or ""
# Diagnostic: warn if FlareSolverr returned the CF challenge page
# itself (challenge not fully solved) rather than the real content.
# When this happens with sites that have weak presenseStrs/absenceStrs,
# maigret's default-true presence rule produces false CLAIMED.
cf_markers = ("Just a moment", "_cf_chl_opt", "cf-mitigated", "challenges.cloudflare.com")
if response_text and any(m in response_text for m in cf_markers):
self.logger.warning(
f"Webgate response from {self.url} still contains CF challenge "
f"markers (status={upstream_status}, body={len(response_text)}b). "
f"FlareSolverr likely did not solve the challenge — site checks "
f"with weak markers may produce false CLAIMED."
)
self.logger.info(
f"Webgate response: url={self.url} status={upstream_status} "
f"body_len={len(response_text)}"
)
return response_text, upstream_status, None
async def _check_url_rewrite(
self, module: Dict[str, Any]
) -> Tuple[Optional[str], int, Optional[CheckError]]:
url_template = module.get('url') or ''
if "{url}" not in url_template:
return None, 0, CheckError(
"Webgate", f"module '{module.get('name')}' url has no {{url}} placeholder"
)
from urllib.parse import quote_plus
proxy_url = url_template.format(url=quote_plus(self.url))
timeout = self.timeout if self.timeout else 30
try:
async with ClientSession() as session:
async with session.get(proxy_url, timeout=timeout) as resp:
if resp.status >= 500:
return None, 0, CheckError(
"Webgate", f"url_rewrite proxy {resp.status}"
)
body = await resp.text()
except (ClientConnectorError, ServerDisconnectedError) as e:
return None, 0, CheckError("Webgate unreachable", str(e))
except asyncio.TimeoutError:
return None, 0, CheckError("Webgate timeout", proxy_url)
except Exception as e:
self.logger.debug(e, exc_info=True)
return None, 0, CheckError("Webgate", str(e))
# url_rewrite mode CANNOT recover the upstream HTTP status.
# We assume 200 when HTML is returned; status_code/response_url
# check types will misfire (see docs).
return body, 200, None
class CheckerMock:
def __init__(self, *args, **kwargs):
pass
@@ -214,7 +557,7 @@ class CheckerMock:
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
return None
async def check(self) -> Tuple[str, int, Optional[CheckError]]:
async def check(self) -> Tuple[Optional[str], int, Optional[CheckError]]:
await asyncio.sleep(0)
return '', 0, None
@@ -271,7 +614,11 @@ def process_site_result(
username = results_info["username"]
is_parsing_enabled = results_info["parsing_enabled"]
url = results_info.get("url_user")
logger.info(url)
url_probe = results_info.get("url_probe") or url
if url_probe != url:
logger.info(f"{url_probe} (display: {url})")
else:
logger.info(url)
status = results_info.get("status")
if status is not None:
@@ -368,7 +715,7 @@ def process_site_result(
MaigretCheckStatus.UNKNOWN,
query_time=response_time,
error=check_error,
context=str(CheckError),
context=str(check_error),
tags=fulltags,
)
elif check_type == "message":
@@ -463,8 +810,33 @@ def make_site_result(
# workaround to prevent slash errors
url = re.sub("(?<!:)/+", "/", url)
# always clearweb_checker for now
checker = options["checkers"][site.protocol]
# Select checker. Order of precedence:
# 1. Cloudflare webgate (FlareSolverr / CloudflareBypassForScraping) when
# bypass is active and site.protection requests it.
# 2. curl_cffi for sites requiring TLS impersonation.
# 3. Default protocol-specific checker (aiohttp).
cf_bypass = options.get("cloudflare_bypass")
needs_webgate = bool(cf_bypass) and any(
p in cf_bypass["trigger_protection"] for p in site.protection
)
needs_impersonation = 'tls_fingerprint' in site.protection
if needs_webgate:
checker = CloudflareWebgateChecker(logger=logger, config=cf_bypass)
logger.info(
f"Using Cloudflare webgate for {site.name} "
f"(protection: {list(site.protection)})"
)
elif needs_impersonation and CURL_CFFI_AVAILABLE:
checker = CurlCffiChecker(logger=logger, browser_emulate='chrome')
elif needs_impersonation and not CURL_CFFI_AVAILABLE:
logger.warning(
f"Site {site.name} requires TLS impersonation (curl_cffi) but it's not installed. "
"Install with: pip install curl_cffi"
)
checker = options["checkers"][site.protocol]
else:
checker = options["checkers"][site.protocol]
# site check is disabled
if site.disabled and not options['forced']:
@@ -519,6 +891,8 @@ def make_site_result(
for k, v in site.get_params.items():
url_probe += f"&{k}={v}"
results_site["url_probe"] = url_probe
if site.request_method:
request_method = site.request_method.lower()
elif site.check_type == "status_code" and site.request_head_only:
@@ -594,7 +968,7 @@ async def check_site_for_username(
method = act["method"]
try:
activate_fun = getattr(ParsingActivator(), method)
activate_fun(site, logger)
activate_fun(site, logger, url=checker.url)
except AttributeError as e:
logger.warning(
f"Activation method {method} for site {site.name} not found!",
@@ -665,6 +1039,7 @@ async def maigret(
cookies=None,
retries=0,
check_domains=False,
cloudflare_bypass: Optional[Dict[str, Any]] = None,
*args,
**kwargs,
) -> QueryResultWrapper:
@@ -763,6 +1138,7 @@ async def maigret(
options["timeout"] = timeout
options["id_type"] = id_type
options["forced"] = forced
options["cloudflare_bypass"] = cloudflare_bypass
# results from analysis of all sites
all_results: Dict[str, QueryResultWrapper] = {}
@@ -799,7 +1175,7 @@ async def maigret(
with alive_bar(
len(tasks_dict), title="Searching", force_tty=True, disable=no_progressbar
) as progress:
async for result in executor.run(tasks_dict.values()):
async for result in executor.run(list(tasks_dict.values())): # type: ignore[arg-type]
cur_results.append(result)
progress()
@@ -866,6 +1242,7 @@ async def site_self_check(
cookies=None,
auto_disable=False,
diagnose=False,
cloudflare_bypass: Optional[Dict[str, Any]] = None,
):
"""
Self-check a site configuration.
@@ -875,135 +1252,150 @@ async def site_self_check(
If False (default), only report issues without disabling.
diagnose: If True, print detailed diagnosis information.
"""
changes = {
changes: Dict[str, Any] = {
"disabled": False,
"issues": [],
"recommendations": [],
}
check_data = [
(site.username_claimed, MaigretCheckStatus.CLAIMED),
(site.username_unclaimed, MaigretCheckStatus.AVAILABLE),
]
try:
check_data = [
(site.username_claimed, MaigretCheckStatus.CLAIMED),
(site.username_unclaimed, MaigretCheckStatus.AVAILABLE),
]
logger.info(f"Checking {site.name}...")
logger.info(f"Checking {site.name}...")
results_cache = {}
results_cache = {}
for username, status in check_data:
async with semaphore:
results_dict = await maigret(
username=username,
site_dict={site.name: site},
logger=logger,
timeout=30,
id_type=site.type,
forced=True,
no_progressbar=True,
retries=1,
proxy=proxy,
tor_proxy=tor_proxy,
i2p_proxy=i2p_proxy,
cookies=cookies,
)
# don't disable entries with other ids types
# TODO: make normal checking
if site.name not in results_dict:
logger.info(results_dict)
changes["issues"].append(f"Site {site.name} not in results (wrong id_type?)")
if auto_disable:
changes["disabled"] = True
continue
logger.debug(results_dict)
result = results_dict[site.name]["status"]
results_cache[username] = results_dict[site.name]
if result.error and 'Cannot connect to host' in result.error.desc:
changes["issues"].append(f"Cannot connect to host")
if auto_disable:
changes["disabled"] = True
site_status = result.status
if site_status != status:
if site_status == MaigretCheckStatus.UNKNOWN:
msgs = site.absence_strs
etype = site.check_type
error_msg = f"Error checking {username}: {result.context}"
changes["issues"].append(error_msg)
logger.warning(
f"Error while searching {username} in {site.name}: {result.context}, {msgs}, type {etype}"
for username, status in check_data:
async with semaphore:
results_dict = await maigret(
username=username,
site_dict={site.name: site},
logger=logger,
timeout=30,
id_type=site.type,
forced=True,
no_progressbar=True,
retries=1,
proxy=proxy,
tor_proxy=tor_proxy,
i2p_proxy=i2p_proxy,
cookies=cookies,
cloudflare_bypass=cloudflare_bypass,
)
# don't disable sites after the error
# meaning that the site could be available, but returned error for the check
# e.g. many sites protected by cloudflare and available in general
if skip_errors:
pass
# don't disable in case of available username
elif status == MaigretCheckStatus.CLAIMED and auto_disable:
changes["disabled"] = True
elif status == MaigretCheckStatus.CLAIMED:
changes["issues"].append(f"Claimed user '{username}' not detected as claimed")
logger.warning(
f"Not found `{username}` in {site.name}, must be claimed"
)
logger.info(results_dict[site.name])
if auto_disable:
changes["disabled"] = True
else:
changes["issues"].append(f"Unclaimed user '{username}' detected as claimed")
logger.warning(f"Found `{username}` in {site.name}, must be available")
logger.info(results_dict[site.name])
# don't disable entries with other ids types
# TODO: make normal checking
if site.name not in results_dict:
logger.info(results_dict)
changes["issues"].append(f"Site {site.name} not in results (wrong id_type?)")
if auto_disable:
changes["disabled"] = True
continue
logger.debug(results_dict)
result = results_dict[site.name]["status"]
results_cache[username] = results_dict[site.name]
if result.error and 'Cannot connect to host' in result.error.desc:
changes["issues"].append("Cannot connect to host")
if auto_disable:
changes["disabled"] = True
logger.info(f"Site {site.name} checking is finished")
site_status = result.status
# Generate recommendations based on issues
if changes["issues"] and len(results_cache) == 2:
claimed_result = results_cache.get(site.username_claimed, {})
unclaimed_result = results_cache.get(site.username_unclaimed, {})
if site_status != status:
if site_status == MaigretCheckStatus.UNKNOWN:
msgs = site.absence_strs
etype = site.check_type
error_msg = f"Error checking {username}: {result.context}"
changes["issues"].append(error_msg)
logger.warning(
f"Error while searching {username} in {site.name}: {result.context}, {msgs}, type {etype}"
)
# don't disable sites after the error
# meaning that the site could be available, but returned error for the check
# e.g. many sites protected by cloudflare and available in general
if skip_errors:
pass
# don't disable in case of available username
elif status == MaigretCheckStatus.CLAIMED and auto_disable:
changes["disabled"] = True
elif status == MaigretCheckStatus.CLAIMED:
changes["issues"].append(f"Claimed user '{username}' not detected as claimed")
logger.warning(
f"Not found `{username}` in {site.name}, must be claimed"
)
logger.info(results_dict[site.name])
if auto_disable:
changes["disabled"] = True
else:
changes["issues"].append(f"Unclaimed user '{username}' detected as claimed")
logger.warning(f"Found `{username}` in {site.name}, must be available")
logger.info(results_dict[site.name])
if auto_disable:
changes["disabled"] = True
claimed_http = claimed_result.get("http_status")
unclaimed_http = unclaimed_result.get("http_status")
logger.info(f"Site {site.name} checking is finished")
if claimed_http and unclaimed_http:
if claimed_http != unclaimed_http and site.check_type != "status_code":
changes["recommendations"].append(
f"Consider checkType: status_code (HTTP {claimed_http} vs {unclaimed_http})"
)
# Generate recommendations based on issues
if changes["issues"] and len(results_cache) == 2:
claimed_result = results_cache.get(site.username_claimed, {})
unclaimed_result = results_cache.get(site.username_unclaimed, {})
# Print diagnosis if requested
if diagnose and changes["issues"]:
print(f"\n--- {site.name} DIAGNOSIS ---")
print(f" Check type: {site.check_type}")
print(f" Issues:")
for issue in changes["issues"]:
print(f" - {issue}")
if changes["recommendations"]:
print(f" Recommendations:")
for rec in changes["recommendations"]:
print(f" -> {rec}")
claimed_http = claimed_result.get("http_status")
unclaimed_http = unclaimed_result.get("http_status")
# Only modify site if auto_disable is enabled
if auto_disable and changes["disabled"] != site.disabled:
site.disabled = changes["disabled"]
logger.info(f"Switching property 'disabled' for {site.name} to {site.disabled}")
db.update_site(site)
if not silent:
action = "Disabled" if site.disabled else "Enabled"
print(f"{action} site {site.name}...")
elif changes["issues"] and not silent and not diagnose:
# Report issues without disabling
print(f"Issues found in {site.name}: {len(changes['issues'])} (not auto-disabled)")
if claimed_http and unclaimed_http:
if claimed_http != unclaimed_http and site.check_type != "status_code":
changes["recommendations"].append(
f"Consider checkType: status_code (HTTP {claimed_http} vs {unclaimed_http})"
)
# remove service tag "unchecked"
if "unchecked" in site.tags:
site.tags.remove("unchecked")
db.update_site(site)
# Print diagnosis if requested
if diagnose and changes["issues"]:
print(f"\n--- {site.name} DIAGNOSIS ---")
print(f" Check type: {site.check_type}")
print(" Issues:")
for issue in changes["issues"]:
print(f" - {issue}")
if changes["recommendations"]:
print(" Recommendations:")
for rec in changes["recommendations"]:
print(f" -> {rec}")
# Only modify site if auto_disable is enabled
if auto_disable and changes["disabled"] != site.disabled:
site.disabled = changes["disabled"]
logger.info(f"Switching property 'disabled' for {site.name} to {site.disabled}")
db.update_site(site)
if not silent:
action = "Disabled" if site.disabled else "Enabled"
print(f"{action} site {site.name}...")
elif changes["issues"] and not silent and not diagnose:
# Report issues without disabling
print(f"Issues found in {site.name}: {len(changes['issues'])} (not auto-disabled)")
# remove service tag "unchecked"
if "unchecked" in site.tags:
site.tags.remove("unchecked")
db.update_site(site)
except Exception as e:
logger.warning(
f"Self-check of {site.name} failed with unexpected error: {e}",
exc_info=True,
)
changes["issues"].append(f"Unexpected error: {e}")
if auto_disable and not site.disabled:
changes["disabled"] = True
site.disabled = True
db.update_site(site)
if not silent:
print(f"Disabled site {site.name} (unexpected error)...")
return changes
@@ -1019,6 +1411,8 @@ async def self_check(
i2p_proxy=None,
auto_disable=False,
diagnose=False,
no_progressbar=False,
cloudflare_bypass: Optional[Dict[str, Any]] = None,
) -> dict:
"""
Run self-check on sites.
@@ -1047,15 +1441,27 @@ async def self_check(
for _, site in all_sites.items():
check_coro = site_self_check(
site, logger, sem, db, silent, proxy, tor_proxy, i2p_proxy,
skip_errors=True, auto_disable=auto_disable, diagnose=diagnose
skip_errors=True, auto_disable=auto_disable, diagnose=diagnose,
cloudflare_bypass=cloudflare_bypass,
)
future = asyncio.ensure_future(check_coro)
tasks.append((site.name, future))
if tasks:
with alive_bar(len(tasks), title='Self-checking', force_tty=True) as progress:
with alive_bar(len(tasks), title='Self-checking', force_tty=True, disable=no_progressbar) as progress:
for site_name, f in tasks:
result = await f
try:
result = await f
except Exception as e:
logger.warning(
f"Self-check task for {site_name} raised unexpected error: {e}",
exc_info=True,
)
result = {
"disabled": False,
"issues": [f"Unexpected error: {e}"],
"recommendations": [],
}
result['site_name'] = site_name
all_results.append(result)
progress() # Update the progress bar
@@ -1091,10 +1497,6 @@ async def self_check(
needs_update = total_disabled != 0 or unchecked_new_count != unchecked_old_count
# For backwards compatibility, return bool if auto_disable is True
if auto_disable:
return needs_update
return {
'needs_update': needs_update,
'results': all_results,
@@ -1118,7 +1520,7 @@ def parse_usernames(extracted_ids_data, logger) -> Dict:
elif "usernames" in k:
try:
tree = ast.literal_eval(v)
if type(tree) == list:
if isinstance(tree, list):
for n in tree:
new_usernames[n] = "username"
except Exception as e:
+342
View File
@@ -0,0 +1,342 @@
"""
Database auto-update logic for maigret.
Checks a lightweight meta file to determine if a newer site database is available,
downloads it if compatible, and caches it locally in ~/.maigret/.
"""
import hashlib
import json
import logging
import os
import os.path as path
import tempfile
from datetime import datetime, timezone
from typing import Optional
import requests
from colorama import Fore, Style
from .__version__ import __version__
logger = logging.getLogger("maigret")
_use_color = True
def _print_info(msg: str) -> None:
text = f"[*] {msg}"
if _use_color:
print(Style.BRIGHT + Fore.GREEN + text + Style.RESET_ALL)
else:
print(text)
def _print_success(msg: str) -> None:
text = f"[+] {msg}"
if _use_color:
print(Style.BRIGHT + Fore.GREEN + text + Style.RESET_ALL)
else:
print(text)
def _print_warning(msg: str) -> None:
text = f"[!] {msg}"
if _use_color:
print(Style.BRIGHT + Fore.YELLOW + text + Style.RESET_ALL)
else:
print(text)
DEFAULT_META_URL = (
"https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/db_meta.json"
)
DEFAULT_CHECK_INTERVAL_HOURS = 24
MAIGRET_HOME = path.expanduser("~/.maigret")
CACHED_DB_PATH = path.join(MAIGRET_HOME, "data.json")
STATE_PATH = path.join(MAIGRET_HOME, "autoupdate_state.json")
BUNDLED_DB_PATH = path.join(path.dirname(path.realpath(__file__)), "resources", "data.json")
def _parse_version(version_str: str) -> tuple:
"""Parse a version string like '0.5.0' into a comparable tuple (0, 5, 0)."""
try:
return tuple(int(x) for x in version_str.strip().split("."))
except (ValueError, AttributeError):
return (0, 0, 0)
def _ensure_maigret_home() -> None:
os.makedirs(MAIGRET_HOME, exist_ok=True)
def _load_state() -> dict:
try:
with open(STATE_PATH, "r", encoding="utf-8") as f:
return json.load(f)
except (FileNotFoundError, json.JSONDecodeError, OSError):
return {}
def _save_state(state: dict) -> None:
_ensure_maigret_home()
tmp_path = STATE_PATH + ".tmp"
try:
with open(tmp_path, "w", encoding="utf-8") as f:
json.dump(state, f, indent=2, ensure_ascii=False)
os.replace(tmp_path, STATE_PATH)
except OSError:
try:
os.unlink(tmp_path)
except OSError:
pass
def _needs_check(state: dict, interval_hours: int) -> bool:
last_check = state.get("last_check_at")
if not last_check:
return True
try:
last_dt = datetime.fromisoformat(last_check.replace("Z", "+00:00"))
elapsed = (datetime.now(timezone.utc) - last_dt).total_seconds() / 3600
return elapsed >= interval_hours
except (ValueError, TypeError):
return True
def _fetch_meta(meta_url: str, timeout: int = 10) -> Optional[dict]:
try:
response = requests.get(meta_url, timeout=timeout)
if response.status_code == 200:
return response.json()
except Exception:
pass
return None
def _is_version_compatible(meta: dict) -> bool:
min_ver = meta.get("min_maigret_version", "0.0.0")
return _parse_version(__version__) >= _parse_version(min_ver)
def _is_update_available(meta: dict, state: dict) -> bool:
if not path.isfile(CACHED_DB_PATH):
return True
remote_date = meta.get("updated_at", "")
cached_date = state.get("last_meta", {}).get("updated_at", "")
return remote_date > cached_date
def _download_and_verify(data_url: str, expected_sha256: str, timeout: int = 60) -> Optional[str]:
_ensure_maigret_home()
tmp_fd, tmp_path = tempfile.mkstemp(dir=MAIGRET_HOME, suffix=".json")
try:
response = requests.get(data_url, timeout=timeout)
if response.status_code != 200:
return None
content = response.content
actual_sha256 = hashlib.sha256(content).hexdigest()
if actual_sha256 != expected_sha256:
_print_warning("DB auto-update: SHA-256 mismatch, download rejected")
return None
# Validate JSON structure
data = json.loads(content)
if not all(k in data for k in ("sites", "engines", "tags")):
_print_warning("DB auto-update: invalid database structure")
return None
os.write(tmp_fd, content)
os.close(tmp_fd)
tmp_fd = None
os.replace(tmp_path, CACHED_DB_PATH)
return CACHED_DB_PATH
except Exception:
return None
finally:
if tmp_fd is not None:
os.close(tmp_fd)
try:
os.unlink(tmp_path)
except OSError:
pass
def _best_local() -> str:
"""Return cached DB if it exists and is valid, otherwise bundled."""
if path.isfile(CACHED_DB_PATH):
try:
with open(CACHED_DB_PATH, "r", encoding="utf-8") as f:
data = json.load(f)
if "sites" in data:
return CACHED_DB_PATH
except (json.JSONDecodeError, OSError):
pass
return BUNDLED_DB_PATH
def _now_iso() -> str:
return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
def resolve_db_path(
db_file_arg: str,
no_autoupdate: bool = False,
meta_url: str = DEFAULT_META_URL,
check_interval_hours: int = DEFAULT_CHECK_INTERVAL_HOURS,
color: bool = True,
) -> str:
"""
Determine which database file to use, potentially downloading an update.
Returns the path to the database file that should be loaded.
"""
global _use_color
_use_color = color
default_db_name = "resources/data.json"
# User specified a custom DB — skip auto-update
is_url = db_file_arg.startswith("http://") or db_file_arg.startswith("https://")
is_default = db_file_arg == default_db_name
if is_url:
return db_file_arg
if not is_default:
# Try the path as-is (absolute or relative to cwd) first.
if path.isfile(db_file_arg):
return path.abspath(db_file_arg)
# Fall back to legacy behavior: resolve relative to the maigret module dir.
module_relative = path.join(path.dirname(path.realpath(__file__)), db_file_arg)
if module_relative != db_file_arg and path.isfile(module_relative):
return module_relative
if module_relative != db_file_arg:
raise FileNotFoundError(
f"Custom database file not found: {db_file_arg!r} "
f"(also tried {module_relative!r})"
)
raise FileNotFoundError(f"Custom database file not found: {db_file_arg!r}")
# Auto-update disabled
if no_autoupdate:
return _best_local()
# Check interval
_ensure_maigret_home()
state = _load_state()
if not _needs_check(state, check_interval_hours):
return _best_local()
# Time to check
_print_info("DB auto-update: checking for updates...")
meta = _fetch_meta(meta_url)
if meta is None:
_print_warning("DB auto-update: could not reach update server, using local database")
state["last_check_at"] = _now_iso()
_save_state(state)
return _best_local()
# Version compatibility
if not _is_version_compatible(meta):
min_ver = meta.get("min_maigret_version", "?")
_print_warning(
f"DB auto-update: latest database requires maigret >= {min_ver}, "
f"you have {__version__}. Please upgrade with: pip install -U maigret"
)
state["last_check_at"] = _now_iso()
_save_state(state)
return _best_local()
# Check if update available
if not _is_update_available(meta, state):
sites_count = meta.get("sites_count", "?")
_print_info(f"DB auto-update: database is up to date ({sites_count} sites)")
state["last_check_at"] = _now_iso()
state["last_meta"] = meta
_save_state(state)
return _best_local()
# Download update
new_count = meta.get("sites_count", "?")
old_count = state.get("last_meta", {}).get("sites_count")
if old_count:
_print_info(f"DB auto-update: downloading updated database ({new_count} sites, was {old_count})...")
else:
_print_info(f"DB auto-update: downloading database ({new_count} sites)...")
data_url = meta.get("data_url", "")
expected_sha = meta.get("data_sha256", "")
result = _download_and_verify(data_url, expected_sha)
if result is None:
_print_warning("DB auto-update: download failed, using local database")
state["last_check_at"] = _now_iso()
_save_state(state)
return _best_local()
_print_success(f"DB auto-update: database updated successfully ({new_count} sites)")
state["last_check_at"] = _now_iso()
state["last_meta"] = meta
state["cached_db_sha256"] = expected_sha
_save_state(state)
return CACHED_DB_PATH
def force_update(
meta_url: str = DEFAULT_META_URL,
color: bool = True,
) -> bool:
"""
Force check for database updates and download if available.
Returns True if database was updated, False otherwise.
"""
global _use_color
_use_color = color
_ensure_maigret_home()
_print_info("DB update: checking for updates...")
meta = _fetch_meta(meta_url)
if meta is None:
_print_warning("DB update: could not reach update server")
return False
if not _is_version_compatible(meta):
min_ver = meta.get("min_maigret_version", "?")
_print_warning(
f"DB update: latest database requires maigret >= {min_ver}, "
f"you have {__version__}. Please upgrade with: pip install -U maigret"
)
return False
state = _load_state()
new_count = meta.get("sites_count", "?")
old_count = state.get("last_meta", {}).get("sites_count")
if not _is_update_available(meta, state):
_print_info(f"DB update: database is already up to date ({new_count} sites)")
state["last_check_at"] = _now_iso()
state["last_meta"] = meta
_save_state(state)
return False
if old_count:
_print_info(f"DB update: downloading updated database ({new_count} sites, was {old_count})...")
else:
_print_info(f"DB update: downloading database ({new_count} sites)...")
data_url = meta.get("data_url", "")
expected_sha = meta.get("data_sha256", "")
result = _download_and_verify(data_url, expected_sha)
if result is None:
_print_warning("DB update: download failed")
return False
_print_success(f"DB update: database updated successfully ({new_count} sites)")
state["last_check_at"] = _now_iso()
state["last_meta"] = meta
state["cached_db_sha256"] = expected_sha
_save_state(state)
return True
+2
View File
@@ -58,6 +58,8 @@ COMMON_ERRORS = {
'Censorship', 'MGTS'
),
'Incapsula incident ID': CheckError('Bot protection', 'Incapsula'),
'<title>Client Challenge</title>': CheckError('Bot protection', 'Anti-bot challenge'),
'<title>DDoS-Guard</title>': CheckError('Bot protection', 'DDoS-Guard'),
'Сайт заблокирован хостинг-провайдером': CheckError(
'Site-specific', 'Site is disabled (Beget)'
),
+7 -6
View File
@@ -1,4 +1,5 @@
import asyncio
import inspect
import sys
import time
from typing import Any, Iterable, List, Callable
@@ -103,7 +104,7 @@ class AsyncioProgressbarQueueExecutor(AsyncExecutor):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.workers_count = kwargs.get('in_parallel', 10)
self.queue = asyncio.Queue(self.workers_count)
self.queue: asyncio.Queue = asyncio.Queue(self.workers_count)
self.timeout = kwargs.get('timeout')
# Pass a progress function; alive_bar by default
self.progress_func = kwargs.get('progress_func', alive_bar)
@@ -113,7 +114,7 @@ class AsyncioProgressbarQueueExecutor(AsyncExecutor):
async def increment_progress(self, count):
"""Update progress by calling the provided progress function."""
if self.progress:
if asyncio.iscoroutinefunction(self.progress):
if inspect.iscoroutinefunction(self.progress):
await self.progress(count)
else:
self.progress(count)
@@ -124,7 +125,7 @@ class AsyncioProgressbarQueueExecutor(AsyncExecutor):
"""Stop the progress tracking."""
if hasattr(self.progress, "close") and self.progress:
close_func = self.progress.close
if asyncio.iscoroutinefunction(close_func):
if inspect.iscoroutinefunction(close_func):
await close_func()
else:
close_func()
@@ -184,10 +185,10 @@ class AsyncioQueueGeneratorExecutor:
# Deprecated: will be removed soon, don't use it
def __init__(self, *args, **kwargs):
self.workers_count = kwargs.get('in_parallel', 10)
self.queue = asyncio.Queue()
self.queue: asyncio.Queue = asyncio.Queue()
self.timeout = kwargs.get('timeout')
self.logger = kwargs['logger']
self._results = asyncio.Queue()
self._results: asyncio.Queue = asyncio.Queue()
self._stop_signal = object()
async def worker(self):
@@ -209,7 +210,7 @@ class AsyncioQueueGeneratorExecutor:
result = kwargs.get('default')
await self._results.put(result)
except Exception as e:
self.logger.error(f"Error in worker: {e}")
self.logger.error(f"Error in worker: {e}", exc_info=True)
finally:
self.queue.task_done()
+195 -26
View File
@@ -13,7 +13,19 @@ from argparse import ArgumentParser, RawDescriptionHelpFormatter
from typing import List, Tuple
import os.path as path
from socid_extractor import extract, parse
try:
from socid_extractor import extract, parse
except ImportError as e:
raise ImportError(
"Missing dependency: socid_extractor\n\n"
"If installed from PyPI:\n"
" pip install -U maigret\n\n"
"If running from a cloned repository:\n"
" pip install -e .\n\n"
"Then run Maigret as:\n"
" python -m maigret <username>"
) from e
from .__version__ import __version__
from .checking import (
@@ -22,6 +34,7 @@ from .checking import (
self_check,
BAD_CHARS,
maigret,
build_cloudflare_bypass_config,
)
from . import errors
from .notify import QueryNotifyPrint
@@ -37,6 +50,7 @@ from .report import (
get_plaintext_report,
sort_report_by_data_points,
save_graph_report,
save_markdown_report,
)
from .sites import MaigretDatabase
from .submit import Submitter
@@ -75,7 +89,7 @@ def extract_ids_from_page(url, logger, timeout=5) -> dict:
elif 'usernames' in k:
try:
tree = ast.literal_eval(v)
if type(tree) == list:
if isinstance(tree, list):
for n in tree:
results[n] = 'username'
except Exception as e:
@@ -201,6 +215,20 @@ def setup_arguments_parser(settings: Settings):
default=settings.sites_db_path,
help="Load Maigret database from a JSON file or HTTP web resource.",
)
parser.add_argument(
"--no-autoupdate",
action="store_true",
dest="no_autoupdate",
default=settings.no_autoupdate,
help="Disable automatic database updates on startup.",
)
parser.add_argument(
"--force-update",
action="store_true",
dest="force_update",
default=False,
help="Force check for database updates and download if available.",
)
parser.add_argument(
"--cookies-jar-file",
metavar="COOKIE_FILE",
@@ -254,6 +282,13 @@ def setup_arguments_parser(settings: Settings):
default=settings.domain_search,
help="Enable (experimental) feature of checking domains on usernames.",
)
parser.add_argument(
"--cloudflare-bypass",
action="store_true",
default=False,
help="Enable Cloudflare webgate bypass for sites with protection cf_js_challenge / cf_firewall / webgate. "
"Requires a local CloudflareBypassForScraping instance (see settings.json -> cloudflare_bypass.modules[0].url).",
)
filter_group = parser.add_argument_group(
'Site filtering', 'Options to set site search scope'
@@ -451,6 +486,14 @@ def setup_arguments_parser(settings: Settings):
default=settings.pdf_report,
help="Generate a PDF report (general report on all usernames).",
)
report_group.add_argument(
"-M",
"--md",
action="store_true",
dest="md",
default=settings.md_report,
help="Generate a Markdown report (general report on all usernames).",
)
report_group.add_argument(
"-G",
"--graph",
@@ -471,6 +514,21 @@ def setup_arguments_parser(settings: Settings):
" (one report per username).",
)
report_group.add_argument(
"--ai",
action="store_true",
dest="ai",
default=False,
help="Generate an AI-powered analysis of the search results using OpenAI API. "
"Requires OPENAI_API_KEY env var or openai_api_key in settings.",
)
report_group.add_argument(
"--ai-model",
dest="ai_model",
default=settings.openai_model,
help="OpenAI model to use for AI analysis (default: gpt-4o).",
)
parser.add_argument(
"--reports-sorting",
default=settings.report_sorting,
@@ -502,6 +560,20 @@ async def main():
arg_parser = setup_arguments_parser(settings)
args = arg_parser.parse_args()
# Resolve Cloudflare webgate config (CLI flag OR settings.cloudflare_bypass.enabled)
cf_bypass_config = build_cloudflare_bypass_config(
settings, force_enable=args.cloudflare_bypass
)
if cf_bypass_config:
modules_summary = ", ".join(
f"{m.get('name', m.get('method'))}({m.get('url')})"
for m in cf_bypass_config["modules"]
)
logger.info(
f"Cloudflare webgate active: triggers={cf_bypass_config['trigger_protection']}, "
f"modules=[{modules_summary}]"
)
# Re-set logging level based on args
if args.debug:
log_level = logging.DEBUG
@@ -543,9 +615,25 @@ async def main():
else:
args.exclude_tags = []
db_file = args.db_file \
if (args.db_file.startswith("http://") or args.db_file.startswith("https://")) \
else path.join(path.dirname(path.realpath(__file__)), args.db_file)
from .db_updater import resolve_db_path, force_update, BUNDLED_DB_PATH
if args.force_update:
force_update(
meta_url=settings.db_update_meta_url,
color=not args.no_color,
)
try:
db_file = resolve_db_path(
db_file_arg=args.db_file,
no_autoupdate=args.no_autoupdate or args.force_update,
meta_url=settings.db_update_meta_url,
check_interval_hours=settings.autoupdate_check_interval_hours,
color=not args.no_color,
)
except FileNotFoundError as e:
logger.error(str(e))
sys.exit(2)
if args.top_sites == 0 or args.all_sites:
args.top_sites = sys.maxsize
@@ -557,10 +645,25 @@ async def main():
print_found_only=not args.print_not_found,
skip_check_errors=not args.print_check_errors,
color=not args.no_color,
silent=args.ai,
)
# Create object with all information about sites we are aware of.
db = MaigretDatabase().load_from_path(db_file)
try:
db = MaigretDatabase().load_from_path(db_file)
query_notify.success(f'Using sites database: {db_file} ({len(db.sites)} sites)')
except Exception as e:
logger.warning(f"Failed to load database from {db_file}: {e}")
if db_file != BUNDLED_DB_PATH:
query_notify.warning(
f'Falling back to bundled database: {BUNDLED_DB_PATH}'
)
db = MaigretDatabase().load_from_path(BUNDLED_DB_PATH)
query_notify.success(
f'Using sites database: {BUNDLED_DB_PATH} ({len(db.sites)} sites)'
)
else:
raise
get_top_sites_for_id = lambda x: db.ranked_sites_dict(
top=args.top_sites,
tags=args.tags,
@@ -600,13 +703,11 @@ async def main():
i2p_proxy=args.i2p_proxy,
auto_disable=args.auto_disable,
diagnose=args.diagnose,
no_progressbar=args.no_progressbar,
cloudflare_bypass=cf_bypass_config,
)
# Handle both old (bool) and new (dict) return types
if isinstance(check_result, dict):
is_need_update = check_result.get('needs_update', False)
else:
is_need_update = check_result
is_need_update = check_result.get('needs_update', False)
if is_need_update:
if input('Do you want to save changes permanently? [Yn]\n').lower() in (
@@ -661,17 +762,33 @@ async def main():
+ get_dict_ascii_tree(usernames, prepend="\t")
)
if args.ai:
from .ai import resolve_api_key
if not resolve_api_key(settings):
query_notify.warning(
'AI analysis requires an OpenAI API key. '
'Set OPENAI_API_KEY environment variable or add '
'openai_api_key to settings.json.'
)
sys.exit(1)
if not site_data:
query_notify.warning('No sites to check, exiting!')
sys.exit(2)
query_notify.warning(
f'Starting a search on top {len(site_data)} sites from the Maigret database...'
)
if not args.all_sites:
if args.ai:
query_notify.warning(
'You can run search by full list of sites with flag `-a`', '!'
f'Starting AI-assisted search on top {len(site_data)} sites from the Maigret database...'
)
else:
query_notify.warning(
f'Starting a search on top {len(site_data)} sites from the Maigret database...'
)
if not args.all_sites:
query_notify.warning(
'You can run search by full list of sites with flag `-a`', '!'
)
already_checked = set()
general_results = []
@@ -722,13 +839,15 @@ async def main():
no_progressbar=args.no_progressbar,
retries=args.retries,
check_domains=args.with_domains,
cloudflare_bypass=cf_bypass_config,
)
errs = errors.notify_about_errors(
results, query_notify, show_statistics=args.verbose
)
for e in errs:
query_notify.warning(*e)
if not args.ai:
errs = errors.notify_about_errors(
results, query_notify, show_statistics=args.verbose
)
for e in errs:
query_notify.warning(*e)
if args.reports_sorting == "data":
results = sort_report_by_data_points(results)
@@ -772,7 +891,7 @@ async def main():
# reporting for all the result
if general_results:
if args.html or args.pdf:
if args.html or args.pdf or args.md:
query_notify.warning('Generating report info...')
report_context = generate_report_context(general_results)
# determine main username
@@ -792,6 +911,23 @@ async def main():
save_pdf_report(filename, report_context)
query_notify.warning(f'PDF report on all usernames saved in {filename}')
if args.md:
username = username.replace('/', '_')
filename = report_filepath_tpl.format(username=username, postfix='.md')
run_flags = []
if args.tags:
run_flags.append(f"--tags {args.tags}")
if args.site_list:
run_flags.append(f"--site {','.join(args.site_list)}")
if args.all_sites:
run_flags.append("--all-sites")
run_info = {
"sites_count": sum(len(d) for _, _, d in general_results),
"flags": " ".join(run_flags) if run_flags else None,
}
save_markdown_report(filename, report_context, run_info=run_info)
query_notify.warning(f'Markdown report on all usernames saved in {filename}')
if args.graph:
username = username.replace('/', '_')
filename = report_filepath_tpl.format(
@@ -800,10 +936,43 @@ async def main():
save_graph_report(filename, general_results, db)
query_notify.warning(f'Graph report on all usernames saved in {filename}')
text_report = get_plaintext_report(report_context)
if text_report:
query_notify.info('Short text report:')
print(text_report)
if not args.ai:
text_report = get_plaintext_report(report_context)
if text_report:
query_notify.info('Short text report:')
print(text_report)
if args.ai:
from .ai import get_ai_analysis, resolve_api_key
from .report import generate_markdown_report
api_key = resolve_api_key(settings)
run_flags = []
if args.tags:
run_flags.append(f"--tags {args.tags}")
if args.site_list:
run_flags.append(f"--site {','.join(args.site_list)}")
if args.all_sites:
run_flags.append("--all-sites")
run_info = {
"sites_count": sum(len(d) for _, _, d in general_results),
"flags": " ".join(run_flags) if run_flags else None,
}
md_report = generate_markdown_report(report_context, run_info=run_info)
try:
await get_ai_analysis(
api_key=api_key,
markdown_report=md_report,
model=args.ai_model,
api_base_url=getattr(
settings, 'openai_api_base_url', 'https://api.openai.com/v1'
),
)
except Exception as e:
query_notify.warning(f'AI analysis failed: {e}')
# update database
db.save_to_file(db_file)
+11 -4
View File
@@ -1,7 +1,6 @@
"""Sherlock Notify Module
"""Console and query notification helpers.
This module defines the objects for notifying the caller about the
results of queries.
This module defines objects for notifying the caller about the results of queries.
"""
import sys
@@ -124,6 +123,7 @@ class QueryNotifyPrint(QueryNotify):
print_found_only=False,
skip_check_errors=False,
color=True,
silent=False,
):
"""Create Query Notify Print Object.
@@ -150,6 +150,7 @@ class QueryNotifyPrint(QueryNotify):
self.print_found_only = print_found_only
self.skip_check_errors = skip_check_errors
self.color = color
self.silent = silent
return
@@ -174,7 +175,7 @@ class QueryNotifyPrint(QueryNotify):
else:
return self.make_simple_terminal_notify(*args)
def start(self, message, id_type):
def start(self, message=None, id_type="username"):
"""Notify Start.
Will print the title to the standard output.
@@ -188,6 +189,9 @@ class QueryNotifyPrint(QueryNotify):
Nothing.
"""
if self.silent:
return
title = f"Checking {id_type}"
if self.color:
print(
@@ -237,6 +241,9 @@ class QueryNotifyPrint(QueryNotify):
Return Value:
Nothing.
"""
if self.silent:
return
notify = None
self.result = result
+165 -17
View File
@@ -7,7 +7,7 @@ import os
from datetime import datetime
from typing import Dict, Any
import xmind
import xmind # type: ignore[import-untyped]
from dateutil.tz import gettz
from dateutil.parser import parse as parse_datetime_str
from jinja2 import Template
@@ -30,14 +30,18 @@ UTILS
def filter_supposed_data(data):
# interesting fields
allowed_fields = ["fullname", "gender", "location", "age"]
filtered_supposed_data = {
CaseConverter.snake_to_title(k): v[0]
def _first(v):
if isinstance(v, (list, tuple)):
return v[0] if v else ""
return v
return {
CaseConverter.snake_to_title(k): _first(v)
for k, v in data.items()
if k in allowed_fields
}
return filtered_supposed_data
def sort_report_by_data_points(results):
@@ -79,7 +83,7 @@ def save_pdf_report(filename: str, context: dict):
filled_template = template.render(**context)
# moved here to speed up the launch of Maigret
from xhtml2pdf import pisa
from xhtml2pdf import pisa # type: ignore[import-untyped]
with open(filename, "w+b") as f:
pisa.pisaDocument(io.StringIO(filled_template), dest=f, default_css=css)
@@ -91,9 +95,9 @@ def save_json_report(filename: str, username: str, results: dict, report_type: s
class MaigretGraph:
other_params = {'size': 10, 'group': 3}
site_params = {'size': 15, 'group': 2}
username_params = {'size': 20, 'group': 1}
other_params: dict = {'size': 10, 'group': 3}
site_params: dict = {'size': 15, 'group': 2}
username_params: dict = {'size': 20, 'group': 1}
def __init__(self, graph):
self.G = graph
@@ -121,12 +125,12 @@ class MaigretGraph:
def save_graph_report(filename: str, username_results: list, db: MaigretDatabase):
import networkx as nx
G = nx.Graph()
G: Any = nx.Graph()
graph = MaigretGraph(G)
base_site_nodes = {}
site_account_nodes = {}
processed_values = {} # Track processed values to avoid duplicates
processed_values: Dict[str, Any] = {} # Track processed values to avoid duplicates
for username, id_type, results in username_results:
# Add username node, using normalized version directly if different
@@ -239,9 +243,9 @@ def save_graph_report(filename: str, username_results: list, db: MaigretDatabase
G.remove_nodes_from(single_degree_sites)
# Generate interactive visualization
from pyvis.network import Network
from pyvis.network import Network # type: ignore[import-untyped]
nt = Network(notebook=True, height="750px", width="100%")
nt = Network(notebook=True, height="100vh", width="100%")
nt.from_nx(G)
nt.show(filename)
@@ -257,6 +261,149 @@ def get_plaintext_report(context: dict) -> str:
return output.strip()
def _md_format_value(value) -> str:
"""Format a value for Markdown output, detecting links."""
if isinstance(value, list):
return ", ".join(str(v) for v in value)
s = str(value)
if s.startswith("http://") or s.startswith("https://"):
return f"[{s}]({s})"
return s
def generate_markdown_report(context: dict, run_info: dict = None) -> str:
username = context.get("username", "unknown")
generated_at = context.get("generated_at", "")
brief = context.get("brief", "")
countries = context.get("countries_tuple_list", [])
interests = context.get("interests_tuple_list", [])
first_seen = context.get("first_seen")
results = context.get("results", [])
# Collect ALL values for key fields across all accounts
all_fields: Dict[str, list] = {}
last_seen = None
for _, _, data in results:
for _, v in data.items():
if not v.get("found") or v.get("is_similar"):
continue
ids_data = v.get("ids_data", {})
# Map multiple source fields to unified output fields
field_sources = {
"fullname": ("fullname", "name"),
"location": ("location", "country", "city", "country_code", "locale", "region"),
"gender": ("gender",),
"bio": ("bio", "about", "description"),
}
for out_field, source_keys in field_sources.items():
for src in source_keys:
val = ids_data.get(src)
if val:
all_fields.setdefault(out_field, [])
val_str = str(val)
if val_str not in all_fields[out_field]:
all_fields[out_field].append(val_str)
# Track last_seen
for ts_field in ("last_online", "latest_activity_at", "updated_at"):
ts = ids_data.get(ts_field)
if ts and (last_seen is None or str(ts) > str(last_seen)):
last_seen = ts
lines = []
lines.append(f"# Report by searching on username \"{username}\"\n")
# Generated line with run info
gen_line = f"Generated at {generated_at} by [Maigret](https://github.com/soxoj/maigret)"
if run_info:
parts = []
if run_info.get("sites_count"):
parts.append(f"{run_info['sites_count']} sites checked")
if run_info.get("flags"):
parts.append(f"flags: `{run_info['flags']}`")
if parts:
gen_line += f" ({', '.join(parts)})"
lines.append(f"{gen_line}\n")
# Summary
lines.append("## Summary\n")
lines.append(f"{brief}\n")
if all_fields:
lines.append("**Information extracted from accounts:**\n")
for field, values in all_fields.items():
title = CaseConverter.snake_to_title(field)
lines.append(f"- {title}: {'; '.join(values)}")
lines.append("")
if countries:
geo = ", ".join(f"{code} (x{count})" for code, count in countries)
lines.append(f"**Country tags:** {geo}\n")
if interests:
tags = ", ".join(f"{tag} (x{count})" for tag, count in interests)
lines.append(f"**Website tags:** {tags}\n")
if first_seen:
lines.append(f"**First seen:** {first_seen}")
if last_seen:
lines.append(f"**Last seen:** {last_seen}")
if first_seen or last_seen:
lines.append("")
# Accounts found
lines.append("## Accounts found\n")
for u, id_type, data in results:
for site_name, v in data.items():
if not v.get("found") or v.get("is_similar"):
continue
lines.append(f"### {site_name}\n")
lines.append(f"- **URL:** [{v.get('url_user', '')}]({v.get('url_user', '')})")
tags = v.get("status") and v["status"].tags or []
if tags:
lines.append(f"- **Tags:** {', '.join(tags)}")
lines.append("")
ids_data = v.get("ids_data", {})
if ids_data:
for field, value in ids_data.items():
if field == "image":
continue
title = CaseConverter.snake_to_title(field)
lines.append(f"- {title}: {_md_format_value(value)}")
lines.append("")
# Possible false positives
lines.append("## Possible false positives\n")
lines.append(
f"This report was generated by searching for accounts matching the username `{username}`. "
f"Accounts listed above may belong to different people who happen to use the same "
f"or similar username. Results without extracted personal information could contain "
f"some false positive findings. Always verify findings before drawing conclusions.\n"
)
# Ethical use
lines.append("## Ethical use\n")
lines.append(
"This report is a result of a technical collection of publicly available information "
"from online accounts and does not constitute personal data processing. If you intend "
"to use this data for personal data processing or collection purposes, ensure your use "
"complies with applicable laws and regulations in your jurisdiction (such as GDPR, "
"CCPA, and similar).\n"
)
return "\n".join(lines)
def save_markdown_report(filename: str, context: dict, run_info: dict = None):
content = generate_markdown_report(context, run_info)
with open(filename, "w", encoding="utf-8") as f:
f.write(content)
"""
REPORTS GENERATING
"""
@@ -353,11 +500,12 @@ def generate_report_context(username_results: list):
if k in ["country", "locale"]:
try:
if is_country_tag(k):
tag = pycountry.countries.get(alpha_2=v).alpha_2.lower()
country = pycountry.countries.get(alpha_2=v)
tag = country.alpha_2.lower() # type: ignore[union-attr]
else:
tag = pycountry.countries.search_fuzzy(v)[
0
].alpha_2.lower()
].alpha_2.lower() # type: ignore[attr-defined]
# TODO: move countries to another struct
tags[tag] = tags.get(tag, 0) + 1
except Exception as e:
@@ -513,8 +661,8 @@ def add_xmind_subtopic(userlink, k, v, supposed_data):
def design_xmind_sheet(sheet, username, results):
alltags = {}
supposed_data = {}
alltags: Dict[str, Any] = {}
supposed_data: Dict[str, Any] = {}
sheet.setTitle("%s Analysis" % (username))
root_topic1 = sheet.getRootTopic()
+62
View File
@@ -0,0 +1,62 @@
You are an OSINT analyst that converts raw username-investigation reports into a short, clean human-readable summary.
Your task:
Read the attached account-discovery report and produce a concise report in exactly this style:
# Investigation Summary
Name: <most likely real full name>
Location: <most likely current location>
Occupation: <short combined description based only on strong signals>
Interests: <36 broad interests inferred from platform types, bios, and activity>
Languages: <languages supported by strong evidence only>
Website: <main personal website if clearly present>
Username: <main username> (variant: <variant usernames if any>)
Platforms: <number> profiles, active from <first year> to <last year>
Confidence: <High / Medium / Low> — <one short explanation why>
# Other leads
- <lead 1>
- <lead 2>
- <lead 3 if needed>
Rules:
1. Use only information supported by the report.
2. Resolve identity using consistency of username, full name, bio, links, company, and location.
3. Prefer strong repeated signals over one-off weak signals.
4. If one profile clearly conflicts with the rest, mention it in "Other leads" as a likely false positive instead of mixing it into the main identity.
5. Keep the tone analytical and neutral.
6. Do not mention every platform individually.
7. Do not include raw URLs except for the main website.
8. Do not mention NSFW/adult platforms in the main summary unless they are the only source for a critical lead; if such a profile looks inconsistent, mention it only as a likely false positive.
9. "Occupation" should be a compact merged description, for example: "Chief Product Officer (CPO) at ..., entrepreneur, OSINT community founder".
10. "Interests" should be broad categories, not noisy tags. Convert raw platform/tag evidence into natural categories like OSINT, software development, blogging, gaming, streaming, etc.
11. "Languages" should only include languages clearly supported by bios, texts, country tags, or profile content.
12. For "Platforms", count the profiles reported as found by the report summary, not manually deduplicated.
13. For active years, use the earliest and latest reliable dates from the consistent identity cluster. Ignore obvious outlier dates if they belong to likely false positives or weak profiles.
14. For confidence:
- High = strong consistency across username, name, bio, links, location, and/or company
- Medium = partial consistency with some gaps
- Low = mostly username-only matches
15. If some field is not reliably known, omit speculation and use the best cautious wording possible.
16. For "Name", output only the most likely real personal name in clean canonical form.
- Remove nicknames, handles, aliases, or bracketed parts such as "(Soxoj)".
- Example: "Dmitriy (Soxoj) Danilov" -> "Dmitriy Danilov".
17. For "Website", output only the plain domain or URL as text, not a markdown hyperlink.
18. In "Other leads", do not label conflicting profiles as "false positive", "likely unrelated", or "potentially a false positive".
- Instead, use neutral intelligence wording such as:
"Accounts were found that are most likely unrelated to the main identity, but may indicate possible cross-border activity and should be verified."
19. When describing anomalies in "Other leads", prefer cautious investigative phrasing:
- "may be unrelated"
- "requires verification"
- "could indicate separate activity"
- "should be checked manually"
20. Do not include nicknames or aliases inside the Name field unless they are clearly part of the legal or real-world name.
Output requirements:
- Return only the final formatted text.
- Keep it short.
- No preamble, no explanations.
Now analyze the following report
+4606 -4694
View File
File diff suppressed because it is too large Load Diff
+8
View File
@@ -0,0 +1,8 @@
{
"version": 1,
"updated_at": "2026-05-09T07:59:17Z",
"sites_count": 3154,
"min_maigret_version": "0.6.0",
"data_sha256": "acf9d9fef8412bf05fa09d50c1ae363e5c8394597b1aaa3f98a9a1c4e31ca356",
"data_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/data.json"
}
+28 -1
View File
@@ -54,5 +54,32 @@
"graph_report": false,
"pdf_report": false,
"html_report": false,
"web_interface_port": 5000
"md_report": false,
"openai_api_key": "",
"openai_model": "gpt-4o",
"openai_api_base_url": "https://api.openai.com/v1",
"web_interface_port": 5000,
"no_autoupdate": false,
"db_update_meta_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/db_meta.json",
"autoupdate_check_interval_hours": 24,
"cloudflare_bypass": {
"enabled": false,
"session_prefix": "maigret",
"trigger_protection": ["cf_js_challenge", "cf_firewall", "webgate"],
"modules": [
{
"name": "flaresolverr",
"method": "json_api",
"url": "http://localhost:8191/v1",
"max_timeout_ms": 60000,
"comment": "FlareSolverr (https://github.com/FlareSolverr/FlareSolverr). docker run -d -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latest"
},
{
"name": "chrome_webgate",
"method": "url_rewrite",
"url": "http://localhost:8000/html?url={url}&retries=1",
"comment": "CloudflareBypassForScraping fallback. WARNING: returns rendered HTML only — checkType: status_code and response_url misfire."
}
]
}
}
+5
View File
@@ -42,7 +42,12 @@ class Settings:
pdf_report: bool
html_report: bool
graph_report: bool
md_report: bool
web_interface_port: int
no_autoupdate: bool
db_update_meta_url: str
autoupdate_check_interval_hours: int
cloudflare_bypass: dict
# submit mode settings
presence_strings: list
+53 -10
View File
@@ -92,10 +92,12 @@ class MaigretSite:
# Alexa traffic rank
alexa_rank = None
# Source (in case a site is a mirror of another site)
source = None
source: Optional[str] = None
# URL protocol (http/https)
protocol = ''
# Protection types detected on this site (e.g. ["tls_fingerprint", "ddos_guard"])
protection: List[str] = []
def __init__(self, name, information):
self.name = name
@@ -173,13 +175,21 @@ class MaigretSite:
self.__dict__[CaseConverter.camel_to_snake(group)],
)
self.url_regexp = URLMatcher.make_profile_url_regexp(url, self.regex_check)
self.url_regexp = URLMatcher.make_profile_url_regexp(url, self.regex_check or "")
def detect_username(self, url: str) -> Optional[str]:
if self.url_regexp:
match_groups = self.url_regexp.match(url)
if match_groups:
return match_groups.groups()[-1].rstrip("/")
username = next(
(
group.rstrip("/")
for group in reversed(match_groups.groups())
if isinstance(group, str) and group
),
None,
)
return username
return None
@@ -194,8 +204,16 @@ class MaigretSite:
match_groups = self.url_regexp.match(url)
if not match_groups:
return None
_id = match_groups.groups()[-1].rstrip("/")
_id = next(
(
group.rstrip("/")
for group in reversed(match_groups.groups())
if isinstance(group, str) and group
),
None,
)
if _id is None:
return None
_type = self.type
return _id, _type
@@ -462,9 +480,9 @@ class MaigretDatabase:
"tags": self._tags,
}
json_data = json.dumps(db_data, indent=4)
json_data = json.dumps(db_data, indent=4, ensure_ascii=False)
with open(filename, "w") as f:
with open(filename, "w", encoding="utf-8") as f:
f.write(json_data)
return self
@@ -564,7 +582,7 @@ class MaigretDatabase:
def get_scan_stats(self, sites_dict):
sites = sites_dict or self.sites_dict
found_flags = {}
found_flags: Dict[str, int] = {}
for _, s in sites.items():
if "presense_flag" in s.stats:
flag = s.stats["presense_flag"]
@@ -585,8 +603,10 @@ class MaigretDatabase:
def get_db_stats(self, is_markdown=False):
# Initialize counters
sites_dict = self.sites_dict
urls = {}
tags = {}
urls: Dict[str, int] = {}
tags: Dict[str, int] = {}
engine_total: Dict[str, int] = {}
engine_enabled: Dict[str, int] = {}
disabled_count = 0
message_checks_one_factor = 0
status_checks = 0
@@ -609,6 +629,14 @@ class MaigretDatabase:
elif site.check_type == 'status_code':
status_checks += 1
# Count engines
if site.engine:
engine_total[site.engine] = engine_total.get(site.engine, 0) + 1
if not site.disabled:
engine_enabled[site.engine] = (
engine_enabled.get(site.engine, 0) + 1
)
# Count tags
if not site.tags:
tags["NO_TAGS"] = tags.get("NO_TAGS", 0) + 1
@@ -645,11 +673,26 @@ class MaigretDatabase:
f"Sites with probing: {', '.join(sorted(site_with_probing))}",
f"Sites with activation: {', '.join(sorted(site_with_activation))}",
self._format_top_items("profile URLs", urls, 20, is_markdown),
self._format_engine_stats(engine_total, engine_enabled, is_markdown),
self._format_top_items("tags", tags, 20, is_markdown, self._tags),
]
return separator.join(output)
def _format_engine_stats(self, engine_total, engine_enabled, is_markdown):
"""Format per-engine enabled/total counts, sorted by total descending."""
output = "Sites by engine:\n"
for engine, total in sorted(
engine_total.items(), key=lambda x: x[1], reverse=True
):
enabled = engine_enabled.get(engine, 0)
perc = round(100 * enabled / total, 1) if total else 0.0
if is_markdown:
output += f"- `{engine}`: {enabled}/{total} ({perc}%)\n"
else:
output += f"{enabled}/{total} ({perc}%)\t{engine}\n"
return output
def _format_top_items(
self, title, items_dict, limit, is_markdown, valid_items=None
):
+28 -24
View File
@@ -6,8 +6,7 @@ import logging
from typing import Any, Dict, List, Optional, Tuple
from aiohttp import ClientSession, TCPConnector
from aiohttp_socks import ProxyConnector
import cloudscraper
import cloudscraper # type: ignore[import-untyped]
from colorama import Fore, Style
from .activation import import_aiohttp_cookies
@@ -68,8 +67,10 @@ class Submitter:
else:
cookie_jar = import_aiohttp_cookies(args.cookie_file)
connector = ProxyConnector.from_url(proxy) if proxy else TCPConnector(ssl=False)
connector.verify_ssl = False
ssl_context = __import__('ssl').create_default_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = __import__('ssl').CERT_NONE
connector = ProxyConnector.from_url(proxy) if proxy else TCPConnector(ssl=ssl_context)
self.session = ClientSession(
connector=connector, trust_env=True, cookie_jar=cookie_jar
)
@@ -88,7 +89,9 @@ class Submitter:
alexa_rank = 0
try:
alexa_rank = int(root.find('.//REACH').attrib['RANK'])
reach_elem = root.find('.//REACH')
if reach_elem is not None:
alexa_rank = int(reach_elem.attrib['RANK'])
except Exception:
pass
@@ -110,6 +113,7 @@ class Submitter:
cookies=self.args.cookie_file,
# Don't skip errors in submit mode - we need check both false positives/true negatives
skip_errors=False,
cloudflare_bypass=getattr(self, 'cloudflare_bypass', None),
)
return changes
@@ -127,7 +131,7 @@ class Submitter:
async def detect_known_engine(
self, url_exists, url_mainpage, session, follow_redirects, headers
) -> [List[MaigretSite], str]:
) -> Tuple[List[MaigretSite], str]:
session = session or self.session
resp_text, _ = await self.get_html_response_to_compare(
@@ -191,8 +195,9 @@ class Submitter:
# TODO: replace with checking.py/SimpleAiohttpChecker call
@staticmethod
async def get_html_response_to_compare(
url: str, session: ClientSession = None, redirects=False, headers: Dict = None
url: str, session: Optional[ClientSession] = None, redirects=False, headers: Optional[Dict] = None
):
assert session is not None, "session must not be None"
async with session.get(
url, allow_redirects=redirects, headers=headers
) as response:
@@ -211,10 +216,10 @@ class Submitter:
username: str,
url_exists: str,
cookie_filename="", # TODO: use cookies
session: ClientSession = None,
session: Optional[ClientSession] = None,
follow_redirects=False,
headers: dict = None,
) -> Tuple[List[str], List[str], str, str]:
headers: Optional[dict] = None,
) -> Tuple[Optional[List[str]], Optional[List[str]], str, str]:
random_username = generate_random_username()
url_of_non_existing_account = url_exists.lower().replace(
@@ -269,11 +274,8 @@ class Submitter:
tokens_a = set(re.split(f'[{self.SEPARATORS}]', first_html_response))
tokens_b = set(re.split(f'[{self.SEPARATORS}]', second_html_response))
a_minus_b = tokens_a.difference(tokens_b)
b_minus_a = tokens_b.difference(tokens_a)
a_minus_b = list(map(lambda x: x.strip('\\'), a_minus_b))
b_minus_a = list(map(lambda x: x.strip('\\'), b_minus_a))
a_minus_b: List[str] = [x.strip('\\') for x in tokens_a.difference(tokens_b)]
b_minus_a: List[str] = [x.strip('\\') for x in tokens_b.difference(tokens_a)]
# Filter out strings containing usernames
a_minus_b = [s for s in a_minus_b if username.lower() not in s.lower()]
@@ -378,7 +380,7 @@ class Submitter:
).strip()
if field in ['tags', 'presense_strs', 'absence_strs']:
new_value = list(map(str.strip, new_value.split(',')))
new_value = list(map(str.strip, new_value.split(','))) # type: ignore[assignment]
if new_value:
setattr(site, field, new_value)
@@ -424,12 +426,12 @@ class Submitter:
f"{Fore.YELLOW}[!] Sites with domain \"{domain_raw}\" already exists in the Maigret database!{Style.RESET_ALL}"
)
status = lambda s: "(disabled)" if s.disabled else ""
site_status = lambda s: "(disabled)" if s.disabled else ""
url_block = lambda s: f"\n\t{s.url_main}\n\t{s.url}"
print(
"\n".join(
[
f"{site.name} {status(site)}{url_block(site)}"
f"{site.name} {site_status(site)}{url_block(site)}"
for site in matched_sites
]
)
@@ -497,7 +499,7 @@ class Submitter:
)
print('Detecting site engine, please wait...')
sites = []
sites: List[MaigretSite] = []
text = None
try:
sites, text = await self.detect_known_engine(
@@ -510,7 +512,7 @@ class Submitter:
except KeyboardInterrupt:
print('Engine detect process is interrupted.')
if 'cloudflare' in text.lower():
if text and 'cloudflare' in text.lower():
print(
'Cloudflare protection detected. I will use cloudscraper for further work'
)
@@ -573,6 +575,8 @@ class Submitter:
found = True
break
assert chosen_site is not None, "No sites to check"
if not found:
print(
f"{Fore.RED}[!] The check for site '{chosen_site.name}' failed!{Style.RESET_ALL}"
@@ -631,8 +635,8 @@ class Submitter:
# chosen_site.alexa_rank = rank
self.logger.info(chosen_site.json)
site_data = chosen_site.strip_engine_data()
self.logger.info(site_data.json)
stripped_site = chosen_site.strip_engine_data()
self.logger.info(stripped_site.json)
if old_site:
# Update old site with new values and log changes
@@ -651,7 +655,7 @@ class Submitter:
for field, display_name in fields_to_check.items():
old_value = getattr(old_site, field)
new_value = getattr(site_data, field)
new_value = getattr(stripped_site, field)
if field == 'tags' and not new_tags:
continue
if str(old_value) != str(new_value):
@@ -661,7 +665,7 @@ class Submitter:
old_site.__dict__[field] = new_value
# update the site
final_site = old_site if old_site else site_data
final_site = old_site if old_site else stripped_site
self.db.update_site(final_site)
# save the db in file
+6 -3
View File
@@ -8,7 +8,7 @@ from typing import Any
DEFAULT_USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36",
]
@@ -71,7 +71,10 @@ class URLMatcher:
def ascii_data_display(data: str) -> Any:
return ast.literal_eval(data)
try:
return ast.literal_eval(data)
except (ValueError, SyntaxError):
return data
def get_dict_ascii_tree(items, prepend="", new_line=True):
@@ -86,7 +89,7 @@ def get_dict_ascii_tree(items, prepend="", new_line=True):
new_result + new_line if num != len(items) - 1 else last_result + new_line
)
if type(item) == tuple:
if isinstance(item, tuple):
field_name, field_value = item
if field_value.startswith("['"):
is_last_item = num == len(items) - 1
+3 -2
View File
@@ -13,6 +13,7 @@ import os
import asyncio
from datetime import datetime
from threading import Thread
from typing import Any, Dict
import maigret
import maigret.settings
from maigret.sites import MaigretDatabase
@@ -23,7 +24,7 @@ app = Flask(__name__)
app.secret_key = os.getenv('FLASK_SECRET_KEY', os.urandom(24).hex())
# add background job tracking
background_jobs = {}
background_jobs: Dict[str, Any] = {}
job_results = {}
# Configuration
@@ -260,7 +261,7 @@ def search():
target=process_search_task, args=(usernames, options, timestamp)
),
}
background_jobs[timestamp]['thread'].start()
background_jobs[timestamp]['thread'].start() # type: ignore[union-attr]
return redirect(url_for('status', timestamp=timestamp))
Generated
+1199 -985
View File
File diff suppressed because it is too large Load Diff
+2 -2
View File
@@ -1,5 +1,5 @@
maigret @ https://github.com/soxoj/maigret/archive/refs/heads/main.zip
pefile==2023.2.7 # do not bump while pyinstaller is 6.11.1, there is a conflict
psutil==7.1.3
pyinstaller==6.16.0
psutil==7.2.2
pyinstaller==6.20.0
pywin32-ctypes==0.2.3
+13 -7
View File
@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
[tool.poetry]
name = "maigret"
version = "0.5.0"
version = "0.6.0"
description = "🕵️‍♂️ Collect a dossier on a person by username from thousands of sites."
authors = ["Soxoj <soxoj@protonmail.com>"]
readme = "README.md"
@@ -15,6 +15,11 @@ repository = "https://github.com/soxoj/maigret"
classifiers = [
"Development Status :: 5 - Production/Stable",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: 3.14",
"Intended Audience :: Information Technology",
"Operating System :: OS Independent",
"License :: OSI Approved :: MIT License",
@@ -38,18 +43,18 @@ arabic-reshaper = "^3.0.0"
async-timeout = "^5.0.1"
attrs = ">=25.3,<27.0"
certifi = ">=2025.6.15,<2027.0.0"
chardet = "^5.0.0"
chardet = ">=5,<8"
colorama = "^0.4.6"
future = "^1.0.0"
future-annotations= "^1.0.0"
html5lib = "^1.1"
idna = "^3.4"
Jinja2 = "^3.1.6"
lxml = ">=5.3,<7.0"
lxml = ">=6.0.2,<7.0"
MarkupSafe = "^3.0.2"
mock = "^5.1.0"
multidict = "^6.6.3"
pycountry = "^24.6.1"
pycountry = ">=24.6.1,<27.0.0"
PyPDF2 = "^3.0.1"
PySocks = "^1.7.1"
python-bidi = "^0.6.3"
@@ -57,7 +62,7 @@ requests = "^2.32.4"
requests-futures = "^1.0.2"
requests-toolbelt = "^1.0.0"
six = "^1.17.0"
socid-extractor = "^0.0.27"
socid-extractor = ">=0.0.27,<0.0.29"
soupsieve = "^2.6"
stem = "^1.8.1"
torrequest = "^0.1.0"
@@ -74,6 +79,7 @@ cloudscraper = "^1.2.71"
flask = {extras = ["async"], version = "^3.1.1"}
asgiref = "^3.9.1"
platformdirs = "^4.3.8"
curl-cffi = ">=0.14,<1.0"
[tool.poetry.group.dev.dependencies]
@@ -86,7 +92,7 @@ pytest-cov = ">=6,<8"
pytest-httpserver = "^1.0.0"
pytest-rerunfailures = ">=15.1,<17.0"
reportlab = "^4.4.3"
mypy = "^1.14.1"
mypy = ">=1.14.1,<3.0.0"
tuna = "^0.5.11"
coverage = "^7.9.2"
black = ">=25.1,<27.0"
@@ -94,4 +100,4 @@ black = ">=25.1,<27.0"
[tool.poetry.scripts]
# Run with: poetry run maigret <username>
maigret = "maigret.maigret:run"
update_sitesmd = "utils.update_site_data:main"
update_sitesmd = "utils.update_site_data:main"
+1
View File
@@ -3,4 +3,5 @@
filterwarnings =
error
ignore::UserWarning
ignore:codecs.open\(\) is deprecated:DeprecationWarning:xmind.core.saver
asyncio_mode=auto
-3
View File
@@ -1,3 +0,0 @@
[mutmut]
paths_to_mutate=maigret/
tests_dir=tests/
+1311 -1394
View File
File diff suppressed because it is too large Load Diff
+1 -1
View File
@@ -3,7 +3,7 @@ icon: static/maigret.png
name: maigret
summary: 🕵️‍♂️ Collect a dossier on a person by username from thousands of sites.
description: |
**Maigret** collects a dossier on a person **by username only**, checking for accounts on a huge number of sites and gathering all the available information from web pages. No API keys required. Maigret is an easy-to-use and powerful fork of Sherlock.
**Maigret** collects a dossier on a person **by username only**, checking for accounts on a huge number of sites and gathering all the available information from web pages. No API keys required.
Currently supported more than 3000 sites, search is launched against 500 popular sites in descending order of popularity by default. Also supported checking of Tor sites, I2P sites, and domains (via DNS resolving).
+107
View File
@@ -56,3 +56,110 @@ async def test_import_aiohttp_cookies(cookie_test_server):
print(f"Server response: {result}")
assert result == {'cookies': {'a': 'b'}}
# ---- OnlyFans signing tests (pure-compute, no network) ----
class _FakeSite:
"""Minimal stand-in for MaigretSite with the attributes onlyfans() touches."""
def __init__(self, headers=None, activation=None):
self.headers = headers or {}
self.activation = activation or {
"static_param": "jLM8LXHU1CGcuCzPMNwWX9osCScVuP4D",
"checksum_indexes": [28, 3, 16, 32, 25, 24, 23, 0, 26],
"checksum_constant": -180,
"format": "57203:{}:{:x}:69cfa6d8",
"url": "https://onlyfans.com/api2/v2/init",
}
class _FakeResponse:
def __init__(self, cookies=None):
self.cookies = cookies or {}
def test_onlyfans_sets_xbc_when_zero(monkeypatch):
site = _FakeSite(headers={"x-bc": "0", "cookie": "existing=1"})
# Prevent any real network. If _sign path still fires requests.get, fail loudly.
import maigret.activation as act_mod
def boom(*a, **kw): # pragma: no cover - sanity
raise AssertionError("requests.get should not run when cookie is present")
monkeypatch.setattr(act_mod.__dict__.get("requests", None) or __import__("requests"), "get", boom, raising=False)
logger = Mock()
ParsingActivator.onlyfans(site, logger, url="https://onlyfans.com/api2/v2/users/adam")
# x-bc must be rewritten to a non-zero hex token
assert site.headers["x-bc"] != "0"
assert len(site.headers["x-bc"]) == 40 # 20 bytes → 40 hex chars
# time / sign headers set for target URL
assert "time" in site.headers and site.headers["time"].isdigit()
assert site.headers["sign"].startswith("57203:")
def test_onlyfans_fetches_init_cookie_when_missing(monkeypatch):
"""When cookie header is absent, init endpoint is called and its cookies stored."""
site = _FakeSite(headers={"x-bc": "already_set_token", "user-id": "0"})
import requests
captured = {}
def fake_get(url, headers=None, timeout=15):
captured["url"] = url
captured["headers"] = dict(headers or {})
return _FakeResponse(cookies={"sess": "abc123", "csrf": "xyz"})
monkeypatch.setattr(requests, "get", fake_get)
logger = Mock()
ParsingActivator.onlyfans(site, logger, url="https://onlyfans.com/api2/v2/users/adam")
# init request made
assert captured["url"] == site.activation["url"]
# headers passed to init include freshly generated time/sign
assert "time" in captured["headers"]
assert captured["headers"]["sign"].startswith("57203:")
# cookie header populated from response
assert site.headers["cookie"] == "sess=abc123; csrf=xyz"
def test_onlyfans_signature_is_deterministic_for_same_time(monkeypatch):
"""Two calls with patched time produce identical signatures."""
site1 = _FakeSite(headers={"x-bc": "token", "cookie": "c=1"})
site2 = _FakeSite(headers={"x-bc": "token", "cookie": "c=1"})
import maigret.activation
monkeypatch.setattr(maigret.activation, "_time", __import__("time"), raising=False)
fixed = 1_700_000_000.123
import time as time_mod
monkeypatch.setattr(time_mod, "time", lambda: fixed)
logger = Mock()
ParsingActivator.onlyfans(site1, logger, url="https://onlyfans.com/api2/v2/users/adam")
ParsingActivator.onlyfans(site2, logger, url="https://onlyfans.com/api2/v2/users/adam")
assert site1.headers["time"] == site2.headers["time"]
assert site1.headers["sign"] == site2.headers["sign"]
def test_onlyfans_sign_differs_per_path(monkeypatch):
"""Different target URLs must yield different signatures."""
site = _FakeSite(headers={"x-bc": "token", "cookie": "c=1"})
import time as time_mod
monkeypatch.setattr(time_mod, "time", lambda: 1_700_000_000.0)
logger = Mock()
ParsingActivator.onlyfans(site, logger, url="https://onlyfans.com/api2/v2/users/adam")
sig_adam = site.headers["sign"]
ParsingActivator.onlyfans(site, logger, url="https://onlyfans.com/api2/v2/users/bob")
sig_bob = site.headers["sign"]
assert sig_adam != sig_bob
+408
View File
@@ -1,7 +1,22 @@
from argparse import ArgumentTypeError
from mock import Mock
import pytest
from maigret import search
from maigret.checking import (
detect_error_page,
extract_ids_data,
parse_usernames,
update_results_info,
get_failed_sites,
timeout_check,
debug_response_logging,
process_site_result,
)
from maigret.errors import CheckError
from maigret.result import MaigretCheckResult, MaigretCheckStatus
from maigret.sites import MaigretSite
def site_result_except(server, username, **kwargs):
@@ -67,3 +82,396 @@ async def test_checking_by_message_negative(httpserver, local_test_db):
result = await search('unclaimed', site_dict=sites_dict, logger=Mock())
assert result['Message']['status'].is_found() is True
# ---- Pure-function unit tests (no network) ----
def test_detect_error_page_site_specific():
err = detect_error_page(
"Please enable JavaScript to proceed",
200,
{"Please enable JavaScript to proceed": "Scraping protection"},
ignore_403=False,
)
assert err is not None
assert err.type == "Site-specific"
assert err.desc == "Scraping protection"
def test_detect_error_page_403():
err = detect_error_page("some body", 403, {}, ignore_403=False)
assert err is not None
assert err.type == "Access denied"
def test_detect_error_page_403_ignored():
# XenForo engine uses ignore403 because member-not-found also returns 403
assert detect_error_page("not found body", 403, {}, ignore_403=True) is None
def test_detect_error_page_999_linkedin():
# LinkedIn returns 999 on bot suspicion — must NOT be reported as Server error
assert detect_error_page("", 999, {}, ignore_403=False) is None
def test_detect_error_page_500():
err = detect_error_page("", 503, {}, ignore_403=False)
assert err is not None
assert err.type == "Server"
assert "503" in err.desc
def test_detect_error_page_ok():
assert detect_error_page("hello world", 200, {}, ignore_403=False) is None
def test_parse_usernames_single_username():
logger = Mock()
result = parse_usernames({"profile_username": "alice"}, logger)
assert result == {"alice": "username"}
def test_parse_usernames_list_of_usernames():
logger = Mock()
result = parse_usernames({"other_usernames": "['alice', 'bob']"}, logger)
assert result == {"alice": "username", "bob": "username"}
def test_parse_usernames_malformed_list():
logger = Mock()
result = parse_usernames({"other_usernames": "not-a-list"}, logger)
# should swallow the error and just return empty
assert result == {}
assert logger.warning.called
def test_parse_usernames_supported_id():
logger = Mock()
# "telegram" is in SUPPORTED_IDS per socid_extractor
from maigret.checking import SUPPORTED_IDS
if SUPPORTED_IDS:
key = next(iter(SUPPORTED_IDS))
result = parse_usernames({key: "some_value"}, logger)
assert result.get("some_value") == key
def test_update_results_info_links():
info = {"username": "test"}
result = update_results_info(
info,
{"links": "['https://example.com/a', 'https://example.com/b']", "website": "https://example.com/w"},
{"alice": "username"},
)
assert result["ids_usernames"] == {"alice": "username"}
assert "https://example.com/w" in result["ids_links"]
assert "https://example.com/a" in result["ids_links"]
def test_update_results_info_no_website():
info = {}
result = update_results_info(info, {"links": "[]"}, {})
assert result["ids_links"] == []
def test_extract_ids_data_bad_html_returns_empty():
logger = Mock()
# Random HTML should not raise — returns {} if nothing matches
out = extract_ids_data("<html><body>nothing special</body></html>", logger, Mock(name="Site"))
assert isinstance(out, dict)
def test_get_failed_sites_filters_permanent_errors():
# Temporary errors (Request timeout, Connecting failure, etc.) are retryable → returned.
# Permanent ones (Captcha, Access denied, etc.) and results without error → filtered out.
good_status = MaigretCheckResult("u", "S1", "https://s1", MaigretCheckStatus.CLAIMED)
timeout_err = MaigretCheckResult(
"u", "S2", "https://s2", MaigretCheckStatus.UNKNOWN,
error=CheckError("Request timeout", "slow server"),
)
captcha_err = MaigretCheckResult(
"u", "S3", "https://s3", MaigretCheckStatus.UNKNOWN,
error=CheckError("Captcha", "Cloudflare"),
)
results = {
"S1": {"status": good_status},
"S2": {"status": timeout_err},
"S3": {"status": captcha_err},
"S4": {}, # no status at all
}
failed = get_failed_sites(results)
# Only the temporary-error site is retry-worthy
assert failed == ["S2"]
def test_timeout_check_valid():
assert timeout_check("2.5") == 2.5
assert timeout_check("30") == 30.0
def test_timeout_check_invalid():
with pytest.raises(ArgumentTypeError):
timeout_check("abc")
with pytest.raises(ArgumentTypeError):
timeout_check("0")
with pytest.raises(ArgumentTypeError):
timeout_check("-1")
def test_debug_response_logging_writes(tmp_path, monkeypatch):
monkeypatch.chdir(tmp_path)
debug_response_logging("https://example.com", "<html>hi</html>", 200, None)
out = (tmp_path / "debug.log").read_text()
assert "https://example.com" in out
assert "200" in out
def test_debug_response_logging_no_response(tmp_path, monkeypatch):
monkeypatch.chdir(tmp_path)
debug_response_logging("https://example.com", None, None, CheckError("Timeout"))
out = (tmp_path / "debug.log").read_text()
assert "No response" in out
def _make_site(data_overrides=None):
base = {
"url": "https://x/{username}",
"urlMain": "https://x",
"checkType": "status_code",
"usernameClaimed": "a",
"usernameUnclaimed": "b",
}
if data_overrides:
base.update(data_overrides)
return MaigretSite("TestSite", base)
def test_process_site_result_no_response_returns_info():
site = _make_site()
info = {"username": "a", "parsing_enabled": False, "url_user": "https://x/a"}
out = process_site_result(None, Mock(), Mock(), info, site)
assert out is info
def test_process_site_result_status_already_set():
site = _make_site()
pre = MaigretCheckResult("a", "S", "u", MaigretCheckStatus.ILLEGAL)
info = {"username": "a", "parsing_enabled": False, "status": pre, "url_user": "u"}
# Since status is already set, function returns without changes
out = process_site_result(("<html/>", 200, None), Mock(), Mock(), info, site)
assert out["status"] is pre
def test_process_site_result_status_code_claimed():
site = _make_site({"checkType": "status_code"})
info = {"username": "a", "parsing_enabled": False, "url_user": "https://x/a"}
out = process_site_result(("<html/>", 200, None), Mock(), Mock(), info, site)
assert out["status"].status == MaigretCheckStatus.CLAIMED
assert out["http_status"] == 200
def test_process_site_result_status_code_available():
site = _make_site({"checkType": "status_code"})
info = {"username": "a", "parsing_enabled": False, "url_user": "https://x/a"}
out = process_site_result(("<html/>", 404, None), Mock(), Mock(), info, site)
assert out["status"].status == MaigretCheckStatus.AVAILABLE
def test_process_site_result_message_claimed():
site = _make_site({
"checkType": "message",
"presenseStrs": ["profile-name"],
"absenceStrs": ["not found"],
})
info = {"username": "a", "parsing_enabled": False, "url_user": "https://x/a"}
out = process_site_result(("<div class='profile-name'>Alice</div>", 200, None), Mock(), Mock(), info, site)
assert out["status"].status == MaigretCheckStatus.CLAIMED
def test_process_site_result_message_available_by_absence():
site = _make_site({
"checkType": "message",
"presenseStrs": ["profile-name"],
"absenceStrs": ["not found"],
})
info = {"username": "a", "parsing_enabled": False, "url_user": "https://x/a"}
out = process_site_result(("<h1>not found</h1> profile-name too", 200, None), Mock(), Mock(), info, site)
# absence marker wins even if presence marker also appears
assert out["status"].status == MaigretCheckStatus.AVAILABLE
def test_process_site_result_with_error_is_unknown():
site = _make_site({"checkType": "status_code"})
info = {"username": "a", "parsing_enabled": False, "url_user": "https://x/a"}
resp = ("body", 403, CheckError("Captcha", "Cloudflare"))
out = process_site_result(resp, Mock(), Mock(), info, site)
assert out["status"].status == MaigretCheckStatus.UNKNOWN
assert out["status"].error is not None
def test_process_site_result_error_context_uses_instance():
# Regression: context must render the CheckError instance, not the class.
site = _make_site({"checkType": "status_code"})
info = {"username": "a", "parsing_enabled": False, "url_user": "https://x/a"}
err = CheckError("Request timeout", "slow server")
out = process_site_result(("body", 0, err), Mock(), Mock(), info, site)
assert out["status"].context == "Request timeout error: slow server"
assert "class" not in out["status"].context
# ---- CurlCffiChecker: TLS impersonation header sanitisation ----
class _FakeCurlResponse:
def __init__(self, text="ok", status_code=200):
self.text = text
self.status_code = status_code
class _FakeCurlSession:
"""Captures the kwargs of the last .get/.post/.head call for assertions."""
last_method = None
last_kwargs = None
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc, tb):
return False
async def get(self, **kwargs):
type(self).last_method = 'get'
type(self).last_kwargs = kwargs
return _FakeCurlResponse()
async def post(self, **kwargs):
type(self).last_method = 'post'
type(self).last_kwargs = kwargs
return _FakeCurlResponse()
async def head(self, **kwargs):
type(self).last_method = 'head'
type(self).last_kwargs = kwargs
return _FakeCurlResponse()
@pytest.fixture
def fake_curl_cffi(monkeypatch):
"""Replace CurlCffiAsyncSession with a recorder. Resets capture between tests."""
from maigret import checking
_FakeCurlSession.last_method = None
_FakeCurlSession.last_kwargs = None
monkeypatch.setattr(checking, 'CurlCffiAsyncSession', _FakeCurlSession)
return _FakeCurlSession
@pytest.mark.asyncio
async def test_curl_cffi_strips_random_user_agent_to_let_impersonation_drive_ua(fake_curl_cffi):
"""Regression: maigret used to forward `get_random_user_agent()` (often Chrome 91)
to curl_cffi alongside `impersonate="chrome"` (Chrome 131 TLS). Cloudflare composite
bot scoring rejects the resulting "Chrome 91 UA + Chrome 131 TLS" combo with a JS
challenge. The fix strips User-Agent and Connection from the headers passed to
curl_cffi so the impersonation default UA wins.
"""
from maigret.checking import CurlCffiChecker
checker = CurlCffiChecker(logger=Mock(), browser_emulate='chrome')
checker.prepare(
url='https://example.com/u/test',
headers={
"User-Agent": "Mozilla/5.0 ... Chrome/91.0.4472.124 ...", # maigret default
"Connection": "close", # maigret default
},
allow_redirects=True,
timeout=10,
method='get',
)
await checker.check()
sent = fake_curl_cffi.last_kwargs
assert fake_curl_cffi.last_method == 'get'
assert sent['impersonate'] == 'chrome'
# The whole point of the fix: random UA must not leak through.
assert sent['headers'] is None or 'User-Agent' not in sent['headers']
assert sent['headers'] is None or 'user-agent' not in {k.lower() for k in sent['headers']}
# Connection: close also stripped (interferes with impersonation defaults).
assert sent['headers'] is None or 'Connection' not in sent['headers']
@pytest.mark.asyncio
async def test_curl_cffi_preserves_site_specific_headers(fake_curl_cffi):
"""Site-specific headers (e.g. Content-Type for POST APIs, auth tokens, cookies)
must survive the User-Agent strip only UA and Connection are removed.
"""
from maigret.checking import CurlCffiChecker
checker = CurlCffiChecker(logger=Mock(), browser_emulate='chrome')
checker.prepare(
url='https://example.com/api',
headers={
"User-Agent": "Mozilla/5.0 random",
"Connection": "close",
"Content-Type": "application/json",
"X-Csrf-Token": "abc123",
},
allow_redirects=True,
timeout=10,
method='get',
)
await checker.check()
sent_headers = fake_curl_cffi.last_kwargs['headers']
assert sent_headers is not None
assert sent_headers.get("Content-Type") == "application/json"
assert sent_headers.get("X-Csrf-Token") == "abc123"
# Sanity: stripped pair is gone
assert "User-Agent" not in sent_headers
assert "Connection" not in sent_headers
@pytest.mark.asyncio
async def test_curl_cffi_handles_empty_headers(fake_curl_cffi):
"""No headers at all → headers kwarg is None (not an empty dict that could confuse
curl_cffi's impersonation header injection)."""
from maigret.checking import CurlCffiChecker
checker = CurlCffiChecker(logger=Mock(), browser_emulate='chrome')
checker.prepare(
url='https://example.com/u/test',
headers=None,
allow_redirects=True,
timeout=10,
method='get',
)
await checker.check()
assert fake_curl_cffi.last_kwargs['headers'] is None
assert fake_curl_cffi.last_kwargs['impersonate'] == 'chrome'
@pytest.mark.asyncio
async def test_curl_cffi_strips_ua_for_post_too(fake_curl_cffi):
"""The same UA-strip must apply on POST (e.g. Discord-style POST username probes
with `tls_fingerprint`)."""
from maigret.checking import CurlCffiChecker
checker = CurlCffiChecker(logger=Mock(), browser_emulate='chrome')
checker.prepare(
url='https://example.com/api/check',
headers={
"User-Agent": "Mozilla/5.0 random",
"Content-Type": "application/json",
},
allow_redirects=True,
timeout=10,
method='post',
payload={"username": "test"},
)
await checker.check()
sent = fake_curl_cffi.last_kwargs
assert fake_curl_cffi.last_method == 'post'
assert sent['json'] == {"username": "test"}
assert "User-Agent" not in sent['headers']
assert sent['headers'].get("Content-Type") == "application/json"
+6
View File
@@ -48,6 +48,12 @@ DEFAULT_ARGS: Dict[str, Any] = {
'web': None,
'with_domains': False,
'xmind': False,
'md': False,
'ai': False,
'ai_model': 'gpt-4o',
'no_autoupdate': False,
'force_update': False,
'cloudflare_bypass': False,
}
+256
View File
@@ -0,0 +1,256 @@
"""Tests for the Cloudflare webgate config + checker."""
import json
from types import SimpleNamespace
from mock import Mock
import pytest
from maigret.checking import (
CloudflareWebgateChecker,
build_cloudflare_bypass_config,
)
def _settings(payload):
return SimpleNamespace(cloudflare_bypass=payload)
def test_config_disabled_by_default():
s = _settings({"enabled": False, "modules": [{"method": "json_api", "url": "x"}]})
assert build_cloudflare_bypass_config(s, force_enable=False) is None
def test_config_force_enable_overrides_disabled_settings():
s = _settings({"enabled": False, "modules": [{"method": "json_api", "url": "http://x:8191/v1"}]})
cfg = build_cloudflare_bypass_config(s, force_enable=True)
assert cfg is not None
assert cfg["modules"][0]["url"] == "http://x:8191/v1"
def test_config_drops_invalid_modules():
s = _settings({
"enabled": True,
"modules": [
{"method": "url_rewrite", "url": "http://x:8000/html"}, # missing {url}
{"method": "json_api", "url": "http://x:8191/v1"},
{"method": "unknown", "url": "http://x"},
],
})
cfg = build_cloudflare_bypass_config(s)
assert len(cfg["modules"]) == 1
assert cfg["modules"][0]["method"] == "json_api"
def test_config_returns_none_when_no_valid_modules():
s = _settings({"enabled": True, "modules": [{"method": "url_rewrite", "url": "no-placeholder"}]})
assert build_cloudflare_bypass_config(s) is None
def test_config_default_trigger_protection():
s = _settings({"enabled": True, "modules": [{"method": "json_api", "url": "http://x:8191/v1"}]})
cfg = build_cloudflare_bypass_config(s)
assert "cf_js_challenge" in cfg["trigger_protection"]
assert "cf_firewall" in cfg["trigger_protection"]
assert "webgate" in cfg["trigger_protection"]
@pytest.mark.asyncio
async def test_flaresolverr_success(httpserver):
httpserver.expect_request("/v1", method="POST").respond_with_json({
"status": "ok",
"solution": {"status": 404, "response": "<html>missing</html>", "url": "https://site/missing"},
})
config = {
"modules": [{"name": "fs", "method": "json_api", "url": httpserver.url_for("/v1")}],
"session_prefix": "test",
}
c = CloudflareWebgateChecker(logger=Mock(), config=config)
c.prepare(url="https://site/missing", timeout=5)
body, status, err = await c.check()
assert err is None
assert status == 404 # upstream status preserved — fixes status_code checktype
assert "missing" in body
@pytest.mark.asyncio
async def test_flaresolverr_solver_error_propagates(httpserver):
httpserver.expect_request("/v1", method="POST").respond_with_json({
"status": "error",
"message": "Challenge could not be solved",
})
config = {
"modules": [{"name": "fs", "method": "json_api", "url": httpserver.url_for("/v1")}],
}
c = CloudflareWebgateChecker(logger=Mock(), config=config)
c.prepare(url="https://site/page", timeout=5)
body, status, err = await c.check()
assert err is not None
assert "Challenge could not be solved" in err.desc
@pytest.mark.asyncio
async def test_falls_back_to_next_module_on_failure(httpserver):
# Bind only the second module — the first is unreachable.
httpserver.expect_request("/v1", method="POST").respond_with_json({
"status": "ok",
"solution": {"status": 200, "response": "ok-from-second", "url": "https://x"},
})
config = {
"modules": [
{"name": "broken", "method": "json_api", "url": "http://127.0.0.1:1/v1"},
{"name": "good", "method": "json_api", "url": httpserver.url_for("/v1")},
],
}
c = CloudflareWebgateChecker(logger=Mock(), config=config)
c.prepare(url="https://site/page", timeout=5)
body, status, err = await c.check()
assert err is None
assert status == 200
assert body == "ok-from-second"
@pytest.mark.asyncio
async def test_url_rewrite_returns_html_with_synthetic_200(httpserver):
# CloudflareBypassForScraping returns just the rendered HTML, no JSON wrapper.
httpserver.expect_request("/html").respond_with_data(
"<html>profile body</html>", status=200, content_type="text/html"
)
config = {
"modules": [{
"name": "cbfs",
"method": "url_rewrite",
"url": httpserver.url_for("/html") + "?url={url}",
}],
}
c = CloudflareWebgateChecker(logger=Mock(), config=config)
c.prepare(url="https://site/page", timeout=5)
body, status, err = await c.check()
assert err is None
assert status == 200 # synthetic — url_rewrite cannot recover real status
assert "profile body" in body
@pytest.mark.asyncio
async def test_all_modules_unreachable_actionable_error():
config = {
"modules": [
{"name": "fs", "method": "json_api", "url": "http://127.0.0.1:1/v1"},
{"name": "cbfs", "method": "url_rewrite", "url": "http://127.0.0.1:2/html?url={url}"},
],
}
c = CloudflareWebgateChecker(logger=Mock(), config=config)
c.prepare(url="https://site/page", timeout=2)
body, status, err = await c.check()
assert err is not None
assert err.type == "Webgate unavailable"
# Per-module attempt summary helps users see WHICH backend failed
assert "fs:" in err.desc and "cbfs:" in err.desc
# Primary URL is shown so the user knows where to look
assert "http://127.0.0.1:1/v1" in err.desc
# FlareSolverr docker hint when primary is json_api
assert "flaresolverr" in err.desc.lower()
@pytest.mark.asyncio
async def test_session_is_scoped_per_host(httpserver):
seen_sessions = []
def handler(request):
seen_sessions.append(request.get_json()["session"])
return {"status": "ok", "solution": {"status": 200, "response": "", "url": "x"}}
httpserver.expect_request("/v1", method="POST").respond_with_handler(handler)
config = {"modules": [{"name": "fs", "method": "json_api", "url": httpserver.url_for("/v1")}]}
c = CloudflareWebgateChecker(logger=Mock(), config=config)
c.prepare(url="https://patreon.com/foo", timeout=5)
await c.check()
c.prepare(url="https://patreon.com/bar", timeout=5)
await c.check()
c.prepare(url="https://lomography.com/baz", timeout=5)
await c.check()
assert seen_sessions[0] == seen_sessions[1], "same host -> same session"
assert seen_sessions[0] != seen_sessions[2], "different host -> different session"
assert "patreon.com" in seen_sessions[0]
assert "lomography.com" in seen_sessions[2]
@pytest.mark.asyncio
async def test_flaresolverr_request_body_shape(httpserver):
captured = {}
def handler(request):
captured["body"] = request.get_json()
return {"status": "ok", "solution": {"status": 200, "response": "", "url": "x"}}
httpserver.expect_request("/v1", method="POST").respond_with_handler(handler)
config = {"modules": [{"name": "fs", "method": "json_api", "url": httpserver.url_for("/v1")}]}
c = CloudflareWebgateChecker(logger=Mock(), config=config)
c.prepare(url="https://site/page", headers={"User-Agent": "test-ua/1.0"}, timeout=5)
await c.check()
body = captured["body"]
assert body["cmd"] == "request.get"
assert body["url"] == "https://site/page"
assert body["session"].startswith("maigret-")
# userAgent was removed in FlareSolverr v2; the impersonated browser's
# own UA must be used to keep TLS+UA consistent.
assert "userAgent" not in body
assert "proxy" not in body
@pytest.mark.asyncio
async def test_flaresolverr_proxy_string_passed_through(httpserver):
captured = {}
def handler(request):
captured["body"] = request.get_json()
return {"status": "ok", "solution": {"status": 200, "response": "", "url": "x"}}
httpserver.expect_request("/v1", method="POST").respond_with_handler(handler)
config = {
"modules": [
{
"name": "fs",
"method": "json_api",
"url": httpserver.url_for("/v1"),
"proxy": "socks5://localhost:1080",
}
]
}
c = CloudflareWebgateChecker(logger=Mock(), config=config)
c.prepare(url="https://site/page", headers={}, timeout=5)
await c.check()
assert captured["body"]["proxy"] == {"url": "socks5://localhost:1080"}
@pytest.mark.asyncio
async def test_flaresolverr_proxy_dict_with_credentials(httpserver):
captured = {}
def handler(request):
captured["body"] = request.get_json()
return {"status": "ok", "solution": {"status": 200, "response": "", "url": "x"}}
httpserver.expect_request("/v1", method="POST").respond_with_handler(handler)
config = {
"modules": [
{
"name": "fs",
"method": "json_api",
"url": httpserver.url_for("/v1"),
"proxy": {
"url": "http://proxy.example:3128",
"username": "u",
"password": "p",
"stripped_extra": "ignored",
},
}
]
}
c = CloudflareWebgateChecker(logger=Mock(), config=config)
c.prepare(url="https://site/page", headers={}, timeout=5)
await c.check()
proxy = captured["body"]["proxy"]
assert proxy == {"url": "http://proxy.example:3128", "username": "u", "password": "p"}
+83
View File
@@ -4,6 +4,30 @@ import pytest
from maigret.utils import is_country_tag
TOP_SITES_ALEXA_RANK_LIMIT = 50
KNOWN_SOCIAL_DOMAINS = [
"facebook.com",
"instagram.com",
"twitter.com",
"tiktok.com",
"vk.com",
"reddit.com",
"pinterest.com",
"snapchat.com",
"linkedin.com",
"tumblr.com",
"threads.net",
"bsky.app",
"myspace.com",
"weibo.com",
"mastodon.social",
"gab.com",
"minds.com",
"clubhouse.com",
]
@pytest.mark.slow
def test_tags_validity(default_db):
unknown_tags = set()
@@ -19,3 +43,62 @@ def test_tags_validity(default_db):
# if you see "unchecked" tag error, please, do
# maigret --db `pwd`/maigret/resources/data.json --self-check --tag unchecked --use-disabled-sites
assert unknown_tags == set()
@pytest.mark.slow
def test_top_sites_have_category_tag(default_db):
"""Top sites by alexaRank must have at least one category tag (not just country codes)."""
sites_ranked = sorted(
[s for s in default_db.sites if s.alexa_rank],
key=lambda s: s.alexa_rank,
)[:TOP_SITES_ALEXA_RANK_LIMIT]
missing_category = []
for site in sites_ranked:
category_tags = [t for t in site.tags if not is_country_tag(t)]
if not category_tags:
missing_category.append(f"{site.name} (rank {site.alexa_rank})")
assert missing_category == [], (
f"{len(missing_category)} top-{TOP_SITES_ALEXA_RANK_LIMIT} sites have no category tag: "
+ ", ".join(missing_category[:20])
)
@pytest.mark.slow
def test_no_unused_tags_in_registry(default_db):
"""Every tag in the registry should be used by at least one site."""
all_used_tags = set()
for site in default_db.sites:
for tag in site.tags:
if not is_country_tag(tag):
all_used_tags.add(tag)
registered_tags = set(default_db._tags)
unused = registered_tags - all_used_tags
assert unused == set(), f"Tags registered but not used by any site: {unused}"
@pytest.mark.slow
def test_social_networks_have_social_tag(default_db):
"""Known social network domains must have the 'social' tag."""
from urllib.parse import urlparse
missing_social = []
for site in default_db.sites:
url = site.url_main or ""
try:
hostname = urlparse(url).hostname or ""
except Exception:
continue
for domain in KNOWN_SOCIAL_DOMAINS:
if hostname == domain or hostname.endswith("." + domain):
if "social" not in site.tags:
missing_social.append(f"{site.name} ({domain})")
break
assert missing_social == [], (
f"{len(missing_social)} known social networks missing 'social' tag: "
+ ", ".join(missing_social)
)
+236
View File
@@ -0,0 +1,236 @@
"""Tests for the database auto-update system."""
import json
import os
import hashlib
from datetime import datetime, timezone, timedelta
from unittest.mock import patch, MagicMock
import pytest
from maigret.db_updater import (
_parse_version,
_needs_check,
_is_version_compatible,
_is_update_available,
_load_state,
_save_state,
_best_local,
_now_iso,
resolve_db_path,
force_update,
CACHED_DB_PATH,
BUNDLED_DB_PATH,
STATE_PATH,
MAIGRET_HOME,
)
def test_parse_version():
assert _parse_version("0.5.0") == (0, 5, 0)
assert _parse_version("1.2.3") == (1, 2, 3)
assert _parse_version("bad") == (0, 0, 0)
assert _parse_version("") == (0, 0, 0)
def test_needs_check_no_state():
assert _needs_check({}, 24) is True
def test_needs_check_recent():
state = {"last_check_at": _now_iso()}
assert _needs_check(state, 24) is False
def test_needs_check_expired():
old_time = (datetime.now(timezone.utc) - timedelta(hours=25)).strftime("%Y-%m-%dT%H:%M:%SZ")
state = {"last_check_at": old_time}
assert _needs_check(state, 24) is True
def test_needs_check_corrupt():
state = {"last_check_at": "not-a-date"}
assert _needs_check(state, 24) is True
def test_version_compatible():
with patch("maigret.db_updater.__version__", "0.5.0"):
assert _is_version_compatible({"min_maigret_version": "0.5.0"}) is True
assert _is_version_compatible({"min_maigret_version": "0.4.0"}) is True
assert _is_version_compatible({"min_maigret_version": "0.6.0"}) is False
assert _is_version_compatible({}) is True # missing field = compatible
def test_update_available_no_cache(tmp_path):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "nonexistent.json")):
assert _is_update_available({"updated_at": "2026-01-01T00:00:00Z"}, {}) is True
def test_update_available_newer(tmp_path):
cache = tmp_path / "data.json"
cache.write_text("{}")
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
state = {"last_meta": {"updated_at": "2026-01-01T00:00:00Z"}}
meta = {"updated_at": "2026-02-01T00:00:00Z"}
assert _is_update_available(meta, state) is True
def test_update_available_same(tmp_path):
cache = tmp_path / "data.json"
cache.write_text("{}")
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
state = {"last_meta": {"updated_at": "2026-01-01T00:00:00Z"}}
meta = {"updated_at": "2026-01-01T00:00:00Z"}
assert _is_update_available(meta, state) is False
def test_load_state_missing(tmp_path):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "missing.json")):
assert _load_state() == {}
def test_load_state_corrupt(tmp_path):
corrupt = tmp_path / "state.json"
corrupt.write_text("not json{{{")
with patch("maigret.db_updater.STATE_PATH", str(corrupt)):
assert _load_state() == {}
def test_save_and_load_state(tmp_path):
state_file = tmp_path / "state.json"
with patch("maigret.db_updater.STATE_PATH", str(state_file)):
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
_save_state({"last_check_at": "2026-01-01T00:00:00Z"})
loaded = _load_state()
assert loaded["last_check_at"] == "2026-01-01T00:00:00Z"
def test_best_local_with_valid_cache(tmp_path):
cache = tmp_path / "data.json"
cache.write_text('{"sites": {}, "engines": {}, "tags": []}')
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
assert _best_local() == str(cache)
def test_best_local_with_corrupt_cache(tmp_path):
cache = tmp_path / "data.json"
cache.write_text("not json")
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
assert _best_local() == BUNDLED_DB_PATH
def test_best_local_no_cache(tmp_path):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "missing.json")):
assert _best_local() == BUNDLED_DB_PATH
def test_resolve_db_path_custom_url():
result = resolve_db_path("https://example.com/db.json")
assert result == "https://example.com/db.json"
def test_resolve_db_path_custom_file(tmp_path):
custom_db = tmp_path / "custom" / "path.json"
custom_db.parent.mkdir(parents=True)
custom_db.write_text("{}")
result = resolve_db_path(str(custom_db))
assert result.endswith("custom/path.json")
def test_resolve_db_path_no_autoupdate(tmp_path):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "missing.json")):
result = resolve_db_path("resources/data.json", no_autoupdate=True)
assert result == BUNDLED_DB_PATH
def test_resolve_db_path_no_autoupdate_with_cache(tmp_path):
cache = tmp_path / "data.json"
cache.write_text('{"sites": {}, "engines": {}, "tags": []}')
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
result = resolve_db_path("resources/data.json", no_autoupdate=True)
assert result == str(cache)
@patch("maigret.db_updater._fetch_meta")
def test_resolve_db_path_network_failure(mock_fetch, tmp_path):
mock_fetch.return_value = None
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "state.json")):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "missing.json")):
result = resolve_db_path("resources/data.json")
assert result == BUNDLED_DB_PATH
# --- force_update tests ---
@patch("maigret.db_updater._fetch_meta")
def test_force_update_network_failure(mock_fetch, tmp_path):
mock_fetch.return_value = None
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "state.json")):
assert force_update() is False
@patch("maigret.db_updater._fetch_meta")
def test_force_update_incompatible_version(mock_fetch, tmp_path):
mock_fetch.return_value = {"min_maigret_version": "99.0.0", "sites_count": 100}
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "state.json")):
assert force_update() is False
@patch("maigret.db_updater._download_and_verify")
@patch("maigret.db_updater._fetch_meta")
def test_force_update_success(mock_fetch, mock_download, tmp_path):
mock_fetch.return_value = {
"min_maigret_version": "0.1.0",
"sites_count": 3200,
"updated_at": "2099-01-01T00:00:00Z",
"data_url": "https://example.com/data.json",
"data_sha256": "abc123",
}
mock_download.return_value = str(tmp_path / "data.json")
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "state.json")):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "missing.json")):
assert force_update() is True
state = _load_state()
assert state["last_meta"]["sites_count"] == 3200
@patch("maigret.db_updater._fetch_meta")
def test_force_update_already_up_to_date(mock_fetch, tmp_path):
cache = tmp_path / "data.json"
cache.write_text('{"sites": {}, "engines": {}, "tags": []}')
state_file = tmp_path / "state.json"
state_file.write_text(json.dumps({
"last_check_at": _now_iso(),
"last_meta": {"updated_at": "2026-01-01T00:00:00Z", "sites_count": 3000},
}))
mock_fetch.return_value = {
"min_maigret_version": "0.1.0",
"sites_count": 3000,
"updated_at": "2026-01-01T00:00:00Z",
}
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(state_file)):
with patch("maigret.db_updater.CACHED_DB_PATH", str(cache)):
assert force_update() is False
@patch("maigret.db_updater._download_and_verify")
@patch("maigret.db_updater._fetch_meta")
def test_force_update_download_fails(mock_fetch, mock_download, tmp_path):
mock_fetch.return_value = {
"min_maigret_version": "0.1.0",
"sites_count": 3200,
"updated_at": "2099-01-01T00:00:00Z",
"data_url": "https://example.com/data.json",
"data_sha256": "abc123",
}
mock_download.return_value = None
with patch("maigret.db_updater.MAIGRET_HOME", str(tmp_path)):
with patch("maigret.db_updater.STATE_PATH", str(tmp_path / "state.json")):
with patch("maigret.db_updater.CACHED_DB_PATH", str(tmp_path / "missing.json")):
assert force_update() is False
+2 -2
View File
@@ -36,7 +36,7 @@ def test_notify_about_errors():
},
}
results = notify_about_errors(results, query_notify=None, show_statistics=True)
notifications = notify_about_errors(results, query_notify=None, show_statistics=True)
# Check the output
expected_output = [
@@ -55,4 +55,4 @@ def test_notify_about_errors():
('Access denied: 25.0%', '!'),
('You can see detailed site check errors with a flag `--print-errors`', '-'),
]
assert results == expected_output
assert notifications == expected_output
+21 -20
View File
@@ -3,6 +3,7 @@
import pytest
import asyncio
import logging
from typing import Any, List, Tuple, Callable, Dict
from maigret.executors import (
AsyncioSimpleExecutor,
AsyncioProgressbarExecutor,
@@ -21,49 +22,49 @@ async def func(n):
@pytest.mark.asyncio
async def test_simple_asyncio_executor():
tasks = [(func, [n], {}) for n in range(10)]
tasks: List[Tuple[Callable, list, dict]] = [(func, [n], {}) for n in range(10)]
executor = AsyncioSimpleExecutor(logger=logger)
assert await executor.run(tasks) == [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
assert executor.execution_time > 0.2
assert executor.execution_time < 0.3
assert executor.execution_time < 1.0
@pytest.mark.asyncio
async def test_asyncio_progressbar_executor():
tasks = [(func, [n], {}) for n in range(10)]
tasks: List[Tuple[Callable, list, dict]] = [(func, [n], {}) for n in range(10)]
executor = AsyncioProgressbarExecutor(logger=logger)
# no guarantees for the results order
assert sorted(await executor.run(tasks)) == [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
assert executor.execution_time > 0.2
assert executor.execution_time < 0.3
assert executor.execution_time < 1.0
@pytest.mark.asyncio
async def test_asyncio_progressbar_semaphore_executor():
tasks = [(func, [n], {}) for n in range(10)]
tasks: List[Tuple[Callable, list, dict]] = [(func, [n], {}) for n in range(10)]
executor = AsyncioProgressbarSemaphoreExecutor(logger=logger, in_parallel=5)
# no guarantees for the results order
assert sorted(await executor.run(tasks)) == [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
assert executor.execution_time > 0.2
assert executor.execution_time < 0.4
assert executor.execution_time < 1.1
@pytest.mark.slow
@pytest.mark.asyncio
async def test_asyncio_progressbar_queue_executor():
tasks = [(func, [n], {}) for n in range(10)]
tasks: List[Tuple[Callable, list, dict]] = [(func, [n], {}) for n in range(10)]
executor = AsyncioProgressbarQueueExecutor(logger=logger, in_parallel=2)
assert await executor.run(tasks) == [0, 1, 3, 2, 4, 6, 7, 5, 9, 8]
assert executor.execution_time > 0.5
assert executor.execution_time < 0.7
assert executor.execution_time < 1.4
executor = AsyncioProgressbarQueueExecutor(logger=logger, in_parallel=3)
assert await executor.run(tasks) == [0, 3, 1, 4, 6, 2, 7, 9, 5, 8]
assert executor.execution_time > 0.4
assert executor.execution_time < 0.6
assert executor.execution_time < 1.3
executor = AsyncioProgressbarQueueExecutor(logger=logger, in_parallel=5)
assert await executor.run(tasks) in (
@@ -71,41 +72,41 @@ async def test_asyncio_progressbar_queue_executor():
[0, 3, 6, 1, 4, 9, 7, 2, 5, 8],
)
assert executor.execution_time > 0.3
assert executor.execution_time < 0.5
assert executor.execution_time < 1.2
executor = AsyncioProgressbarQueueExecutor(logger=logger, in_parallel=10)
assert await executor.run(tasks) == [0, 3, 6, 9, 1, 4, 7, 2, 5, 8]
assert executor.execution_time > 0.2
assert executor.execution_time < 0.4
assert executor.execution_time < 1.1
@pytest.mark.asyncio
async def test_asyncio_queue_generator_executor():
tasks = [(func, [n], {}) for n in range(10)]
tasks: List[Tuple[Callable, list, dict]] = [(func, [n], {}) for n in range(10)]
executor = AsyncioQueueGeneratorExecutor(logger=logger, in_parallel=2)
results = [result async for result in executor.run(tasks)]
results = [result async for result in executor.run(tasks)] # type: ignore[arg-type]
assert results == [0, 1, 3, 2, 4, 6, 7, 5, 9, 8]
assert executor.execution_time > 0.5
assert executor.execution_time < 0.6
assert executor.execution_time < 1.3
executor = AsyncioQueueGeneratorExecutor(logger=logger, in_parallel=3)
results = [result async for result in executor.run(tasks)]
results = [result async for result in executor.run(tasks)] # type: ignore[arg-type]
assert results == [0, 3, 1, 4, 6, 2, 7, 9, 5, 8]
assert executor.execution_time > 0.4
assert executor.execution_time < 0.5
assert executor.execution_time < 1.2
executor = AsyncioQueueGeneratorExecutor(logger=logger, in_parallel=5)
results = [result async for result in executor.run(tasks)]
results = [result async for result in executor.run(tasks)] # type: ignore[arg-type]
assert results in (
[0, 3, 6, 1, 4, 7, 9, 2, 5, 8],
[0, 3, 6, 1, 4, 9, 7, 2, 5, 8],
)
assert executor.execution_time > 0.3
assert executor.execution_time < 0.4
assert executor.execution_time < 1.1
executor = AsyncioQueueGeneratorExecutor(logger=logger, in_parallel=10)
results = [result async for result in executor.run(tasks)]
results = [result async for result in executor.run(tasks)] # type: ignore[arg-type]
assert results == [0, 3, 6, 9, 1, 4, 7, 2, 5, 8]
assert executor.execution_time > 0.2
assert executor.execution_time < 0.3
assert executor.execution_time < 1.0
+84 -2
View File
@@ -2,6 +2,7 @@
import asyncio
import copy
from unittest.mock import patch
import pytest
from mock import Mock
@@ -11,7 +12,8 @@ from maigret.maigret import (
extract_ids_from_page,
extract_ids_from_results,
)
from maigret.sites import MaigretSite
from maigret.checking import site_self_check
from maigret.sites import MaigretSite, MaigretDatabase
from maigret.result import MaigretCheckResult, MaigretCheckStatus
from tests.conftest import RESULTS_EXAMPLE
@@ -37,6 +39,86 @@ async def test_self_check_db(test_db):
assert test_db.sites_dict['InvalidInactive'].disabled is True
@pytest.mark.slow
@pytest.mark.asyncio
async def test_self_check_no_progressbar(test_db):
"""Verify that no_progressbar=True disables the alive_bar in self_check."""
logger = Mock()
with patch('maigret.checking.alive_bar') as mock_alive_bar:
mock_bar = Mock()
mock_alive_bar.return_value.__enter__ = Mock(return_value=mock_bar)
mock_alive_bar.return_value.__exit__ = Mock(return_value=False)
await self_check(
test_db, test_db.sites_dict, logger, silent=True,
no_progressbar=True,
)
# First call is the self-check progress bar; subsequent calls are
# from inner search() invocations.
self_check_call = mock_alive_bar.call_args_list[0]
_, kwargs = self_check_call
assert kwargs.get('title') == 'Self-checking'
assert kwargs.get('disable') is True
@pytest.mark.slow
@pytest.mark.asyncio
async def test_self_check_progressbar_enabled_by_default(test_db):
"""Verify that alive_bar is enabled by default (no_progressbar=False)."""
logger = Mock()
with patch('maigret.checking.alive_bar') as mock_alive_bar:
mock_bar = Mock()
mock_alive_bar.return_value.__enter__ = Mock(return_value=mock_bar)
mock_alive_bar.return_value.__exit__ = Mock(return_value=False)
await self_check(
test_db, test_db.sites_dict, logger, silent=True,
)
self_check_call = mock_alive_bar.call_args_list[0]
_, kwargs = self_check_call
assert kwargs.get('title') == 'Self-checking'
assert kwargs.get('disable') is False
@pytest.mark.asyncio
async def test_site_self_check_handles_exception(test_db):
"""Verify that site_self_check catches unexpected exceptions and returns a valid result."""
logger = Mock()
sem = asyncio.Semaphore(1)
site = test_db.sites_dict['ValidActive']
with patch('maigret.checking.maigret', side_effect=RuntimeError("test crash")):
result = await site_self_check(site, logger, sem, test_db)
assert isinstance(result, dict)
assert "issues" in result
assert len(result["issues"]) > 0
assert any("Unexpected error" in issue for issue in result["issues"])
@pytest.mark.asyncio
async def test_self_check_handles_task_exception(test_db):
"""Verify that self_check continues when individual site checks raise exceptions."""
logger = Mock()
with patch('maigret.checking.maigret', side_effect=RuntimeError("test crash")):
result = await self_check(
test_db, test_db.sites_dict, logger, silent=True,
no_progressbar=True,
)
assert isinstance(result, dict)
assert 'results' in result
assert len(result['results']) == len(test_db.sites_dict)
for r in result['results']:
assert 'site_name' in r
assert 'issues' in r
@pytest.mark.slow
@pytest.mark.skip(reason="broken, fixme")
def test_maigret_results(test_db):
@@ -112,7 +194,7 @@ def test_extract_ids_from_page(test_db):
def test_extract_ids_from_results(test_db):
TEST_EXAMPLE = copy.deepcopy(RESULTS_EXAMPLE)
TEST_EXAMPLE: dict = copy.deepcopy(RESULTS_EXAMPLE)
TEST_EXAMPLE['Reddit']['ids_usernames'] = {'test1': 'yandex_public_id'}
TEST_EXAMPLE['Reddit']['ids_links'] = ['https://www.reddit.com/user/test2']
+228 -1
View File
@@ -6,12 +6,19 @@ import os
import pytest
from io import StringIO
import xmind
import xmind # type: ignore[import-untyped]
from jinja2 import Template
from maigret.report import (
filter_supposed_data,
sort_report_by_data_points,
_md_format_value,
generate_csv_report,
generate_txt_report,
save_csv_report,
save_txt_report,
save_json_report,
save_markdown_report,
save_xmind_report,
save_html_report,
save_pdf_report,
@@ -456,3 +463,223 @@ def test_text_report_broken():
assert brief_part in report_text
assert 'us' in report_text
assert 'photo' in report_text
def test_filter_supposed_data():
data = {
'fullname': ['Alice'],
'gender': ['female'],
'location': ['Berlin'],
'age': ['30'],
'email': ['x@y.z'], # not allowed, must be dropped
'bio': ['hi'], # not allowed
}
result = filter_supposed_data(data)
assert result == {
'Fullname': 'Alice',
'Gender': 'female',
'Location': 'Berlin',
'Age': '30',
}
def test_filter_supposed_data_empty():
assert filter_supposed_data({}) == {}
assert filter_supposed_data({'nope': ['v']}) == {}
def test_filter_supposed_data_scalar_values():
# Strings and scalars must be kept whole — previously v[0] on "Alice"
# silently returned "A" instead of "Alice".
data = {
'fullname': 'Alice',
'gender': 'female',
'location': 'Berlin',
'age': 30,
}
assert filter_supposed_data(data) == {
'Fullname': 'Alice',
'Gender': 'female',
'Location': 'Berlin',
'Age': 30,
}
def test_filter_supposed_data_empty_list_yields_empty_string():
# Edge case: list value present but empty should not crash with IndexError.
assert filter_supposed_data({'fullname': []}) == {'Fullname': ''}
def test_filter_supposed_data_mixed_values():
# List and scalar mixed in the same payload.
data = {'fullname': ['Alice', 'Alicia'], 'gender': 'female'}
assert filter_supposed_data(data) == {
'Fullname': 'Alice',
'Gender': 'female',
}
def test_sort_report_by_data_points():
status_many = MaigretCheckResult('', '', '', MaigretCheckStatus.CLAIMED)
status_many.ids_data = {'a': 1, 'b': 2, 'c': 3}
status_one = MaigretCheckResult('', '', '', MaigretCheckStatus.CLAIMED)
status_one.ids_data = {'a': 1}
status_none = MaigretCheckResult('', '', '', MaigretCheckStatus.CLAIMED)
results = {
'few': {'status': status_one},
'many': {'status': status_many},
'zero': {'status': status_none},
'nostatus': {},
}
sorted_out = sort_report_by_data_points(results)
keys = list(sorted_out.keys())
# site with 3 ids_data fields must come first
assert keys[0] == 'many'
# site with 1 field next
assert keys[1] == 'few'
def test_md_format_value_list():
assert _md_format_value(['a', 'b', 'c']) == 'a, b, c'
def test_md_format_value_url():
assert _md_format_value('https://example.com') == '[https://example.com](https://example.com)'
assert _md_format_value('http://x.y') == '[http://x.y](http://x.y)'
def test_md_format_value_plain():
assert _md_format_value('hello') == 'hello'
assert _md_format_value(42) == '42'
def test_save_csv_report():
filename = 'report_test.csv'
save_csv_report(filename, 'test', EXAMPLE_RESULTS)
with open(filename) as f:
content = f.read()
assert 'username,name,url_main' in content
assert 'test,GitHub' in content
def test_save_txt_report():
filename = 'report_test.txt'
save_txt_report(filename, 'test', EXAMPLE_RESULTS)
with open(filename) as f:
content = f.read()
assert 'https://www.github.com/test' in content
assert 'Total Websites Username Detected On : 1' in content
def test_save_json_report_simple():
filename = 'report_test.json'
save_json_report(filename, 'test', EXAMPLE_RESULTS, 'simple')
with open(filename) as f:
data = json.load(f)
assert 'GitHub' in data
def test_save_json_report_ndjson():
filename = 'report_test_ndjson.json'
save_json_report(filename, 'test', EXAMPLE_RESULTS, 'ndjson')
with open(filename) as f:
lines = f.readlines()
assert len(lines) == 1
assert json.loads(lines[0])['sitename'] == 'GitHub'
def _markdown_context_with_rich_ids():
"""Build a context with found accounts, ids_data (incl. image, url, list) to exercise all branches."""
found_result = copy.deepcopy(GOOD_RESULT)
found_result.tags = ['photo', 'us']
found_result.ids_data = {
"fullname": "Alice",
"name": "Alice A.",
"location": "Berlin",
"bio": "Photographer",
"external_url": "https://example.com/profile",
"image": "https://example.com/avatar.png", # must be skipped
"aliases": ["alice", "alicea"], # list value
"last_online": "2024-01-02 10:00:00",
}
data = {
'Github': {
'username': 'alice',
'parsing_enabled': True,
'url_main': 'https://github.com/',
'url_user': 'https://github.com/alice',
'status': found_result,
'http_status': 200,
'is_similar': False,
'rank': 1,
'site': MaigretSite('Github', {}),
'found': True,
'ids_data': found_result.ids_data,
},
'Similar': {
'username': 'alice',
'url_user': 'https://other.com/alice',
'is_similar': True,
'found': True,
'status': copy.deepcopy(GOOD_RESULT),
},
}
return {
'username': 'alice',
'generated_at': '2024-01-02 10:00',
'brief': 'Search returned 1 account',
'countries_tuple_list': [('us', 1)],
'interests_tuple_list': [('photo', 1)],
'first_seen': '2023-01-01',
'results': [('alice', 'username', data)],
}
def test_save_markdown_report():
filename = 'report_test.md'
context = _markdown_context_with_rich_ids()
save_markdown_report(filename, context, run_info={'sites_count': 100, 'flags': '--top-sites 100'})
with open(filename) as f:
content = f.read()
assert '# Report by searching on username "alice"' in content
assert '## Summary' in content
assert '## Accounts found' in content
assert '### Github' in content
assert '[https://github.com/alice](https://github.com/alice)' in content
assert 'Ethical use' in content
assert '100 sites checked' in content
# image field must NOT appear in per-site listing
assert 'avatar.png' not in content
# list field rendered with join
assert 'alice, alicea' in content
# external url formatted as markdown link
assert '[https://example.com/profile](https://example.com/profile)' in content
def test_save_markdown_report_minimal_context():
"""No run_info, no first_seen — exercise the fallback branches."""
filename = 'report_test_min.md'
context = {
'username': 'bob',
'brief': 'nothing found',
'results': [],
}
save_markdown_report(filename, context)
with open(filename) as f:
content = f.read()
assert '# Report by searching on username "bob"' in content
assert '## Summary' in content
def test_get_plaintext_report_minimal():
"""Minimal context without countries/interests."""
context = {
'brief': 'Nothing to report.',
'interests_tuple_list': [],
'countries_tuple_list': [],
}
out = get_plaintext_report(context)
assert 'Nothing to report.' in out
assert 'Countries:' not in out
assert 'Interests' not in out
+21 -1
View File
@@ -1,8 +1,12 @@
"""Maigret Database test functions"""
import re
from typing import Any, Dict
from maigret.sites import MaigretDatabase, MaigretSite
EXAMPLE_DB = {
EXAMPLE_DB: Dict[str, Any] = {
'engines': {
"XenForo": {
"presenseStrs": ["XenForo"],
@@ -124,6 +128,22 @@ def test_site_url_detector():
)
def test_extract_id_from_url_skips_none_groups():
site = MaigretSite(
"Example",
{
"urlMain": "https://example.com",
"url": "https://example.com/{username}",
},
)
site.url_regexp = re.compile(r"^https://example\.com/([^/?#]+)(?:/(.*))?$")
assert site.extract_id_from_url("https://example.com/username") == (
"username",
"username",
)
def test_ranked_sites_dict():
db = MaigretDatabase()
db.update_site(MaigretSite('3', {'alexaRank': 1000, 'engine': 'ucoz'}))
+2 -2
View File
@@ -28,7 +28,7 @@ async def test_detect_known_engine(test_db, local_test_db):
url_exists = "https://devforum.zoom.us/u/adam"
url_mainpage = "https://devforum.zoom.us/"
# Mock extract_username_dialog to return "adam"
submitter.extract_username_dialog = MagicMock(return_value="adam")
submitter.extract_username_dialog = MagicMock(return_value="adam") # type: ignore[method-assign]
sites, resp_text = await submitter.detect_known_engine(
url_exists, url_mainpage, session=None, follow_redirects=False, headers=None
@@ -111,7 +111,7 @@ async def test_check_features_manually_success(settings):
@pytest.mark.slow
@pytest.mark.asyncio
async def test_check_features_manually_success(settings):
async def test_check_features_manually_cloudflare(settings):
# Setup
db = MaigretDatabase()
logger = logging.getLogger("test_logger")
+172
View File
@@ -0,0 +1,172 @@
"""Smoke tests for the Flask web interface in maigret.web.app.
The goal is to catch breakage in the basic user flow (render index, kick off
search, redirect to results) without making real network calls. Heavy maigret
internals are mocked; the report-generation smoke test keeps `save_graph_report`
unmocked so regressions like `nt.options.groups = ...` (AttributeError on a
plain dict) are caught automatically.
"""
import os
import pytest
import maigret
import maigret.report
from maigret.web import app as web_app_module
CUR_PATH = os.path.dirname(os.path.realpath(__file__))
TEST_DB = os.path.join(CUR_PATH, 'db.json')
class _SyncThread:
"""Drop-in for threading.Thread that runs target synchronously on start()."""
def __init__(self, target=None, args=(), kwargs=None, **_):
self._target = target
self._args = args
self._kwargs = kwargs or {}
def start(self):
self._target(*self._args, **self._kwargs)
@pytest.fixture
def web_app(tmp_path):
web_app_module.app.config['TESTING'] = True
web_app_module.app.config['REPORTS_FOLDER'] = str(tmp_path)
web_app_module.app.config['MAIGRET_DB_FILE'] = TEST_DB
web_app_module.background_jobs.clear()
web_app_module.job_results.clear()
yield web_app_module
web_app_module.background_jobs.clear()
web_app_module.job_results.clear()
@pytest.fixture
def client(web_app):
return web_app.app.test_client()
def test_index_renders(client):
resp = client.get('/')
assert resp.status_code == 200
body = resp.get_data(as_text=True)
assert 'name="usernames"' in body
assert '<form' in body
def test_search_empty_input_redirects_to_index(client):
resp = client.post('/search', data={'usernames': ''})
assert resp.status_code == 302
assert resp.location.rstrip('/').endswith('') or resp.location.endswith('/')
def test_search_redirects_to_status(client, web_app, monkeypatch):
monkeypatch.setattr(web_app, 'process_search_task', lambda *a, **kw: None)
monkeypatch.setattr(web_app, 'Thread', _SyncThread)
resp = client.post('/search', data={'usernames': 'soxoj'})
assert resp.status_code == 302
assert '/status/' in resp.location
def test_invalid_timestamp_redirects_to_index(client):
resp = client.get('/status/nonexistent_ts')
assert resp.status_code == 302
assert resp.location.endswith('/')
def test_status_running_renders_status_page(client, web_app, monkeypatch):
"""While the background job is still running, /status/<ts> returns 200."""
def never_completes(usernames, options, timestamp):
# leave background_jobs[timestamp]['completed'] as False
pass
monkeypatch.setattr(web_app, 'process_search_task', never_completes)
monkeypatch.setattr(web_app, 'Thread', _SyncThread)
post = client.post('/search', data={'usernames': 'soxoj'})
status_resp = client.get(post.location)
assert status_resp.status_code == 200
def test_completed_search_redirects_to_results(client, web_app, monkeypatch):
"""Happy path: POST /search → background completes → /status/<ts> → /results/<session>."""
def fake_task(usernames, options, timestamp):
web_app.job_results[timestamp] = {
'status': 'completed',
'session_folder': f'search_{timestamp}',
'graph_file': f'search_{timestamp}/combined_graph.html',
'usernames': usernames,
'individual_reports': [],
}
web_app.background_jobs[timestamp]['completed'] = True
monkeypatch.setattr(web_app, 'process_search_task', fake_task)
monkeypatch.setattr(web_app, 'Thread', _SyncThread)
post = client.post('/search', data={'usernames': 'soxoj'})
assert post.status_code == 302
status_resp = client.get(post.location)
assert status_resp.status_code == 302
assert '/results/search_' in status_resp.location
results_resp = client.get(status_resp.location)
assert results_resp.status_code == 200
assert b'soxoj' in results_resp.data
def test_failed_task_redirects_to_index(client, web_app, monkeypatch):
def failing_task(usernames, options, timestamp):
web_app.job_results[timestamp] = {'status': 'failed', 'error': 'boom'}
web_app.background_jobs[timestamp]['completed'] = True
monkeypatch.setattr(web_app, 'process_search_task', failing_task)
monkeypatch.setattr(web_app, 'Thread', _SyncThread)
post = client.post('/search', data={'usernames': 'soxoj'})
status_resp = client.get(post.location)
assert status_resp.status_code == 302
assert status_resp.location.endswith('/')
def test_real_report_generation_does_not_crash(client, web_app, monkeypatch):
"""End-to-end with mocked maigret.search but REAL report generation.
This is the regression guard for bugs inside `save_graph_report` and friends
(e.g. `nt.options.groups = ...` raising AttributeError on a dict). If any of
the unmocked report functions throws, the task records a failed status and
this assertion catches it.
"""
async def fake_search(*args, **kwargs):
return {}
monkeypatch.setattr(maigret, 'search', fake_search)
# Mock the per-username report writers — they are not what we care about here,
# and pdf/html generation pulls in xhtml2pdf which is slow and brittle.
monkeypatch.setattr(maigret.report, 'save_csv_report', lambda *a, **kw: None)
monkeypatch.setattr(maigret.report, 'save_json_report', lambda *a, **kw: None)
monkeypatch.setattr(maigret.report, 'save_pdf_report', lambda *a, **kw: None)
monkeypatch.setattr(maigret.report, 'save_html_report', lambda *a, **kw: None)
monkeypatch.setattr(maigret.report, 'generate_report_context', lambda *a, **kw: {})
monkeypatch.setattr(web_app, 'Thread', _SyncThread)
post = client.post('/search', data={'usernames': 'testuser'})
timestamp = post.location.rsplit('/', 1)[1]
assert timestamp in web_app.job_results, 'background task did not record any result'
result = web_app.job_results[timestamp]
assert result['status'] == 'completed', (
f"report generation failed: {result.get('error')!r}"
)
+5
View File
@@ -0,0 +1,5 @@
#!/bin/bash
set -e
sudo apt-get update && sudo apt-get install -y libcairo2-dev pkg-config
pip install .
+59
View File
@@ -0,0 +1,59 @@
"""Generate db_meta.json from data.json for the auto-update system."""
import argparse
import hashlib
import json
import os.path as path
import sys
from datetime import datetime, timezone
RESOURCES_DIR = path.join(path.dirname(path.dirname(path.abspath(__file__))), "maigret", "resources")
DATA_JSON_PATH = path.join(RESOURCES_DIR, "data.json")
META_JSON_PATH = path.join(RESOURCES_DIR, "db_meta.json")
DEFAULT_DATA_URL = "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/data.json"
def get_current_version():
version_file = path.join(path.dirname(path.dirname(path.abspath(__file__))), "maigret", "__version__.py")
with open(version_file) as f:
for line in f:
if line.startswith("__version__"):
return line.split("=")[1].strip().strip("'\"")
return "0.0.0"
def main():
parser = argparse.ArgumentParser(description="Generate db_meta.json from data.json")
parser.add_argument("--min-version", default=None, help="Minimum compatible maigret version (default: current version)")
parser.add_argument("--data-url", default=DEFAULT_DATA_URL, help="URL where data.json can be downloaded")
args = parser.parse_args()
min_version = args.min_version or get_current_version()
with open(DATA_JSON_PATH, "rb") as f:
raw = f.read()
sha256 = hashlib.sha256(raw).hexdigest()
data = json.loads(raw)
sites_count = len(data.get("sites", {}))
meta = {
"version": 1,
"updated_at": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
"sites_count": sites_count,
"min_maigret_version": min_version,
"data_sha256": sha256,
"data_url": args.data_url,
}
with open(META_JSON_PATH, "w", encoding="utf-8") as f:
json.dump(meta, f, indent=4, ensure_ascii=False)
print(f"Generated {META_JSON_PATH}")
print(f" sites: {sites_count}")
print(f" sha256: {sha256[:16]}...")
print(f" min_version: {min_version}")
if __name__ == "__main__":
main()
+67 -9
View File
@@ -26,6 +26,7 @@ sys.path.insert(0, str(Path(__file__).parent.parent))
try:
import aiohttp
from yarl import URL as YarlURL
except ImportError:
print("aiohttp not installed. Run: pip install aiohttp")
sys.exit(1)
@@ -74,8 +75,14 @@ def color(text: str, c: str) -> str:
async def check_url_aiohttp(url: str, headers: dict = None, follow_redirects: bool = True,
timeout: int = 15, ssl_verify: bool = False) -> dict:
"""Check a URL using aiohttp and return detailed response info."""
timeout: int = 15, ssl_verify: bool = False,
method: str = "GET", payload: dict = None) -> dict:
"""Check a URL using aiohttp and return detailed response info.
Args:
method: HTTP method ("GET" or "POST").
payload: JSON payload for POST requests (dict, will be serialized).
"""
headers = headers or DEFAULT_HEADERS.copy()
result = {
"method": "aiohttp",
@@ -96,7 +103,14 @@ async def check_url_aiohttp(url: str, headers: dict = None, follow_redirects: bo
timeout_obj = aiohttp.ClientTimeout(total=timeout)
async with aiohttp.ClientSession(connector=connector, timeout=timeout_obj) as session:
async with session.get(url, headers=headers, allow_redirects=follow_redirects) as resp:
# Use encoded=True if URL contains percent-encoded chars to prevent double-encoding
request_url = YarlURL(url, encoded=True) if '%' in url else url
request_kwargs = dict(headers=headers, allow_redirects=follow_redirects)
if method.upper() == "POST" and payload is not None:
request_kwargs["json"] = payload
request_fn = session.post if method.upper() == "POST" else session.get
async with request_fn(request_url, **request_kwargs) as resp:
result["status"] = resp.status
result["final_url"] = str(resp.url)
@@ -438,21 +452,54 @@ async def diagnose_site(site_config: dict, site_name: str) -> dict:
print(f" {color('[!]', Colors.RED)} No usernameClaimed defined")
return diagnosis
# Build full URL
# Build full URL (display URL)
url_template = url.replace("{urlMain}", url_main).replace("{urlSubpath}", site_config.get("urlSubpath", ""))
# Build probe URL (what Maigret actually requests)
url_probe = site_config.get("urlProbe", "")
if url_probe:
probe_template = url_probe.replace("{urlMain}", url_main).replace("{urlSubpath}", site_config.get("urlSubpath", ""))
else:
probe_template = url_template
# Detect request method and payload
request_method = site_config.get("requestMethod", "GET").upper()
request_payload_template = site_config.get("requestPayload")
headers = DEFAULT_HEADERS.copy()
# For API probes (urlProbe, POST), use neutral Accept header instead of text/html
# which can cause servers to return HTML instead of JSON
if url_probe or request_method == "POST":
headers["Accept"] = "*/*"
if site_config.get("headers"):
headers.update(site_config["headers"])
if url_probe:
print(f" urlProbe: {url_probe}")
if request_method != "GET":
print(f" requestMethod: {request_method}")
if request_payload_template:
print(f" requestPayload: {request_payload_template}")
# 2. Connectivity test
print(f"\n--- {color('2. CONNECTIVITY TEST', Colors.BOLD)} ---")
url_claimed = url_template.replace("{username}", claimed)
url_unclaimed = url_template.replace("{username}", unclaimed)
probe_claimed = probe_template.replace("{username}", claimed)
probe_unclaimed = probe_template.replace("{username}", unclaimed)
# Build payloads with username substituted
payload_claimed = None
payload_unclaimed = None
if request_payload_template and request_method == "POST":
payload_claimed = json.loads(
json.dumps(request_payload_template).replace("{username}", claimed)
)
payload_unclaimed = json.loads(
json.dumps(request_payload_template).replace("{username}", unclaimed)
)
result_claimed, result_unclaimed = await asyncio.gather(
check_url_aiohttp(url_claimed, headers),
check_url_aiohttp(url_unclaimed, headers)
check_url_aiohttp(probe_claimed, headers, method=request_method, payload=payload_claimed),
check_url_aiohttp(probe_unclaimed, headers, method=request_method, payload=payload_unclaimed)
)
print(f" Claimed ({claimed}): status={result_claimed['status']}, error={result_claimed['error']}")
@@ -523,7 +570,18 @@ async def diagnose_site(site_config: dict, site_name: str) -> dict:
diagnosis["warnings"].append(f"absenceStrs not found in unclaimed page")
print(f" {color('[WARN]', Colors.YELLOW)} absenceStrs not found in unclaimed page")
if presense_found_claimed and not absence_found_claimed and absence_found_unclaimed:
# Check works if: claimed is detected as present AND unclaimed is detected as absent.
# Presence detection: presenseStrs found (or empty = always true).
# Absence detection: absenceStrs found in unclaimed (or empty = never, rely on presenseStrs only).
# With only presenseStrs: works if found in claimed but NOT in unclaimed.
# With only absenceStrs: works if found in unclaimed but NOT in claimed.
# With both: standard combination.
claimed_is_present = presense_found_claimed and not absence_found_claimed
unclaimed_is_absent = (
(absence_strs and absence_found_unclaimed) or
(presense_strs and not presense_found_unclaimed)
)
if claimed_is_present and unclaimed_is_absent:
print(f" {color('[OK]', Colors.GREEN)} Message check should work correctly")
diagnosis["working"] = True
+84
View File
@@ -4,6 +4,7 @@ This module generates the listing of supported sites in file `SITES.md`
and pretty prints file with sites data.
"""
import sys
import socket
import requests
import logging
import threading
@@ -64,6 +65,49 @@ def get_base_domain(url):
return ""
def check_dns(domain, timeout=5):
"""Check if a domain resolves via DNS. Returns True if it resolves."""
try:
socket.setdefaulttimeout(timeout)
socket.getaddrinfo(domain, None)
return True
except (socket.gaierror, socket.timeout, OSError):
return False
def check_sites_dns(sites):
"""Check DNS resolution for all sites. Returns a set of site names that failed."""
SKIP_TLDS = ('.onion', '.i2p')
domains = {}
for site in sites:
domain = get_base_domain(site.url_main)
if domain and not any(domain.endswith(tld) for tld in SKIP_TLDS):
domains.setdefault(domain, []).append(site)
failed_sites = set()
results = {}
def resolve(domain):
results[domain] = check_dns(domain)
threads = []
for domain in domains:
t = threading.Thread(target=resolve, args=(domain,))
threads.append(t)
t.start()
for t in threads:
t.join()
for domain, resolved in results.items():
if not resolved:
for site in domains[domain]:
failed_sites.add(site.name)
logging.warning(f"DNS resolution failed for {domain}")
return failed_sites
def get_step_rank(rank):
def get_readable_rank(r):
return RANKS[str(r)]
@@ -86,6 +130,8 @@ def main():
parser.add_argument('--empty-only', help='update only sites without rating', action='store_true')
parser.add_argument('--exclude-engine', help='do not update score with certain engine',
action="append", dest="exclude_engine_list", default=[])
parser.add_argument('--dns-check', help='disable sites whose domains do not resolve via DNS',
action='store_true')
pool = list()
@@ -103,6 +149,24 @@ Rank data fetched from Majestic Million by domains.
""")
if args.dns_check:
print("Checking DNS resolution for all site domains...")
failed = check_sites_dns(sites_subset)
disabled_count = 0
re_enabled_count = 0
for site in sites_subset:
if site.name in failed:
if not site.disabled:
site.disabled = True
disabled_count += 1
print(f" Disabled {site.name}: DNS does not resolve ({get_base_domain(site.url_main)})")
else:
if site.disabled:
# Re-enable previously disabled site if DNS now resolves
# (only if it was likely disabled due to DNS failure)
pass
print(f"DNS check complete: {disabled_count} site(s) disabled, {len(failed)} domain(s) unresolvable.")
majestic_ranks = {}
if args.with_rank:
majestic_ranks = fetch_majestic_million()
@@ -153,6 +217,26 @@ Rank data fetched from Majestic Million by domains.
site_file.write(f'\nThe list was updated at ({datetime.now(timezone.utc).date()})\n')
db.save_to_file(args.base_file)
# Regenerate db_meta.json to stay in sync with data.json
try:
import hashlib, json, os
db_data_raw = open(args.base_file, 'rb').read()
db_data_parsed = json.loads(db_data_raw)
meta = {
"version": 1,
"updated_at": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
"sites_count": len(db_data_parsed.get("sites", {})),
"min_maigret_version": "0.5.0",
"data_sha256": hashlib.sha256(db_data_raw).hexdigest(),
"data_url": "https://raw.githubusercontent.com/soxoj/maigret/main/maigret/resources/data.json",
}
meta_path = os.path.join(os.path.dirname(args.base_file), "db_meta.json")
with open(meta_path, "w", encoding="utf-8") as mf:
json.dump(meta, mf, indent=4, ensure_ascii=False)
print(f"Updated {meta_path} ({meta['sites_count']} sites)")
except Exception as e:
print(f"Warning: could not regenerate db_meta.json: {e}")
statistics_text = db.get_db_stats(is_markdown=True)
site_file.write('## Statistics\n\n')
site_file.write(statistics_text)