Compare commits

..

95 Commits

Author SHA1 Message Date
soxoj 1afdda7336 Merge pull request #119 from soxoj/0.1.20
Bump to 0.1.20
2021-05-02 12:05:08 +03:00
Soxoj 252d12ff9e Bump to 0.1.20 2021-05-02 12:02:53 +03:00
soxoj 6afb17e24f Merge pull request #118 from soxoj/submit-improving-new-sites
Some sites added, submit mode improved
2021-05-02 11:08:52 +03:00
Soxoj 7fdd965bb2 Some sites added, submit mode improved 2021-05-02 11:06:37 +03:00
soxoj 8e30e969f9 Merge pull request #117 from soxoj/retries-refactoring
Introduced `--retries` flag, made thorough refactoring
2021-05-01 23:58:28 +03:00
Soxoj 5ee91f6659 Introduced --retries flag, made thorough refactoring
- updated sites list
- test scripts linting
2021-05-01 23:54:01 +03:00
soxoj 7fd4a2c516 Merge pull request #116 from soxoj/refactoring-errors
Refactoring and linting, added notifications about frequent search errors
2021-04-30 12:06:29 +03:00
Soxoj bfa6afac32 Refactoring and linting, added notifications about frequent search errors 2021-04-30 12:03:13 +03:00
soxoj bfaf276f6e Merge pull request #115 from soxoj/submit-source-improving
Added some new sites, implemented filtering by source site with `--na…
2021-04-29 17:18:31 +03:00
Soxoj c9194b20ba Added some new sites, implemented filtering by source site with --name, improved submit mode 2021-04-29 17:11:17 +03:00
soxoj a30a012550 Merge pull request #114 from soxoj/new-sites-source-feature
Added some new sites and introduced 'source' feature
2021-04-29 15:17:13 +03:00
Soxoj 2cdc9bb276 Added some new sites and introduced 'source' feature 2021-04-29 15:14:21 +03:00
soxoj 99fc6c8a8f Merge pull request #113 from soxoj/errors-stats
Errors stats MVP, some fp fixes
2021-04-25 01:13:39 +03:00
Soxoj b269c4a8e0 Added new modules 2021-04-25 01:12:15 +03:00
Soxoj f43dc5bd6f Errors stats MVP, some fp fixes 2021-04-25 01:08:23 +03:00
soxoj 83cda9e37f Merge pull request #112 from soxoj/tapd-added
Sites update
2021-04-19 00:25:55 +03:00
soxoj cc3df85690 Merge branch 'main' into tapd-added 2021-04-18 22:40:27 +03:00
Soxoj 8007e92021 Sites update 2021-04-18 22:38:30 +03:00
soxoj daaddbde4e Merge pull request #111 from soxoj/fp-fixes-18-04-21
Some false positives fixes
2021-04-18 15:26:11 +03:00
Soxoj cea5073962 Some false positives fixes 2021-04-18 15:20:35 +03:00
soxoj b345512489 Merge pull request #110 from soxoj/0.1.19
Bump to 0.1.19
2021-04-14 23:16:30 +03:00
Soxoj 786cb59145 Bump to 0.1.19 2021-04-14 23:14:33 +03:00
soxoj 481baddec6 Merge pull request #109 from soxoj/fp-fixes
Some false positive fixes
2021-04-12 23:18:47 +03:00
Soxoj ecb3d76581 Some false positive fixes 2021-04-12 23:16:26 +03:00
soxoj 8a8fab5bed Merge pull request #108 from soxoj/async-tasks-timeout
Added asyncio tasks with timeouts, non-blocking work with queue
2021-04-12 23:01:59 +03:00
Soxoj 2fee65fe4e Added asyncio tasks with timeouts, non-blocking work with queue 2021-04-11 17:56:27 +03:00
soxoj dabba859f3 Merge pull request #107 from soxoj/main-module-bugfix
Fixed maigret-as-a-module start
2021-04-06 00:36:45 +03:00
Soxoj 74d4d40abd Fixed maigret-as-a-module start 2021-04-06 00:33:39 +03:00
soxoj d6f6d78d3f Merge pull request #104 from soxoj/ascii-tree-bugfix
Fixed ascii tree bug
2021-04-02 09:08:14 +03:00
Soxoj 1b61c5085e Fixed ascii tree bug 2021-04-02 09:03:22 +03:00
soxoj 01e20518c1 Merge pull request #100 from soxoj/fp-fixes
Fixed some false positives
2021-03-31 23:20:18 +03:00
Soxoj 8477385289 Fixed some false positives 2021-03-31 23:17:47 +03:00
soxoj 491dd8f166 Merge pull request #99 from soxoj/no-progressbar-option
Added `--no-progressbar` flag
2021-03-30 19:47:42 +03:00
Soxoj c64b7a1c85 Added --no-progressbar flag 2021-03-30 19:44:01 +03:00
soxoj 03511a7a8f Merge pull request #97 from soxoj/wizard
Some API improvements
2021-03-30 01:16:12 +03:00
Soxoj 7f1a0fae03 Some API improvements 2021-03-30 01:14:46 +03:00
soxoj b0de174df2 Merge pull request #96 from soxoj/wizard
Added search wizard script as an API usage example
2021-03-30 01:11:12 +03:00
Soxoj b5db3f0035 Added search wizard script as an API usage example 2021-03-30 01:09:06 +03:00
soxoj 53d698bb7b Merge pull request #95 from soxoj/socid-bump
Updated socid_extractor version
2021-03-30 00:37:02 +03:00
soxoj 23fff42ca7 Merge pull request #94 from soxoj/dependabot/pip/lxml-4.6.3
Bump lxml from 4.6.2 to 4.6.3
2021-03-30 00:34:13 +03:00
Soxoj 51d9e6f5f6 Bump to v0.1.17 2021-03-30 00:33:51 +03:00
Soxoj 640c04f20b Updated socid_extractor version 2021-03-30 00:31:40 +03:00
dependabot[bot] 69f78e331b Bump lxml from 4.6.2 to 4.6.3
Bumps [lxml](https://github.com/lxml/lxml) from 4.6.2 to 4.6.3.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](https://github.com/lxml/lxml/compare/lxml-4.6.2...lxml-4.6.3)

Signed-off-by: dependabot[bot] <support@github.com>
2021-03-29 21:25:19 +00:00
soxoj 69c315b00e Merge pull request #93 from soxoj/docs-requirements
Documentation and API improving
2021-03-30 00:24:49 +03:00
Soxoj b755628a1d Documentation and API improving 2021-03-30 00:19:17 +03:00
soxoj 7490a412db Merge pull request #92 from soxoj/ignore403-bugfix
Fixed bug with ignore403 for engine-based sites
2021-03-28 17:40:35 +03:00
Soxoj 2741680d4a Fixed bug with ignore403 for engine-based sites 2021-03-28 17:37:18 +03:00
soxoj e5fc221ce2 Merge pull request #91 from soxoj/async-3.6.9-fix
Fix of 3.6.9 asyncio create_task error
2021-03-24 21:43:11 +03:00
Soxoj a044e3dd79 Fix of 3.6.9 asyncio create_task error 2021-03-24 21:37:56 +03:00
soxoj 6da4ff1e7b Merge pull request #89 from soxoj/v0.1.16
Bump to 0.1.16
2021-03-21 18:58:48 +03:00
Soxoj eb2442401d Bump to 0.1.16 2021-03-21 18:50:13 +03:00
soxoj d23d24eeca Merge pull request #88 from soxoj/parsing-mode-improve
Improving "parse" mode for extracting usernames and other info for a …
2021-03-21 18:41:17 +03:00
Soxoj a2ddb15f09 Improving "parse" mode for extracting usernames and other info for a further search 2021-03-21 18:34:57 +03:00
soxoj e90e85d2a9 Merge pull request #85 from soxoj/submit-improving
Improved submit mode, several sites added
2021-03-21 14:04:09 +03:00
Soxoj 2bb01f7019 Improved submit mode, several sites added 2021-03-21 13:59:59 +03:00
soxoj b586a4cd06 Merge pull request #84 from soxoj/ucoz-support
Added support of uID.me and uCoz sites
2021-03-20 23:26:35 +03:00
Soxoj 28733282ab CI reruns 2021-03-20 23:24:55 +03:00
Soxoj 0a7a7ad70d Added support of uID.me and uCoz sites 2021-03-20 23:21:53 +03:00
soxoj c895f6b418 Merge pull request #82 from soxoj/dependabot/pip/jinja2-2.11.3
Bump jinja2 from 2.11.2 to 2.11.3
2021-03-20 20:59:35 +03:00
soxoj a6286a0286 Merge pull request #83 from soxoj/executors-update
Created async requests executors, some sites fixes
2021-03-20 20:59:22 +03:00
Soxoj 314eb25d1f Created async requests executors, some sites fixes 2021-03-20 20:57:07 +03:00
dependabot[bot] fbbc8b49f3 Bump jinja2 from 2.11.2 to 2.11.3
Bumps [jinja2](https://github.com/pallets/jinja) from 2.11.2 to 2.11.3.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/master/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/2.11.2...2.11.3)

Signed-off-by: dependabot[bot] <support@github.com>
2021-03-20 05:47:45 +00:00
soxoj faa03b62e5 Merge pull request #81 from soxoj/dependabot/pip/pillow-8.1.1
Bump pillow from 8.1.0 to 8.1.1
2021-03-19 21:04:50 +03:00
dependabot[bot] d676f7bb94 Bump pillow from 8.1.0 to 8.1.1
Bumps [pillow](https://github.com/python-pillow/Pillow) from 8.1.0 to 8.1.1.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/master/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/8.1.0...8.1.1)

Signed-off-by: dependabot[bot] <support@github.com>
2021-03-19 15:57:58 +00:00
soxoj d4757aab78 Merge pull request #80 from soxoj/reformatting
Reformat code, some sites added
2021-03-19 01:52:54 +03:00
Soxoj 908176be85 Reformat code, some sites added 2021-03-19 01:48:20 +03:00
soxoj 940f408da3 Merge pull request #79 from soxoj/new-sites-submit
Added new sites through auto submit, some fixes
2021-03-18 23:35:19 +03:00
Soxoj 8c700b9810 Added new sites through auto submit, some fixes 2021-03-18 23:21:33 +03:00
soxoj f9c9af5f41 Merge pull request #78 from soxoj/docker-update-readme
Update README.md
2021-03-16 23:39:33 +03:00
soxoj 57a9a82102 Update README.md 2021-03-16 23:38:58 +03:00
soxoj 9bbca995e9 Merge pull request #77 from vincenttjia/main
Fix Dockerfile
2021-03-16 23:34:17 +03:00
Vincent Tjianattan 39b713497d Fix scipy build dependencies
Fix scipy build dependencies by changing the image from python:3.7-alpine to python:3.7
2021-03-17 00:42:35 +07:00
soxoj 6a84875775 Merge pull request #76 from soxoj/new-sites
Several sites added, Disqus improved, tags fixes
2021-03-15 23:58:09 +03:00
soxoj 84f7d93478 Merge branch 'main' into new-sites 2021-03-15 23:52:52 +03:00
Soxoj 17870ef5c8 Several sites added, Disqus improved, tags fixes 2021-03-15 23:45:20 +03:00
soxoj d3cd5e45a1 Merge pull request #75 from soxoj/collab-badge
Collab link added
2021-03-15 02:52:52 +03:00
soxoj 9a3f2f0aa7 Update README.md 2021-03-15 02:50:54 +03:00
soxoj 4b7d344b41 Merge pull request #73 from soxoj/cloud-based-run
Update README.md
2021-03-15 00:28:19 +03:00
soxoj ac9cfe7885 Update README.md 2021-03-15 00:26:29 +03:00
soxoj 6058a4b70c Fixed repl.it 2021-03-15 00:15:16 +03:00
soxoj 3aa225bda4 Update README.md 2021-03-15 00:13:29 +03:00
soxoj c6661e22ff Merge pull request #72 from soxoj/v0.1.15
Bump to 0.1.15
2021-03-14 20:15:12 +03:00
Soxoj fdb68b5e80 Bump to 0.1.15 2021-03-14 20:11:32 +03:00
soxoj 9fe6b99239 Merge pull request #71 from soxoj/html-report-img-fix
Fixed HTML report images hiding for small screens + some minor fixes
2021-03-14 17:31:12 +03:00
Soxoj b9d303fde3 Fixed HTML report images hiding for small screens + some minor fixes 2021-03-14 16:15:31 +03:00
soxoj d29e88d96f Merge pull request #70 from soxoj/extracting-flag
Added separate `no-extracing` flag to rule page parsing
2021-03-14 13:22:29 +03:00
Soxoj 731a8e01f9 Added separate no-extracing flag to rule page parsing 2021-03-14 13:03:29 +03:00
soxoj cf7acfd8c8 Merge pull request #69 from soxoj/tiktok-fix
TikTok fixes
2021-03-13 00:02:25 +03:00
soxoj 9e6bd05acc Merge pull request #68 from soxoj/ssl-error-catching
Fixed catching of python-specific exception
2021-03-13 00:00:45 +03:00
Soxoj 6ea1dc33f7 TikTok fixes 2021-03-12 23:58:46 +03:00
Soxoj d5bc92d26a Fixed catching of python-specific exception 2021-03-12 23:34:59 +03:00
soxoj f7263c9b3c Merge pull request #67 from soxoj/fp-fixes
Some false positives fixes
2021-03-12 23:31:54 +03:00
Soxoj e6f82a8ba3 Some false positives fixes 2021-03-12 22:53:53 +03:00
soxoj ba7a38092c Merge pull request #65 from soxoj/dependabot/pip/aiohttp-3.7.4
Bump aiohttp from 3.7.3 to 3.7.4
2021-02-26 22:06:04 +03:00
dependabot[bot] 92a1677213 Bump aiohttp from 3.7.3 to 3.7.4
Bumps [aiohttp](https://github.com/aio-libs/aiohttp) from 3.7.3 to 3.7.4.
- [Release notes](https://github.com/aio-libs/aiohttp/releases)
- [Changelog](https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst)
- [Commits](https://github.com/aio-libs/aiohttp/compare/v3.7.3...v3.7.4)

Signed-off-by: dependabot[bot] <support@github.com>
2021-02-26 03:07:44 +00:00
38 changed files with 10260 additions and 7134 deletions
+3 -3
View File
@@ -15,7 +15,7 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
strategy: strategy:
matrix: matrix:
python-version: [3.6, 3.7, 3.8, 3.9] python-version: [3.6.9, 3.7, 3.8, 3.9]
steps: steps:
- uses: actions/checkout@v2 - uses: actions/checkout@v2
@@ -26,8 +26,8 @@ jobs:
- name: Install dependencies - name: Install dependencies
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
python -m pip install flake8 pytest python -m pip install flake8 pytest pytest-rerunfailures
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Test with pytest - name: Test with pytest
run: | run: |
pytest pytest --reruns 3 --reruns-delay 5
+31
View File
@@ -2,6 +2,37 @@
## [Unreleased] ## [Unreleased]
## [0.1.20] - 2021-05-02
* added `--retries` option
* added `source` feature for sites' mirrors
* improved `submit` mode
* lot of style and logic fixes
## [0.1.19] - 2021-04-14
* added `--no-progressbar` option
* fixed ascii tree bug
* fixed `python -m maigret` run
* fixed requests freeze with timeout async tasks
## [0.1.18] - 2021-03-30
* some API improvements
## [0.1.17] - 2021-03-30
* simplified maigret search API
* improved documentation
* fixed 403 response code ignoring bug
## [0.1.16] - 2021-03-21
* improved URL parsing mode
* improved sites submit mode
* added uID.me uguid support
* improved requests processing
## [0.1.15] - 2021-03-14
* improved HTML reports
* fixed python-3.6-specific error
* false positives fixes
## [0.1.14] - 2021-02-25 ## [0.1.14] - 2021-02-25
* added JSON export formats * added JSON export formats
* improved tags markup * improved tags markup
+6 -6
View File
@@ -1,21 +1,21 @@
FROM python:3.7-alpine FROM python:3.7
LABEL maintainer="Soxoj <soxoj@protonmail.com>" LABEL maintainer="Soxoj <soxoj@protonmail.com>"
WORKDIR /app WORKDIR /app
ADD requirements.txt . ADD requirements.txt .
RUN pip install --upgrade pip \ RUN pip install --upgrade pip
&& apk add --update --virtual .build-dependencies \
build-base \ RUN apt update -y
RUN apt install -y\
gcc \ gcc \
musl-dev \ musl-dev \
libxml2 \ libxml2 \
libxml2-dev \ libxml2-dev \
libxslt-dev \ libxslt-dev \
jpeg-dev \
&& YARL_NO_EXTENSIONS=1 python3 -m pip install maigret \ && YARL_NO_EXTENSIONS=1 python3 -m pip install maigret \
&& apk del .build-dependencies \
&& rm -rf /var/cache/apk/* \ && rm -rf /var/cache/apk/* \
/tmp/* \ /tmp/* \
/var/tmp/* /var/tmp/*
+32 -10
View File
@@ -26,6 +26,7 @@ Currently supported more than 2000 sites ([full list](./sites.md)), by default s
* Search by tags (site categories, countries) * Search by tags (site categories, countries)
* Censorship and captcha detection * Censorship and captcha detection
* Very few false positives * Very few false positives
* Failed requests' restarts
## Installation ## Installation
@@ -33,20 +34,43 @@ Currently supported more than 2000 sites ([full list](./sites.md)), by default s
**Python 3.8 is recommended.** **Python 3.8 is recommended.**
### Package installing
```bash ```bash
# install from pypi # install from pypi
$ pip3 install maigret pip3 install maigret
# or clone and install manually # or clone and install manually
$ git clone https://github.com/soxoj/maigret && cd maigret git clone https://github.com/soxoj/maigret && cd maigret
$ pip3 install . pip3 install .
```
### Cloning a repository
```bash
git clone https://github.com/soxoj/maigret && cd maigret
```
You can use a free virtual machine, the repo will be automatically cloned:
[![Open in Cloud Shell](https://user-images.githubusercontent.com/27065646/92304704-8d146d80-ef80-11ea-8c29-0deaabb1c702.png)](https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/soxoj/maigret&tutorial=README.md) [![Run on Repl.it](https://user-images.githubusercontent.com/27065646/92304596-bf719b00-ef7f-11ea-987f-2c1f3c323088.png)](https://repl.it/github/soxoj/maigret)
<a href="https://colab.research.google.com/gist//soxoj/879b51bc3b2f8b695abb054090645000/maigret.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" height="40"></a>
```bash
pip3 install -r requirements.txt
``` ```
## Using examples ## Using examples
```bash ```bash
maigret user # for a cloned repo
./maigret.py user
# for a package
maigret user
```
Features:
```bash
# make HTML and PDF reports # make HTML and PDF reports
maigret user --html --pdf maigret user --html --pdf
@@ -63,19 +87,17 @@ Run `maigret --help` to get arguments description. Also options are documented i
With Docker: With Docker:
``` ```
docker build -t maigret . # manual build
docker build -t maigret . && docker run maigret user
docker run maigret user # official image
docker run soxoj/maigret:latest user
``` ```
## Demo with page parsing and recursive username search ## Demo with page parsing and recursive username search
[PDF report](./static/report_alexaimephotographycars.pdf), [HTML report](https://htmlpreview.github.io/?https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotographycars.html) [PDF report](./static/report_alexaimephotographycars.pdf), [HTML report](https://htmlpreview.github.io/?https://raw.githubusercontent.com/soxoj/maigret/main/static/report_alexaimephotographycars.html)
```bash
maigret alexaimephotographycars
```
![animation of recursive search](./static/recursive_search.svg) ![animation of recursive search](./static/recursive_search.svg)
![HTML report screenshot](./static/report_alexaimephotography_html_screenshot.png) ![HTML report screenshot](./static/report_alexaimephotography_html_screenshot.png)
+9 -11
View File
@@ -1,15 +1,13 @@
# HTTP Cookie File downloaded with cookies.txt by Genuinous @genuinous # HTTP Cookie File downloaded with cookies.txt by Genuinous @genuinous
# This file can be used by wget, curl, aria2c and other standard compliant tools. # This file can be used by wget, curl, aria2c and other standard compliant tools.
# Usage Examples: # Usage Examples:
# 1) wget -x --load-cookies cookies.txt "https://xss.is/search/" # 1) wget -x --load-cookies cookies.txt "https://pixabay.com/users/blue-156711/"
# 2) curl --cookie cookies.txt "https://xss.is/search/" # 2) curl --cookie cookies.txt "https://pixabay.com/users/blue-156711/"
# 3) aria2c --load-cookies cookies.txt "https://xss.is/search/" # 3) aria2c --load-cookies cookies.txt "https://pixabay.com/users/blue-156711/"
# #
xss.is FALSE / TRUE 0 xf_csrf PMnZNsr42HETwYEr .pixabay.com TRUE / TRUE 1618356838 __cfduid d56929cd50d11474f421b849df5758a881615764837
xss.is FALSE / TRUE 0 xf_from_search google .pixabay.com TRUE / TRUE 1615766638 __cf_bm ea8f7c565b44d749f65500f0e45176cebccaeb09-1615764837-1800-AYJIXh2boDJ6HPf44JI9fnteWABHOVvkxiSccACP9EiS1E58UDTGhViXtqjFfVE0QRj1WowP4ss2DzCs+pW+qUc=
xss.is FALSE / TRUE 1642709308 xf_user 215268%2CZNKB_-64Wk-BOpsdtLYy-1UxfS5zGpxWaiEGUhmX pixabay.com FALSE / FALSE 0 anonymous_user_id c1e4ee09-5674-4252-aa94-8c47b1ea80ab
xss.is FALSE / TRUE 0 xf_session sGdxJtP_sKV0LCG8vUQbr6cL670_EFWM pixabay.com FALSE / FALSE 1647214439 csrftoken vfetTSvIul7gBlURt6s985JNM18GCdEwN5MWMKqX4yI73xoPgEj42dbNefjGx5fr
.xss.is TRUE / FALSE 0 muchacho_cache [&quot;00fbb0f2772c9596b0483d6864563cce&quot;] pixabay.com FALSE / FALSE 1647300839 client_width 1680
.xss.is TRUE / FALSE 0 muchacho_png [&quot;00fbb0f2772c9596b0483d6864563cce&quot;] pixabay.com FALSE / FALSE 748111764839 is_human 1
.xss.is TRUE / FALSE 0 muchacho_etag [&quot;00fbb0f2772c9596b0483d6864563cce&quot;]
.xss.is TRUE / FALSE 1924905600 2e66e4dd94a7a237d0d1b4d50f01e179_evc [&quot;00fbb0f2772c9596b0483d6864563cce&quot;]
Executable
+5
View File
@@ -0,0 +1,5 @@
#!/bin/sh
FILES="maigret wizard.py maigret.py tests"
echo 'black'
black --skip-string-normalization $FILES
Executable
+11
View File
@@ -0,0 +1,11 @@
#!/bin/sh
FILES="maigret wizard.py maigret.py tests"
echo 'syntax errors or undefined names'
flake8 --count --select=E9,F63,F7,F82 --show-source --statistics $FILES
echo 'warning'
flake8 --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics --ignore=E731,W503 $FILES
echo 'mypy'
mypy ./maigret ./wizard.py ./tests
+1 -1
View File
@@ -1,4 +1,4 @@
#! /usr/bin/env python3 #!/usr/bin/env python3
import asyncio import asyncio
import sys import sys
+4
View File
@@ -1 +1,5 @@
"""Maigret""" """Maigret"""
from .checking import maigret as search
from .sites import MaigretEngine, MaigretSite, MaigretDatabase
from .notify import QueryNotifyPrint as Notifier
+2 -2
View File
@@ -6,7 +6,7 @@ Maigret entrypoint
import asyncio import asyncio
import maigret from .maigret import main
if __name__ == "__main__": if __name__ == "__main__":
asyncio.run(maigret.main()) asyncio.run(main())
+27 -27
View File
@@ -1,56 +1,56 @@
import aiohttp
from aiohttp import CookieJar
import asyncio
import json
from http.cookiejar import MozillaCookieJar from http.cookiejar import MozillaCookieJar
from http.cookies import Morsel from http.cookies import Morsel
import requests import requests
from aiohttp import CookieJar
class ParsingActivator: class ParsingActivator:
@staticmethod @staticmethod
def twitter(site, logger, cookies={}): def twitter(site, logger, cookies={}):
headers = dict(site.headers) headers = dict(site.headers)
del headers['x-guest-token'] del headers["x-guest-token"]
r = requests.post(site.activation['url'], headers=headers) r = requests.post(site.activation["url"], headers=headers)
logger.info(r) logger.info(r)
j = r.json() j = r.json()
guest_token = j[site.activation['src']] guest_token = j[site.activation["src"]]
site.headers['x-guest-token'] = guest_token site.headers["x-guest-token"] = guest_token
@staticmethod @staticmethod
def vimeo(site, logger, cookies={}): def vimeo(site, logger, cookies={}):
headers = dict(site.headers) headers = dict(site.headers)
if 'Authorization' in headers: if "Authorization" in headers:
del headers['Authorization'] del headers["Authorization"]
r = requests.get(site.activation['url'], headers=headers) r = requests.get(site.activation["url"], headers=headers)
jwt_token = r.json()['jwt'] jwt_token = r.json()["jwt"]
site.headers['Authorization'] = 'jwt ' + jwt_token site.headers["Authorization"] = "jwt " + jwt_token
@staticmethod @staticmethod
def spotify(site, logger, cookies={}): def spotify(site, logger, cookies={}):
headers = dict(site.headers) headers = dict(site.headers)
if 'Authorization' in headers: if "Authorization" in headers:
del headers['Authorization'] del headers["Authorization"]
r = requests.get(site.activation['url']) r = requests.get(site.activation["url"])
bearer_token = r.json()['accessToken'] bearer_token = r.json()["accessToken"]
site.headers['authorization'] = f'Bearer {bearer_token}' site.headers["authorization"] = f"Bearer {bearer_token}"
@staticmethod @staticmethod
def xssis(site, logger, cookies={}): def xssis(site, logger, cookies={}):
if not cookies: if not cookies:
logger.debug('You must have cookies to activate xss.is parsing!') logger.debug("You must have cookies to activate xss.is parsing!")
return return
headers = dict(site.headers) headers = dict(site.headers)
post_data = { post_data = {
'_xfResponseType': 'json', "_xfResponseType": "json",
'_xfToken': '1611177919,a2710362e45dad9aa1da381e21941a38' "_xfToken": "1611177919,a2710362e45dad9aa1da381e21941a38",
} }
headers['content-type'] = 'application/x-www-form-urlencoded; charset=UTF-8' headers["content-type"] = "application/x-www-form-urlencoded; charset=UTF-8"
r = requests.post(site.activation['url'], headers=headers, cookies=cookies, data=post_data) r = requests.post(
csrf = r.json()['csrf'] site.activation["url"], headers=headers, cookies=cookies, data=post_data
site.get_params['_xfToken'] = csrf )
csrf = r.json()["csrf"]
site.get_params["_xfToken"] = csrf
async def import_aiohttp_cookies(cookiestxt_filename): async def import_aiohttp_cookies(cookiestxt_filename):
@@ -64,8 +64,8 @@ async def import_aiohttp_cookies(cookiestxt_filename):
for key, cookie in list(domain.values())[0].items(): for key, cookie in list(domain.values())[0].items():
c = Morsel() c = Morsel()
c.set(key, cookie.value, cookie.value) c.set(key, cookie.value, cookie.value)
c['domain'] = cookie.domain c["domain"] = cookie.domain
c['path'] = cookie.path c["path"] = cookie.path
cookies_list.append((key, c)) cookies_list.append((key, c))
cookies.update_cookies(cookies_list) cookies.update_cookies(cookies_list)
+406 -287
View File
@@ -1,133 +1,125 @@
import asyncio import asyncio
import logging import logging
from mock import Mock
import re import re
import ssl import ssl
import sys
import tqdm
from typing import Tuple, Optional, Dict, List
import aiohttp import aiohttp
import tqdm.asyncio import tqdm.asyncio
from aiohttp_socks import ProxyConnector from aiohttp_socks import ProxyConnector
from mock import Mock
from python_socks import _errors as proxy_errors from python_socks import _errors as proxy_errors
from socid_extractor import extract from socid_extractor import extract
from .activation import ParsingActivator, import_aiohttp_cookies from .activation import ParsingActivator, import_aiohttp_cookies
from . import errors
from .errors import CheckError
from .executors import (
AsyncExecutor,
AsyncioSimpleExecutor,
AsyncioProgressbarQueueExecutor,
)
from .result import QueryResult, QueryStatus from .result import QueryResult, QueryStatus
from .sites import MaigretDatabase, MaigretSite from .sites import MaigretDatabase, MaigretSite
from .types import QueryOptions, QueryResultWrapper
from .utils import get_random_user_agent
supported_recursive_search_ids = ( supported_recursive_search_ids = (
'yandex_public_id', "yandex_public_id",
'gaia_id', "gaia_id",
'vk_id', "vk_id",
'ok_id', "ok_id",
'wikimapia_uid', "wikimapia_uid",
'steam_id', "steam_id",
"uidme_uguid",
) )
common_errors = { unsupported_characters = "#"
'<title>Attention Required! | Cloudflare</title>': 'Cloudflare captcha',
'Please stand by, while we are checking your browser': 'Cloudflare captcha',
'<title>Доступ ограничен</title>': 'Rostelecom censorship',
'document.getElementById(\'validate_form_submit\').disabled=true': 'Mail.ru captcha',
'Verifying your browser, please wait...<br>DDoS Protection by</font> Blazingfast.io': 'Blazingfast protection',
'404</h1><p class="error-card__description">Мы&nbsp;не&nbsp;нашли страницу': 'MegaFon 404 page',
'Доступ к информационному ресурсу ограничен на основании Федерального закона': 'MGTS censorship',
'Incapsula incident ID': 'Incapsula antibot protection',
}
unsupported_characters = '#'
async def get_response(request_future, site_name, logger): async def get_response(request_future, logger) -> Tuple[str, int, Optional[CheckError]]:
html_text = None html_text = None
status_code = 0 status_code = 0
error: Optional[CheckError] = CheckError("Unknown")
error_text = "General Unknown Error"
expection_text = None
try: try:
response = await request_future response = await request_future
status_code = response.status status_code = response.status
response_content = await response.content.read() response_content = await response.content.read()
charset = response.charset or 'utf-8' charset = response.charset or "utf-8"
decoded_content = response_content.decode(charset, 'ignore') decoded_content = response_content.decode(charset, "ignore")
html_text = decoded_content html_text = decoded_content
if status_code > 0: if status_code == 0:
error_text = None error = CheckError("Connection lost")
else:
error = None
logger.debug(html_text) logger.debug(html_text)
except asyncio.TimeoutError as errt: except asyncio.TimeoutError as e:
error_text = "Timeout Error" error = CheckError("Request timeout", str(e))
expection_text = str(errt) except aiohttp.client_exceptions.ClientConnectorError as e:
except (ssl.SSLCertVerificationError, ssl.SSLError) as err: error = CheckError("Connecting failure", str(e))
error_text = "SSL Error" except aiohttp.http_exceptions.BadHttpMessage as e:
expection_text = str(err) error = CheckError("HTTP", str(e))
except aiohttp.client_exceptions.ClientConnectorError as err: except proxy_errors.ProxyError as e:
error_text = "Error Connecting" error = CheckError("Proxy", str(e))
expection_text = str(err) except KeyboardInterrupt:
except aiohttp.http_exceptions.BadHttpMessage as err: error = CheckError("Interrupted")
error_text = "HTTP Error" except Exception as e:
expection_text = str(err) # python-specific exceptions
except proxy_errors.ProxyError as err: if sys.version_info.minor > 6:
error_text = "Proxy Error" if isinstance(e, ssl.SSLCertVerificationError) or isinstance(
expection_text = str(err) e, ssl.SSLError
except Exception as err: ):
logger.warning(f'Unhandled error while requesting {site_name}: {err}') error = CheckError("SSL", str(e))
logger.debug(err, exc_info=True) else:
error_text = "Some Error" logger.debug(e, exc_info=True)
expection_text = str(err) error = CheckError("Unexpected", str(e))
# TODO: return only needed information return str(html_text), status_code, error
return html_text, status_code, error_text, expection_text
async def update_site_dict_from_response(sitename, site_dict, results_info, semaphore, logger, query_notify):
async with semaphore:
site_obj = site_dict[sitename]
future = site_obj.request_future
if not future:
# ignore: search by incompatible id type
return
response = await get_response(request_future=future,
site_name=sitename,
logger=logger)
site_dict[sitename] = process_site_result(response, query_notify, logger, results_info, site_obj)
# TODO: move to separate class # TODO: move to separate class
def detect_error_page(html_text, status_code, fail_flags, ignore_403): def detect_error_page(
html_text, status_code, fail_flags, ignore_403
) -> Optional[CheckError]:
# Detect service restrictions such as a country restriction # Detect service restrictions such as a country restriction
for flag, msg in fail_flags.items(): for flag, msg in fail_flags.items():
if flag in html_text: if flag in html_text:
return 'Some site error', msg return CheckError("Site-specific", msg)
# Detect common restrictions such as provider censorship and bot protection # Detect common restrictions such as provider censorship and bot protection
for flag, msg in common_errors.items(): err = errors.detect(html_text)
if flag in html_text: if err:
return 'Error', msg return err
# Detect common site errors # Detect common site errors
if status_code == 403 and not ignore_403: if status_code == 403 and not ignore_403:
return 'Access denied', 'Access denied, use proxy/vpn' return CheckError("Access denied", "403 status code, use proxy/vpn")
elif status_code >= 500: elif status_code >= 500:
return f'Error {status_code}', f'Site error {status_code}' return CheckError("Server", f"{status_code} status code")
return None, None return None
def process_site_result(response, query_notify, logger, results_info, site: MaigretSite): def process_site_result(
response, query_notify, logger, results_info: QueryResultWrapper, site: MaigretSite
):
if not response: if not response:
return results_info return results_info
fulltags = site.tags fulltags = site.tags
# Retrieve other site information again # Retrieve other site information again
username = results_info['username'] username = results_info["username"]
is_parsing_enabled = results_info['parsing_enabled'] is_parsing_enabled = results_info["parsing_enabled"]
url = results_info.get("url_user") url = results_info.get("url_user")
logger.debug(url) logger.debug(url)
@@ -139,42 +131,47 @@ def process_site_result(response, query_notify, logger, results_info, site: Maig
# Get the expected check type # Get the expected check type
check_type = site.check_type check_type = site.check_type
# Get the failure messages and comments
failure_errors = site.errors
# TODO: refactor # TODO: refactor
if not response: if not response:
logger.error(f'No response for {site.name}') logger.error(f"No response for {site.name}")
return results_info return results_info
html_text, status_code, error_text, expection_text = response html_text, status_code, check_error = response
site_error_text = '?'
# TODO: add elapsed request time counting # TODO: add elapsed request time counting
response_time = None response_time = None
if logger.level == logging.DEBUG: if logger.level == logging.DEBUG:
with open('debug.txt', 'a') as f: with open("debug.txt", "a") as f:
status = status_code or 'No response' status = status_code or "No response"
f.write(f'url: {url}\nerror: {str(error_text)}\nr: {status}\n') f.write(f"url: {url}\nerror: {check_error}\nr: {status}\n")
if html_text: if html_text:
f.write(f'code: {status}\nresponse: {str(html_text)}\n') f.write(f"code: {status}\nresponse: {str(html_text)}\n")
if status_code and not error_text: # additional check for errors
error_text, site_error_text = detect_error_page(html_text, status_code, failure_errors, if status_code and not check_error:
site.ignore_403) check_error = detect_error_page(
html_text, status_code, site.errors, site.ignore403
)
if site.activation and html_text: if site.activation and html_text:
is_need_activation = any([s for s in site.activation['marks'] if s in html_text]) is_need_activation = any(
[s for s in site.activation["marks"] if s in html_text]
)
if is_need_activation: if is_need_activation:
method = site.activation['method'] method = site.activation["method"]
try: try:
activate_fun = getattr(ParsingActivator(), method) activate_fun = getattr(ParsingActivator(), method)
# TODO: async call # TODO: async call
activate_fun(site, logger) activate_fun(site, logger)
except AttributeError: except AttributeError:
logger.warning(f'Activation method {method} for site {site.name} not found!') logger.warning(
f"Activation method {method} for site {site.name} not found!"
)
except Exception as e:
logger.warning(f"Failed activation {method} for site {site.name}: {e}")
site_name = site.pretty_name
# presense flags # presense flags
# True by default # True by default
presense_flags = site.presense_strs presense_flags = site.presense_strs
@@ -182,55 +179,53 @@ def process_site_result(response, query_notify, logger, results_info, site: Maig
if html_text: if html_text:
if not presense_flags: if not presense_flags:
is_presense_detected = True is_presense_detected = True
site.stats['presense_flag'] = None site.stats["presense_flag"] = None
else: else:
for presense_flag in presense_flags: for presense_flag in presense_flags:
if presense_flag in html_text: if presense_flag in html_text:
is_presense_detected = True is_presense_detected = True
site.stats['presense_flag'] = presense_flag site.stats["presense_flag"] = presense_flag
logger.info(presense_flag) logger.debug(presense_flag)
break break
if error_text is not None: def build_result(status, **kwargs):
logger.debug(error_text) return QueryResult(
result = QueryResult(username, username,
site.name, site_name,
url,
status,
query_time=response_time,
tags=fulltags,
**kwargs,
)
if check_error:
logger.debug(check_error)
result = QueryResult(
username,
site_name,
url, url,
QueryStatus.UNKNOWN, QueryStatus.UNKNOWN,
query_time=response_time, query_time=response_time,
context=f'{error_text}: {site_error_text}', tags=fulltags) error=check_error,
context=str(CheckError),
tags=fulltags,
)
elif check_type == "message": elif check_type == "message":
absence_flags = site.absence_strs
is_absence_flags_list = isinstance(absence_flags, list)
absence_flags_set = set(absence_flags) if is_absence_flags_list else {absence_flags}
# Checks if the error message is in the HTML # Checks if the error message is in the HTML
is_absence_detected = any([(absence_flag in html_text) for absence_flag in absence_flags_set]) is_absence_detected = any(
[(absence_flag in html_text) for absence_flag in site.absence_strs]
)
if not is_absence_detected and is_presense_detected: if not is_absence_detected and is_presense_detected:
result = QueryResult(username, result = build_result(QueryStatus.CLAIMED)
site.name,
url,
QueryStatus.CLAIMED,
query_time=response_time, tags=fulltags)
else: else:
result = QueryResult(username, result = build_result(QueryStatus.AVAILABLE)
site.name,
url,
QueryStatus.AVAILABLE,
query_time=response_time, tags=fulltags)
elif check_type == "status_code": elif check_type == "status_code":
# Checks if the status code of the response is 2XX # Checks if the status code of the response is 2XX
if (not status_code >= 300 or status_code < 200) and is_presense_detected: if is_presense_detected and (not status_code >= 300 or status_code < 200):
result = QueryResult(username, result = build_result(QueryStatus.CLAIMED)
site.name,
url,
QueryStatus.CLAIMED,
query_time=response_time, tags=fulltags)
else: else:
result = QueryResult(username, result = build_result(QueryStatus.AVAILABLE)
site.name,
url,
QueryStatus.AVAILABLE,
query_time=response_time, tags=fulltags)
elif check_type == "response_url": elif check_type == "response_url":
# For this detection method, we have turned off the redirect. # For this detection method, we have turned off the redirect.
# So, there is no need to check the response URL: it will always # So, there is no need to check the response URL: it will always
@@ -238,21 +233,14 @@ def process_site_result(response, query_notify, logger, results_info, site: Maig
# code indicates that the request was successful (i.e. no 404, or # code indicates that the request was successful (i.e. no 404, or
# forward to some odd redirect). # forward to some odd redirect).
if 200 <= status_code < 300 and is_presense_detected: if 200 <= status_code < 300 and is_presense_detected:
result = QueryResult(username, result = build_result(QueryStatus.CLAIMED)
site.name,
url,
QueryStatus.CLAIMED,
query_time=response_time, tags=fulltags)
else: else:
result = QueryResult(username, result = build_result(QueryStatus.AVAILABLE)
site.name,
url,
QueryStatus.AVAILABLE,
query_time=response_time, tags=fulltags)
else: else:
# It should be impossible to ever get here... # It should be impossible to ever get here...
raise ValueError(f"Unknown check type '{check_type}' for " raise ValueError(
f"site '{site.name}'") f"Unknown check type '{check_type}' for " f"site '{site.name}'"
)
extracted_ids_data = {} extracted_ids_data = {}
@@ -260,142 +248,103 @@ def process_site_result(response, query_notify, logger, results_info, site: Maig
try: try:
extracted_ids_data = extract(html_text) extracted_ids_data = extract(html_text)
except Exception as e: except Exception as e:
logger.warning(f'Error while parsing {site.name}: {e}', exc_info=True) logger.warning(f"Error while parsing {site.name}: {e}", exc_info=True)
if extracted_ids_data: if extracted_ids_data:
new_usernames = {} new_usernames = {}
for k, v in extracted_ids_data.items(): for k, v in extracted_ids_data.items():
if 'username' in k: if "username" in k:
new_usernames[v] = 'username' new_usernames[v] = "username"
if k in supported_recursive_search_ids: if k in supported_recursive_search_ids:
new_usernames[v] = k new_usernames[v] = k
results_info['ids_usernames'] = new_usernames results_info["ids_usernames"] = new_usernames
results_info['ids_links'] = eval(extracted_ids_data.get('links', '[]')) results_info["ids_links"] = eval(extracted_ids_data.get("links", "[]"))
result.ids_data = extracted_ids_data result.ids_data = extracted_ids_data
# Notify caller about results of query. # Notify caller about results of query.
query_notify.update(result, site.similar_search) query_notify.update(result, site.similar_search)
# Save status of request # Save status of request
results_info['status'] = result results_info["status"] = result
# Save results from request # Save results from request
results_info['http_status'] = status_code results_info["http_status"] = status_code
results_info['is_similar'] = site.similar_search results_info["is_similar"] = site.similar_search
# results_site['response_text'] = html_text # results_site['response_text'] = html_text
results_info['rank'] = site.alexa_rank results_info["rank"] = site.alexa_rank
return results_info return results_info
async def maigret(username, site_dict, query_notify, logger, def make_site_result(
proxy=None, timeout=None, recursive_search=False, site: MaigretSite, username: str, options: QueryOptions, logger
id_type='username', debug=False, forced=False, ) -> QueryResultWrapper:
max_connections=100, no_progressbar=False, results_site: QueryResultWrapper = {}
cookies=None):
"""Main search func
Checks for existence of username on various social media sites.
Keyword Arguments:
username -- String indicating username that report
should be created against.
site_dict -- Dictionary containing all of the site data.
query_notify -- Object with base type of QueryNotify().
This will be used to notify the caller about
query results.
proxy -- String indicating the proxy URL
timeout -- Time in seconds to wait before timing out request.
Default is no timeout.
recursive_search -- Search for other usernames in website pages & recursive search by them.
Return Value:
Dictionary containing results from report. Key of dictionary is the name
of the social network site, and the value is another dictionary with
the following keys:
url_main: URL of main site.
url_user: URL of user on site (if account exists).
status: QueryResult() object indicating results of test for
account existence.
http_status: HTTP status code of query which checked for existence on
site.
response_text: Text that came back from request. May be None if
there was an HTTP error when checking for existence.
"""
# Notify caller that we are starting the query.
query_notify.start(username, id_type)
# TODO: connector
connector = ProxyConnector.from_url(proxy) if proxy else aiohttp.TCPConnector(ssl=False)
# connector = aiohttp.TCPConnector(ssl=False)
connector.verify_ssl = False
cookie_jar = None
if cookies:
logger.debug(f'Using cookies jar file {cookies}')
cookie_jar = await import_aiohttp_cookies(cookies)
session = aiohttp.ClientSession(connector=connector, trust_env=True, cookie_jar=cookie_jar)
if logger.level == logging.DEBUG:
future = session.get(url='https://icanhazip.com')
ip, status, error, expection = await get_response(future, None, logger)
if ip:
logger.debug(f'My IP is: {ip.strip()}')
else:
logger.debug(f'IP requesting {error}: {expection}')
# Results from analysis of all sites
results_total = {}
# First create futures for all requests. This allows for the requests to run in parallel
for site_name, site in site_dict.items():
if site.type != id_type:
continue
if site.disabled and not forced:
logger.debug(f'Site {site.name} is disabled, skipping...')
continue
# Results from analysis of this specific site
results_site = {}
# Record URL of main site and username # Record URL of main site and username
results_site['username'] = username results_site["site"] = site
results_site['parsing_enabled'] = recursive_search results_site["username"] = username
results_site['url_main'] = site.url_main results_site["parsing_enabled"] = options["parsing"]
results_site['cookies'] = cookie_jar and cookie_jar.filter_cookies(site.url_main) or None results_site["url_main"] = site.url_main
results_site["cookies"] = (
options.get("cookie_jar")
and options["cookie_jar"].filter_cookies(site.url_main)
or None
)
headers = { headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:55.0) Gecko/20100101 Firefox/55.0', "User-Agent": get_random_user_agent(),
} }
headers.update(site.headers) headers.update(site.headers)
if not 'url' in site.__dict__: if "url" not in site.__dict__:
logger.error('No URL for site %s', site.name) logger.error("No URL for site %s", site.name)
# URL of user on site (if it exists) # URL of user on site (if it exists)
url = site.url.format( url = site.url.format(
urlMain=site.url_main, urlMain=site.url_main, urlSubpath=site.url_subpath, username=username
urlSubpath=site.url_subpath,
username=username
) )
# workaround to prevent slash errors
url = re.sub('(?<!:)/+', '/', url)
# Don't make request if username is invalid for the site # workaround to prevent slash errors
if site.regex_check and re.search(site.regex_check, username) is None: url = re.sub("(?<!:)/+", "/", url)
# No need to do the check at the site: this user name is not allowed.
results_site['status'] = QueryResult(username, session = options['session']
site_name,
# site check is disabled
if site.disabled and not options['forced']:
logger.debug(f"Site {site.name} is disabled, skipping...")
results_site["status"] = QueryResult(
username,
site.name,
url, url,
QueryStatus.ILLEGAL) QueryStatus.ILLEGAL,
error=CheckError("Check is disabled"),
)
# current username type could not be applied
elif site.type != options["id_type"]:
results_site["status"] = QueryResult(
username,
site.name,
url,
QueryStatus.ILLEGAL,
error=CheckError('Unsupported identifier type', f'Want "{site.type}"'),
)
# username is not allowed.
elif site.regex_check and re.search(site.regex_check, username) is None:
results_site["status"] = QueryResult(
username,
site.name,
url,
QueryStatus.ILLEGAL,
error=CheckError(
'Unsupported username format', f'Want "{site.regex_check}"'
),
)
results_site["url_user"] = "" results_site["url_user"] = ""
results_site['http_status'] = "" results_site["http_status"] = ""
results_site['response_text'] = "" results_site["response_text"] = ""
query_notify.update(results_site['status']) # query_notify.update(results_site["status"])
else: else:
# URL of user on site (if it exists) # URL of user on site (if it exists)
results_site["url_user"] = url results_site["url_user"] = url
@@ -413,9 +362,9 @@ async def maigret(username, site_dict, query_notify, logger,
) )
for k, v in site.get_params.items(): for k, v in site.get_params.items():
url_probe += f'&{k}={v}' url_probe += f"&{k}={v}"
if site.check_type == 'status_code' and site.request_head_only: if site.check_type == "status_code" and site.request_head_only:
# In most cases when we are detecting by status code, # In most cases when we are detecting by status code,
# it is not necessary to get the entire body: we can # it is not necessary to get the entire body: we can
# detect fine with just the HEAD response. # detect fine with just the HEAD response.
@@ -436,40 +385,203 @@ async def maigret(username, site_dict, query_notify, logger,
# The final result of the request will be what is available. # The final result of the request will be what is available.
allow_redirects = True allow_redirects = True
future = request_method(url=url_probe, headers=headers, future = request_method(
url=url_probe,
headers=headers,
allow_redirects=allow_redirects, allow_redirects=allow_redirects,
timeout=timeout, timeout=options['timeout'],
) )
# Store future in data for access later # Store future request object in the results object
# TODO: move to separate obj results_site["future"] = future
site.request_future = future
# Add this site's results into final dictionary with all of the other results. return results_site
results_total[site_name] = results_site
# TODO: move into top-level function
sem = asyncio.Semaphore(max_connections) async def check_site_for_username(
site, username, options: QueryOptions, logger, query_notify, *args, **kwargs
) -> Tuple[str, QueryResultWrapper]:
default_result = make_site_result(site, username, options, logger)
future = default_result.get("future")
if not future:
return site.name, default_result
tasks = [] response = await get_response(request_future=future, logger=logger)
for sitename, result_obj in results_total.items():
update_site_coro = update_site_dict_from_response(sitename, site_dict, result_obj, sem, logger, query_notify)
future = asyncio.ensure_future(update_site_coro)
tasks.append(future)
if no_progressbar: response_result = process_site_result(
await asyncio.gather(*tasks) response, query_notify, logger, default_result, site
)
return site.name, response_result
async def debug_ip_request(session, logger):
future = session.get(url="https://icanhazip.com")
ip, status, check_error = await get_response(future, logger)
if ip:
logger.debug(f"My IP is: {ip.strip()}")
else: else:
for f in tqdm.asyncio.tqdm.as_completed(tasks): logger.debug(f"IP requesting {check_error.type}: {check_error.desc}")
await f
def get_failed_sites(results: Dict[str, QueryResultWrapper]) -> List[str]:
sites = []
for sitename, r in results.items():
status = r.get('status', {})
if status and status.error:
if errors.is_permanent(status.error.type):
continue
sites.append(sitename)
return sites
async def maigret(
username: str,
site_dict: Dict[str, MaigretSite],
logger,
query_notify=None,
proxy=None,
timeout=None,
is_parsing_enabled=False,
id_type="username",
debug=False,
forced=False,
max_connections=100,
no_progressbar=False,
cookies=None,
retries=0,
) -> QueryResultWrapper:
"""Main search func
Checks for existence of username on certain sites.
Keyword Arguments:
username -- Username string will be used for search.
site_dict -- Dictionary containing sites data in MaigretSite objects.
query_notify -- Object with base type of QueryNotify().
This will be used to notify the caller about
query results.
logger -- Standard Python logger object.
timeout -- Time in seconds to wait before timing out request.
Default is no timeout.
is_parsing_enabled -- Extract additional info from account pages.
id_type -- Type of username to search.
Default is 'username', see all supported here:
https://github.com/soxoj/maigret/wiki/Supported-identifier-types
max_connections -- Maximum number of concurrent connections allowed.
Default is 100.
no_progressbar -- Displaying of ASCII progressbar during scanner.
cookies -- Filename of a cookie jar file to use for each request.
Return Value:
Dictionary containing results from report. Key of dictionary is the name
of the social network site, and the value is another dictionary with
the following keys:
url_main: URL of main site.
url_user: URL of user on site (if account exists).
status: QueryResult() object indicating results of test for
account existence.
http_status: HTTP status code of query which checked for existence on
site.
response_text: Text that came back from request. May be None if
there was an HTTP error when checking for existence.
"""
# notify caller that we are starting the query.
if not query_notify:
query_notify = Mock()
query_notify.start(username, id_type)
# make http client session
connector = (
ProxyConnector.from_url(proxy) if proxy else aiohttp.TCPConnector(ssl=False)
)
connector.verify_ssl = False
cookie_jar = None
if cookies:
logger.debug(f"Using cookies jar file {cookies}")
cookie_jar = await import_aiohttp_cookies(cookies)
session = aiohttp.ClientSession(
connector=connector, trust_env=True, cookie_jar=cookie_jar
)
if logger.level == logging.DEBUG:
await debug_ip_request(session, logger)
# setup parallel executor
executor: Optional[AsyncExecutor] = None
if no_progressbar:
executor = AsyncioSimpleExecutor(logger=logger)
else:
executor = AsyncioProgressbarQueueExecutor(
logger=logger, in_parallel=max_connections, timeout=timeout + 0.5
)
# make options objects for all the requests
options: QueryOptions = {}
options["cookies"] = cookie_jar
options["session"] = session
options["parsing"] = is_parsing_enabled
options["timeout"] = timeout
options["id_type"] = id_type
options["forced"] = forced
# results from analysis of all sites
all_results: Dict[str, QueryResultWrapper] = {}
sites = list(site_dict.keys())
attempts = retries + 1
while attempts:
tasks_dict = {}
for sitename, site in site_dict.items():
if sitename not in sites:
continue
default_result: QueryResultWrapper = {
'site': site,
'status': QueryResult(
username,
sitename,
'',
QueryStatus.UNKNOWN,
error=CheckError('Request failed'),
),
}
tasks_dict[sitename] = (
check_site_for_username,
[site, username, options, logger, query_notify],
{'default': (sitename, default_result)},
)
cur_results = await executor.run(tasks_dict.values())
# wait for executor timeout errors
await asyncio.sleep(1)
all_results.update(cur_results)
sites = get_failed_sites(dict(cur_results))
attempts -= 1
if not sites:
break
if attempts:
query_notify.warning(
f'Restarting checks for {len(sites)} sites... ({attempts} attempts left)'
)
# closing http client session
await session.close() await session.close()
# Notify caller that all queries are finished. # notify caller that all queries are finished
query_notify.finish() query_notify.finish()
return results_total return all_results
def timeout_check(value): def timeout_check(value):
@@ -497,10 +609,11 @@ def timeout_check(value):
return timeout return timeout
async def site_self_check(site, logger, semaphore, db: MaigretDatabase, silent=False): async def site_self_check(
query_notify = Mock() site: MaigretSite, logger, semaphore, db: MaigretDatabase, silent=False
):
changes = { changes = {
'disabled': False, "disabled": False,
} }
try: try:
@@ -513,29 +626,29 @@ async def site_self_check(site, logger, semaphore, db: MaigretDatabase, silent=F
logger.error(site.__dict__) logger.error(site.__dict__)
check_data = [] check_data = []
logger.info(f'Checking {site.name}...') logger.info(f"Checking {site.name}...")
for username, status in check_data: for username, status in check_data:
async with semaphore: async with semaphore:
results_dict = await maigret( results_dict = await maigret(
username, username=username,
{site.name: site}, site_dict={site.name: site},
query_notify, logger=logger,
logger,
timeout=30, timeout=30,
id_type=site.type, id_type=site.type,
forced=True, forced=True,
no_progressbar=True, no_progressbar=True,
retries=1,
) )
# don't disable entries with other ids types # don't disable entries with other ids types
# TODO: make normal checking # TODO: make normal checking
if site.name not in results_dict: if site.name not in results_dict:
logger.info(results_dict) logger.info(results_dict)
changes['disabled'] = True changes["disabled"] = True
continue continue
result = results_dict[site.name]['status'] result = results_dict[site.name]["status"]
site_status = result.status site_status = result.status
@@ -544,33 +657,37 @@ async def site_self_check(site, logger, semaphore, db: MaigretDatabase, silent=F
msgs = site.absence_strs msgs = site.absence_strs
etype = site.check_type etype = site.check_type
logger.warning( logger.warning(
f'Error while searching {username} in {site.name}: {result.context}, {msgs}, type {etype}') f"Error while searching {username} in {site.name}: {result.context}, {msgs}, type {etype}"
)
# don't disable in case of available username # don't disable in case of available username
if status == QueryStatus.CLAIMED: if status == QueryStatus.CLAIMED:
changes['disabled'] = True changes["disabled"] = True
elif status == QueryStatus.CLAIMED: elif status == QueryStatus.CLAIMED:
logger.warning(f'Not found `{username}` in {site.name}, must be claimed') logger.warning(
f"Not found `{username}` in {site.name}, must be claimed"
)
logger.info(results_dict[site.name]) logger.info(results_dict[site.name])
changes['disabled'] = True changes["disabled"] = True
else: else:
logger.warning(f'Found `{username}` in {site.name}, must be available') logger.warning(f"Found `{username}` in {site.name}, must be available")
logger.info(results_dict[site.name]) logger.info(results_dict[site.name])
changes['disabled'] = True changes["disabled"] = True
logger.info(f'Site {site.name} checking is finished') logger.info(f"Site {site.name} checking is finished")
if changes['disabled'] != site.disabled: if changes["disabled"] != site.disabled:
site.disabled = changes['disabled'] site.disabled = changes["disabled"]
db.update_site(site) db.update_site(site)
if not silent: if not silent:
action = 'Disabled' if site.disabled else 'Enabled' action = "Disabled" if site.disabled else "Enabled"
print(f'{action} site {site.name}...') print(f"{action} site {site.name}...")
return changes return changes
async def self_check(db: MaigretDatabase, site_data: dict, logger, silent=False, async def self_check(
max_connections=10) -> bool: db: MaigretDatabase, site_data: dict, logger, silent=False, max_connections=10
) -> bool:
sem = asyncio.Semaphore(max_connections) sem = asyncio.Semaphore(max_connections)
tasks = [] tasks = []
all_sites = site_data all_sites = site_data
@@ -592,13 +709,15 @@ async def self_check(db: MaigretDatabase, site_data: dict, logger, silent=False,
total_disabled = disabled_new_count - disabled_old_count total_disabled = disabled_new_count - disabled_old_count
if total_disabled >= 0: if total_disabled >= 0:
message = 'Disabled' message = "Disabled"
else: else:
message = 'Enabled' message = "Enabled"
total_disabled *= -1 total_disabled *= -1
if not silent: if not silent:
print( print(
f'{message} {total_disabled} ({disabled_old_count} => {disabled_new_count}) checked sites. Run with `--info` flag to get more information') f"{message} {total_disabled} ({disabled_old_count} => {disabled_new_count}) checked sites. "
"Run with `--info` flag to get more information"
)
return total_disabled != 0 return total_disabled != 0
+115
View File
@@ -0,0 +1,115 @@
from typing import Dict, List, Any
from .result import QueryResult
# error got as a result of completed search query
class CheckError:
_type = 'Unknown'
_desc = ''
def __init__(self, typename, desc=''):
self._type = typename
self._desc = desc
def __str__(self):
if not self._desc:
return f'{self._type} error'
return f'{self._type} error: {self._desc}'
@property
def type(self):
return self._type
@property
def desc(self):
return self._desc
COMMON_ERRORS = {
'<title>Attention Required! | Cloudflare</title>': CheckError(
'Captcha', 'Cloudflare'
),
'Please stand by, while we are checking your browser': CheckError(
'Bot protection', 'Cloudflare'
),
'<title>Доступ ограничен</title>': CheckError('Censorship', 'Rostelecom'),
'document.getElementById(\'validate_form_submit\').disabled=true': CheckError(
'Captcha', 'Mail.ru'
),
'Verifying your browser, please wait...<br>DDoS Protection by</font> Blazingfast.io': CheckError(
'Bot protection', 'Blazingfast'
),
'404</h1><p class="error-card__description">Мы&nbsp;не&nbsp;нашли страницу': CheckError(
'Resolving', 'MegaFon 404 page'
),
'Доступ к информационному ресурсу ограничен на основании Федерального закона': CheckError(
'Censorship', 'MGTS'
),
'Incapsula incident ID': CheckError('Bot protection', 'Incapsula'),
}
ERRORS_TYPES = {
'Captcha': 'Try to switch to another IP address or to use service cookies',
'Bot protection': 'Try to switch to another IP address',
'Censorship': 'switch to another internet service provider',
'Request timeout': 'Try to increase timeout or to switch to another internet service provider',
}
TEMPORARY_ERRORS_TYPES = [
'Request timeout',
'Unknown',
'Request failed',
'Connecting failure',
'HTTP',
'Proxy',
'Interrupted',
'Connection lost',
]
THRESHOLD = 3 # percent
def is_important(err_data):
return err_data['perc'] >= THRESHOLD
def is_permanent(err_type):
return err_type not in TEMPORARY_ERRORS_TYPES
def detect(text):
for flag, err in COMMON_ERRORS.items():
if flag in text:
return err
return None
def solution_of(err_type) -> str:
return ERRORS_TYPES.get(err_type, '')
def extract_and_group(search_res: dict) -> List[Dict[str, Any]]:
errors_counts: Dict[str, int] = {}
for r in search_res:
if r and isinstance(r, dict) and r.get('status'):
if not isinstance(r['status'], QueryResult):
continue
err = r['status'].error
if not err:
continue
errors_counts[err.type] = errors_counts.get(err.type, 0) + 1
counts = []
for err, count in sorted(errors_counts.items(), key=lambda x: x[1], reverse=True):
counts.append(
{
'err': err,
'count': count,
'perc': round(count / len(search_res), 2) * 100,
}
)
return counts
+118
View File
@@ -0,0 +1,118 @@
import asyncio
import time
import tqdm
import sys
from typing import Iterable, Any, List
from .types import QueryDraft
def create_task_func():
if sys.version_info.minor > 6:
create_asyncio_task = asyncio.create_task
else:
loop = asyncio.get_event_loop()
create_asyncio_task = loop.create_task
return create_asyncio_task
class AsyncExecutor:
def __init__(self, *args, **kwargs):
self.logger = kwargs['logger']
async def run(self, tasks: Iterable[QueryDraft]):
start_time = time.time()
results = await self._run(tasks)
self.execution_time = time.time() - start_time
self.logger.debug(f'Spent time: {self.execution_time}')
return results
async def _run(self, tasks: Iterable[QueryDraft]):
await asyncio.sleep(0)
class AsyncioSimpleExecutor(AsyncExecutor):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
async def _run(self, tasks: Iterable[QueryDraft]):
futures = [f(*args, **kwargs) for f, args, kwargs in tasks]
return await asyncio.gather(*futures)
class AsyncioProgressbarExecutor(AsyncExecutor):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
async def _run(self, tasks: Iterable[QueryDraft]):
futures = [f(*args, **kwargs) for f, args, kwargs in tasks]
results = []
for f in tqdm.asyncio.tqdm.as_completed(futures):
results.append(await f)
return results
class AsyncioProgressbarSemaphoreExecutor(AsyncExecutor):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.semaphore = asyncio.Semaphore(kwargs.get('in_parallel', 1))
async def _run(self, tasks: Iterable[QueryDraft]):
async def _wrap_query(q: QueryDraft):
async with self.semaphore:
f, args, kwargs = q
return await f(*args, **kwargs)
async def semaphore_gather(tasks: Iterable[QueryDraft]):
coros = [_wrap_query(q) for q in tasks]
results = []
for f in tqdm.asyncio.tqdm.as_completed(coros):
results.append(await f)
return results
return await semaphore_gather(tasks)
class AsyncioProgressbarQueueExecutor(AsyncExecutor):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.workers_count = kwargs.get('in_parallel', 10)
self.progress_func = kwargs.get('progress_func', tqdm.tqdm)
self.queue = asyncio.Queue(self.workers_count)
self.timeout = kwargs.get('timeout')
async def worker(self):
while True:
try:
f, args, kwargs = self.queue.get_nowait()
except asyncio.QueueEmpty:
return
query_future = f(*args, **kwargs)
query_task = create_task_func()(query_future)
try:
result = await asyncio.wait_for(query_task, timeout=self.timeout)
except asyncio.TimeoutError:
result = kwargs.get('default')
self.results.append(result)
self.progress.update(1)
self.queue.task_done()
async def _run(self, queries: Iterable[QueryDraft]):
self.results: List[Any] = []
queries_list = list(queries)
min_workers = min(len(queries_list), self.workers_count)
workers = [create_task_func()(self.worker()) for _ in range(min_workers)]
self.progress = self.progress_func(total=len(queries_list))
for t in queries_list:
await self.queue.put(t)
await self.queue.join()
for w in workers:
w.cancel()
self.progress.close()
return self.results
+386 -172
View File
@@ -1,188 +1,360 @@
""" """
Maigret main module Maigret main module
""" """
import aiohttp
import asyncio
import logging
import os import os
import platform
import sys import sys
import platform
from argparse import ArgumentParser, RawDescriptionHelpFormatter from argparse import ArgumentParser, RawDescriptionHelpFormatter
import requests import requests
from socid_extractor import parse, __version__ as socid_version from socid_extractor import extract, parse, __version__ as socid_version
from .checking import * from .checking import (
timeout_check,
supported_recursive_search_ids,
self_check,
unsupported_characters,
maigret,
)
from . import errors
from .notify import QueryNotifyPrint from .notify import QueryNotifyPrint
from .report import save_csv_report, save_xmind_report, save_html_report, save_pdf_report, \ from .report import (
generate_report_context, save_txt_report, SUPPORTED_JSON_REPORT_FORMATS, check_supported_json_format, \ save_csv_report,
save_json_report save_xmind_report,
save_html_report,
save_pdf_report,
generate_report_context,
save_txt_report,
SUPPORTED_JSON_REPORT_FORMATS,
check_supported_json_format,
save_json_report,
)
from .sites import MaigretDatabase
from .submit import submit_dialog from .submit import submit_dialog
from .utils import get_dict_ascii_tree
__version__ = '0.1.14' __version__ = '0.1.20'
async def main(): def notify_about_errors(search_results, query_notify):
version_string = '\n'.join([ errs = errors.extract_and_group(search_results.values())
was_errs_displayed = False
for e in errs:
if not errors.is_important(e):
continue
text = f'Too many errors of type "{e["err"]}" ({e["perc"]}%)'
solution = errors.solution_of(e['err'])
if solution:
text = '. '.join([text, solution])
query_notify.warning(text, '!')
was_errs_displayed = True
if was_errs_displayed:
query_notify.warning(
'You can see detailed site check errors with a flag `--print-errors`'
)
def setup_arguments_parser():
version_string = '\n'.join(
[
f'%(prog)s {__version__}', f'%(prog)s {__version__}',
f'Socid-extractor: {socid_version}', f'Socid-extractor: {socid_version}',
f'Aiohttp: {aiohttp.__version__}', f'Aiohttp: {aiohttp.__version__}',
f'Requests: {requests.__version__}', f'Requests: {requests.__version__}',
f'Python: {platform.python_version()}', f'Python: {platform.python_version()}',
]) ]
)
parser = ArgumentParser(formatter_class=RawDescriptionHelpFormatter, parser = ArgumentParser(
description=f"Maigret v{__version__}" formatter_class=RawDescriptionHelpFormatter,
description=f"Maigret v{__version__}",
) )
parser.add_argument("--version", parser.add_argument(
action="version", version=version_string, "--version",
help="Display version information and dependencies." action="version",
version=version_string,
help="Display version information and dependencies.",
) )
parser.add_argument("--info", "-vv", parser.add_argument(
action="store_true", dest="info", default=False, "--info",
help="Display service information." "-vv",
action="store_true",
dest="info",
default=False,
help="Display service information.",
) )
parser.add_argument("--verbose", "-v", parser.add_argument(
action="store_true", dest="verbose", default=False, "--verbose",
help="Display extra information and metrics." "-v",
action="store_true",
dest="verbose",
default=False,
help="Display extra information and metrics.",
) )
parser.add_argument("-d", "--debug", "-vvv", parser.add_argument(
action="store_true", dest="debug", default=False, "-d",
help="Saving debugging information and sites responses in debug.txt." "--debug",
"-vvv",
action="store_true",
dest="debug",
default=False,
help="Saving debugging information and sites responses in debug.txt.",
) )
parser.add_argument("--site", parser.add_argument(
action="append", metavar='SITE_NAME', "--site",
dest="site_list", default=[], action="append",
help="Limit analysis to just the listed sites (use several times to specify more than one)" metavar='SITE_NAME',
dest="site_list",
default=[],
help="Limit analysis to just the listed sites (use several times to specify more than one)",
) )
parser.add_argument("--proxy", "-p", metavar='PROXY_URL', parser.add_argument(
action="store", dest="proxy", default=None, "--proxy",
help="Make requests over a proxy. e.g. socks5://127.0.0.1:1080" "-p",
) metavar='PROXY_URL',
parser.add_argument("--db", metavar="DB_FILE",
dest="db_file", default=None,
help="Load Maigret database from a JSON file or an online, valid, JSON file.")
parser.add_argument("--cookies-jar-file", metavar="COOKIE_FILE",
dest="cookie_file", default=None,
help="File with cookies.")
parser.add_argument("--timeout",
action="store", metavar='TIMEOUT',
dest="timeout", type=timeout_check, default=10,
help="Time (in seconds) to wait for response to requests."
"Default timeout of 10.0s. "
"A longer timeout will be more likely to get results from slow sites."
"On the other hand, this may cause a long delay to gather all results."
)
parser.add_argument("-n", "--max-connections",
action="store", type=int,
dest="connections", default=100,
help="Allowed number of concurrent connections."
)
parser.add_argument("-a", "--all-sites",
action="store_true", dest="all_sites", default=False,
help="Use all sites for scan."
)
parser.add_argument("--top-sites",
action="store", default=500, type=int,
help="Count of sites for scan ranked by Alexa Top (default: 500)."
)
parser.add_argument("--print-not-found",
action="store_true", dest="print_not_found", default=False,
help="Print sites where the username was not found."
)
parser.add_argument("--print-errors",
action="store_true", dest="print_check_errors", default=False,
help="Print errors messages: connection, captcha, site country ban, etc."
)
parser.add_argument("--submit", metavar='EXISTING_USER_URL',
type=str, dest="new_site_to_submit", default=False,
help="URL of existing profile in new site to submit."
)
parser.add_argument("--no-color",
action="store_true", dest="no_color", default=False,
help="Don't color terminal output"
)
parser.add_argument("--browse", "-b",
action="store_true", dest="browse", default=False,
help="Browse to all results on default bowser."
)
parser.add_argument("--no-recursion",
action="store_true", dest="disable_recursive_search", default=False,
help="Disable parsing pages for other usernames and recursive search by them."
)
parser.add_argument("--self-check",
action="store_true", default=False,
help="Do self check for sites and database and disable non-working ones."
)
parser.add_argument("--stats",
action="store_true", default=False,
help="Show database statistics."
)
parser.add_argument("--use-disabled-sites",
action="store_true", default=False,
help="Use disabled sites to search (may cause many false positives)."
)
parser.add_argument("--parse",
dest="parse_url", default='',
help="Parse page by URL and extract username and IDs to use for search."
)
parser.add_argument("--id-type",
dest="id_type", default='username',
help="Specify identifier(s) type (default: username)."
)
parser.add_argument("--ignore-ids",
action="append", metavar='IGNORED_IDS',
dest="ignore_ids_list", default=[],
help="Do not make search by the specified username or other ids."
)
parser.add_argument("username",
nargs='+', metavar='USERNAMES',
action="store", action="store",
help="One or more usernames to check with social networks." dest="proxy",
default=None,
help="Make requests over a proxy. e.g. socks5://127.0.0.1:1080",
) )
parser.add_argument("--tags", parser.add_argument(
dest="tags", default='', "--db",
help="Specify tags of sites." metavar="DB_FILE",
dest="db_file",
default=None,
help="Load Maigret database from a JSON file or an online, valid, JSON file.",
)
parser.add_argument(
"--cookies-jar-file",
metavar="COOKIE_FILE",
dest="cookie_file",
default=None,
help="File with cookies.",
)
parser.add_argument(
"--timeout",
action="store",
metavar='TIMEOUT',
dest="timeout",
type=timeout_check,
default=30,
help="Time (in seconds) to wait for response to requests. "
"Default timeout of 30.0s. "
"A longer timeout will be more likely to get results from slow sites. "
"On the other hand, this may cause a long delay to gather all results. ",
)
parser.add_argument(
"--retries",
action="store",
type=int,
metavar='RETRIES',
default=1,
help="Attempts to restart temporary failed requests.",
)
parser.add_argument(
"-n",
"--max-connections",
action="store",
type=int,
dest="connections",
default=100,
help="Allowed number of concurrent connections.",
)
parser.add_argument(
"-a",
"--all-sites",
action="store_true",
dest="all_sites",
default=False,
help="Use all sites for scan.",
)
parser.add_argument(
"--top-sites",
action="store",
default=500,
type=int,
help="Count of sites for scan ranked by Alexa Top (default: 500).",
)
parser.add_argument(
"--print-not-found",
action="store_true",
dest="print_not_found",
default=False,
help="Print sites where the username was not found.",
)
parser.add_argument(
"--print-errors",
action="store_true",
dest="print_check_errors",
default=False,
help="Print errors messages: connection, captcha, site country ban, etc.",
)
parser.add_argument(
"--submit",
metavar='EXISTING_USER_URL',
type=str,
dest="new_site_to_submit",
default=False,
help="URL of existing profile in new site to submit.",
)
parser.add_argument(
"--no-color",
action="store_true",
dest="no_color",
default=False,
help="Don't color terminal output",
)
parser.add_argument(
"--no-progressbar",
action="store_true",
dest="no_progressbar",
default=False,
help="Don't show progressbar.",
)
parser.add_argument(
"--browse",
"-b",
action="store_true",
dest="browse",
default=False,
help="Browse to all results on default bowser.",
)
parser.add_argument(
"--no-recursion",
action="store_true",
dest="disable_recursive_search",
default=False,
help="Disable recursive search by additional data extracted from pages.",
)
parser.add_argument(
"--no-extracting",
action="store_true",
dest="disable_extracting",
default=False,
help="Disable parsing pages for additional data and other usernames.",
)
parser.add_argument(
"--self-check",
action="store_true",
default=False,
help="Do self check for sites and database and disable non-working ones.",
)
parser.add_argument(
"--stats", action="store_true", default=False, help="Show database statistics."
)
parser.add_argument(
"--use-disabled-sites",
action="store_true",
default=False,
help="Use disabled sites to search (may cause many false positives).",
)
parser.add_argument(
"--parse",
dest="parse_url",
default='',
help="Parse page by URL and extract username and IDs to use for search.",
)
parser.add_argument(
"--id-type",
dest="id_type",
default='username',
help="Specify identifier(s) type (default: username).",
)
parser.add_argument(
"--ignore-ids",
action="append",
metavar='IGNORED_IDS',
dest="ignore_ids_list",
default=[],
help="Do not make search by the specified username or other ids.",
)
parser.add_argument(
"username",
nargs='+',
metavar='USERNAMES',
action="store",
help="One or more usernames to check with social networks.",
)
parser.add_argument(
"--tags", dest="tags", default='', help="Specify tags of sites."
) )
# reports options # reports options
parser.add_argument("--folderoutput", "-fo", dest="folderoutput", default="reports", parser.add_argument(
help="If using multiple usernames, the output of the results will be saved to this folder." "--folderoutput",
"-fo",
dest="folderoutput",
default="reports",
help="If using multiple usernames, the output of the results will be saved to this folder.",
) )
parser.add_argument("-T", "--txt", parser.add_argument(
action="store_true", dest="txt", default=False, "-T",
help="Create a TXT report (one report per username)." "--txt",
)
parser.add_argument("-C", "--csv",
action="store_true", dest="csv", default=False,
help="Create a CSV report (one report per username)."
)
parser.add_argument("-H", "--html",
action="store_true", dest="html", default=False,
help="Create an HTML report file (general report on all usernames)."
)
parser.add_argument("-X", "--xmind",
action="store_true", action="store_true",
dest="xmind", default=False, dest="txt",
help="Generate an XMind 8 mindmap report (one report per username)." default=False,
help="Create a TXT report (one report per username).",
) )
parser.add_argument("-P", "--pdf", parser.add_argument(
"-C",
"--csv",
action="store_true", action="store_true",
dest="pdf", default=False, dest="csv",
help="Generate a PDF report (general report on all usernames)." default=False,
help="Create a CSV report (one report per username).",
) )
parser.add_argument("-J", "--json", parser.add_argument(
action="store", metavar='REPORT_TYPE', "-H",
dest="json", default='', type=check_supported_json_format, "--html",
action="store_true",
dest="html",
default=False,
help="Create an HTML report file (general report on all usernames).",
)
parser.add_argument(
"-X",
"--xmind",
action="store_true",
dest="xmind",
default=False,
help="Generate an XMind 8 mindmap report (one report per username).",
)
parser.add_argument(
"-P",
"--pdf",
action="store_true",
dest="pdf",
default=False,
help="Generate a PDF report (general report on all usernames).",
)
parser.add_argument(
"-J",
"--json",
action="store",
metavar='REPORT_TYPE',
dest="json",
default='',
type=check_supported_json_format,
help=f"Generate a JSON report of specific type: {', '.join(SUPPORTED_JSON_REPORT_FORMATS)}" help=f"Generate a JSON report of specific type: {', '.join(SUPPORTED_JSON_REPORT_FORMATS)}"
" (one report per username)." " (one report per username).",
) )
return parser
args = parser.parse_args()
async def main():
arg_parser = setup_arguments_parser()
args = arg_parser.parse_args()
# Logging # Logging
log_level = logging.ERROR log_level = logging.ERROR
logging.basicConfig( logging.basicConfig(
format='[%(filename)s:%(lineno)d] %(levelname)-3s %(asctime)s %(message)s', format='[%(filename)s:%(lineno)d] %(levelname)-3s %(asctime)s %(message)s',
datefmt='%H:%M:%S', datefmt='%H:%M:%S',
level=log_level level=log_level,
) )
if args.debug: if args.debug:
@@ -199,10 +371,10 @@ async def main():
usernames = { usernames = {
u: args.id_type u: args.id_type
for u in args.username for u in args.username
if u not in ['-'] if u not in ['-'] and u not in args.ignore_ids_list
and u not in args.ignore_ids_list
} }
parsing_enabled = not args.disable_extracting
recursive_search_enabled = not args.disable_recursive_search recursive_search_enabled = not args.disable_recursive_search
# Make prompts # Make prompts
@@ -210,10 +382,26 @@ async def main():
print("Using the proxy: " + args.proxy) print("Using the proxy: " + args.proxy)
if args.parse_url: if args.parse_url:
page, _ = parse(args.parse_url, cookies_str='') # url, headers
reqs = [(args.parse_url, set())]
try:
# temporary workaround for URL mutations MVP
from socid_extractor import mutate_url
reqs += list(mutate_url(args.parse_url))
except Exception as e:
logger.warning(e)
pass
for req in reqs:
url, headers = req
print(f'Scanning webpage by URL {url}...')
page, _ = parse(url, cookies_str='', headers=headers)
info = extract(page) info = extract(page)
text = 'Extracted ID data from webpage: ' + ', '.join([f'{a}: {b}' for a, b in info.items()]) if not info:
print(text) print('Nothing extracted')
else:
print(get_dict_ascii_tree(info.items(), new_line=False), ' ')
for k, v in info.items(): for k, v in info.items():
if 'username' in k: if 'username' in k:
usernames[v] = 'username' usernames[v] = 'username'
@@ -224,40 +412,49 @@ async def main():
args.tags = list(set(str(args.tags).split(','))) args.tags = list(set(str(args.tags).split(',')))
if args.db_file is None: if args.db_file is None:
args.db_file = \ args.db_file = os.path.join(
os.path.join(os.path.dirname(os.path.realpath(__file__)), os.path.dirname(os.path.realpath(__file__)), "resources/data.json"
"resources/data.json"
) )
if args.top_sites == 0 or args.all_sites: if args.top_sites == 0 or args.all_sites:
args.top_sites = sys.maxsize args.top_sites = sys.maxsize
# Create notify object for query results. # Create notify object for query results.
query_notify = QueryNotifyPrint(result=None, query_notify = QueryNotifyPrint(
result=None,
verbose=args.verbose, verbose=args.verbose,
print_found_only=not args.print_not_found, print_found_only=not args.print_not_found,
skip_check_errors=not args.print_check_errors, skip_check_errors=not args.print_check_errors,
color=not args.no_color) color=not args.no_color,
)
# Create object with all information about sites we are aware of. # Create object with all information about sites we are aware of.
db = MaigretDatabase().load_from_file(args.db_file) db = MaigretDatabase().load_from_file(args.db_file)
get_top_sites_for_id = lambda x: db.ranked_sites_dict(top=args.top_sites, tags=args.tags, get_top_sites_for_id = lambda x: db.ranked_sites_dict(
top=args.top_sites,
tags=args.tags,
names=args.site_list, names=args.site_list,
disabled=False, id_type=x) disabled=False,
id_type=x,
)
site_data = get_top_sites_for_id(args.id_type) site_data = get_top_sites_for_id(args.id_type)
if args.new_site_to_submit: if args.new_site_to_submit:
is_submitted = await submit_dialog(db, args.new_site_to_submit) is_submitted = await submit_dialog(
db, args.new_site_to_submit, args.cookie_file, logger
)
if is_submitted: if is_submitted:
db.save_to_file(args.db_file) db.save_to_file(args.db_file)
# Database self-checking # Database self-checking
if args.self_check: if args.self_check:
print('Maigret sites database self-checking...') print('Maigret sites database self-checking...')
is_need_update = await self_check(db, site_data, logger, max_connections=args.connections) is_need_update = await self_check(
db, site_data, logger, max_connections=args.connections
)
if is_need_update: if is_need_update:
if input('Do you want to save changes permanently? [yYnN]\n').lower() == 'y': if input('Do you want to save changes permanently? [Yn]\n').lower() == 'y':
db.save_to_file(args.db_file) db.save_to_file(args.db_file)
print('Database was successfully updated.') print('Database was successfully updated.')
else: else:
@@ -269,7 +466,6 @@ async def main():
# Make reports folder is not exists # Make reports folder is not exists
os.makedirs(args.folderoutput, exist_ok=True) os.makedirs(args.folderoutput, exist_ok=True)
report_path = args.folderoutput
# Define one report filename template # Define one report filename template
report_filepath_tpl = os.path.join(args.folderoutput, 'report_{username}{postfix}') report_filepath_tpl = os.path.join(args.folderoutput, 'report_{username}{postfix}')
@@ -288,9 +484,13 @@ async def main():
query_notify.warning('No sites to check, exiting!') query_notify.warning('No sites to check, exiting!')
sys.exit(2) sys.exit(2)
else: else:
query_notify.warning(f'Starting a search on top {len(site_data)} sites from the Maigret database...') query_notify.warning(
f'Starting a search on top {len(site_data)} sites from the Maigret database...'
)
if not args.all_sites: if not args.all_sites:
query_notify.warning(f'You can run search by full list of sites with flag `-a`', '!') query_notify.warning(
'You can run search by full list of sites with flag `-a`', '!'
)
already_checked = set() already_checked = set()
general_results = [] general_results = []
@@ -305,42 +505,53 @@ async def main():
already_checked.add(username.lower()) already_checked.add(username.lower())
if username in args.ignore_ids_list: if username in args.ignore_ids_list:
query_notify.warning(f'Skip a search by username {username} cause it\'s marked as ignored.') query_notify.warning(
f'Skip a search by username {username} cause it\'s marked as ignored.'
)
continue continue
# check for characters do not supported by sites generally # check for characters do not supported by sites generally
found_unsupported_chars = set(unsupported_characters).intersection(set(username)) found_unsupported_chars = set(unsupported_characters).intersection(
set(username)
)
if found_unsupported_chars: if found_unsupported_chars:
pretty_chars_str = ','.join(map(lambda s: f'"{s}"', found_unsupported_chars)) pretty_chars_str = ','.join(
map(lambda s: f'"{s}"', found_unsupported_chars)
)
query_notify.warning( query_notify.warning(
f'Found unsupported URL characters: {pretty_chars_str}, skip search by username "{username}"') f'Found unsupported URL characters: {pretty_chars_str}, skip search by username "{username}"'
)
continue continue
sites_to_check = get_top_sites_for_id(id_type) sites_to_check = get_top_sites_for_id(id_type)
results = await maigret(username, results = await maigret(
dict(sites_to_check), username=username,
query_notify, site_dict=dict(sites_to_check),
query_notify=query_notify,
proxy=args.proxy, proxy=args.proxy,
timeout=args.timeout, timeout=args.timeout,
recursive_search=recursive_search_enabled, is_parsing_enabled=parsing_enabled,
id_type=id_type, id_type=id_type,
debug=args.verbose, debug=args.verbose,
logger=logger, logger=logger,
cookies=args.cookie_file, cookies=args.cookie_file,
forced=args.use_disabled_sites, forced=args.use_disabled_sites,
max_connections=args.connections, max_connections=args.connections,
no_progressbar=args.no_progressbar,
retries=args.retries,
) )
username_result = (username, id_type, results) notify_about_errors(results, query_notify)
general_results.append((username, id_type, results)) general_results.append((username, id_type, results))
# TODO: tests # TODO: tests
for website_name in results: for website_name in results:
dictionary = results[website_name] dictionary = results[website_name]
# TODO: fix no site data issue # TODO: fix no site data issue
if not dictionary: if not dictionary or not recursive_search_enabled:
continue continue
new_usernames = dictionary.get('ids_usernames') new_usernames = dictionary.get('ids_usernames')
@@ -371,10 +582,13 @@ async def main():
query_notify.warning(f'TXT report for {username} saved in {filename}') query_notify.warning(f'TXT report for {username} saved in {filename}')
if args.json: if args.json:
filename = report_filepath_tpl.format(username=username, postfix=f'_{args.json}.json') filename = report_filepath_tpl.format(
username=username, postfix=f'_{args.json}.json'
)
save_json_report(filename, username, results, report_type=args.json) save_json_report(filename, username, results, report_type=args.json)
query_notify.warning(f'JSON {args.json} report for {username} saved in {filename}') query_notify.warning(
f'JSON {args.json} report for {username} saved in {filename}'
)
# reporting for all the result # reporting for all the result
if general_results: if general_results:
+63 -52
View File
@@ -4,12 +4,14 @@ This module defines the objects for notifying the caller about the
results of queries. results of queries.
""" """
import sys import sys
from colorama import Fore, Style, init from colorama import Fore, Style, init
from .result import QueryStatus from .result import QueryStatus
from .utils import get_dict_ascii_tree
class QueryNotify(): class QueryNotify:
"""Query Notify Object. """Query Notify Object.
Base class that describes methods available to notify the results of Base class that describes methods available to notify the results of
@@ -37,7 +39,7 @@ class QueryNotify():
return return
def start(self, message=None, id_type='username'): def start(self, message=None, id_type="username"):
"""Notify Start. """Notify Start.
Notify method for start of query. This method will be called before Notify method for start of query. This method will be called before
@@ -114,8 +116,14 @@ class QueryNotifyPrint(QueryNotify):
Query notify class that prints results. Query notify class that prints results.
""" """
def __init__(self, result=None, verbose=False, print_found_only=False, def __init__(
skip_check_errors=False, color=True): self,
result=None,
verbose=False,
print_found_only=False,
skip_check_errors=False,
color=True,
):
"""Create Query Notify Print Object. """Create Query Notify Print Object.
Contains information about a specific method of notifying the results Contains information about a specific method of notifying the results
@@ -160,38 +168,29 @@ class QueryNotifyPrint(QueryNotify):
title = f"Checking {id_type}" title = f"Checking {id_type}"
if self.color: if self.color:
print(Style.BRIGHT + Fore.GREEN + "[" + print(
Fore.YELLOW + "*" + Style.BRIGHT
Fore.GREEN + f"] {title}" + + Fore.GREEN
Fore.WHITE + f" {message}" + + "["
Fore.GREEN + " on:") + Fore.YELLOW
+ "*"
+ Fore.GREEN
+ f"] {title}"
+ Fore.WHITE
+ f" {message}"
+ Fore.GREEN
+ " on:"
)
else: else:
print(f"[*] {title} {message} on:") print(f"[*] {title} {message} on:")
def warning(self, message, symbol='-'): def warning(self, message, symbol="-"):
msg = f'[{symbol}] {message}' msg = f"[{symbol}] {message}"
if self.color: if self.color:
print(Style.BRIGHT + Fore.YELLOW + msg) print(Style.BRIGHT + Fore.YELLOW + msg)
else: else:
print(msg) print(msg)
def get_additional_data_text(self, items, prepend=''):
text = ''
for num, item in enumerate(items):
box_symbol = '┣╸' if num != len(items) - 1 else '┗╸'
if type(item) == tuple:
field_name, field_value = item
if field_value.startswith('[\''):
is_last_item = num == len(items) - 1
prepend_symbols = ' ' * 3 if is_last_item else ''
field_value = self.get_additional_data_text(eval(field_value), prepend_symbols)
text += f'\n{prepend}{box_symbol}{field_name}: {field_value}'
else:
text += f'\n{prepend}{box_symbol} {item}'
return text
def update(self, result, is_similar=False): def update(self, result, is_similar=False):
"""Notify Update. """Notify Update.
@@ -210,18 +209,20 @@ class QueryNotifyPrint(QueryNotify):
if not self.result.ids_data: if not self.result.ids_data:
ids_data_text = "" ids_data_text = ""
else: else:
ids_data_text = self.get_additional_data_text(self.result.ids_data.items(), ' ') ids_data_text = get_dict_ascii_tree(self.result.ids_data.items(), " ")
def make_colored_terminal_notify(status, text, status_color, text_color, appendix): def make_colored_terminal_notify(
status, text, status_color, text_color, appendix
):
text = [ text = [
f'{Style.BRIGHT}{Fore.WHITE}[{status_color}{status}{Fore.WHITE}]' + f"{Style.BRIGHT}{Fore.WHITE}[{status_color}{status}{Fore.WHITE}]"
f'{text_color} {text}: {Style.RESET_ALL}' + + f"{text_color} {text}: {Style.RESET_ALL}"
f'{appendix}' + f"{appendix}"
] ]
return ''.join(text) return "".join(text)
def make_simple_terminal_notify(status, text, appendix): def make_simple_terminal_notify(status, text, appendix):
return f'[{status}] {text}: {appendix}' return f"[{status}] {text}: {appendix}"
def make_terminal_notify(is_colored=True, *args): def make_terminal_notify(is_colored=True, *args):
if is_colored: if is_colored:
@@ -234,45 +235,55 @@ class QueryNotifyPrint(QueryNotify):
# Output to the terminal is desired. # Output to the terminal is desired.
if result.status == QueryStatus.CLAIMED: if result.status == QueryStatus.CLAIMED:
color = Fore.BLUE if is_similar else Fore.GREEN color = Fore.BLUE if is_similar else Fore.GREEN
status = '?' if is_similar else '+' status = "?" if is_similar else "+"
notify = make_terminal_notify( notify = make_terminal_notify(
self.color, self.color,
status, result.site_name, status,
color, color, result.site_name,
result.site_url_user + ids_data_text color,
color,
result.site_url_user + ids_data_text,
) )
elif result.status == QueryStatus.AVAILABLE: elif result.status == QueryStatus.AVAILABLE:
if not self.print_found_only: if not self.print_found_only:
notify = make_terminal_notify( notify = make_terminal_notify(
self.color, self.color,
'-', result.site_name, "-",
Fore.RED, Fore.YELLOW, result.site_name,
'Not found!' + ids_data_text Fore.RED,
Fore.YELLOW,
"Not found!" + ids_data_text,
) )
elif result.status == QueryStatus.UNKNOWN: elif result.status == QueryStatus.UNKNOWN:
if not self.skip_check_errors: if not self.skip_check_errors:
notify = make_terminal_notify( notify = make_terminal_notify(
self.color, self.color,
'?', result.site_name, "?",
Fore.RED, Fore.RED, result.site_name,
self.result.context + ids_data_text Fore.RED,
Fore.RED,
str(self.result.error) + ids_data_text,
) )
elif result.status == QueryStatus.ILLEGAL: elif result.status == QueryStatus.ILLEGAL:
if not self.print_found_only: if not self.print_found_only:
text = 'Illegal Username Format For This Site!' text = "Illegal Username Format For This Site!"
notify = make_terminal_notify( notify = make_terminal_notify(
self.color, self.color,
'-', result.site_name, "-",
Fore.RED, Fore.YELLOW, result.site_name,
text + ids_data_text Fore.RED,
Fore.YELLOW,
text + ids_data_text,
) )
else: else:
# It should be impossible to ever get here... # It should be impossible to ever get here...
raise ValueError(f"Unknown Query Status '{str(result.status)}' for " raise ValueError(
f"site '{self.result.site_name}'") f"Unknown Query Status '{str(result.status)}' for "
f"site '{self.result.site_name}'"
)
if notify: if notify:
sys.stdout.write('\x1b[1K\r') sys.stdout.write("\x1b[1K\r")
print(notify) print(notify)
return return
+121 -99
View File
@@ -1,90 +1,101 @@
import csv import csv
import json
import io import io
import json
import logging import logging
import os import os
from argparse import ArgumentTypeError
from datetime import datetime
from typing import Dict, Any
import pycountry import pycountry
import xmind import xmind
from datetime import datetime from dateutil.parser import parse as parse_datetime_str
from jinja2 import Template from jinja2 import Template
from xhtml2pdf import pisa from xhtml2pdf import pisa
from argparse import ArgumentTypeError
from dateutil.parser import parse as parse_datetime_str
from .result import QueryStatus from .result import QueryStatus
from .utils import is_country_tag, CaseConverter, enrich_link_str from .utils import is_country_tag, CaseConverter, enrich_link_str
SUPPORTED_JSON_REPORT_FORMATS = [ SUPPORTED_JSON_REPORT_FORMATS = [
'simple', "simple",
'ndjson', "ndjson",
] ]
"""
'''
UTILS UTILS
''' """
def filter_supposed_data(data): def filter_supposed_data(data):
### interesting fields # interesting fields
allowed_fields = ['fullname', 'gender', 'location', 'age'] allowed_fields = ["fullname", "gender", "location", "age"]
filtered_supposed_data = {CaseConverter.snake_to_title(k): v[0] filtered_supposed_data = {
CaseConverter.snake_to_title(k): v[0]
for k, v in data.items() for k, v in data.items()
if k in allowed_fields} if k in allowed_fields
}
return filtered_supposed_data return filtered_supposed_data
''' """
REPORTS SAVING REPORTS SAVING
''' """
def save_csv_report(filename: str, username: str, results: dict): def save_csv_report(filename: str, username: str, results: dict):
with open(filename, 'w', newline='', encoding='utf-8') as f: with open(filename, "w", newline="", encoding="utf-8") as f:
generate_csv_report(username, results, f) generate_csv_report(username, results, f)
def save_txt_report(filename: str, username: str, results: dict): def save_txt_report(filename: str, username: str, results: dict):
with open(filename, 'w', encoding='utf-8') as f: with open(filename, "w", encoding="utf-8") as f:
generate_txt_report(username, results, f) generate_txt_report(username, results, f)
def save_html_report(filename: str, context: dict): def save_html_report(filename: str, context: dict):
template, _ = generate_report_template(is_pdf=False) template, _ = generate_report_template(is_pdf=False)
filled_template = template.render(**context) filled_template = template.render(**context)
with open(filename, 'w') as f: with open(filename, "w") as f:
f.write(filled_template) f.write(filled_template)
def save_pdf_report(filename: str, context: dict): def save_pdf_report(filename: str, context: dict):
template, css = generate_report_template(is_pdf=True) template, css = generate_report_template(is_pdf=True)
filled_template = template.render(**context) filled_template = template.render(**context)
with open(filename, 'w+b') as f: with open(filename, "w+b") as f:
pisa.pisaDocument(io.StringIO(filled_template), dest=f, default_css=css) pisa.pisaDocument(io.StringIO(filled_template), dest=f, default_css=css)
def save_json_report(filename: str, username: str, results: dict, report_type: str): def save_json_report(filename: str, username: str, results: dict, report_type: str):
with open(filename, 'w', encoding='utf-8') as f: with open(filename, "w", encoding="utf-8") as f:
generate_json_report(username, results, f, report_type=report_type) generate_json_report(username, results, f, report_type=report_type)
''' """
REPORTS GENERATING REPORTS GENERATING
''' """
def generate_report_template(is_pdf: bool): def generate_report_template(is_pdf: bool):
""" """
HTML/PDF template generation HTML/PDF template generation
""" """
def get_resource_content(filename): def get_resource_content(filename):
return open(os.path.join(maigret_path, 'resources', filename)).read() return open(os.path.join(maigret_path, "resources", filename)).read()
maigret_path = os.path.dirname(os.path.realpath(__file__)) maigret_path = os.path.dirname(os.path.realpath(__file__))
if is_pdf: if is_pdf:
template_content = get_resource_content('simple_report_pdf.tpl') template_content = get_resource_content("simple_report_pdf.tpl")
css_content = get_resource_content('simple_report_pdf.css') css_content = get_resource_content("simple_report_pdf.css")
else: else:
template_content = get_resource_content('simple_report.tpl') template_content = get_resource_content("simple_report.tpl")
css_content = None css_content = None
template = Template(template_content) template = Template(template_content)
template.globals['title'] = CaseConverter.snake_to_title template.globals["title"] = CaseConverter.snake_to_title # type: ignore
template.globals['detect_link'] = enrich_link_str template.globals["detect_link"] = enrich_link_str # type: ignore
return template, css_content return template, css_content
@@ -92,15 +103,15 @@ def generate_report_context(username_results: list):
brief_text = [] brief_text = []
usernames = {} usernames = {}
extended_info_count = 0 extended_info_count = 0
tags = {} tags: Dict[str, int] = {}
supposed_data = {} supposed_data: Dict[str, Any] = {}
first_seen = None first_seen = None
for username, id_type, results in username_results: for username, id_type, results in username_results:
found_accounts = 0 found_accounts = 0
new_ids = [] new_ids = []
usernames[username] = {'type': id_type} usernames[username] = {"type": id_type}
for website_name in results: for website_name in results:
dictionary = results[website_name] dictionary = results[website_name]
@@ -108,16 +119,19 @@ def generate_report_context(username_results: list):
if not dictionary: if not dictionary:
continue continue
if dictionary.get('is_similar'): if dictionary.get("is_similar"):
continue
status = dictionary.get("status")
if not status: # FIXME: currently in case of timeout
continue continue
status = dictionary.get('status')
if status.ids_data: if status.ids_data:
dictionary['ids_data'] = status.ids_data dictionary["ids_data"] = status.ids_data
extended_info_count += 1 extended_info_count += 1
# detect first seen # detect first seen
created_at = status.ids_data.get('created_at') created_at = status.ids_data.get("created_at")
if created_at: if created_at:
if first_seen is None: if first_seen is None:
first_seen = created_at first_seen = created_at
@@ -127,37 +141,46 @@ def generate_report_context(username_results: list):
new_time = parse_datetime_str(created_at) new_time = parse_datetime_str(created_at)
if new_time < known_time: if new_time < known_time:
first_seen = created_at first_seen = created_at
except: except Exception as e:
logging.debug('Problems with converting datetime %s/%s', first_seen, created_at) logging.debug(
"Problems with converting datetime %s/%s: %s",
first_seen,
created_at,
str(e),
)
for k, v in status.ids_data.items(): for k, v in status.ids_data.items():
# suppose target data # suppose target data
field = 'fullname' if k == 'name' else k field = "fullname" if k == "name" else k
if not field in supposed_data: if field not in supposed_data:
supposed_data[field] = [] supposed_data[field] = []
supposed_data[field].append(v) supposed_data[field].append(v)
# suppose country # suppose country
if k in ['country', 'locale']: if k in ["country", "locale"]:
try: try:
if is_country_tag(k): if is_country_tag(k):
tag = pycountry.countries.get(alpha_2=v).alpha_2.lower() tag = pycountry.countries.get(alpha_2=v).alpha_2.lower()
else: else:
tag = pycountry.countries.search_fuzzy(v)[0].alpha_2.lower() tag = pycountry.countries.search_fuzzy(v)[
0
].alpha_2.lower()
# TODO: move countries to another struct # TODO: move countries to another struct
tags[tag] = tags.get(tag, 0) + 1 tags[tag] = tags.get(tag, 0) + 1
except Exception as e: except Exception as e:
logging.debug('pycountry exception', exc_info=True) logging.debug(
"Pycountry exception: %s", str(e), exc_info=True
)
new_usernames = dictionary.get('ids_usernames') new_usernames = dictionary.get("ids_usernames")
if new_usernames: if new_usernames:
for u, utype in new_usernames.items(): for u, utype in new_usernames.items():
if not u in usernames: if u not in usernames:
new_ids.append((u, utype)) new_ids.append((u, utype))
usernames[u] = {'type': utype} usernames[u] = {"type": utype}
if status.status == QueryStatus.CLAIMED: if status.status == QueryStatus.CLAIMED:
found_accounts += 1 found_accounts += 1
dictionary['found'] = True dictionary["found"] = True
else: else:
continue continue
@@ -166,25 +189,24 @@ def generate_report_context(username_results: list):
for t in status.tags: for t in status.tags:
tags[t] = tags.get(t, 0) + 1 tags[t] = tags.get(t, 0) + 1
brief_text.append(
brief_text.append(f'Search by {id_type} {username} returned {found_accounts} accounts.') f"Search by {id_type} {username} returned {found_accounts} accounts."
)
if new_ids: if new_ids:
ids_list = [] ids_list = []
for u, t in new_ids: for u, t in new_ids:
ids_list.append(f'{u} ({t})' if t != 'username' else u) ids_list.append(f"{u} ({t})" if t != "username" else u)
brief_text.append(f'Found target\'s other IDs: ' + ', '.join(ids_list) + '.') brief_text.append("Found target's other IDs: " + ", ".join(ids_list) + ".")
brief_text.append(f'Extended info extracted from {extended_info_count} accounts.') brief_text.append(f"Extended info extracted from {extended_info_count} accounts.")
brief = " ".join(brief_text).strip()
brief = ' '.join(brief_text).strip()
tuple_sort = lambda d: sorted(d, key=lambda x: x[1], reverse=True) tuple_sort = lambda d: sorted(d, key=lambda x: x[1], reverse=True)
if 'global' in tags: if "global" in tags:
# remove tag 'global' useless for country detection # remove tag 'global' useless for country detection
del tags['global'] del tags["global"]
first_username = username_results[0][0] first_username = username_results[0][0]
countries_lists = list(filter(lambda x: is_country_tag(x[0]), tags.items())) countries_lists = list(filter(lambda x: is_country_tag(x[0]), tags.items()))
@@ -193,35 +215,33 @@ def generate_report_context(username_results: list):
filtered_supposed_data = filter_supposed_data(supposed_data) filtered_supposed_data = filter_supposed_data(supposed_data)
return { return {
'username': first_username, "username": first_username,
'brief': brief, "brief": brief,
'results': username_results, "results": username_results,
'first_seen': first_seen, "first_seen": first_seen,
'interests_tuple_list': tuple_sort(interests_list), "interests_tuple_list": tuple_sort(interests_list),
'countries_tuple_list': tuple_sort(countries_lists), "countries_tuple_list": tuple_sort(countries_lists),
'supposed_data': filtered_supposed_data, "supposed_data": filtered_supposed_data,
'generated_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S'), "generated_at": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
} }
def generate_csv_report(username: str, results: dict, csvfile): def generate_csv_report(username: str, results: dict, csvfile):
writer = csv.writer(csvfile) writer = csv.writer(csvfile)
writer.writerow(['username', writer.writerow(
'name', ["username", "name", "url_main", "url_user", "exists", "http_status"]
'url_main',
'url_user',
'exists',
'http_status'
]
) )
for site in results: for site in results:
writer.writerow([username, writer.writerow(
[
username,
site, site,
results[site]['url_main'], results[site]["url_main"],
results[site]['url_user'], results[site]["url_user"],
str(results[site]['status'].status), str(results[site]["status"].status),
results[site]['http_status'], results[site]["http_status"],
]) ]
)
def generate_txt_report(username: str, results: dict, file): def generate_txt_report(username: str, results: dict, file):
@@ -234,12 +254,11 @@ def generate_txt_report(username: str, results: dict, file):
if dictionary.get("status").status == QueryStatus.CLAIMED: if dictionary.get("status").status == QueryStatus.CLAIMED:
exists_counter += 1 exists_counter += 1
file.write(dictionary["url_user"] + "\n") file.write(dictionary["url_user"] + "\n")
file.write(f'Total Websites Username Detected On : {exists_counter}') file.write(f"Total Websites Username Detected On : {exists_counter}")
def generate_json_report(username: str, results: dict, file, report_type): def generate_json_report(username: str, results: dict, file, report_type):
exists_counter = 0 is_report_per_line = report_type.startswith("ndjson")
is_report_per_line = report_type.startswith('ndjson')
all_json = {} all_json = {}
for sitename in results: for sitename in results:
@@ -249,20 +268,23 @@ def generate_json_report(username: str, results: dict, file, report_type):
continue continue
data = dict(site_result) data = dict(site_result)
data['status'] = data['status'].json() data["status"] = data["status"].json()
if is_report_per_line: if is_report_per_line:
data['sitename'] = sitename data["sitename"] = sitename
file.write(json.dumps(data)+'\n') file.write(json.dumps(data) + "\n")
else: else:
all_json[sitename] = data all_json[sitename] = data
if not is_report_per_line: if not is_report_per_line:
file.write(json.dumps(all_json)) file.write(json.dumps(all_json))
'''
"""
XMIND 8 Functions XMIND 8 Functions
''' """
def save_xmind_report(filename, username, results): def save_xmind_report(filename, username, results):
if os.path.exists(filename): if os.path.exists(filename):
os.remove(filename) os.remove(filename)
@@ -273,13 +295,12 @@ def save_xmind_report(filename, username, results):
def design_sheet(sheet, username, results): def design_sheet(sheet, username, results):
##all tag list
alltags = {} alltags = {}
supposed_data = {} supposed_data = {}
sheet.setTitle("%s Analysis"%(username)) sheet.setTitle("%s Analysis" % (username))
root_topic1 = sheet.getRootTopic() root_topic1 = sheet.getRootTopic()
root_topic1.setTitle("%s"%(username)) root_topic1.setTitle("%s" % (username))
undefinedsection = root_topic1.addSubTopic() undefinedsection = root_topic1.addSubTopic()
undefinedsection.setTitle("Undefined") undefinedsection.setTitle("Undefined")
@@ -289,7 +310,7 @@ def design_sheet(sheet, username, results):
dictionary = results[website_name] dictionary = results[website_name]
if dictionary.get("status").status == QueryStatus.CLAIMED: if dictionary.get("status").status == QueryStatus.CLAIMED:
## firsttime I found that entry # firsttime I found that entry
for tag in dictionary.get("status").tags: for tag in dictionary.get("status").tags:
if tag.strip() == "": if tag.strip() == "":
continue continue
@@ -318,22 +339,22 @@ def design_sheet(sheet, username, results):
# suppose target data # suppose target data
if not isinstance(v, list): if not isinstance(v, list):
currentsublabel = userlink.addSubTopic() currentsublabel = userlink.addSubTopic()
field = 'fullname' if k == 'name' else k field = "fullname" if k == "name" else k
if not field in supposed_data: if field not in supposed_data:
supposed_data[field] = [] supposed_data[field] = []
supposed_data[field].append(v) supposed_data[field].append(v)
currentsublabel.setTitle("%s: %s" % (k, v)) currentsublabel.setTitle("%s: %s" % (k, v))
else: else:
for currentval in v: for currentval in v:
currentsublabel = userlink.addSubTopic() currentsublabel = userlink.addSubTopic()
field = 'fullname' if k == 'name' else k field = "fullname" if k == "name" else k
if not field in supposed_data: if field not in supposed_data:
supposed_data[field] = [] supposed_data[field] = []
supposed_data[field].append(currentval) supposed_data[field].append(currentval)
currentsublabel.setTitle("%s: %s" % (k, currentval)) currentsublabel.setTitle("%s: %s" % (k, currentval))
### Add Supposed DATA # add supposed data
filterede_supposed_data = filter_supposed_data(supposed_data) filterede_supposed_data = filter_supposed_data(supposed_data)
if(len(filterede_supposed_data) >0): if len(filterede_supposed_data) > 0:
undefinedsection = root_topic1.addSubTopic() undefinedsection = root_topic1.addSubTopic()
undefinedsection.setTitle("SUPPOSED DATA") undefinedsection.setTitle("SUPPOSED DATA")
for k, v in filterede_supposed_data.items(): for k, v in filterede_supposed_data.items():
@@ -342,8 +363,9 @@ def design_sheet(sheet, username, results):
def check_supported_json_format(value): def check_supported_json_format(value):
if value and not value in SUPPORTED_JSON_REPORT_FORMATS: if value and value not in SUPPORTED_JSON_REPORT_FORMATS:
raise ArgumentTypeError(f'JSON report type must be one of the following types: ' raise ArgumentTypeError(
+ ', '.join(SUPPORTED_JSON_REPORT_FORMATS)) "JSON report type must be one of the following types: "
+ ", ".join(SUPPORTED_JSON_REPORT_FORMATS)
)
return value return value
+5608 -3847
View File
File diff suppressed because it is too large Load Diff
+1 -1
View File
@@ -68,7 +68,7 @@
<div class="row-mb"> <div class="row-mb">
<div class="col-md"> <div class="col-md">
<div class="card flex-md-row mb-4 box-shadow h-md-250"> <div class="card flex-md-row mb-4 box-shadow h-md-250">
<img class="card-img-right flex-auto d-none d-md-block" alt="Photo" style="width: 200px; height: 200px; object-fit: scale-down;" src="{{ v.status.ids_data.image or 'https://i.imgur.com/040fmbw.png' }}" data-holder-rendered="true"> <img class="card-img-right flex-auto d-md-block" alt="Photo" style="width: 200px; height: 200px; object-fit: scale-down;" src="{{ v.status.ids_data.image or 'https://i.imgur.com/040fmbw.png' }}" data-holder-rendered="true">
<div class="card-body d-flex flex-column align-items-start" style="padding-top: 0;"> <div class="card-body d-flex flex-column align-items-start" style="padding-top: 0;">
<h3 class="mb-0" style="padding-top: 1rem;"> <h3 class="mb-0" style="padding-top: 1rem;">
<a class="text-dark" href="{{ v.url_main }}" target="_blank">{{ k }}</a> <a class="text-dark" href="{{ v.url_main }}" target="_blank">{{ k }}</a>
+24 -9
View File
@@ -10,6 +10,7 @@ class QueryStatus(Enum):
Describes status of query about a given username. Describes status of query about a given username.
""" """
CLAIMED = "Claimed" # Username Detected CLAIMED = "Claimed" # Username Detected
AVAILABLE = "Available" # Username Not Detected AVAILABLE = "Available" # Username Not Detected
UNKNOWN = "Unknown" # Error Occurred While Trying To Detect Username UNKNOWN = "Unknown" # Error Occurred While Trying To Detect Username
@@ -27,14 +28,24 @@ class QueryStatus(Enum):
return self.value return self.value
class QueryResult(): class QueryResult:
"""Query Result Object. """Query Result Object.
Describes result of query about a given username. Describes result of query about a given username.
""" """
def __init__(self, username, site_name, site_url_user, status, ids_data=None, def __init__(
query_time=None, context=None, tags=[]): self,
username,
site_name,
site_url_user,
status,
ids_data=None,
query_time=None,
context=None,
error=None,
tags=[],
):
"""Create Query Result Object. """Create Query Result Object.
Contains information about a specific method of detecting usernames on Contains information about a specific method of detecting usernames on
@@ -73,17 +84,21 @@ class QueryResult():
self.context = context self.context = context
self.ids_data = ids_data self.ids_data = ids_data
self.tags = tags self.tags = tags
self.error = error
def json(self): def json(self):
return { return {
'username': self.username, "username": self.username,
'site_name': self.site_name, "site_name": self.site_name,
'url': self.site_url_user, "url": self.site_url_user,
'status': str(self.status), "status": str(self.status),
'ids': self.ids_data or {}, "ids": self.ids_data or {},
'tags': self.tags, "tags": self.tags,
} }
def is_found(self):
return self.status == QueryStatus.CLAIMED
def __str__(self): def __str__(self):
"""Convert Object To String. """Convert Object To String.
+176 -105
View File
@@ -1,9 +1,9 @@
# -*- coding: future_annotations -*- # ****************************** -*-
"""Maigret Sites Information""" """Maigret Sites Information"""
import copy import copy
import json import json
import re
import sys import sys
from typing import Optional, List, Dict, Any
import requests import requests
@@ -11,18 +11,56 @@ from .utils import CaseConverter, URLMatcher, is_country_tag
# TODO: move to data.json # TODO: move to data.json
SUPPORTED_TAGS = [ SUPPORTED_TAGS = [
'gaming', 'coding', 'photo', 'music', 'blog', 'finance', 'freelance', 'dating', "gaming",
'tech', 'forum', 'porn', 'erotic', 'webcam', 'video', 'movies', 'hacking', 'art', "coding",
'discussion', 'sharing', 'writing', 'wiki', 'business', 'shopping', 'sport', "photo",
'books', 'news', 'documents', 'travel', 'maps', 'hobby', 'apps', 'classified', "music",
'career', 'geosocial', 'streaming', 'education', 'networking', 'torrent', "blog",
"finance",
"freelance",
"dating",
"tech",
"forum",
"porn",
"erotic",
"webcam",
"video",
"movies",
"hacking",
"art",
"discussion",
"sharing",
"writing",
"wiki",
"business",
"shopping",
"sport",
"books",
"news",
"documents",
"travel",
"maps",
"hobby",
"apps",
"classified",
"career",
"geosocial",
"streaming",
"education",
"networking",
"torrent",
"science",
"medicine",
"reading",
"stock",
] ]
class MaigretEngine: class MaigretEngine:
site: Dict[str, Any] = {}
def __init__(self, name, data): def __init__(self, name, data):
self.name = name self.name = name
self.site = {}
self.__dict__.update(data) self.__dict__.update(data)
@property @property
@@ -32,43 +70,49 @@ class MaigretEngine:
class MaigretSite: class MaigretSite:
NOT_SERIALIZABLE_FIELDS = [ NOT_SERIALIZABLE_FIELDS = [
'name', "name",
'engineData', "engineData",
'requestFuture', "requestFuture",
'detectedEngine', "detectedEngine",
'engineObj', "engineObj",
'stats', "stats",
'urlRegexp', "urlRegexp",
] ]
username_claimed = ""
username_unclaimed = ""
url_subpath = ""
url_main = ""
url = ""
disabled = False
similar_search = False
ignore403 = False
tags: List[str] = []
type = "username"
headers: Dict[str, str] = {}
errors: Dict[str, str] = {}
activation: Dict[str, Any] = {}
regex_check = None
url_probe = None
check_type = ""
request_head_only = ""
get_params: Dict[str, Any] = {}
presense_strs: List[str] = []
absence_strs: List[str] = []
stats: Dict[str, Any] = {}
engine = None
engine_data: Dict[str, Any] = {}
engine_obj: Optional["MaigretEngine"] = None
request_future = None
alexa_rank = None
source = None
def __init__(self, name, information): def __init__(self, name, information):
self.name = name self.name = name
self.url_subpath = ""
self.disabled = False
self.similar_search = False
self.ignore_403 = False
self.tags = []
self.type = 'username'
self.headers = {}
self.errors = {}
self.activation = {}
self.url_subpath = ''
self.regex_check = None
self.url_probe = None
self.check_type = ''
self.request_head_only = ''
self.get_params = {}
self.presense_strs = []
self.absence_strs = []
self.stats = {}
self.engine = None
self.engine_data = {}
self.engine_obj = None
self.request_future = None
self.alexa_rank = None
for k, v in information.items(): for k, v in information.items():
self.__dict__[CaseConverter.camel_to_snake(k)] = v self.__dict__[CaseConverter.camel_to_snake(k)] = v
@@ -83,23 +127,31 @@ class MaigretSite:
return f"{self.name} ({self.url_main})" return f"{self.name} ({self.url_main})"
def update_detectors(self): def update_detectors(self):
if 'url' in self.__dict__: if "url" in self.__dict__:
url = self.url url = self.url
for group in ['urlMain', 'urlSubpath']: for group in ["urlMain", "urlSubpath"]:
if group in url: if group in url:
url = url.replace('{'+group+'}', self.__dict__[CaseConverter.camel_to_snake(group)]) url = url.replace(
"{" + group + "}",
self.__dict__[CaseConverter.camel_to_snake(group)],
)
self.url_regexp = URLMatcher.make_profile_url_regexp(url, self.regex_check) self.url_regexp = URLMatcher.make_profile_url_regexp(url, self.regex_check)
def detect_username(self, url: str) -> str: def detect_username(self, url: str) -> Optional[str]:
if self.url_regexp: if self.url_regexp:
import logging
match_groups = self.url_regexp.match(url) match_groups = self.url_regexp.match(url)
if match_groups: if match_groups:
return match_groups.groups()[-1].rstrip('/') return match_groups.groups()[-1].rstrip("/")
return None return None
@property
def pretty_name(self):
if self.source:
return f"{self.name} [{self.source}]"
return self.name
@property @property
def json(self): def json(self):
result = {} result = {}
@@ -107,7 +159,7 @@ class MaigretSite:
# convert to camelCase # convert to camelCase
field = CaseConverter.snake_to_camel(k) field = CaseConverter.snake_to_camel(k)
# strip empty elements # strip empty elements
if v in (False, '', [], {}, None, sys.maxsize, 'username'): if v in (False, "", [], {}, None, sys.maxsize, "username"):
continue continue
if field in self.NOT_SERIALIZABLE_FIELDS: if field in self.NOT_SERIALIZABLE_FIELDS:
continue continue
@@ -115,13 +167,13 @@ class MaigretSite:
return result return result
def update(self, updates: dict) -> MaigretSite: def update(self, updates: "dict") -> "MaigretSite":
self.__dict__.update(updates) self.__dict__.update(updates)
self.update_detectors() self.update_detectors()
return self return self
def update_from_engine(self, engine: MaigretEngine) -> MaigretSite: def update_from_engine(self, engine: MaigretEngine) -> "MaigretSite":
engine_data = engine.site engine_data = engine.site
for k, v in engine_data.items(): for k, v in engine_data.items():
field = CaseConverter.camel_to_snake(k) field = CaseConverter.camel_to_snake(k)
@@ -139,7 +191,7 @@ class MaigretSite:
return self return self
def strip_engine_data(self) -> MaigretSite: def strip_engine_data(self) -> "MaigretSite":
if not self.engine_obj: if not self.engine_obj:
return self return self
@@ -147,7 +199,7 @@ class MaigretSite:
self.url_regexp = None self.url_regexp = None
self_copy = copy.deepcopy(self) self_copy = copy.deepcopy(self)
engine_data = self_copy.engine_obj.site engine_data = self_copy.engine_obj and self_copy.engine_obj.site or {}
site_data_keys = list(self_copy.__dict__.keys()) site_data_keys = list(self_copy.__dict__.keys())
for k in engine_data.keys(): for k in engine_data.keys():
@@ -156,6 +208,7 @@ class MaigretSite:
# remove dict keys # remove dict keys
if isinstance(engine_data[k], dict) and is_exists: if isinstance(engine_data[k], dict) and is_exists:
for f in engine_data[k].keys(): for f in engine_data[k].keys():
if f in self_copy.__dict__[field]:
del self_copy.__dict__[field][f] del self_copy.__dict__[field][f]
continue continue
# remove list items # remove list items
@@ -183,8 +236,15 @@ class MaigretDatabase:
def sites_dict(self): def sites_dict(self):
return {site.name: site for site in self._sites} return {site.name: site for site in self._sites}
def ranked_sites_dict(self, reverse=False, top=sys.maxsize, tags=[], names=[], def ranked_sites_dict(
disabled=True, id_type='username'): self,
reverse=False,
top=sys.maxsize,
tags=[],
names=[],
disabled=True,
id_type="username",
):
""" """
Ranking and filtering of the sites list Ranking and filtering of the sites list
""" """
@@ -192,20 +252,31 @@ class MaigretDatabase:
normalized_tags = list(map(str.lower, tags)) normalized_tags = list(map(str.lower, tags))
is_name_ok = lambda x: x.name.lower() in normalized_names is_name_ok = lambda x: x.name.lower() in normalized_names
is_engine_ok = lambda x: isinstance(x.engine, str) and x.engine.lower() in normalized_tags is_source_ok = lambda x: x.source and x.source.lower() in normalized_names
is_engine_ok = (
lambda x: isinstance(x.engine, str) and x.engine.lower() in normalized_tags
)
is_tags_ok = lambda x: set(x.tags).intersection(set(normalized_tags)) is_tags_ok = lambda x: set(x.tags).intersection(set(normalized_tags))
is_disabled_needed = lambda x: not x.disabled or ('disabled' in tags or disabled) is_disabled_needed = lambda x: not x.disabled or (
"disabled" in tags or disabled
)
is_id_type_ok = lambda x: x.type == id_type is_id_type_ok = lambda x: x.type == id_type
filter_tags_engines_fun = lambda x: not tags or is_engine_ok(x) or is_tags_ok(x) filter_tags_engines_fun = lambda x: not tags or is_engine_ok(x) or is_tags_ok(x)
filter_names_fun = lambda x: not names or is_name_ok(x) filter_names_fun = lambda x: not names or is_name_ok(x) or is_source_ok(x)
filter_fun = lambda x: filter_tags_engines_fun(x) and filter_names_fun(x) \ filter_fun = (
and is_disabled_needed(x) and is_id_type_ok(x) lambda x: filter_tags_engines_fun(x)
and filter_names_fun(x)
and is_disabled_needed(x)
and is_id_type_ok(x)
)
filtered_list = [s for s in self.sites if filter_fun(s)] filtered_list = [s for s in self.sites if filter_fun(s)]
sorted_list = sorted(filtered_list, key=lambda x: x.alexa_rank, reverse=reverse)[:top] sorted_list = sorted(
filtered_list, key=lambda x: x.alexa_rank, reverse=reverse
)[:top]
return {site.name: site for site in sorted_list} return {site.name: site for site in sorted_list}
@property @property
@@ -216,7 +287,7 @@ class MaigretDatabase:
def engines_dict(self): def engines_dict(self):
return {engine.name: engine for engine in self._engines} return {engine.name: engine for engine in self._engines}
def update_site(self, site: MaigretSite) -> MaigretDatabase: def update_site(self, site: MaigretSite) -> "MaigretDatabase":
for s in self._sites: for s in self._sites:
if s.name == site.name: if s.name == site.name:
s = site s = site
@@ -225,21 +296,20 @@ class MaigretDatabase:
self._sites.append(site) self._sites.append(site)
return self return self
def save_to_file(self, filename: str) -> MaigretDatabase: def save_to_file(self, filename: str) -> "MaigretDatabase":
db_data = { db_data = {
'sites': {site.name: site.strip_engine_data().json for site in self._sites}, "sites": {site.name: site.strip_engine_data().json for site in self._sites},
'engines': {engine.name: engine.json for engine in self._engines}, "engines": {engine.name: engine.json for engine in self._engines},
} }
json_data = json.dumps(db_data, indent=4) json_data = json.dumps(db_data, indent=4)
with open(filename, 'w') as f: with open(filename, "w") as f:
f.write(json_data) f.write(json_data)
return self return self
def load_from_json(self, json_data: dict) -> "MaigretDatabase":
def load_from_json(self, json_data: dict) -> MaigretDatabase:
# Add all of site information from the json file to internal site list. # Add all of site information from the json file to internal site list.
site_data = json_data.get("sites", {}) site_data = json_data.get("sites", {})
engines_data = json_data.get("engines", {}) engines_data = json_data.get("engines", {})
@@ -251,32 +321,32 @@ class MaigretDatabase:
try: try:
maigret_site = MaigretSite(site_name, site_data[site_name]) maigret_site = MaigretSite(site_name, site_data[site_name])
engine = site_data[site_name].get('engine') engine = site_data[site_name].get("engine")
if engine: if engine:
maigret_site.update_from_engine(self.engines_dict[engine]) maigret_site.update_from_engine(self.engines_dict[engine])
self._sites.append(maigret_site) self._sites.append(maigret_site)
except KeyError as error: except KeyError as error:
raise ValueError(f"Problem parsing json content for site {site_name}: " raise ValueError(
f"Problem parsing json content for site {site_name}: "
f"Missing attribute {str(error)}." f"Missing attribute {str(error)}."
) )
return self return self
def load_from_str(self, db_str: "str") -> "MaigretDatabase":
def load_from_str(self, db_str: str) -> MaigretDatabase:
try: try:
data = json.loads(db_str) data = json.loads(db_str)
except Exception as error: except Exception as error:
raise ValueError(f"Problem parsing json contents from str" raise ValueError(
f"Problem parsing json contents from str"
f"'{db_str[:50]}'...: {str(error)}." f"'{db_str[:50]}'...: {str(error)}."
) )
return self.load_from_json(data) return self.load_from_json(data)
def load_from_url(self, url: str) -> "MaigretDatabase":
def load_from_url(self, url: str) -> MaigretDatabase: is_url_valid = url.startswith("http://") or url.startswith("https://")
is_url_valid = url.startswith('http://') or url.startswith('https://')
if not is_url_valid: if not is_url_valid:
raise FileNotFoundError(f"Invalid data file URL '{url}'.") raise FileNotFoundError(f"Invalid data file URL '{url}'.")
@@ -284,7 +354,8 @@ class MaigretDatabase:
try: try:
response = requests.get(url=url) response = requests.get(url=url)
except Exception as error: except Exception as error:
raise FileNotFoundError(f"Problem while attempting to access " raise FileNotFoundError(
f"Problem while attempting to access "
f"data file URL '{url}': " f"data file URL '{url}': "
f"{str(error)}" f"{str(error)}"
) )
@@ -293,30 +364,30 @@ class MaigretDatabase:
try: try:
data = response.json() data = response.json()
except Exception as error: except Exception as error:
raise ValueError(f"Problem parsing json contents at " raise ValueError(
f"'{url}': {str(error)}." f"Problem parsing json contents at " f"'{url}': {str(error)}."
) )
else: else:
raise FileNotFoundError(f"Bad response while accessing " raise FileNotFoundError(
f"data file URL '{url}'." f"Bad response while accessing " f"data file URL '{url}'."
) )
return self.load_from_json(data) return self.load_from_json(data)
def load_from_file(self, filename: "str") -> "MaigretDatabase":
def load_from_file(self, filename: str) -> MaigretDatabase:
try: try:
with open(filename, 'r', encoding='utf-8') as file: with open(filename, "r", encoding="utf-8") as file:
try: try:
data = json.load(file) data = json.load(file)
except Exception as error: except Exception as error:
raise ValueError(f"Problem parsing json contents from " raise ValueError(
f"Problem parsing json contents from "
f"file '{filename}': {str(error)}." f"file '{filename}': {str(error)}."
) )
except FileNotFoundError as error: except FileNotFoundError as error:
raise FileNotFoundError(f"Problem while attempting to access " raise FileNotFoundError(
f"data file '{filename}'." f"Problem while attempting to access " f"data file '{filename}'."
) ) from error
return self.load_from_json(data) return self.load_from_json(data)
@@ -324,8 +395,8 @@ class MaigretDatabase:
sites = sites_dict or self.sites_dict sites = sites_dict or self.sites_dict
found_flags = {} found_flags = {}
for _, s in sites.items(): for _, s in sites.items():
if 'presense_flag' in s.stats: if "presense_flag" in s.stats:
flag = s.stats['presense_flag'] flag = s.stats["presense_flag"]
found_flags[flag] = found_flags.get(flag, 0) + 1 found_flags[flag] = found_flags.get(flag, 0) + 1
return found_flags return found_flags
@@ -334,7 +405,7 @@ class MaigretDatabase:
if not sites_dict: if not sites_dict:
sites_dict = self.sites_dict() sites_dict = self.sites_dict()
output = '' output = ""
disabled_count = 0 disabled_count = 0
total_count = len(sites_dict) total_count = len(sites_dict)
urls = {} urls = {}
@@ -345,18 +416,18 @@ class MaigretDatabase:
disabled_count += 1 disabled_count += 1
url = URLMatcher.extract_main_part(site.url) url = URLMatcher.extract_main_part(site.url)
if url.startswith('{username}'): if url.startswith("{username}"):
url = 'SUBDOMAIN' url = "SUBDOMAIN"
elif url == '': elif url == "":
url = f'{site.url} ({site.engine})' url = f"{site.url} ({site.engine})"
else: else:
parts = url.split('/') parts = url.split("/")
url = '/' + '/'.join(parts[1:]) url = "/" + "/".join(parts[1:])
urls[url] = urls.get(url, 0) + 1 urls[url] = urls.get(url, 0) + 1
if not site.tags: if not site.tags:
tags['NO_TAGS'] = tags.get('NO_TAGS', 0) + 1 tags["NO_TAGS"] = tags.get("NO_TAGS", 0) + 1
for tag in site.tags: for tag in site.tags:
if is_country_tag(tag): if is_country_tag(tag):
@@ -364,17 +435,17 @@ class MaigretDatabase:
continue continue
tags[tag] = tags.get(tag, 0) + 1 tags[tag] = tags.get(tag, 0) + 1
output += f'Enabled/total sites: {total_count-disabled_count}/{total_count}\n' output += f"Enabled/total sites: {total_count - disabled_count}/{total_count}\n"
output += 'Top sites\' profile URLs:\n' output += "Top sites' profile URLs:\n"
for url, count in sorted(urls.items(), key=lambda x: x[1], reverse=True)[:20]: for url, count in sorted(urls.items(), key=lambda x: x[1], reverse=True)[:20]:
if count == 1: if count == 1:
break break
output += f'{count}\t{url}\n' output += f"{count}\t{url}\n"
output += 'Top sites\' tags:\n' output += "Top sites' tags:\n"
for tag, count in sorted(tags.items(), key=lambda x: x[1], reverse=True): for tag, count in sorted(tags.items(), key=lambda x: x[1], reverse=True):
mark = '' mark = ""
if not tag in SUPPORTED_TAGS: if tag not in SUPPORTED_TAGS:
mark = ' (non-standard)' mark = " (non-standard)"
output += f'{count}\t{tag}{mark}\n' output += f"{count}\t{tag}{mark}\n"
return output return output
+246 -82
View File
@@ -1,34 +1,58 @@
import asyncio
import difflib import difflib
import json import re
from typing import List
import requests import requests
from mock import Mock
from .checking import * from .activation import import_aiohttp_cookies
from .checking import maigret
from .result import QueryStatus
from .sites import MaigretDatabase, MaigretSite, MaigretEngine
from .utils import get_random_user_agent
DESIRED_STRINGS = ["username", "not found", "пользователь", "profile", "lastname", "firstname", "biography",
"birthday", "репутация", "информация", "e-mail"] DESIRED_STRINGS = [
"username",
"not found",
"пользователь",
"profile",
"lastname",
"firstname",
"biography",
"birthday",
"репутация",
"информация",
"e-mail",
]
SUPPOSED_USERNAMES = ["alex", "god", "admin", "red", "blue", "john"]
HEADERS = {
"User-Agent": get_random_user_agent(),
}
RATIO = 0.6 RATIO = 0.6
TOP_FEATURES = 5 TOP_FEATURES = 5
URL_RE = re.compile(r'https?://(www\.)?') URL_RE = re.compile(r"https?://(www\.)?")
def get_match_ratio(x): def get_match_ratio(x):
return round(max([ return round(
difflib.SequenceMatcher(a=x.lower(), b=y).ratio() max(
for y in DESIRED_STRINGS [difflib.SequenceMatcher(a=x.lower(), b=y).ratio() for y in DESIRED_STRINGS]
]), 2) ),
2,
)
def extract_domain(url): def extract_mainpage_url(url):
return '/'.join(url.split('/', 3)[:3]) return "/".join(url.split("/", 3)[:3])
async def site_self_check(site, logger, semaphore, db: MaigretDatabase, silent=False): async def site_self_check(site, logger, semaphore, db: MaigretDatabase, silent=False):
query_notify = Mock()
changes = { changes = {
'disabled': False, "disabled": False,
} }
check_data = [ check_data = [
@@ -36,15 +60,13 @@ async def site_self_check(site, logger, semaphore, db: MaigretDatabase, silent=F
(site.username_unclaimed, QueryStatus.AVAILABLE), (site.username_unclaimed, QueryStatus.AVAILABLE),
] ]
logger.info(f'Checking {site.name}...') logger.info(f"Checking {site.name}...")
for username, status in check_data: for username, status in check_data:
async with semaphore:
results_dict = await maigret( results_dict = await maigret(
username, username=username,
{site.name: site}, site_dict={site.name: site},
query_notify, logger=logger,
logger,
timeout=30, timeout=30,
id_type=site.type, id_type=site.type,
forced=True, forced=True,
@@ -55,10 +77,10 @@ async def site_self_check(site, logger, semaphore, db: MaigretDatabase, silent=F
# TODO: make normal checking # TODO: make normal checking
if site.name not in results_dict: if site.name not in results_dict:
logger.info(results_dict) logger.info(results_dict)
changes['disabled'] = True changes["disabled"] = True
continue continue
result = results_dict[site.name]['status'] result = results_dict[site.name]["status"]
site_status = result.status site_status = result.status
@@ -67,48 +89,133 @@ async def site_self_check(site, logger, semaphore, db: MaigretDatabase, silent=F
msgs = site.absence_strs msgs = site.absence_strs
etype = site.check_type etype = site.check_type
logger.warning( logger.warning(
f'Error while searching {username} in {site.name}: {result.context}, {msgs}, type {etype}') "Error while searching '%s' in %s: %s, %s, check type %s",
username,
site.name,
result.context,
msgs,
etype,
)
# don't disable in case of available username # don't disable in case of available username
if status == QueryStatus.CLAIMED: if status == QueryStatus.CLAIMED:
changes['disabled'] = True changes["disabled"] = True
elif status == QueryStatus.CLAIMED: elif status == QueryStatus.CLAIMED:
logger.warning(f'Not found `{username}` in {site.name}, must be claimed') logger.warning(
f"Not found `{username}` in {site.name}, must be claimed"
)
logger.info(results_dict[site.name]) logger.info(results_dict[site.name])
changes['disabled'] = True changes["disabled"] = True
else: else:
logger.warning(f'Found `{username}` in {site.name}, must be available') logger.warning(f"Found `{username}` in {site.name}, must be available")
logger.info(results_dict[site.name]) logger.info(results_dict[site.name])
changes['disabled'] = True changes["disabled"] = True
logger.info(f'Site {site.name} checking is finished') logger.info(f"Site {site.name} checking is finished")
return changes return changes
async def submit_dialog(db, url_exists): def generate_additional_fields_dialog(engine: MaigretEngine, dialog):
domain_raw = URL_RE.sub('', url_exists).strip().strip('/') fields = {}
domain_raw = domain_raw.split('/')[0] if 'urlSubpath' in engine.site.get('url', ''):
msg = (
'Detected engine suppose additional URL subpath using (/forum/, /blog/, etc). '
'Enter in manually if it exists: '
)
subpath = input(msg).strip('/')
if subpath:
fields['urlSubpath'] = f'/{subpath}'
return fields
matched_sites = list(filter(lambda x: domain_raw in x.url_main+x.url, db.sites))
if matched_sites:
print(f'Sites with domain "{domain_raw}" already exists in the Maigret database!')
status = lambda s: '(disabled)' if s.disabled else ''
url_block = lambda s: f'\n\t{s.url_main}\n\t{s.url}'
print('\n'.join([f'{site.name} {status(site)}{url_block(site)}' for site in matched_sites]))
return False
url_parts = url_exists.split('/') async def detect_known_engine(
db, url_exists, url_mainpage, logger
) -> List[MaigretSite]:
try:
r = requests.get(url_mainpage)
except Exception as e:
logger.warning(e)
print("Some error while checking main page")
return []
for engine in db.engines:
strs_to_check = engine.__dict__.get("presenseStrs")
if strs_to_check and r and r.text:
all_strs_in_response = True
for s in strs_to_check:
if s not in r.text:
all_strs_in_response = False
sites = []
if all_strs_in_response:
engine_name = engine.__dict__.get("name")
print(f"Detected engine {engine_name} for site {url_mainpage}")
usernames_to_check = SUPPOSED_USERNAMES
supposed_username = extract_username_dialog(url_exists)
if supposed_username:
usernames_to_check = [supposed_username] + usernames_to_check
add_fields = generate_additional_fields_dialog(engine, url_exists)
for u in usernames_to_check:
site_data = {
"urlMain": url_mainpage,
"name": url_mainpage.split("//")[1],
"engine": engine_name,
"usernameClaimed": u,
"usernameUnclaimed": "noonewouldeverusethis7",
**add_fields,
}
logger.info(site_data)
maigret_site = MaigretSite(url_mainpage.split("/")[-1], site_data)
maigret_site.update_from_engine(db.engines_dict[engine_name])
sites.append(maigret_site)
return sites
return []
def extract_username_dialog(url):
url_parts = url.rstrip("/").split("/")
supposed_username = url_parts[-1] supposed_username = url_parts[-1]
new_name = input(f'Is "{supposed_username}" a valid username? If not, write it manually: ') entered_username = input(
if new_name: f'Is "{supposed_username}" a valid username? If not, write it manually: '
supposed_username = new_name )
non_exist_username = 'noonewouldeverusethis7' return entered_username if entered_username else supposed_username
url_user = url_exists.replace(supposed_username, '{username}')
async def check_features_manually(
db, url_exists, url_mainpage, cookie_file, logger, redirects=True
):
supposed_username = extract_username_dialog(url_exists)
non_exist_username = "noonewouldeverusethis7"
url_user = url_exists.replace(supposed_username, "{username}")
url_not_exists = url_exists.replace(supposed_username, non_exist_username) url_not_exists = url_exists.replace(supposed_username, non_exist_username)
a = requests.get(url_exists).text # cookies
b = requests.get(url_not_exists).text cookie_dict = None
if cookie_file:
cookie_jar = await import_aiohttp_cookies(cookie_file)
cookie_dict = {c.key: c.value for c in cookie_jar}
exists_resp = requests.get(
url_exists, cookies=cookie_dict, headers=HEADERS, allow_redirects=redirects
)
logger.debug(exists_resp.status_code)
logger.debug(exists_resp.text)
non_exists_resp = requests.get(
url_not_exists, cookies=cookie_dict, headers=HEADERS, allow_redirects=redirects
)
logger.debug(non_exists_resp.status_code)
logger.debug(non_exists_resp.text)
a = exists_resp.text
b = non_exists_resp.text
tokens_a = set(a.split('"')) tokens_a = set(a.split('"'))
tokens_b = set(b.split('"')) tokens_b = set(b.split('"'))
@@ -116,57 +223,114 @@ async def submit_dialog(db, url_exists):
a_minus_b = tokens_a.difference(tokens_b) a_minus_b = tokens_a.difference(tokens_b)
b_minus_a = tokens_b.difference(tokens_a) b_minus_a = tokens_b.difference(tokens_a)
top_features_count = int(input(f'Specify count of features to extract [default {TOP_FEATURES}]: ') or TOP_FEATURES) if len(a_minus_b) == len(b_minus_a) == 0:
print("The pages for existing and non-existing account are the same!")
presence_list = sorted(a_minus_b, key=get_match_ratio, reverse=True)[:top_features_count] top_features_count = int(
input(f"Specify count of features to extract [default {TOP_FEATURES}]: ")
or TOP_FEATURES
)
print('Detected text features of existing account: ' + ', '.join(presence_list)) presence_list = sorted(a_minus_b, key=get_match_ratio, reverse=True)[
features = input('If features was not detected correctly, write it manually: ') :top_features_count
]
print("Detected text features of existing account: " + ", ".join(presence_list))
features = input("If features was not detected correctly, write it manually: ")
if features: if features:
presence_list = features.split(',') presence_list = features.split(",")
absence_list = sorted(b_minus_a, key=get_match_ratio, reverse=True)[:top_features_count] absence_list = sorted(b_minus_a, key=get_match_ratio, reverse=True)[
print('Detected text features of non-existing account: ' + ', '.join(absence_list)) :top_features_count
features = input('If features was not detected correctly, write it manually: ') ]
print("Detected text features of non-existing account: " + ", ".join(absence_list))
features = input("If features was not detected correctly, write it manually: ")
if features: if features:
absence_list = features.split(',') absence_list = features.split(",")
url_main = extract_domain(url_exists)
site_data = { site_data = {
'absenceStrs': absence_list, "absenceStrs": absence_list,
'presenseStrs': presence_list, "presenseStrs": presence_list,
'url': url_user, "url": url_user,
'urlMain': url_main, "urlMain": url_mainpage,
'usernameClaimed': supposed_username, "usernameClaimed": supposed_username,
'usernameUnclaimed': non_exist_username, "usernameUnclaimed": non_exist_username,
'checkType': 'message', "checkType": "message",
} }
site = MaigretSite(url_main.split('/')[-1], site_data) site = MaigretSite(url_mainpage.split("/")[-1], site_data)
return site
print(site.__dict__)
async def submit_dialog(db, url_exists, cookie_file, logger):
domain_raw = URL_RE.sub("", url_exists).strip().strip("/")
domain_raw = domain_raw.split("/")[0]
# check for existence
matched_sites = list(filter(lambda x: domain_raw in x.url_main + x.url, db.sites))
if matched_sites:
print(
f'Sites with domain "{domain_raw}" already exists in the Maigret database!'
)
status = lambda s: "(disabled)" if s.disabled else ""
url_block = lambda s: f"\n\t{s.url_main}\n\t{s.url}"
print(
"\n".join(
[
f"{site.name} {status(site)}{url_block(site)}"
for site in matched_sites
]
)
)
if input("Do you want to continue? [yN] ").lower() in "n":
return False
url_mainpage = extract_mainpage_url(url_exists)
sites = await detect_known_engine(db, url_exists, url_mainpage, logger)
if not sites:
print("Unable to detect site engine, lets generate checking features")
sites = [
await check_features_manually(
db, url_exists, url_mainpage, cookie_file, logger
)
]
logger.debug(sites[0].__dict__)
sem = asyncio.Semaphore(1) sem = asyncio.Semaphore(1)
log_level = logging.INFO
logging.basicConfig( found = False
format='[%(filename)s:%(lineno)d] %(levelname)-3s %(asctime)s %(message)s', chosen_site = None
datefmt='%H:%M:%S', for s in sites:
level=log_level chosen_site = s
result = await site_self_check(s, logger, sem, db)
if not result["disabled"]:
found = True
break
if not found:
print(
f"Sorry, we couldn't find params to detect account presence/absence in {chosen_site.name}."
)
print(
"Try to run this mode again and increase features count or choose others."
) )
logger = logging.getLogger('site-submit')
logger.setLevel(log_level)
result = await site_self_check(site, logger, sem, db)
if result['disabled']:
print(f'Sorry, we couldn\'t find params to detect account presence/absence in {site.name}.')
print('Try to run this mode again and increase features count or choose others.')
else: else:
if input(f'Site {site.name} successfully checked. Do you want to save it in the Maigret DB? [yY] ') in 'yY': if (
db.update_site(site) input(
f"Site {chosen_site.name} successfully checked. Do you want to save it in the Maigret DB? [Yn] "
).lower()
in "y"
):
logger.debug(chosen_site.json)
site_data = chosen_site.strip_engine_data()
logger.debug(site_data.json)
db.update_site(site_data)
return True return True
return False return False
+11
View File
@@ -0,0 +1,11 @@
from typing import Callable, List, Dict, Tuple, Any
# search query
QueryDraft = Tuple[Callable, List, Dict]
# options dict
QueryOptions = Dict[str, Any]
# TODO: throw out
QueryResultWrapper = Dict[str, Any]
+46 -16
View File
@@ -1,58 +1,88 @@
import re import re
import sys import random
DEFAULT_USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36",
]
class CaseConverter: class CaseConverter:
@staticmethod @staticmethod
def camel_to_snake(camelcased_string: str) -> str: def camel_to_snake(camelcased_string: str) -> str:
return re.sub(r'(?<!^)(?=[A-Z])', '_', camelcased_string).lower() return re.sub(r"(?<!^)(?=[A-Z])", "_", camelcased_string).lower()
@staticmethod @staticmethod
def snake_to_camel(snakecased_string: str) -> str: def snake_to_camel(snakecased_string: str) -> str:
formatted = ''.join(word.title() for word in snakecased_string.split('_')) formatted = "".join(word.title() for word in snakecased_string.split("_"))
result = formatted[0].lower() + formatted[1:] result = formatted[0].lower() + formatted[1:]
return result return result
@staticmethod @staticmethod
def snake_to_title(snakecased_string: str) -> str: def snake_to_title(snakecased_string: str) -> str:
words = snakecased_string.split('_') words = snakecased_string.split("_")
words[0] = words[0].title() words[0] = words[0].title()
return ' '.join(words) return " ".join(words)
def is_country_tag(tag: str) -> bool: def is_country_tag(tag: str) -> bool:
"""detect if tag represent a country""" """detect if tag represent a country"""
return bool(re.match("^([a-zA-Z]){2}$", tag)) or tag == 'global' return bool(re.match("^([a-zA-Z]){2}$", tag)) or tag == "global"
def enrich_link_str(link: str) -> str: def enrich_link_str(link: str) -> str:
link = link.strip() link = link.strip()
if link.startswith('www.') or (link.startswith('http') and '//' in link): if link.startswith("www.") or (link.startswith("http") and "//" in link):
return f'<a class="auto-link" href="{link}">{link}</a>' return f'<a class="auto-link" href="{link}">{link}</a>'
return link return link
class URLMatcher: class URLMatcher:
_HTTP_URL_RE_STR = '^https?://(www.)?(.+)$' _HTTP_URL_RE_STR = "^https?://(www.)?(.+)$"
HTTP_URL_RE = re.compile(_HTTP_URL_RE_STR) HTTP_URL_RE = re.compile(_HTTP_URL_RE_STR)
UNSAFE_SYMBOLS = '.?' UNSAFE_SYMBOLS = ".?"
@classmethod @classmethod
def extract_main_part(self, url: str) -> str: def extract_main_part(self, url: str) -> str:
match = self.HTTP_URL_RE.search(url) match = self.HTTP_URL_RE.search(url)
if match and match.group(2): if match and match.group(2):
return match.group(2).rstrip('/') return match.group(2).rstrip("/")
return '' return ""
@classmethod @classmethod
def make_profile_url_regexp(self, url: str, username_regexp: str = ''): def make_profile_url_regexp(self, url: str, username_regexp: str = ""):
url_main_part = self.extract_main_part(url) url_main_part = self.extract_main_part(url)
for c in self.UNSAFE_SYMBOLS: for c in self.UNSAFE_SYMBOLS:
url_main_part = url_main_part.replace(c, f'\\{c}') url_main_part = url_main_part.replace(c, f"\\{c}")
username_regexp = username_regexp or '.+?' username_regexp = username_regexp or ".+?"
url_regexp = url_main_part.replace('{username}', f'({username_regexp})') url_regexp = url_main_part.replace("{username}", f"({username_regexp})")
regexp_str = self._HTTP_URL_RE_STR.replace('(.+)', url_regexp) regexp_str = self._HTTP_URL_RE_STR.replace("(.+)", url_regexp)
return re.compile(regexp_str) return re.compile(regexp_str)
def get_dict_ascii_tree(items, prepend="", new_line=True):
text = ""
for num, item in enumerate(items):
box_symbol = "┣╸" if num != len(items) - 1 else "┗╸"
if type(item) == tuple:
field_name, field_value = item
if field_value.startswith("['"):
is_last_item = num == len(items) - 1
prepend_symbols = " " * 3 if is_last_item else ""
field_value = get_dict_ascii_tree(eval(field_value), prepend_symbols)
text += f"\n{prepend}{box_symbol}{field_name}: {field_value}"
else:
text += f"\n{prepend}{box_symbol} {item}"
if not new_line:
text = text[1:]
return text
def get_random_user_agent():
return random.choice(DEFAULT_USER_AGENTS)
+4 -6
View File
@@ -1,4 +1,4 @@
aiohttp==3.7.3 aiohttp==3.7.4
aiohttp-socks==0.5.5 aiohttp-socks==0.5.5
arabic-reshaper==2.1.1 arabic-reshaper==2.1.1
async-timeout==3.0.1 async-timeout==3.0.1
@@ -13,22 +13,20 @@ future==0.18.2
future-annotations==1.0.0 future-annotations==1.0.0
html5lib==1.1 html5lib==1.1
idna==2.10 idna==2.10
Jinja2==2.11.2 Jinja2==2.11.3
lxml==4.6.2 lxml==4.6.3
MarkupSafe==1.1.1 MarkupSafe==1.1.1
mock==4.0.2 mock==4.0.2
multidict==5.1.0 multidict==5.1.0
Pillow==8.1.0
pycountry==20.7.3 pycountry==20.7.3
PyPDF2==1.26.0 PyPDF2==1.26.0
PySocks==1.7.1 PySocks==1.7.1
python-bidi==0.4.2 python-bidi==0.4.2
python-socks==1.1.2 python-socks==1.1.2
reportlab==3.5.59
requests>=2.24.0 requests>=2.24.0
requests-futures==1.0.0 requests-futures==1.0.0
six==1.15.0 six==1.15.0
socid-extractor>=0.0.12 socid-extractor>=0.0.16
soupsieve==2.1 soupsieve==2.1
stem==1.8.0 stem==1.8.0
torrequest==0.1.0 torrequest==0.1.0
+6
View File
@@ -1,3 +1,9 @@
[egg_info] [egg_info]
tag_build = tag_build =
tag_date = 0 tag_date = 0
[flake8]
per-file-ignores = __init__.py:F401
[mypy]
ignore_missing_imports = True
+1 -1
View File
@@ -12,7 +12,7 @@ with open('requirements.txt') as rf:
requires = rf.read().splitlines() requires = rf.read().splitlines()
setup(name='maigret', setup(name='maigret',
version='0.1.14', version='0.1.20',
description='Collect a dossier on a person by username from a huge number of sites', description='Collect a dossier on a person by username from a huge number of sites',
long_description=long_description, long_description=long_description,
long_description_content_type="text/markdown", long_description_content_type="text/markdown",
+2199 -2142
View File
File diff suppressed because it is too large Load Diff
Executable
+2
View File
@@ -0,0 +1,2 @@
#!/bin/sh
pytest tests
+4 -3
View File
@@ -1,11 +1,11 @@
import glob import glob
import logging import logging
import os import os
import pytest import pytest
from _pytest.mark import Mark from _pytest.mark import Mark
from mock import Mock
from maigret.sites import MaigretDatabase, MaigretSite from maigret.sites import MaigretDatabase
CUR_PATH = os.path.dirname(os.path.realpath(__file__)) CUR_PATH = os.path.dirname(os.path.realpath(__file__))
JSON_FILE = os.path.join(CUR_PATH, '../maigret/resources/data.json') JSON_FILE = os.path.join(CUR_PATH, '../maigret/resources/data.json')
@@ -26,7 +26,8 @@ def get_test_reports_filenames():
def remove_test_reports(): def remove_test_reports():
reports_list = get_test_reports_filenames() reports_list = get_test_reports_filenames()
for f in reports_list: os.remove(f) for f in reports_list:
os.remove(f)
logging.error(f'Removed test reports {reports_list}') logging.error(f'Removed test reports {reports_list}')
+4 -2
View File
@@ -1,5 +1,6 @@
"""Maigret activation test functions""" """Maigret activation test functions"""
import json import json
import aiohttp import aiohttp
import pytest import pytest
from mock import Mock from mock import Mock
@@ -43,8 +44,9 @@ async def test_import_aiohttp_cookies():
url = 'https://httpbin.org/cookies' url = 'https://httpbin.org/cookies'
connector = aiohttp.TCPConnector(ssl=False) connector = aiohttp.TCPConnector(ssl=False)
session = aiohttp.ClientSession(connector=connector, trust_env=True, session = aiohttp.ClientSession(
cookie_jar=cookie_jar) connector=connector, trust_env=True, cookie_jar=cookie_jar
)
response = await session.get(url=url) response = await session.get(url=url)
result = json.loads(await response.content.read()) result = json.loads(await response.content.read())
+73
View File
@@ -0,0 +1,73 @@
"""Maigret checking logic test functions"""
import pytest
import asyncio
import logging
from maigret.executors import (
AsyncioSimpleExecutor,
AsyncioProgressbarExecutor,
AsyncioProgressbarSemaphoreExecutor,
AsyncioProgressbarQueueExecutor,
)
logger = logging.getLogger(__name__)
async def func(n):
await asyncio.sleep(0.1 * (n % 3))
return n
@pytest.mark.asyncio
async def test_simple_asyncio_executor():
tasks = [(func, [n], {}) for n in range(10)]
executor = AsyncioSimpleExecutor(logger=logger)
assert await executor.run(tasks) == [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
assert executor.execution_time > 0.2
assert executor.execution_time < 0.3
@pytest.mark.asyncio
async def test_asyncio_progressbar_executor():
tasks = [(func, [n], {}) for n in range(10)]
executor = AsyncioProgressbarExecutor(logger=logger)
# no guarantees for the results order
assert sorted(await executor.run(tasks)) == [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
assert executor.execution_time > 0.2
assert executor.execution_time < 0.3
@pytest.mark.asyncio
async def test_asyncio_progressbar_semaphore_executor():
tasks = [(func, [n], {}) for n in range(10)]
executor = AsyncioProgressbarSemaphoreExecutor(logger=logger, in_parallel=5)
# no guarantees for the results order
assert sorted(await executor.run(tasks)) == [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
assert executor.execution_time > 0.2
assert executor.execution_time < 0.4
@pytest.mark.asyncio
async def test_asyncio_progressbar_queue_executor():
tasks = [(func, [n], {}) for n in range(10)]
executor = AsyncioProgressbarQueueExecutor(logger=logger, in_parallel=2)
assert await executor.run(tasks) == [0, 1, 3, 2, 4, 6, 7, 5, 9, 8]
assert executor.execution_time > 0.5
assert executor.execution_time < 0.6
executor = AsyncioProgressbarQueueExecutor(logger=logger, in_parallel=3)
assert await executor.run(tasks) == [0, 3, 1, 4, 6, 2, 7, 9, 5, 8]
assert executor.execution_time > 0.4
assert executor.execution_time < 0.5
executor = AsyncioProgressbarQueueExecutor(logger=logger, in_parallel=5)
assert await executor.run(tasks) == [0, 3, 6, 1, 4, 7, 9, 2, 5, 8]
assert executor.execution_time > 0.3
assert executor.execution_time < 0.4
executor = AsyncioProgressbarQueueExecutor(logger=logger, in_parallel=10)
assert await executor.run(tasks) == [0, 3, 6, 9, 1, 4, 7, 2, 5, 8]
assert executor.execution_time > 0.2
assert executor.execution_time < 0.3
+9 -18
View File
@@ -1,46 +1,37 @@
"""Maigret main module test functions""" """Maigret main module test functions"""
import asyncio import asyncio
import pytest import pytest
from mock import Mock from mock import Mock
from maigret.maigret import self_check from maigret.maigret import self_check
from maigret.sites import MaigretDatabase, MaigretSite from maigret.sites import MaigretDatabase
EXAMPLE_DB = { EXAMPLE_DB = {
'engines': { 'engines': {},
},
'sites': { 'sites': {
"GooglePlayStore": { "GooglePlayStore": {
"tags": [ "tags": ["global", "us"],
"global",
"us"
],
"disabled": False, "disabled": False,
"checkType": "status_code", "checkType": "status_code",
"alexaRank": 1, "alexaRank": 1,
"url": "https://play.google.com/store/apps/developer?id={username}", "url": "https://play.google.com/store/apps/developer?id={username}",
"urlMain": "https://play.google.com/store", "urlMain": "https://play.google.com/store",
"usernameClaimed": "Facebook_nosuchname", "usernameClaimed": "Facebook_nosuchname",
"usernameUnclaimed": "noonewouldeverusethis7" "usernameUnclaimed": "noonewouldeverusethis7",
}, },
"Reddit": { "Reddit": {
"tags": [ "tags": ["news", "social", "us"],
"news",
"social",
"us"
],
"checkType": "status_code", "checkType": "status_code",
"presenseStrs": [ "presenseStrs": ["totalKarma"],
"totalKarma"
],
"disabled": True, "disabled": True,
"alexaRank": 17, "alexaRank": 17,
"url": "https://www.reddit.com/user/{username}", "url": "https://www.reddit.com/user/{username}",
"urlMain": "https://www.reddit.com/", "urlMain": "https://www.reddit.com/",
"usernameClaimed": "blue", "usernameClaimed": "blue",
"usernameUnclaimed": "noonewouldeverusethis7" "usernameUnclaimed": "noonewouldeverusethis7",
},
}, },
}
} }
+188 -53
View File
@@ -7,8 +7,16 @@ from io import StringIO
import xmind import xmind
from jinja2 import Template from jinja2 import Template
from maigret.report import generate_csv_report, generate_txt_report, save_xmind_report, save_html_report, \ from maigret.report import (
save_pdf_report, generate_report_template, generate_report_context, generate_json_report generate_csv_report,
generate_txt_report,
save_xmind_report,
save_html_report,
save_pdf_report,
generate_report_template,
generate_report_context,
generate_json_report,
)
from maigret.result import QueryResult, QueryStatus from maigret.result import QueryResult, QueryStatus
EXAMPLE_RESULTS = { EXAMPLE_RESULTS = {
@@ -17,14 +25,16 @@ EXAMPLE_RESULTS = {
'parsing_enabled': True, 'parsing_enabled': True,
'url_main': 'https://www.github.com/', 'url_main': 'https://www.github.com/',
'url_user': 'https://www.github.com/test', 'url_user': 'https://www.github.com/test',
'status': QueryResult('test', 'status': QueryResult(
'test',
'GitHub', 'GitHub',
'https://www.github.com/test', 'https://www.github.com/test',
QueryStatus.CLAIMED, QueryStatus.CLAIMED,
tags=['test_tag']), tags=['test_tag'],
),
'http_status': 200, 'http_status': 200,
'is_similar': False, 'is_similar': False,
'rank': 78 'rank': 78,
} }
} }
@@ -33,74 +43,196 @@ BAD_RESULT = QueryResult('', '', '', QueryStatus.AVAILABLE)
GOOD_500PX_RESULT = copy.deepcopy(GOOD_RESULT) GOOD_500PX_RESULT = copy.deepcopy(GOOD_RESULT)
GOOD_500PX_RESULT.tags = ['photo', 'us', 'global'] GOOD_500PX_RESULT.tags = ['photo', 'us', 'global']
GOOD_500PX_RESULT.ids_data = {"uid": "dXJpOm5vZGU6VXNlcjoyNjQwMzQxNQ==", "legacy_id": "26403415", GOOD_500PX_RESULT.ids_data = {
"username": "alexaimephotographycars", "name": "Alex Aim\u00e9", "uid": "dXJpOm5vZGU6VXNlcjoyNjQwMzQxNQ==",
"legacy_id": "26403415",
"username": "alexaimephotographycars",
"name": "Alex Aim\u00e9",
"website": "www.flickr.com/photos/alexaimephotography/", "website": "www.flickr.com/photos/alexaimephotography/",
"facebook_link": " www.instagram.com/street.reality.photography/", "facebook_link": " www.instagram.com/street.reality.photography/",
"instagram_username": "alexaimephotography", "twitter_username": "Alexaimephotogr"} "instagram_username": "alexaimephotography",
"twitter_username": "Alexaimephotogr",
}
GOOD_REDDIT_RESULT = copy.deepcopy(GOOD_RESULT) GOOD_REDDIT_RESULT = copy.deepcopy(GOOD_RESULT)
GOOD_REDDIT_RESULT.tags = ['news', 'us'] GOOD_REDDIT_RESULT.tags = ['news', 'us']
GOOD_REDDIT_RESULT.ids_data = {"reddit_id": "t5_1nytpy", "reddit_username": "alexaimephotography", GOOD_REDDIT_RESULT.ids_data = {
"reddit_id": "t5_1nytpy",
"reddit_username": "alexaimephotography",
"fullname": "alexaimephotography", "fullname": "alexaimephotography",
"image": "https://styles.redditmedia.com/t5_1nytpy/styles/profileIcon_7vmhdwzd3g931.jpg?width=256&height=256&crop=256:256,smart&frame=1&s=4f355f16b4920844a3f4eacd4237a7bf76b2e97e", "image": "https://styles.redditmedia.com/t5_1nytpy/styles/profileIcon_7vmhdwzd3g931.jpg?width=256&height=256&crop=256:256,smart&frame=1&s=4f355f16b4920844a3f4eacd4237a7bf76b2e97e",
"is_employee": "False", "is_nsfw": "False", "is_mod": "True", "is_following": "True", "is_employee": "False",
"has_user_profile": "True", "hide_from_robots": "False", "is_nsfw": "False",
"created_at": "2019-07-10 12:20:03", "total_karma": "53959", "post_karma": "52738"} "is_mod": "True",
"is_following": "True",
"has_user_profile": "True",
"hide_from_robots": "False",
"created_at": "2019-07-10 12:20:03",
"total_karma": "53959",
"post_karma": "52738",
}
GOOD_IG_RESULT = copy.deepcopy(GOOD_RESULT) GOOD_IG_RESULT = copy.deepcopy(GOOD_RESULT)
GOOD_IG_RESULT.tags = ['photo', 'global'] GOOD_IG_RESULT.tags = ['photo', 'global']
GOOD_IG_RESULT.ids_data = {"instagram_username": "alexaimephotography", "fullname": "Alexaimephotography", GOOD_IG_RESULT.ids_data = {
"instagram_username": "alexaimephotography",
"fullname": "Alexaimephotography",
"id": "6828488620", "id": "6828488620",
"image": "https://scontent-hel3-1.cdninstagram.com/v/t51.2885-19/s320x320/95420076_1169632876707608_8741505804647006208_n.jpg?_nc_ht=scontent-hel3-1.cdninstagram.com&_nc_ohc=jd87OUGsX4MAX_Ym5GX&tp=1&oh=0f42badd68307ba97ec7fb1ef7b4bfd4&oe=601E5E6F", "image": "https://scontent-hel3-1.cdninstagram.com/v/t51.2885-19/s320x320/95420076_1169632876707608_8741505804647006208_n.jpg?_nc_ht=scontent-hel3-1.cdninstagram.com&_nc_ohc=jd87OUGsX4MAX_Ym5GX&tp=1&oh=0f42badd68307ba97ec7fb1ef7b4bfd4&oe=601E5E6F",
"bio": "Photographer \nChild of fine street arts", "bio": "Photographer \nChild of fine street arts",
"external_url": "https://www.flickr.com/photos/alexaimephotography2020/"} "external_url": "https://www.flickr.com/photos/alexaimephotography2020/",
}
GOOD_TWITTER_RESULT = copy.deepcopy(GOOD_RESULT) GOOD_TWITTER_RESULT = copy.deepcopy(GOOD_RESULT)
GOOD_TWITTER_RESULT.tags = ['social', 'us'] GOOD_TWITTER_RESULT.tags = ['social', 'us']
TEST = [('alexaimephotographycars', 'username', { TEST = [
'500px': {'username': 'alexaimephotographycars', 'parsing_enabled': True, 'url_main': 'https://500px.com/', (
'alexaimephotographycars',
'username',
{
'500px': {
'username': 'alexaimephotographycars',
'parsing_enabled': True,
'url_main': 'https://500px.com/',
'url_user': 'https://500px.com/p/alexaimephotographycars', 'url_user': 'https://500px.com/p/alexaimephotographycars',
'ids_usernames': {'alexaimephotographycars': 'username', 'alexaimephotography': 'username', 'ids_usernames': {
'Alexaimephotogr': 'username'}, 'status': GOOD_500PX_RESULT, 'http_status': 200, 'alexaimephotographycars': 'username',
'is_similar': False, 'rank': 2981}, 'alexaimephotography': 'username',
'Reddit': {'username': 'alexaimephotographycars', 'parsing_enabled': True, 'url_main': 'https://www.reddit.com/', 'Alexaimephotogr': 'username',
'url_user': 'https://www.reddit.com/user/alexaimephotographycars', 'status': BAD_RESULT, },
'http_status': 404, 'is_similar': False, 'rank': 17}, 'status': GOOD_500PX_RESULT,
'Twitter': {'username': 'alexaimephotographycars', 'parsing_enabled': True, 'url_main': 'https://www.twitter.com/', 'http_status': 200,
'url_user': 'https://twitter.com/alexaimephotographycars', 'status': BAD_RESULT, 'http_status': 400, 'is_similar': False,
'is_similar': False, 'rank': 55}, 'rank': 2981,
'Instagram': {'username': 'alexaimephotographycars', 'parsing_enabled': True, },
'Reddit': {
'username': 'alexaimephotographycars',
'parsing_enabled': True,
'url_main': 'https://www.reddit.com/',
'url_user': 'https://www.reddit.com/user/alexaimephotographycars',
'status': BAD_RESULT,
'http_status': 404,
'is_similar': False,
'rank': 17,
},
'Twitter': {
'username': 'alexaimephotographycars',
'parsing_enabled': True,
'url_main': 'https://www.twitter.com/',
'url_user': 'https://twitter.com/alexaimephotographycars',
'status': BAD_RESULT,
'http_status': 400,
'is_similar': False,
'rank': 55,
},
'Instagram': {
'username': 'alexaimephotographycars',
'parsing_enabled': True,
'url_main': 'https://www.instagram.com/', 'url_main': 'https://www.instagram.com/',
'url_user': 'https://www.instagram.com/alexaimephotographycars', 'status': BAD_RESULT, 'url_user': 'https://www.instagram.com/alexaimephotographycars',
'http_status': 404, 'is_similar': False, 'rank': 29}}), ('alexaimephotography', 'username', { 'status': BAD_RESULT,
'500px': {'username': 'alexaimephotography', 'parsing_enabled': True, 'url_main': 'https://500px.com/', 'http_status': 404,
'url_user': 'https://500px.com/p/alexaimephotography', 'status': BAD_RESULT, 'http_status': 200, 'is_similar': False,
'is_similar': False, 'rank': 2981}, 'rank': 29,
'Reddit': {'username': 'alexaimephotography', 'parsing_enabled': True, 'url_main': 'https://www.reddit.com/', },
},
),
(
'alexaimephotography',
'username',
{
'500px': {
'username': 'alexaimephotography',
'parsing_enabled': True,
'url_main': 'https://500px.com/',
'url_user': 'https://500px.com/p/alexaimephotography',
'status': BAD_RESULT,
'http_status': 200,
'is_similar': False,
'rank': 2981,
},
'Reddit': {
'username': 'alexaimephotography',
'parsing_enabled': True,
'url_main': 'https://www.reddit.com/',
'url_user': 'https://www.reddit.com/user/alexaimephotography', 'url_user': 'https://www.reddit.com/user/alexaimephotography',
'ids_usernames': {'alexaimephotography': 'username'}, 'status': GOOD_REDDIT_RESULT, 'http_status': 200, 'ids_usernames': {'alexaimephotography': 'username'},
'is_similar': False, 'rank': 17}, 'status': GOOD_REDDIT_RESULT,
'Twitter': {'username': 'alexaimephotography', 'parsing_enabled': True, 'url_main': 'https://www.twitter.com/', 'http_status': 200,
'url_user': 'https://twitter.com/alexaimephotography', 'status': BAD_RESULT, 'http_status': 400, 'is_similar': False,
'is_similar': False, 'rank': 55}, 'rank': 17,
'Instagram': {'username': 'alexaimephotography', 'parsing_enabled': True, 'url_main': 'https://www.instagram.com/', },
'Twitter': {
'username': 'alexaimephotography',
'parsing_enabled': True,
'url_main': 'https://www.twitter.com/',
'url_user': 'https://twitter.com/alexaimephotography',
'status': BAD_RESULT,
'http_status': 400,
'is_similar': False,
'rank': 55,
},
'Instagram': {
'username': 'alexaimephotography',
'parsing_enabled': True,
'url_main': 'https://www.instagram.com/',
'url_user': 'https://www.instagram.com/alexaimephotography', 'url_user': 'https://www.instagram.com/alexaimephotography',
'ids_usernames': {'alexaimephotography': 'username'}, 'status': GOOD_IG_RESULT, 'http_status': 200, 'ids_usernames': {'alexaimephotography': 'username'},
'is_similar': False, 'rank': 29}}), ('Alexaimephotogr', 'username', { 'status': GOOD_IG_RESULT,
'500px': {'username': 'Alexaimephotogr', 'parsing_enabled': True, 'url_main': 'https://500px.com/', 'http_status': 200,
'url_user': 'https://500px.com/p/Alexaimephotogr', 'status': BAD_RESULT, 'http_status': 200, 'is_similar': False,
'is_similar': False, 'rank': 2981}, 'rank': 29,
'Reddit': {'username': 'Alexaimephotogr', 'parsing_enabled': True, 'url_main': 'https://www.reddit.com/', },
'url_user': 'https://www.reddit.com/user/Alexaimephotogr', 'status': BAD_RESULT, 'http_status': 404, },
'is_similar': False, 'rank': 17}, ),
'Twitter': {'username': 'Alexaimephotogr', 'parsing_enabled': True, 'url_main': 'https://www.twitter.com/', (
'url_user': 'https://twitter.com/Alexaimephotogr', 'status': GOOD_TWITTER_RESULT, 'http_status': 400, 'Alexaimephotogr',
'is_similar': False, 'rank': 55}, 'username',
'Instagram': {'username': 'Alexaimephotogr', 'parsing_enabled': True, 'url_main': 'https://www.instagram.com/', {
'url_user': 'https://www.instagram.com/Alexaimephotogr', 'status': BAD_RESULT, 'http_status': 404, '500px': {
'is_similar': False, 'rank': 29}})] 'username': 'Alexaimephotogr',
'parsing_enabled': True,
'url_main': 'https://500px.com/',
'url_user': 'https://500px.com/p/Alexaimephotogr',
'status': BAD_RESULT,
'http_status': 200,
'is_similar': False,
'rank': 2981,
},
'Reddit': {
'username': 'Alexaimephotogr',
'parsing_enabled': True,
'url_main': 'https://www.reddit.com/',
'url_user': 'https://www.reddit.com/user/Alexaimephotogr',
'status': BAD_RESULT,
'http_status': 404,
'is_similar': False,
'rank': 17,
},
'Twitter': {
'username': 'Alexaimephotogr',
'parsing_enabled': True,
'url_main': 'https://www.twitter.com/',
'url_user': 'https://twitter.com/Alexaimephotogr',
'status': GOOD_TWITTER_RESULT,
'http_status': 400,
'is_similar': False,
'rank': 55,
},
'Instagram': {
'username': 'Alexaimephotogr',
'parsing_enabled': True,
'url_main': 'https://www.instagram.com/',
'url_user': 'https://www.instagram.com/Alexaimephotogr',
'status': BAD_RESULT,
'http_status': 404,
'is_similar': False,
'rank': 29,
},
},
),
]
SUPPOSED_BRIEF = """Search by username alexaimephotographycars returned 1 accounts. Found target's other IDs: alexaimephotography, Alexaimephotogr. Search by username alexaimephotography returned 2 accounts. Search by username Alexaimephotogr returned 1 accounts. Extended info extracted from 3 accounts.""" SUPPOSED_BRIEF = """Search by username alexaimephotographycars returned 1 accounts. Found target's other IDs: alexaimephotography, Alexaimephotogr. Search by username alexaimephotography returned 2 accounts. Search by username Alexaimephotogr returned 1 accounts. Extended info extracted from 3 accounts."""
@@ -187,7 +319,10 @@ def test_save_xmind_report():
assert data['topic']['topics'][0]['title'] == 'Undefined' assert data['topic']['topics'][0]['title'] == 'Undefined'
assert data['topic']['topics'][1]['title'] == 'test_tag' assert data['topic']['topics'][1]['title'] == 'test_tag'
assert len(data['topic']['topics'][1]['topics']) == 1 assert len(data['topic']['topics'][1]['topics']) == 1
assert data['topic']['topics'][1]['topics'][0]['label'] == 'https://www.github.com/test' assert (
data['topic']['topics'][1]['topics'][0]['label']
== 'https://www.github.com/test'
)
def test_html_report(): def test_html_report():
+14 -12
View File
@@ -1,7 +1,6 @@
"""Maigret Database test functions""" """Maigret Database test functions"""
from maigret.sites import MaigretDatabase, MaigretSite from maigret.sites import MaigretDatabase, MaigretSite
EXAMPLE_DB = { EXAMPLE_DB = {
'engines': { 'engines': {
"XenForo": { "XenForo": {
@@ -11,25 +10,21 @@ EXAMPLE_DB = {
"The specified member cannot be found. Please enter a member's entire name.", "The specified member cannot be found. Please enter a member's entire name.",
], ],
"checkType": "message", "checkType": "message",
"errors": { "errors": {"You must be logged-in to do that.": "Login required"},
"You must be logged-in to do that.": "Login required" "url": "{urlMain}{urlSubpath}/members/?username={username}",
}, },
"url": "{urlMain}{urlSubpath}/members/?username={username}"
}
}, },
}, },
'sites': { 'sites': {
"Amperka": { "Amperka": {
"engine": "XenForo", "engine": "XenForo",
"rank": 121613, "rank": 121613,
"tags": [ "tags": ["ru"],
"ru"
],
"urlMain": "http://forum.amperka.ru", "urlMain": "http://forum.amperka.ru",
"usernameClaimed": "adam", "usernameClaimed": "adam",
"usernameUnclaimed": "noonewouldeverusethis7" "usernameUnclaimed": "noonewouldeverusethis7",
},
}, },
}
} }
@@ -117,8 +112,14 @@ def test_site_url_detector():
db = MaigretDatabase() db = MaigretDatabase()
db.load_from_json(EXAMPLE_DB) db.load_from_json(EXAMPLE_DB)
assert db.sites[0].url_regexp.pattern == r'^https?://(www.)?forum\.amperka\.ru/members/\?username=(.+?)$' assert (
assert db.sites[0].detect_username('http://forum.amperka.ru/members/?username=test') == 'test' db.sites[0].url_regexp.pattern
== r'^https?://(www.)?forum\.amperka\.ru/members/\?username=(.+?)$'
)
assert (
db.sites[0].detect_username('http://forum.amperka.ru/members/?username=test')
== 'test'
)
def test_ranked_sites_dict(): def test_ranked_sites_dict():
@@ -167,6 +168,7 @@ def test_ranked_sites_dict_disabled():
assert len(db.ranked_sites_dict()) == 2 assert len(db.ranked_sites_dict()) == 2
assert len(db.ranked_sites_dict(disabled=False)) == 1 assert len(db.ranked_sites_dict(disabled=False)) == 1
def test_ranked_sites_dict_id_type(): def test_ranked_sites_dict_id_type():
db = MaigretDatabase() db = MaigretDatabase()
db.update_site(MaigretSite('1', {})) db.update_site(MaigretSite('1', {}))
+63 -3
View File
@@ -1,7 +1,14 @@
"""Maigret utils test functions""" """Maigret utils test functions"""
import itertools import itertools
import re import re
from maigret.utils import CaseConverter, is_country_tag, enrich_link_str, URLMatcher
from maigret.utils import (
CaseConverter,
is_country_tag,
enrich_link_str,
URLMatcher,
get_dict_ascii_tree,
)
def test_case_convert_camel_to_snake(): def test_case_convert_camel_to_snake():
@@ -10,18 +17,28 @@ def test_case_convert_camel_to_snake():
assert b == 'snake_cased_string' assert b == 'snake_cased_string'
def test_case_convert_snake_to_camel(): def test_case_convert_snake_to_camel():
a = 'camel_cased_string' a = 'camel_cased_string'
b = CaseConverter.snake_to_camel(a) b = CaseConverter.snake_to_camel(a)
assert b == 'camelCasedString' assert b == 'camelCasedString'
def test_case_convert_snake_to_title(): def test_case_convert_snake_to_title():
a = 'camel_cased_string' a = 'camel_cased_string'
b = CaseConverter.snake_to_title(a) b = CaseConverter.snake_to_title(a)
assert b == 'Camel cased string' assert b == 'Camel cased string'
def test_case_convert_camel_with_digits_to_snake():
a = 'ignore403'
b = CaseConverter.camel_to_snake(a)
assert b == 'ignore403'
def test_is_country_tag(): def test_is_country_tag():
assert is_country_tag('ru') == True assert is_country_tag('ru') == True
assert is_country_tag('FR') == True assert is_country_tag('FR') == True
@@ -31,9 +48,14 @@ def test_is_country_tag():
assert is_country_tag('global') == True assert is_country_tag('global') == True
def test_enrich_link_str(): def test_enrich_link_str():
assert enrich_link_str('test') == 'test' assert enrich_link_str('test') == 'test'
assert enrich_link_str(' www.flickr.com/photos/alexaimephotography/') == '<a class="auto-link" href="www.flickr.com/photos/alexaimephotography/">www.flickr.com/photos/alexaimephotography/</a>' assert (
enrich_link_str(' www.flickr.com/photos/alexaimephotography/')
== '<a class="auto-link" href="www.flickr.com/photos/alexaimephotography/">www.flickr.com/photos/alexaimephotography/</a>'
)
def test_url_extract_main_part(): def test_url_extract_main_part():
url_main_part = 'flickr.com/photos/alexaimephotography' url_main_part = 'flickr.com/photos/alexaimephotography'
@@ -51,6 +73,7 @@ def test_url_extract_main_part():
assert URLMatcher.extract_main_part(url) == url_main_part assert URLMatcher.extract_main_part(url) == url_main_part
assert not url_regexp.match(url) is None assert not url_regexp.match(url) is None
def test_url_make_profile_url_regexp(): def test_url_make_profile_url_regexp():
url_main_part = 'flickr.com/photos/{username}' url_main_part = 'flickr.com/photos/{username}'
@@ -63,4 +86,41 @@ def test_url_make_profile_url_regexp():
for url_parts in itertools.product(*parts): for url_parts in itertools.product(*parts):
url = ''.join(url_parts) url = ''.join(url_parts)
assert URLMatcher.make_profile_url_regexp(url).pattern == r'^https?://(www.)?flickr\.com/photos/(.+?)$' assert (
URLMatcher.make_profile_url_regexp(url).pattern
== r'^https?://(www.)?flickr\.com/photos/(.+?)$'
)
def test_get_dict_ascii_tree():
data = {
'uid': 'dXJpOm5vZGU6VXNlcjoyNjQwMzQxNQ==',
'legacy_id': '26403415',
'username': 'alexaimephotographycars',
'name': 'Alex Aimé',
'created_at': '2018-05-04T10:17:01.000+0000',
'image': 'https://drscdn.500px.org/user_avatar/26403415/q%3D85_w%3D300_h%3D300/v2?webp=true&v=2&sig=0235678a4f7b65e007e864033ebfaf5ef6d87fad34f80a8639d985320c20fe3b',
'image_bg': 'https://drscdn.500px.org/user_cover/26403415/q%3D65_m%3D2048/v2?webp=true&v=1&sig=bea411fb158391a4fdad498874ff17088f91257e59dfb376ff67e3a44c3a4201',
'website': 'www.instagram.com/street.reality.photography/',
'facebook_link': ' www.instagram.com/street.reality.photography/',
'instagram_username': 'Street.Reality.Photography',
'twitter_username': 'Alexaimephotogr',
}
ascii_tree = get_dict_ascii_tree(data.items())
assert (
ascii_tree
== """
uid: dXJpOm5vZGU6VXNlcjoyNjQwMzQxNQ==
legacy_id: 26403415
username: alexaimephotographycars
name: Alex Aimé
created_at: 2018-05-04T10:17:01.000+0000
image: https://drscdn.500px.org/user_avatar/26403415/q%3D85_w%3D300_h%3D300/v2?webp=true&v=2&sig=0235678a4f7b65e007e864033ebfaf5ef6d87fad34f80a8639d985320c20fe3b
image_bg: https://drscdn.500px.org/user_cover/26403415/q%3D65_m%3D2048/v2?webp=true&v=1&sig=bea411fb158391a4fdad498874ff17088f91257e59dfb376ff67e3a44c3a4201
website: www.instagram.com/street.reality.photography/
facebook_link: www.instagram.com/street.reality.photography/
instagram_username: Street.Reality.Photography
twitter_username: Alexaimephotogr"""
)
+9 -3
View File
@@ -20,8 +20,9 @@ RANKS.update({
'5000': '5K', '5000': '5K',
'10000': '10K', '10000': '10K',
'100000': '100K', '100000': '100K',
'10000000': '1M', '10000000': '10M',
'50000000': '10M', '50000000': '50M',
'100000000': '100M',
}) })
SEMAPHORE = threading.Semaphore(10) SEMAPHORE = threading.Semaphore(10)
@@ -58,8 +59,9 @@ def get_rank(domain_to_query, site, print_errors=True):
def get_step_rank(rank): def get_step_rank(rank):
def get_readable_rank(r): def get_readable_rank(r):
return RANKS[str(r)] return RANKS[str(r)]
valid_step_ranks = sorted(map(int, RANKS.keys())) valid_step_ranks = sorted(map(int, RANKS.keys()))
if rank == 0: if rank == 0 or rank == sys.maxsize:
return get_readable_rank(valid_step_ranks[-1]) return get_readable_rank(valid_step_ranks[-1])
else: else:
return get_readable_rank(list(filter(lambda x: x >= rank, valid_step_ranks))[0]) return get_readable_rank(list(filter(lambda x: x >= rank, valid_step_ranks))[0])
@@ -73,6 +75,8 @@ if __name__ == '__main__':
help="JSON file with sites data to update.") help="JSON file with sites data to update.")
parser.add_argument('--empty-only', help='update only sites without rating', action='store_true') parser.add_argument('--empty-only', help='update only sites without rating', action='store_true')
parser.add_argument('--exclude-engine', help='do not update score with certain engine',
action="append", dest="exclude_engine_list", default=[])
pool = list() pool = list()
@@ -92,6 +96,8 @@ Rank data fetched from Alexa by domains.
url_main = site.url_main url_main = site.url_main
if site.alexa_rank < sys.maxsize and args.empty_only: if site.alexa_rank < sys.maxsize and args.empty_only:
continue continue
if args.exclude_engine_list and site.engine in args.exclude_engine_list:
continue
site.alexa_rank = 0 site.alexa_rank = 0
th = threading.Thread(target=get_rank, args=(url_main, site)) th = threading.Thread(target=get_rank, args=(url_main, site))
pool.append((site.name, url_main, th)) pool.append((site.name, url_main, th))
Executable
+71
View File
@@ -0,0 +1,71 @@
#!/usr/bin/env python3
import asyncio
import logging
import maigret
# top popular sites from the Maigret database
TOP_SITES_COUNT = 300
# Maigret HTTP requests timeout
TIMEOUT = 10
# max parallel requests
MAX_CONNECTIONS = 50
if __name__ == '__main__':
# setup logging and asyncio
logger = logging.getLogger('maigret')
logger.setLevel(logging.WARNING)
loop = asyncio.get_event_loop()
# setup Maigret
db = maigret.MaigretDatabase().load_from_file('./maigret/resources/data.json')
# also can be downloaded from web
# db = MaigretDatabase().load_from_url(MAIGRET_DB_URL)
# user input
username = input('Enter username to search: ')
sites_count_raw = input(
f'Select the number of sites to search ({TOP_SITES_COUNT} for default, {len(db.sites_dict)} max): '
)
sites_count = int(sites_count_raw) or TOP_SITES_COUNT
sites = db.ranked_sites_dict(top=sites_count)
show_progressbar_raw = input('Do you want to show a progressbar? [Yn] ')
show_progressbar = show_progressbar_raw.lower() != 'n'
extract_info_raw = input(
'Do you want to extract additional info from accounts\' pages? [Yn] '
)
extract_info = extract_info_raw.lower() != 'n'
use_notifier_raw = input(
'Do you want to use notifier for displaying results while searching? [Yn] '
)
use_notifier = use_notifier_raw.lower() != 'n'
notifier = None
if use_notifier:
notifier = maigret.Notifier(print_found_only=True, skip_check_errors=True)
# search!
search_func = maigret.search(
username=username,
site_dict=sites,
timeout=TIMEOUT,
logger=logger,
max_connections=MAX_CONNECTIONS,
query_notify=notifier,
no_progressbar=(not show_progressbar),
is_parsing_enabled=extract_info,
)
results = loop.run_until_complete(search_func)
input('Search completed. Press any key to show results.')
for sitename, data in results.items():
is_found = data['status'].is_found()
print(f'{sitename} - {"Found!" if is_found else "Not found"}')