mirror of
https://github.com/soxoj/maigret.git
synced 2026-05-07 14:34:33 +00:00
Compare commits
33 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| aa6cd0eca9 | |||
| 38e5d5c664 | |||
| 8a562d06ae | |||
| aa50ee9672 | |||
| 51327f9647 | |||
| 4a368c9bb6 | |||
| 6fd5f6e33a | |||
| fa3db9c39c | |||
| 5912ad4fbc | |||
| ee36dc0187 | |||
| 9eb62e4e22 | |||
| ead048af93 | |||
| acc751ff98 | |||
| b7bdd71cf0 | |||
| 43f189f774 | |||
| 5bda7fb339 | |||
| 414523a8ac | |||
| 6d4e268706 | |||
| b696b982f4 | |||
| d4234036c0 | |||
| b57c70091c | |||
| e90df3560b | |||
| bc6ee48b8c | |||
| e70bdf3789 | |||
| 84f9d417cf | |||
| 4333c40be7 | |||
| 9e504c0094 | |||
| 2f752a0368 | |||
| 53e9dab677 | |||
| 11b70a2a48 | |||
| 960708ef2e | |||
| e6f6d8735d | |||
| f77d7d307a |
@@ -0,0 +1,32 @@
|
|||||||
|
name: Build docker image and push to DockerHub
|
||||||
|
|
||||||
|
on:
|
||||||
|
push:
|
||||||
|
branches: [ main ]
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
docker:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
-
|
||||||
|
name: Set up QEMU
|
||||||
|
uses: docker/setup-qemu-action@v1
|
||||||
|
-
|
||||||
|
name: Set up Docker Buildx
|
||||||
|
uses: docker/setup-buildx-action@v1
|
||||||
|
-
|
||||||
|
name: Login to DockerHub
|
||||||
|
uses: docker/login-action@v1
|
||||||
|
with:
|
||||||
|
username: ${{ secrets.DOCKER_HUB_USERNAME }}
|
||||||
|
password: ${{ secrets.DOCKER_HUB_ACCESS_TOKEN }}
|
||||||
|
-
|
||||||
|
name: Build and push
|
||||||
|
id: docker_build
|
||||||
|
uses: docker/build-push-action@v2
|
||||||
|
with:
|
||||||
|
push: true
|
||||||
|
tags: ${{ secrets.DOCKER_HUB_USERNAME }}/maigret:latest
|
||||||
|
-
|
||||||
|
name: Image digest
|
||||||
|
run: echo ${{ steps.docker_build.outputs.digest }}
|
||||||
@@ -2,6 +2,11 @@
|
|||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
|
## [0.2.3] - 2021-05-12
|
||||||
|
* added Yelp and yelp_userid support
|
||||||
|
* tags markup stabilization
|
||||||
|
* improved errors detection
|
||||||
|
|
||||||
## [0.2.2] - 2021-05-07
|
## [0.2.2] - 2021-05-07
|
||||||
* improved ids extractors
|
* improved ids extractors
|
||||||
* updated sites and engines
|
* updated sites and engines
|
||||||
|
|||||||
@@ -1,40 +1,55 @@
|
|||||||
# Maigret
|
# Maigret
|
||||||
|
|
||||||

|
|
||||||

|
|
||||||
[](https://gitter.im/maigret-osint/community)
|
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<img src="./static/maigret.png" />
|
<p align="center">
|
||||||
|
<a href="https://pypi.org/project/maigret/">
|
||||||
|
<img alt="PyPI" src="https://img.shields.io/pypi/v/maigret?style=flat-square">
|
||||||
|
</a>
|
||||||
|
<a href="https://pypi.org/project/maigret/">
|
||||||
|
<img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dw/maigret?style=flat-square">
|
||||||
|
</a>
|
||||||
|
<a href="https://gitter.im/maigret-osint/community">
|
||||||
|
<img alt="Chat - Gitter" src="./static/chat_gitter.svg" />
|
||||||
|
</a>
|
||||||
|
<a href="https://twitter.com/intent/follow?screen_name=sox0j">
|
||||||
|
<img src="https://img.shields.io/twitter/follow/sox0j?label=Follow%20sox0j&style=social&color=blue" alt="Follow @sox0j" />
|
||||||
|
</a>
|
||||||
|
</p>
|
||||||
|
<p align="center">
|
||||||
|
<img src="./static/maigret.png" height="200"/>
|
||||||
|
</p>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
<i>The Commissioner Jules Maigret is a fictional French police detective, created by Georges Simenon. His investigation method is based on understanding the personality of different people and their interactions.</i>
|
<i>The Commissioner Jules Maigret is a fictional French police detective, created by Georges Simenon. His investigation method is based on understanding the personality of different people and their interactions.</i>
|
||||||
|
|
||||||
## About
|
## About
|
||||||
|
|
||||||
Purpose of Maigret - **collect a dossier on a person by username only**, checking for accounts on a huge number of sites.
|
**Maigret** collect a dossier on a person **by username only**, checking for accounts on a huge number of sites and gathering all the available information from web pages. Maigret is an easy-to-use and powerful fork of [Sherlock](https://github.com/sherlock-project/sherlock).
|
||||||
|
|
||||||
This is a [sherlock](https://github.com/sherlock-project/) fork with cool features under heavy development.
|
Currently supported more than 2000 sites ([full list](./sites.md)), search is launched against 500 popular sites in descending order of popularity by default.
|
||||||
*Don't forget to regularly update source code from repo*.
|
|
||||||
|
|
||||||
Currently supported more than 2000 sites ([full list](./sites.md)), by default search is launched against 500 popular sites in descending order of popularity.
|
|
||||||
|
|
||||||
## Main features
|
## Main features
|
||||||
|
|
||||||
* Profile pages parsing, [extracting](https://github.com/soxoj/socid_extractor) personal info, links to other profiles, etc.
|
* Profile pages parsing, [extraction](https://github.com/soxoj/socid_extractor) of personal info, links to other profiles, etc.
|
||||||
* Recursive search by new usernames found
|
* Recursive search by new usernames and other ids found
|
||||||
* Search by tags (site categories, countries)
|
* Search by tags (site categories, countries)
|
||||||
* Censorship and captcha detection
|
* Censorship and captcha detection
|
||||||
* Very few false positives
|
* Requests retries
|
||||||
* Failed requests' restarts
|
|
||||||
|
See full description of Maigret features [in the Wiki](https://github.com/soxoj/maigret/wiki/Features).
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
**NOTE**: Python 3.6 or higher and pip is required.
|
Maigret can be installed using pip, Docker, or simply can be launched from the cloned repo.
|
||||||
|
Also you can run Maigret using cloud shells (see buttons below).
|
||||||
|
|
||||||
**Python 3.8 is recommended.**
|
[](https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/soxoj/maigret&tutorial=README.md) [](https://repl.it/github/soxoj/maigret)
|
||||||
|
<a href="https://colab.research.google.com/gist//soxoj/879b51bc3b2f8b695abb054090645000/maigret.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" height="40"></a>
|
||||||
|
|
||||||
### Package installing
|
### Package installing
|
||||||
|
|
||||||
|
**NOTE**: Python 3.6 or higher and pip is required, **Python 3.8 is recommended.**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# install from pypi
|
# install from pypi
|
||||||
pip3 install maigret
|
pip3 install maigret
|
||||||
@@ -42,34 +57,36 @@ pip3 install maigret
|
|||||||
# or clone and install manually
|
# or clone and install manually
|
||||||
git clone https://github.com/soxoj/maigret && cd maigret
|
git clone https://github.com/soxoj/maigret && cd maigret
|
||||||
pip3 install .
|
pip3 install .
|
||||||
|
|
||||||
|
# usage
|
||||||
|
maigret username
|
||||||
```
|
```
|
||||||
|
|
||||||
### Cloning a repository
|
### Cloning a repository
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/soxoj/maigret && cd maigret
|
git clone https://github.com/soxoj/maigret && cd maigret
|
||||||
```
|
|
||||||
|
|
||||||
You can use a free virtual machine, the repo will be automatically cloned:
|
|
||||||
|
|
||||||
[](https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/soxoj/maigret&tutorial=README.md) [](https://repl.it/github/soxoj/maigret)
|
|
||||||
<a href="https://colab.research.google.com/gist//soxoj/879b51bc3b2f8b695abb054090645000/maigret.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" height="40"></a>
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pip3 install -r requirements.txt
|
pip3 install -r requirements.txt
|
||||||
|
|
||||||
|
# usage
|
||||||
|
./maigret.py username
|
||||||
```
|
```
|
||||||
|
|
||||||
## Using examples
|
### Docker
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# for a cloned repo
|
# official image
|
||||||
./maigret.py user
|
docker pull soxoj/maigret
|
||||||
|
|
||||||
# for a package
|
# usage
|
||||||
maigret user
|
docker run soxoj/maigret:latest username
|
||||||
|
|
||||||
|
# manual build
|
||||||
|
docker build -t maigret .
|
||||||
```
|
```
|
||||||
|
|
||||||
Features:
|
## Usage examples
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# make HTML and PDF reports
|
# make HTML and PDF reports
|
||||||
maigret user --html --pdf
|
maigret user --html --pdf
|
||||||
@@ -77,22 +94,12 @@ maigret user --html --pdf
|
|||||||
# search on sites marked with tags photo & dating
|
# search on sites marked with tags photo & dating
|
||||||
maigret user --tags photo,dating
|
maigret user --tags photo,dating
|
||||||
|
|
||||||
|
|
||||||
# search for three usernames on all available sites
|
# search for three usernames on all available sites
|
||||||
maigret user1 user2 user3 -a
|
maigret user1 user2 user3 -a
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Run `maigret --help` to get arguments description. Also options are documented in [the Maigret Wiki](https://github.com/soxoj/maigret/wiki/Command-line-options).
|
Use `maigret --help` to get full options description. Also options are documented in [the Maigret Wiki](https://github.com/soxoj/maigret/wiki/Command-line-options).
|
||||||
|
|
||||||
With Docker:
|
|
||||||
```
|
|
||||||
# manual build
|
|
||||||
docker build -t maigret . && docker run maigret user
|
|
||||||
|
|
||||||
# official image
|
|
||||||
docker run soxoj/maigret:latest user
|
|
||||||
```
|
|
||||||
|
|
||||||
## Demo with page parsing and recursive username search
|
## Demo with page parsing and recursive username search
|
||||||
|
|
||||||
|
|||||||
@@ -1,5 +1,12 @@
|
|||||||
"""Maigret"""
|
"""Maigret"""
|
||||||
|
|
||||||
|
__title__ = 'Maigret'
|
||||||
|
__package__ = 'maigret'
|
||||||
|
__author__ = 'Soxoj'
|
||||||
|
__author_email__ = 'soxoj@protonmail.com'
|
||||||
|
|
||||||
|
|
||||||
|
from .__version__ import __version__
|
||||||
from .checking import maigret as search
|
from .checking import maigret as search
|
||||||
from .sites import MaigretEngine, MaigretSite, MaigretDatabase
|
from .sites import MaigretEngine, MaigretSite, MaigretDatabase
|
||||||
from .notify import QueryNotifyPrint as Notifier
|
from .notify import QueryNotifyPrint as Notifier
|
||||||
|
|||||||
@@ -0,0 +1,3 @@
|
|||||||
|
"""Maigret version file"""
|
||||||
|
|
||||||
|
__version__ = '0.2.3'
|
||||||
+10
-3
@@ -13,6 +13,7 @@ import tqdm.asyncio
|
|||||||
from aiohttp_socks import ProxyConnector
|
from aiohttp_socks import ProxyConnector
|
||||||
from python_socks import _errors as proxy_errors
|
from python_socks import _errors as proxy_errors
|
||||||
from socid_extractor import extract
|
from socid_extractor import extract
|
||||||
|
from aiohttp.client_exceptions import ServerDisconnectedError, ClientConnectorError
|
||||||
|
|
||||||
from .activation import ParsingActivator, import_aiohttp_cookies
|
from .activation import ParsingActivator, import_aiohttp_cookies
|
||||||
from . import errors
|
from . import errors
|
||||||
@@ -36,6 +37,7 @@ SUPPORTED_IDS = (
|
|||||||
"wikimapia_uid",
|
"wikimapia_uid",
|
||||||
"steam_id",
|
"steam_id",
|
||||||
"uidme_uguid",
|
"uidme_uguid",
|
||||||
|
"yelp_userid",
|
||||||
)
|
)
|
||||||
|
|
||||||
BAD_CHARS = "#"
|
BAD_CHARS = "#"
|
||||||
@@ -63,8 +65,10 @@ async def get_response(request_future, logger) -> Tuple[str, int, Optional[Check
|
|||||||
|
|
||||||
except asyncio.TimeoutError as e:
|
except asyncio.TimeoutError as e:
|
||||||
error = CheckError("Request timeout", str(e))
|
error = CheckError("Request timeout", str(e))
|
||||||
except aiohttp.client_exceptions.ClientConnectorError as e:
|
except ClientConnectorError as e:
|
||||||
error = CheckError("Connecting failure", str(e))
|
error = CheckError("Connecting failure", str(e))
|
||||||
|
except ServerDisconnectedError as e:
|
||||||
|
error = CheckError("Server disconnected", str(e))
|
||||||
except aiohttp.http_exceptions.BadHttpMessage as e:
|
except aiohttp.http_exceptions.BadHttpMessage as e:
|
||||||
error = CheckError("HTTP", str(e))
|
error = CheckError("HTTP", str(e))
|
||||||
except proxy_errors.ProxyError as e:
|
except proxy_errors.ProxyError as e:
|
||||||
@@ -154,7 +158,7 @@ def process_site_result(
|
|||||||
# additional check for errors
|
# additional check for errors
|
||||||
if status_code and not check_error:
|
if status_code and not check_error:
|
||||||
check_error = detect_error_page(
|
check_error = detect_error_page(
|
||||||
html_text, status_code, site.errors, site.ignore403
|
html_text, status_code, site.errors_dict, site.ignore403
|
||||||
)
|
)
|
||||||
|
|
||||||
# parsing activation
|
# parsing activation
|
||||||
@@ -268,7 +272,10 @@ def process_site_result(
|
|||||||
new_usernames[v] = k
|
new_usernames[v] = k
|
||||||
|
|
||||||
results_info["ids_usernames"] = new_usernames
|
results_info["ids_usernames"] = new_usernames
|
||||||
results_info["ids_links"] = eval(extracted_ids_data.get("links", "[]"))
|
links = eval(extracted_ids_data.get("links", "[]"))
|
||||||
|
if "website" in extracted_ids_data:
|
||||||
|
links.append(extracted_ids_data["website"])
|
||||||
|
results_info["ids_links"] = links
|
||||||
result.ids_data = extracted_ids_data
|
result.ids_data = extracted_ids_data
|
||||||
|
|
||||||
# Save status of request
|
# Save status of request
|
||||||
|
|||||||
+3
-4
@@ -13,6 +13,7 @@ from typing import List, Tuple
|
|||||||
import requests
|
import requests
|
||||||
from socid_extractor import extract, parse, __version__ as socid_version
|
from socid_extractor import extract, parse, __version__ as socid_version
|
||||||
|
|
||||||
|
from .__version__ import __version__
|
||||||
from .checking import (
|
from .checking import (
|
||||||
timeout_check,
|
timeout_check,
|
||||||
SUPPORTED_IDS,
|
SUPPORTED_IDS,
|
||||||
@@ -37,8 +38,6 @@ from .submit import submit_dialog
|
|||||||
from .types import QueryResultWrapper
|
from .types import QueryResultWrapper
|
||||||
from .utils import get_dict_ascii_tree
|
from .utils import get_dict_ascii_tree
|
||||||
|
|
||||||
__version__ = '0.2.2'
|
|
||||||
|
|
||||||
|
|
||||||
def notify_about_errors(search_results: QueryResultWrapper, query_notify):
|
def notify_about_errors(search_results: QueryResultWrapper, query_notify):
|
||||||
errs = errors.extract_and_group(search_results)
|
errs = errors.extract_and_group(search_results)
|
||||||
@@ -49,7 +48,7 @@ def notify_about_errors(search_results: QueryResultWrapper, query_notify):
|
|||||||
text = f'Too many errors of type "{e["err"]}" ({e["perc"]}%)'
|
text = f'Too many errors of type "{e["err"]}" ({e["perc"]}%)'
|
||||||
solution = errors.solution_of(e['err'])
|
solution = errors.solution_of(e['err'])
|
||||||
if solution:
|
if solution:
|
||||||
text = '. '.join([text, solution])
|
text = '. '.join([text, solution.capitalize()])
|
||||||
|
|
||||||
query_notify.warning(text, '!')
|
query_notify.warning(text, '!')
|
||||||
was_errs_displayed = True
|
was_errs_displayed = True
|
||||||
@@ -166,7 +165,7 @@ def setup_arguments_parser():
|
|||||||
type=int,
|
type=int,
|
||||||
metavar='RETRIES',
|
metavar='RETRIES',
|
||||||
default=1,
|
default=1,
|
||||||
help="Attempts to restart temporary failed requests.",
|
help="Attempts to restart temporarily failed requests.",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-n",
|
"-n",
|
||||||
|
|||||||
+2197
-568
File diff suppressed because it is too large
Load Diff
+20
-1
@@ -53,6 +53,17 @@ SUPPORTED_TAGS = [
|
|||||||
"medicine",
|
"medicine",
|
||||||
"reading",
|
"reading",
|
||||||
"stock",
|
"stock",
|
||||||
|
"messaging",
|
||||||
|
"trading",
|
||||||
|
"links",
|
||||||
|
"fashion",
|
||||||
|
"tasks",
|
||||||
|
"military",
|
||||||
|
"auto",
|
||||||
|
"gambling",
|
||||||
|
"business",
|
||||||
|
"cybercriminal",
|
||||||
|
"review",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
@@ -180,6 +191,14 @@ class MaigretSite:
|
|||||||
|
|
||||||
return result
|
return result
|
||||||
|
|
||||||
|
@property
|
||||||
|
def errors_dict(self) -> dict:
|
||||||
|
errors: Dict[str, str] = {}
|
||||||
|
if self.engine_obj:
|
||||||
|
errors.update(self.engine_obj.site.get('errors', {}))
|
||||||
|
errors.update(self.errors)
|
||||||
|
return errors
|
||||||
|
|
||||||
def get_url_type(self) -> str:
|
def get_url_type(self) -> str:
|
||||||
url = URLMatcher.extract_main_part(self.url)
|
url = URLMatcher.extract_main_part(self.url)
|
||||||
if url.startswith("{username}"):
|
if url.startswith("{username}"):
|
||||||
@@ -456,7 +475,7 @@ class MaigretDatabase:
|
|||||||
output += f"{count}\t{url}\n"
|
output += f"{count}\t{url}\n"
|
||||||
|
|
||||||
output += "Top tags:\n"
|
output += "Top tags:\n"
|
||||||
for tag, count in sorted(tags.items(), key=lambda x: x[1], reverse=True)[:20]:
|
for tag, count in sorted(tags.items(), key=lambda x: x[1], reverse=True)[:200]:
|
||||||
mark = ""
|
mark = ""
|
||||||
if tag not in SUPPORTED_TAGS:
|
if tag not in SUPPORTED_TAGS:
|
||||||
mark = " (non-standard)"
|
mark = " (non-standard)"
|
||||||
|
|||||||
+34
-9
@@ -2,7 +2,7 @@ import asyncio
|
|||||||
import difflib
|
import difflib
|
||||||
import re
|
import re
|
||||||
from typing import List
|
from typing import List
|
||||||
|
import xml.etree.ElementTree as ET
|
||||||
import requests
|
import requests
|
||||||
|
|
||||||
from .activation import import_aiohttp_cookies
|
from .activation import import_aiohttp_cookies
|
||||||
@@ -46,6 +46,20 @@ def get_match_ratio(x):
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def get_alexa_rank(site_url_main):
|
||||||
|
url = f"http://data.alexa.com/data?cli=10&url={site_url_main}"
|
||||||
|
xml_data = requests.get(url).text
|
||||||
|
root = ET.fromstring(xml_data)
|
||||||
|
alexa_rank = 0
|
||||||
|
|
||||||
|
try:
|
||||||
|
alexa_rank = int(root.find('.//REACH').attrib['RANK'])
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return alexa_rank
|
||||||
|
|
||||||
|
|
||||||
def extract_mainpage_url(url):
|
def extract_mainpage_url(url):
|
||||||
return "/".join(url.split("/", 3)[:3])
|
return "/".join(url.split("/", 3)[:3])
|
||||||
|
|
||||||
@@ -133,6 +147,7 @@ async def detect_known_engine(
|
|||||||
) -> List[MaigretSite]:
|
) -> List[MaigretSite]:
|
||||||
try:
|
try:
|
||||||
r = requests.get(url_mainpage)
|
r = requests.get(url_mainpage)
|
||||||
|
logger.debug(r.text)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(e)
|
logger.warning(e)
|
||||||
print("Some error while checking main page")
|
print("Some error while checking main page")
|
||||||
@@ -199,6 +214,7 @@ async def check_features_manually(
|
|||||||
# cookies
|
# cookies
|
||||||
cookie_dict = None
|
cookie_dict = None
|
||||||
if cookie_file:
|
if cookie_file:
|
||||||
|
logger.info(f'Use {cookie_file} for cookies')
|
||||||
cookie_jar = await import_aiohttp_cookies(cookie_file)
|
cookie_jar = await import_aiohttp_cookies(cookie_file)
|
||||||
cookie_dict = {c.key: c.value for c in cookie_jar}
|
cookie_dict = {c.key: c.value for c in cookie_jar}
|
||||||
|
|
||||||
@@ -327,17 +343,26 @@ async def submit_dialog(db, url_exists, cookie_file, logger):
|
|||||||
print(
|
print(
|
||||||
"Try to run this mode again and increase features count or choose others."
|
"Try to run this mode again and increase features count or choose others."
|
||||||
)
|
)
|
||||||
|
return False
|
||||||
else:
|
else:
|
||||||
if (
|
if (
|
||||||
input(
|
input(
|
||||||
f"Site {chosen_site.name} successfully checked. Do you want to save it in the Maigret DB? [Yn] "
|
f"Site {chosen_site.name} successfully checked. Do you want to save it in the Maigret DB? [Yn] "
|
||||||
).lower()
|
)
|
||||||
in "y"
|
.lower()
|
||||||
|
.strip("y")
|
||||||
):
|
):
|
||||||
logger.debug(chosen_site.json)
|
return False
|
||||||
site_data = chosen_site.strip_engine_data()
|
|
||||||
logger.debug(site_data.json)
|
|
||||||
db.update_site(site_data)
|
|
||||||
return True
|
|
||||||
|
|
||||||
return False
|
chosen_site.name = input("Change site name if you want: ") or chosen_site.name
|
||||||
|
chosen_site.tags = input("Site tags: ").split(',')
|
||||||
|
rank = get_alexa_rank(chosen_site.url_main)
|
||||||
|
if rank:
|
||||||
|
print(f'New alexa rank: {rank}')
|
||||||
|
chosen_site.alexa_rank = rank
|
||||||
|
|
||||||
|
logger.debug(chosen_site.json)
|
||||||
|
site_data = chosen_site.strip_engine_data()
|
||||||
|
logger.debug(site_data.json)
|
||||||
|
db.update_site(site_data)
|
||||||
|
return True
|
||||||
|
|||||||
+1
-1
@@ -26,7 +26,7 @@ python-socks==1.1.2
|
|||||||
requests>=2.24.0
|
requests>=2.24.0
|
||||||
requests-futures==1.0.0
|
requests-futures==1.0.0
|
||||||
six==1.15.0
|
six==1.15.0
|
||||||
socid-extractor>=0.0.16
|
socid-extractor>=0.0.19
|
||||||
soupsieve==2.1
|
soupsieve==2.1
|
||||||
stem==1.8.0
|
stem==1.8.0
|
||||||
torrequest==0.1.0
|
torrequest==0.1.0
|
||||||
|
|||||||
@@ -12,7 +12,7 @@ with open('requirements.txt') as rf:
|
|||||||
requires = rf.read().splitlines()
|
requires = rf.read().splitlines()
|
||||||
|
|
||||||
setup(name='maigret',
|
setup(name='maigret',
|
||||||
version='0.2.2',
|
version='0.2.3',
|
||||||
description='Collect a dossier on a person by username from a huge number of sites',
|
description='Collect a dossier on a person by username from a huge number of sites',
|
||||||
long_description=long_description,
|
long_description=long_description,
|
||||||
long_description_content_type="text/markdown",
|
long_description_content_type="text/markdown",
|
||||||
|
|||||||
@@ -103,6 +103,7 @@ def test_saving_site_error():
|
|||||||
|
|
||||||
amperka = db.sites[0]
|
amperka = db.sites[0]
|
||||||
assert len(amperka.errors) == 2
|
assert len(amperka.errors) == 2
|
||||||
|
assert len(amperka.errors_dict) == 2
|
||||||
|
|
||||||
assert amperka.strip_engine_data().errors == {'error1': 'text1'}
|
assert amperka.strip_engine_data().errors == {'error1': 'text1'}
|
||||||
assert amperka.strip_engine_data().json['errors'] == {'error1': 'text1'}
|
assert amperka.strip_engine_data().json['errors'] == {'error1': 'text1'}
|
||||||
|
|||||||
Executable
+57
@@ -0,0 +1,57 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
import random
|
||||||
|
from argparse import ArgumentParser, RawDescriptionHelpFormatter
|
||||||
|
|
||||||
|
from maigret.maigret import MaigretDatabase
|
||||||
|
from maigret.submit import get_alexa_rank
|
||||||
|
|
||||||
|
|
||||||
|
def update_tags(site):
|
||||||
|
tags = []
|
||||||
|
if not site.tags:
|
||||||
|
print(f'Site {site.name} doesn\'t have tags')
|
||||||
|
else:
|
||||||
|
tags = site.tags
|
||||||
|
print(f'Site {site.name} tags: ' + ', '.join(tags))
|
||||||
|
|
||||||
|
print(f'URL: {site.url_main}')
|
||||||
|
|
||||||
|
new_tags = set(input('Enter new tags: ').split(', '))
|
||||||
|
if "disabled" in new_tags:
|
||||||
|
new_tags.remove("disabled")
|
||||||
|
site.disabled = True
|
||||||
|
|
||||||
|
print(f'Old alexa rank: {site.alexa_rank}')
|
||||||
|
rank = get_alexa_rank(site.url_main)
|
||||||
|
if rank:
|
||||||
|
print(f'New alexa rank: {rank}')
|
||||||
|
site.alexa_rank = rank
|
||||||
|
|
||||||
|
site.tags = [x for x in list(new_tags) if x]
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
parser = ArgumentParser(formatter_class=RawDescriptionHelpFormatter
|
||||||
|
)
|
||||||
|
parser.add_argument("--base","-b", metavar="BASE_FILE",
|
||||||
|
dest="base_file", default="maigret/resources/data.json",
|
||||||
|
help="JSON file with sites data to update.")
|
||||||
|
|
||||||
|
pool = list()
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
db = MaigretDatabase()
|
||||||
|
db.load_from_file(args.base_file).sites
|
||||||
|
|
||||||
|
while True:
|
||||||
|
site = random.choice(db.sites)
|
||||||
|
if site.engine == 'uCoz':
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not 'in' in site.tags:
|
||||||
|
continue
|
||||||
|
|
||||||
|
update_tags(site)
|
||||||
|
|
||||||
|
db.save_to_file(args.base_file)
|
||||||
+21
-16
@@ -37,15 +37,15 @@ def get_rank(domain_to_query, site, print_errors=True):
|
|||||||
try:
|
try:
|
||||||
#Get ranking for this site.
|
#Get ranking for this site.
|
||||||
site.alexa_rank = int(root.find('.//REACH').attrib['RANK'])
|
site.alexa_rank = int(root.find('.//REACH').attrib['RANK'])
|
||||||
country = root.find('.//COUNTRY')
|
# country = root.find('.//COUNTRY')
|
||||||
if not country is None and country.attrib:
|
# if not country is None and country.attrib:
|
||||||
country_code = country.attrib['CODE']
|
# country_code = country.attrib['CODE']
|
||||||
tags = set(site.tags)
|
# tags = set(site.tags)
|
||||||
if country_code:
|
# if country_code:
|
||||||
tags.add(country_code.lower())
|
# tags.add(country_code.lower())
|
||||||
site.tags = sorted(list(tags))
|
# site.tags = sorted(list(tags))
|
||||||
if site.type != 'username':
|
# if site.type != 'username':
|
||||||
site.disabled = False
|
# site.disabled = False
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
if print_errors:
|
if print_errors:
|
||||||
logging.error(e)
|
logging.error(e)
|
||||||
@@ -74,6 +74,7 @@ if __name__ == '__main__':
|
|||||||
dest="base_file", default="maigret/resources/data.json",
|
dest="base_file", default="maigret/resources/data.json",
|
||||||
help="JSON file with sites data to update.")
|
help="JSON file with sites data to update.")
|
||||||
|
|
||||||
|
parser.add_argument('--with-rank', help='update with use of local data only', action='store_true')
|
||||||
parser.add_argument('--empty-only', help='update only sites without rating', action='store_true')
|
parser.add_argument('--empty-only', help='update only sites without rating', action='store_true')
|
||||||
parser.add_argument('--exclude-engine', help='do not update score with certain engine',
|
parser.add_argument('--exclude-engine', help='do not update score with certain engine',
|
||||||
action="append", dest="exclude_engine_list", default=[])
|
action="append", dest="exclude_engine_list", default=[])
|
||||||
@@ -93,22 +94,25 @@ Rank data fetched from Alexa by domains.
|
|||||||
""")
|
""")
|
||||||
|
|
||||||
for site in sites_subset:
|
for site in sites_subset:
|
||||||
|
if not args.with_rank:
|
||||||
|
break
|
||||||
url_main = site.url_main
|
url_main = site.url_main
|
||||||
if site.alexa_rank < sys.maxsize and args.empty_only:
|
if site.alexa_rank < sys.maxsize and args.empty_only:
|
||||||
continue
|
continue
|
||||||
if args.exclude_engine_list and site.engine in args.exclude_engine_list:
|
if args.exclude_engine_list and site.engine in args.exclude_engine_list:
|
||||||
continue
|
continue
|
||||||
site.alexa_rank = 0
|
site.alexa_rank = 0
|
||||||
th = threading.Thread(target=get_rank, args=(url_main, site))
|
th = threading.Thread(target=get_rank, args=(url_main, site,))
|
||||||
pool.append((site.name, url_main, th))
|
pool.append((site.name, url_main, th))
|
||||||
th.start()
|
th.start()
|
||||||
|
|
||||||
index = 1
|
if args.with_rank:
|
||||||
for site_name, url_main, th in pool:
|
index = 1
|
||||||
th.join()
|
for site_name, url_main, th in pool:
|
||||||
sys.stdout.write("\r{0}".format(f"Updated {index} out of {len(sites_subset)} entries"))
|
th.join()
|
||||||
sys.stdout.flush()
|
sys.stdout.write("\r{0}".format(f"Updated {index} out of {len(sites_subset)} entries"))
|
||||||
index = index + 1
|
sys.stdout.flush()
|
||||||
|
index = index + 1
|
||||||
|
|
||||||
sites_full_list = [(s, s.alexa_rank) for s in sites_subset]
|
sites_full_list = [(s, s.alexa_rank) for s in sites_subset]
|
||||||
|
|
||||||
@@ -123,6 +127,7 @@ Rank data fetched from Alexa by domains.
|
|||||||
url_main = site.url_main
|
url_main = site.url_main
|
||||||
valid_rank = get_step_rank(rank)
|
valid_rank = get_step_rank(rank)
|
||||||
all_tags = site.tags
|
all_tags = site.tags
|
||||||
|
all_tags.sort()
|
||||||
tags = ', ' + ', '.join(all_tags) if all_tags else ''
|
tags = ', ' + ', '.join(all_tags) if all_tags else ''
|
||||||
note = ''
|
note = ''
|
||||||
if site.disabled:
|
if site.disabled:
|
||||||
|
|||||||
Reference in New Issue
Block a user