mirror of
https://github.com/soxoj/maigret.git
synced 2026-05-06 14:08:59 +00:00
Improve site-check quality: fix broken site configs, add diagnostic utilities, and make self-check report-only by default with opt-in auto-disable. (#2301)
- Fix VK and TradingView checkType; add Reddit and Microsoft Learn API-style probes where appropriate; adjust or disable entries that are unreliable under anti-bot protection. - Self-check: stop aggressive auto-disable; default to reporting issues only; add --auto-disable and --diagnose for optional fixes and deeper output. - Tooling: add utils/site_check.py and utils/check_top_n.py (and related helpers) to inspect and rank site behavior against the top-N list - Scope: aligns with fixing top-traffic / high-impact sites and making diagnostics repeatable without silently flipping disabled flags
This commit is contained in:
+195
-2
@@ -20,6 +20,13 @@ For other `checkType` values, [`make_site_result`](../maigret/checking.py) sets
|
|||||||
|
|
||||||
Sites with an `engine` field (e.g. XenForo) are merged with a template from the `engines` section in [`maigret/resources/data.json`](../maigret/resources/data.json) ([`MaigretSite.update_from_engine`](../maigret/sites.py)).
|
Sites with an `engine` field (e.g. XenForo) are merged with a template from the `engines` section in [`maigret/resources/data.json`](../maigret/resources/data.json) ([`MaigretSite.update_from_engine`](../maigret/sites.py)).
|
||||||
|
|
||||||
|
### `urlProbe`: probe URL vs reported profile URL
|
||||||
|
|
||||||
|
- **`url`** — pattern for the **public profile page** users should open (what appears in reports as `url_user`). Supports `{username}`, `{urlMain}`, `{urlSubpath}`; the username segment is URL-encoded when the string is built ([`make_site_result`](../maigret/checking.py)).
|
||||||
|
- **`urlProbe`** (optional) — if set, Maigret sends the HTTP **GET** (or HEAD where applicable) to **this** URL for the check, instead of to `url`. Same placeholders. Use it when the reliable signal is a **JSON/API** endpoint but the human-facing link must stay on the main site (e.g. `https://picsart.com/u/{username}` + probe `https://api.picsart.com/users/show/{username}.json`, or GitHub’s `https://github.com/{username}` + `https://api.github.com/users/{username}`).
|
||||||
|
|
||||||
|
If `urlProbe` is omitted, the probe URL defaults to `url`.
|
||||||
|
|
||||||
### Redirects and final URL as a signal
|
### Redirects and final URL as a signal
|
||||||
|
|
||||||
If the **HTML shell** looks the same for “user exists” and “user does not exist” (typical SPA), it is still worth checking whether the **server** behaves differently:
|
If the **HTML shell** looks the same for “user exists” and “user does not exist” (typical SPA), it is still worth checking whether the **server** behaves differently:
|
||||||
@@ -39,7 +46,7 @@ If that differs reliably, you may be able to use **`checkType`: `response_url`**
|
|||||||
| **Bibsonomy** | Both requests redirect to **`/pow-challenge/?return=/user/...`** (proof-of-work). Only the `return` path changes with the username; **both** existing and fake hit the same challenge flow — not a profile-vs-missing distinction. |
|
| **Bibsonomy** | Both requests redirect to **`/pow-challenge/?return=/user/...`** (proof-of-work). Only the `return` path changes with the username; **both** existing and fake hit the same challenge flow — not a profile-vs-missing distinction. |
|
||||||
| **Picsart (web UI `https://picsart.com/u/{username}`)** | Only a **trailing-slash** `301`; the first HTML is the same empty app shell (~3 KiB) for real and fake users. Browser-only routes such as `…/posts` vs `…/not-found` are **not** visible as additional HTTP redirects in this pipeline. |
|
| **Picsart (web UI `https://picsart.com/u/{username}`)** | Only a **trailing-slash** `301`; the first HTML is the same empty app shell (~3 KiB) for real and fake users. Browser-only routes such as `…/posts` vs `…/not-found` are **not** visible as additional HTTP redirects in this pipeline. |
|
||||||
|
|
||||||
**Picsart — workable check via public API.** The site exposes **`https://api.picsart.com/users/show/{username}.json`**: JSON with `"status":"success"` and a user object when the account exists, and `"reason":"user_not_found"` when it does not. Pointing the site entry’s **`url`** at this endpoint with **`checkType`: `message`** and narrow `presenseStrs` / `absenceStrs` restores a reliable check without a headless browser.
|
**Picsart — workable check via public API.** The site exposes **`https://api.picsart.com/users/show/{username}.json`**: JSON with `"status":"success"` and a user object when the account exists, and `"reason":"user_not_found"` when it does not. Put that URL in **`urlProbe`**, set **`url`** to the web profile pattern **`https://picsart.com/u/{username}`**, and use **`checkType`: `message`** with narrow `presenseStrs` / `absenceStrs` so reports show the human link while the request hits the API (see **`urlProbe`** above).
|
||||||
|
|
||||||
For **Kaskus** and **Bibsonomy**, HTTP-level comparison still does **not** unlock a safe check without PoW / richer signals; keep **`disabled: true`** until something stable appears (API, SSR markers, etc.).
|
For **Kaskus** and **Bibsonomy**, HTTP-level comparison still does **not** unlock a safe check without PoW / richer signals; keep **`disabled: true`** until something stable appears (API, SSR markers, etc.).
|
||||||
|
|
||||||
@@ -49,7 +56,7 @@ For **Kaskus** and **Bibsonomy**, HTTP-level comparison still does **not** unloc
|
|||||||
|
|
||||||
### 2.1 Public JSON API (always)
|
### 2.1 Public JSON API (always)
|
||||||
|
|
||||||
When diagnosing a site—especially **SPAs**, **soft 404s**, or **near-identical HTML** for real vs fake users—**routinely look for a public JSON (or JSON-like) API** used for profile or user lookup. Typical leads: paths containing `/api/`, `/v1/`, `graphql`, `users/show`, `.json` suffixes, or the same endpoints mobile apps use. Verify with `curl` (or the Maigret request path) that **claimed** and **unclaimed** usernames produce **reliably different** bodies or status codes. If such an endpoint is more stable than HTML, prefer it for the site entry’s **`url`** in [`data.json`](../maigret/resources/data.json) (see **Picsart** above).
|
When diagnosing a site—especially **SPAs**, **soft 404s**, or **near-identical HTML** for real vs fake users—**routinely look for a public JSON (or JSON-like) API** used for profile or user lookup. Typical leads: paths containing `/api/`, `/v1/`, `graphql`, `users/show`, `.json` suffixes, or the same endpoints mobile apps use. Verify with `curl` (or the Maigret request path) that **claimed** and **unclaimed** usernames produce **reliably different** bodies or status codes. If such an endpoint is more stable than HTML, put it in **`urlProbe`** and keep **`url`** as the canonical profile page on the main site (see **`urlProbe`** in section 1). If there is no separate public URL for humans, you may still point **`url`** at the API only (reports will show that URL).
|
||||||
|
|
||||||
This is a **standard** part of site-check work, not an optional extra.
|
This is a **standard** part of site-check work, not an optional extra.
|
||||||
|
|
||||||
@@ -177,6 +184,192 @@ In those cases **`disabled: true`** is better than false “found”; remove the
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## 6. Development utilities
|
||||||
|
|
||||||
|
### 6.1 `utils/site_check.py` — Single site diagnostics
|
||||||
|
|
||||||
|
A comprehensive utility for testing individual sites with multiple modes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic comparison of claimed vs unclaimed (aiohttp)
|
||||||
|
python utils/site_check.py --site "VK" --check-claimed
|
||||||
|
|
||||||
|
# Test via Maigret's checker directly
|
||||||
|
python utils/site_check.py --site "VK" --maigret
|
||||||
|
|
||||||
|
# Compare aiohttp vs Maigret results (find discrepancies)
|
||||||
|
python utils/site_check.py --site "VK" --compare-methods
|
||||||
|
|
||||||
|
# Full diagnosis with recommendations
|
||||||
|
python utils/site_check.py --site "VK" --diagnose
|
||||||
|
|
||||||
|
# Test with custom URL
|
||||||
|
python utils/site_check.py --url "https://example.com/{username}" --compare user1 user2
|
||||||
|
|
||||||
|
# Find a valid username for a site
|
||||||
|
python utils/site_check.py --site "VK" --find-user
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key features:**
|
||||||
|
- `--maigret` — Uses Maigret's actual checking code, not raw aiohttp
|
||||||
|
- `--compare-methods` — Shows if aiohttp and Maigret see different results (useful for debugging)
|
||||||
|
- `--diagnose` — Validates checkType against actual responses, suggests fixes
|
||||||
|
- Color output with markers detection (captcha, cloudflare, login, etc.)
|
||||||
|
- `--json` flag for machine-readable output
|
||||||
|
|
||||||
|
**When to use each mode:**
|
||||||
|
|
||||||
|
| Mode | Use case |
|
||||||
|
|------|----------|
|
||||||
|
| `--check-claimed` | Quick sanity check: do claimed/unclaimed still differ? |
|
||||||
|
| `--maigret` | Verify Maigret's actual behavior matches expectations |
|
||||||
|
| `--compare-methods` | Debug "works in curl but fails in Maigret" issues |
|
||||||
|
| `--diagnose` | Full analysis when a site is broken, get fix recommendations |
|
||||||
|
|
||||||
|
### 6.2 `utils/check_top_n.py` — Mass site checking
|
||||||
|
|
||||||
|
Batch-check top N sites by Alexa rank with categorized reporting:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check top 100 sites
|
||||||
|
python utils/check_top_n.py --top 100
|
||||||
|
|
||||||
|
# Faster with more parallelism
|
||||||
|
python utils/check_top_n.py --top 100 --parallel 10
|
||||||
|
|
||||||
|
# Output JSON report
|
||||||
|
python utils/check_top_n.py --top 100 --output report.json
|
||||||
|
|
||||||
|
# Only show broken sites
|
||||||
|
python utils/check_top_n.py --top 100 --only-broken
|
||||||
|
```
|
||||||
|
|
||||||
|
**Output categories:**
|
||||||
|
- `working` — Site check passes
|
||||||
|
- `broken` — Check fails (wrong status, missing markers)
|
||||||
|
- `timeout` — Request timed out
|
||||||
|
- `anti_bot` — 403/429 or captcha detected
|
||||||
|
- `error` — Connection or other errors
|
||||||
|
- `disabled` — Already disabled in data.json
|
||||||
|
|
||||||
|
**Report includes:**
|
||||||
|
- Summary counts by category
|
||||||
|
- List of broken sites with issues
|
||||||
|
- Recommendations for fixes (e.g., "Switch to checkType: status_code")
|
||||||
|
|
||||||
|
### 6.3 Self-check behavior (`--self-check`)
|
||||||
|
|
||||||
|
The self-check command has been improved to be less aggressive:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check sites WITHOUT auto-disabling (default)
|
||||||
|
maigret --self-check --site "VK"
|
||||||
|
|
||||||
|
# Auto-disable failing sites (old behavior)
|
||||||
|
maigret --self-check --site "VK" --auto-disable
|
||||||
|
|
||||||
|
# Show detailed diagnosis for each failure
|
||||||
|
maigret --self-check --site "VK" --diagnose
|
||||||
|
```
|
||||||
|
|
||||||
|
**Behavior changes:**
|
||||||
|
|
||||||
|
| Flag | Effect |
|
||||||
|
|------|--------|
|
||||||
|
| `--self-check` alone | Reports issues but does NOT disable sites |
|
||||||
|
| `--auto-disable` | Automatically disables sites that fail (opt-in) |
|
||||||
|
| `--diagnose` | Prints detailed diagnosis with recommendations |
|
||||||
|
|
||||||
|
**Why this matters:**
|
||||||
|
- Old behavior was too aggressive — sites got disabled without explanation
|
||||||
|
- New behavior reports issues and suggests fixes
|
||||||
|
- Explicit `--auto-disable` required to modify database
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Lessons learned (practical observations)
|
||||||
|
|
||||||
|
Collected from hands-on work fixing top-ranked sites (Reddit, Wikipedia, Microsoft Learn, Baidu, etc.).
|
||||||
|
|
||||||
|
### 7.1 JSON API is the first thing to look for
|
||||||
|
|
||||||
|
Both Reddit and Microsoft Learn had working public APIs that solved the problem entirely. The web pages were SPAs or blocked by anti-bot measures, but the APIs worked reliably:
|
||||||
|
|
||||||
|
- **Reddit**: `https://api.reddit.com/user/{username}/about` — returns JSON with user data or `{"message": "Not Found", "error": 404}`.
|
||||||
|
- **Microsoft Learn**: `https://learn.microsoft.com/api/profiles/{username}` — returns JSON with `userName` field or HTTP 404.
|
||||||
|
|
||||||
|
This confirms the playbook recommendation: always check for `/api/`, `.json`, GraphQL endpoints before giving up on a site.
|
||||||
|
|
||||||
|
### 7.2 `urlProbe` is a powerful tool
|
||||||
|
|
||||||
|
It separates "what we check" (API) from "what we show the user" (human-readable profile URL). Reddit is a perfect example:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"url": "https://www.reddit.com/user/{username}",
|
||||||
|
"urlProbe": "https://api.reddit.com/user/{username}/about",
|
||||||
|
"checkType": "message",
|
||||||
|
"presenseStrs": ["\"name\":"],
|
||||||
|
"absenceStrs": ["Not Found"]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The check hits the API, but reports display `www.reddit.com/user/blue`.
|
||||||
|
|
||||||
|
### 7.3 aiohttp ≠ curl ≠ requests
|
||||||
|
|
||||||
|
Wikipedia returned HTTP 200 for `curl` and Python `requests`, but HTTP 403 for `aiohttp`. This is **TLS fingerprinting** — the server identifies the HTTP library by cryptographic characteristics of the TLS handshake, not by headers.
|
||||||
|
|
||||||
|
**Key insight:** Changing `User-Agent` does **not** help against TLS fingerprinting. Always test with aiohttp directly (or via Maigret with `-vvv` and `debug.log`), not just `curl`.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# This returns 403 for Wikipedia even with browser UA:
|
||||||
|
async with aiohttp.ClientSession() as session:
|
||||||
|
async with session.get(url, headers={"User-Agent": "Mozilla/5.0 ..."}) as resp:
|
||||||
|
print(resp.status) # 403
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.4 HTTP 403 in Maigret can mean different things
|
||||||
|
|
||||||
|
Initially it seemed Wikipedia was returning 403, but `curl` showed 200. Only `debug.log` revealed the real picture — aiohttp was getting blocked at TLS level.
|
||||||
|
|
||||||
|
**Lesson:** Use `-vvv` flag and inspect `debug.log` for raw response status and body. The warning message alone may be misleading.
|
||||||
|
|
||||||
|
### 7.5 Dead services migrate, not disappear
|
||||||
|
|
||||||
|
MSDN Social and TechNet profiles redirected to Microsoft Learn. Instead of deleting old entries:
|
||||||
|
|
||||||
|
1. Keep old entries with `disabled: true` as historical record.
|
||||||
|
2. Create a new entry for the current service with working API.
|
||||||
|
|
||||||
|
This preserves audit trail and avoids breaking existing workflows.
|
||||||
|
|
||||||
|
### 7.6 `status_code` is more reliable than `message` for APIs
|
||||||
|
|
||||||
|
Microsoft Learn API returns HTTP 404 for non-existent users — a clean signal without HTML parsing. For JSON APIs that return proper HTTP status codes, `status_code` is often the best choice:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"checkType": "status_code",
|
||||||
|
"urlProbe": "https://learn.microsoft.com/api/profiles/{username}"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
No need for fragile string matching when the API speaks HTTP correctly.
|
||||||
|
|
||||||
|
### 7.7 The playbook classification works
|
||||||
|
|
||||||
|
The decision tree from the documentation accurately describes real-world cases:
|
||||||
|
|
||||||
|
| Situation | Playbook says | Actual result |
|
||||||
|
|-----------|---------------|---------------|
|
||||||
|
| Captcha (Baidu) | `disabled: true` | Correct |
|
||||||
|
| TLS fingerprinting (Wikipedia) | `disabled: true` (anti-bot) | Correct |
|
||||||
|
| Working API available (Reddit, MS Learn) | Use `urlProbe` | Correct |
|
||||||
|
| Service migrated (MSDN → MS Learn) | Update URL or create new entry | Correct |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Documentation maintenance
|
## Documentation maintenance
|
||||||
|
|
||||||
For any of the changes below, **always** keep these artifacts in sync — this file ([`site-checks-guide.md`](site-checks-guide.md)), [`site-checks-playbook.md`](site-checks-playbook.md), and (when rules or templates change) the header/template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log):
|
For any of the changes below, **always** keep these artifacts in sync — this file ([`site-checks-guide.md`](site-checks-guide.md)), [`site-checks-playbook.md`](site-checks-playbook.md), and (when rules or templates change) the header/template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log):
|
||||||
|
|||||||
@@ -6,7 +6,7 @@ Short checklist for edits to [`maigret/resources/data.json`](../maigret/resource
|
|||||||
|
|
||||||
## 0. Standard checks (do alongside reproduce / classify)
|
## 0. Standard checks (do alongside reproduce / classify)
|
||||||
|
|
||||||
- **Public JSON API:** always look for a stable JSON (or GraphQL JSON) profile endpoint (`/api/`, `.json`, mobile-style URLs). Prefer it in `url` when it differentiates claimed vs unclaimed users better than HTML. Details: section **2.1** in [`site-checks-guide.md`](site-checks-guide.md).
|
- **Public JSON API:** always look for a stable JSON (or GraphQL JSON) profile endpoint (`/api/`, `.json`, mobile-style URLs). When the API is more reliable than HTML, set **`urlProbe`** to that endpoint and keep **`url`** as the human-readable profile link (e.g. `https://picsart.com/u/{username}`). If there is no separate profile URL, use the API as `url` only. Details: **`urlProbe`** and section **2.1** in [`site-checks-guide.md`](site-checks-guide.md).
|
||||||
- **`socid_extractor` log (mandatory):** if you find **embedded user JSON in HTML** or a **standalone JSON profile API**, append a dated entry (with **example username**) to [`socid_extractor_improvements.log`](socid_extractor_improvements.log). Details: section **2.2** in [`site-checks-guide.md`](site-checks-guide.md).
|
- **`socid_extractor` log (mandatory):** if you find **embedded user JSON in HTML** or a **standalone JSON profile API**, append a dated entry (with **example username**) to [`socid_extractor_improvements.log`](socid_extractor_improvements.log). Details: section **2.2** in [`site-checks-guide.md`](site-checks-guide.md).
|
||||||
|
|
||||||
## 1. Reproduce
|
## 1. Reproduce
|
||||||
@@ -29,7 +29,7 @@ Short checklist for edits to [`maigret/resources/data.json`](../maigret/resource
|
|||||||
|
|
||||||
## 3. Data edits
|
## 3. Data edits
|
||||||
|
|
||||||
1. Update `url` / `urlMain` if needed (HTTPS redirects).
|
1. Update `url` / `urlMain` if needed (HTTPS redirects). Use optional **`urlProbe`** when the HTTP check should hit a different URL than the profile link shown in reports (API vs web UI).
|
||||||
2. For `message`: **always** tune string pairs so `absenceStrs` fire on “no user” pages and `presenseStrs` fire on real profiles without false absence hits.
|
2. For `message`: **always** tune string pairs so `absenceStrs` fire on “no user” pages and `presenseStrs` fire on real profiles without false absence hits.
|
||||||
3. Engine (`engine`, e.g. XenForo): override only differing fields in the site entry so other sites are not broken.
|
3. Engine (`engine`, e.g. XenForo): override only differing fields in the site entry so other sites are not broken.
|
||||||
4. Keep `status_code` only if the response **reliably** differs by status code without soft 404.
|
4. Keep `status_code` only if the response **reliably** differs by status code without soft 404.
|
||||||
@@ -44,6 +44,34 @@ Short checklist for edits to [`maigret/resources/data.json`](../maigret/resource
|
|||||||
- `process_site_result` uses strict comparison to `"status_code"` for `checkType` (not a substring trick).
|
- `process_site_result` uses strict comparison to `"status_code"` for `checkType` (not a substring trick).
|
||||||
- Empty `presenseStrs` with `message` means “presence always true”; a debug line is logged only at DEBUG level.
|
- Empty `presenseStrs` with `message` means “presence always true”; a debug line is logged only at DEBUG level.
|
||||||
|
|
||||||
## 6. Documentation maintenance
|
## 6. Development utilities
|
||||||
|
|
||||||
|
Quick reference for site check utilities. Full details: section **6** in [`site-checks-guide.md`](site-checks-guide.md).
|
||||||
|
|
||||||
|
| Command | Purpose |
|
||||||
|
|---------|---------|
|
||||||
|
| `python utils/site_check.py --site "X" --check-claimed` | Quick aiohttp comparison |
|
||||||
|
| `python utils/site_check.py --site "X" --maigret` | Test via Maigret checker |
|
||||||
|
| `python utils/site_check.py --site "X" --compare-methods` | Find aiohttp vs Maigret discrepancies |
|
||||||
|
| `python utils/site_check.py --site "X" --diagnose` | Full diagnosis with fix recommendations |
|
||||||
|
| `python utils/check_top_n.py --top 100` | Mass-check top 100 sites |
|
||||||
|
| `maigret --self-check --site "X"` | Self-check (reports only, no auto-disable) |
|
||||||
|
| `maigret --self-check --site "X" --auto-disable` | Self-check with auto-disable |
|
||||||
|
| `maigret --self-check --site "X" --diagnose` | Self-check with detailed diagnosis |
|
||||||
|
|
||||||
|
## 7. Quick tips (lessons learned)
|
||||||
|
|
||||||
|
Practical observations from fixing top-ranked sites. Full details: section **7** in [`site-checks-guide.md`](site-checks-guide.md).
|
||||||
|
|
||||||
|
| Tip | Why it matters |
|
||||||
|
|-----|----------------|
|
||||||
|
| **API first** | Reddit, Microsoft Learn — APIs worked when web pages were blocked. Always check `/api/`, `.json` endpoints. |
|
||||||
|
| **`urlProbe` separates check from display** | Check via API, show human URL in reports. Example: Reddit API → `www.reddit.com/user/` link. |
|
||||||
|
| **aiohttp ≠ curl** | Wikipedia returned 200 for curl, 403 for aiohttp (TLS fingerprinting). Always test with Maigret directly. |
|
||||||
|
| **Use `debug.log`** | Run with `-vvv` to see raw response. Warning messages alone can be misleading. |
|
||||||
|
| **`status_code` for clean APIs** | If API returns proper 404 for missing users, prefer `status_code` over `message`. |
|
||||||
|
| **Migrate, don't delete** | MSDN → Microsoft Learn: keep old entry disabled, create new one for current service. |
|
||||||
|
|
||||||
|
## 8. Documentation maintenance
|
||||||
|
|
||||||
When you change Maigret, add search tools, or change check logic, keep **this playbook**, [`site-checks-guide.md`](site-checks-guide.md), and (when applicable) the template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) aligned. New log **entries** are append-only at the bottom of that file.
|
When you change Maigret, add search tools, or change check logic, keep **this playbook**, [`site-checks-guide.md`](site-checks-guide.md), and (when applicable) the template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) aligned. New log **entries** are append-only at the bottom of that file.
|
||||||
|
|||||||
@@ -115,11 +115,22 @@ There are few options for sites data.json helpful in various cases:
|
|||||||
- ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
|
- ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
|
||||||
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
|
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
|
||||||
|
|
||||||
|
``urlProbe`` (optional profile probe URL)
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
By default Maigret performs the HTTP request to the same URL as ``url`` (the public profile link pattern).
|
||||||
|
|
||||||
|
If you set ``urlProbe`` in ``data.json``, Maigret **fetches** that URL for the presence check (API, GraphQL, JSON endpoint, etc.), while **reports and ``url_user``** still use ``url`` — the human-readable profile page users should open.
|
||||||
|
|
||||||
|
Placeholders: ``{username}``, ``{urlMain}``, ``{urlSubpath}`` (same as for ``url``). Example: GitHub uses ``url`` ``https://github.com/{username}`` and ``urlProbe`` ``https://api.github.com/users/{username}``; Picsart uses the web profile ``https://picsart.com/u/{username}`` and probes ``https://api.picsart.com/users/show/{username}.json``.
|
||||||
|
|
||||||
|
Implementation: ``make_site_result`` in `checking.py <https://github.com/soxoj/maigret/blob/main/maigret/checking.py>`_.
|
||||||
|
|
||||||
Site check fixes using LLM
|
Site check fixes using LLM
|
||||||
--------------------
|
--------------------------
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
The ``LLM/`` directory at the root of the repository contains detailed instructions for editing site checks (in Markdown format): checklist, full guide to ``checkType`` / ``data.json``, handling false positives, searching for public JSON APIs, and the proposal log for ``socid_extractor``.
|
The ``LLM/`` directory at the root of the repository contains detailed instructions for editing site checks (in Markdown format): checklist, full guide to ``checkType`` / ``data.json`` / ``urlProbe``, handling false positives, searching for public JSON APIs, and the proposal log for ``socid_extractor``.
|
||||||
|
|
||||||
Main files:
|
Main files:
|
||||||
|
|
||||||
|
|||||||
+102
-13
@@ -826,9 +826,21 @@ async def site_self_check(
|
|||||||
i2p_proxy=None,
|
i2p_proxy=None,
|
||||||
skip_errors=False,
|
skip_errors=False,
|
||||||
cookies=None,
|
cookies=None,
|
||||||
|
auto_disable=False,
|
||||||
|
diagnose=False,
|
||||||
):
|
):
|
||||||
|
"""
|
||||||
|
Self-check a site configuration.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
auto_disable: If True, automatically disable sites that fail checks.
|
||||||
|
If False (default), only report issues without disabling.
|
||||||
|
diagnose: If True, print detailed diagnosis information.
|
||||||
|
"""
|
||||||
changes = {
|
changes = {
|
||||||
"disabled": False,
|
"disabled": False,
|
||||||
|
"issues": [],
|
||||||
|
"recommendations": [],
|
||||||
}
|
}
|
||||||
|
|
||||||
check_data = [
|
check_data = [
|
||||||
@@ -838,6 +850,8 @@ async def site_self_check(
|
|||||||
|
|
||||||
logger.info(f"Checking {site.name}...")
|
logger.info(f"Checking {site.name}...")
|
||||||
|
|
||||||
|
results_cache = {}
|
||||||
|
|
||||||
for username, status in check_data:
|
for username, status in check_data:
|
||||||
async with semaphore:
|
async with semaphore:
|
||||||
results_dict = await maigret(
|
results_dict = await maigret(
|
||||||
@@ -859,15 +873,20 @@ async def site_self_check(
|
|||||||
# TODO: make normal checking
|
# TODO: make normal checking
|
||||||
if site.name not in results_dict:
|
if site.name not in results_dict:
|
||||||
logger.info(results_dict)
|
logger.info(results_dict)
|
||||||
changes["disabled"] = True
|
changes["issues"].append(f"Site {site.name} not in results (wrong id_type?)")
|
||||||
|
if auto_disable:
|
||||||
|
changes["disabled"] = True
|
||||||
continue
|
continue
|
||||||
|
|
||||||
logger.debug(results_dict)
|
logger.debug(results_dict)
|
||||||
|
|
||||||
result = results_dict[site.name]["status"]
|
result = results_dict[site.name]["status"]
|
||||||
|
results_cache[username] = results_dict[site.name]
|
||||||
|
|
||||||
if result.error and 'Cannot connect to host' in result.error.desc:
|
if result.error and 'Cannot connect to host' in result.error.desc:
|
||||||
changes["disabled"] = True
|
changes["issues"].append(f"Cannot connect to host")
|
||||||
|
if auto_disable:
|
||||||
|
changes["disabled"] = True
|
||||||
|
|
||||||
site_status = result.status
|
site_status = result.status
|
||||||
|
|
||||||
@@ -875,6 +894,8 @@ async def site_self_check(
|
|||||||
if site_status == MaigretCheckStatus.UNKNOWN:
|
if site_status == MaigretCheckStatus.UNKNOWN:
|
||||||
msgs = site.absence_strs
|
msgs = site.absence_strs
|
||||||
etype = site.check_type
|
etype = site.check_type
|
||||||
|
error_msg = f"Error checking {username}: {result.context}"
|
||||||
|
changes["issues"].append(error_msg)
|
||||||
logger.warning(
|
logger.warning(
|
||||||
f"Error while searching {username} in {site.name}: {result.context}, {msgs}, type {etype}"
|
f"Error while searching {username} in {site.name}: {result.context}, {msgs}, type {etype}"
|
||||||
)
|
)
|
||||||
@@ -884,28 +905,62 @@ async def site_self_check(
|
|||||||
if skip_errors:
|
if skip_errors:
|
||||||
pass
|
pass
|
||||||
# don't disable in case of available username
|
# don't disable in case of available username
|
||||||
elif status == MaigretCheckStatus.CLAIMED:
|
elif status == MaigretCheckStatus.CLAIMED and auto_disable:
|
||||||
changes["disabled"] = True
|
changes["disabled"] = True
|
||||||
elif status == MaigretCheckStatus.CLAIMED:
|
elif status == MaigretCheckStatus.CLAIMED:
|
||||||
|
changes["issues"].append(f"Claimed user '{username}' not detected as claimed")
|
||||||
logger.warning(
|
logger.warning(
|
||||||
f"Not found `{username}` in {site.name}, must be claimed"
|
f"Not found `{username}` in {site.name}, must be claimed"
|
||||||
)
|
)
|
||||||
logger.info(results_dict[site.name])
|
logger.info(results_dict[site.name])
|
||||||
changes["disabled"] = True
|
if auto_disable:
|
||||||
|
changes["disabled"] = True
|
||||||
else:
|
else:
|
||||||
|
changes["issues"].append(f"Unclaimed user '{username}' detected as claimed")
|
||||||
logger.warning(f"Found `{username}` in {site.name}, must be available")
|
logger.warning(f"Found `{username}` in {site.name}, must be available")
|
||||||
logger.info(results_dict[site.name])
|
logger.info(results_dict[site.name])
|
||||||
changes["disabled"] = True
|
if auto_disable:
|
||||||
|
changes["disabled"] = True
|
||||||
|
|
||||||
logger.info(f"Site {site.name} checking is finished")
|
logger.info(f"Site {site.name} checking is finished")
|
||||||
|
|
||||||
if changes["disabled"] != site.disabled:
|
# Generate recommendations based on issues
|
||||||
|
if changes["issues"] and len(results_cache) == 2:
|
||||||
|
claimed_result = results_cache.get(site.username_claimed, {})
|
||||||
|
unclaimed_result = results_cache.get(site.username_unclaimed, {})
|
||||||
|
|
||||||
|
claimed_http = claimed_result.get("http_status")
|
||||||
|
unclaimed_http = unclaimed_result.get("http_status")
|
||||||
|
|
||||||
|
if claimed_http and unclaimed_http:
|
||||||
|
if claimed_http != unclaimed_http and site.check_type != "status_code":
|
||||||
|
changes["recommendations"].append(
|
||||||
|
f"Consider checkType: status_code (HTTP {claimed_http} vs {unclaimed_http})"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Print diagnosis if requested
|
||||||
|
if diagnose and changes["issues"]:
|
||||||
|
print(f"\n--- {site.name} DIAGNOSIS ---")
|
||||||
|
print(f" Check type: {site.check_type}")
|
||||||
|
print(f" Issues:")
|
||||||
|
for issue in changes["issues"]:
|
||||||
|
print(f" - {issue}")
|
||||||
|
if changes["recommendations"]:
|
||||||
|
print(f" Recommendations:")
|
||||||
|
for rec in changes["recommendations"]:
|
||||||
|
print(f" -> {rec}")
|
||||||
|
|
||||||
|
# Only modify site if auto_disable is enabled
|
||||||
|
if auto_disable and changes["disabled"] != site.disabled:
|
||||||
site.disabled = changes["disabled"]
|
site.disabled = changes["disabled"]
|
||||||
logger.info(f"Switching property 'disabled' for {site.name} to {site.disabled}")
|
logger.info(f"Switching property 'disabled' for {site.name} to {site.disabled}")
|
||||||
db.update_site(site)
|
db.update_site(site)
|
||||||
if not silent:
|
if not silent:
|
||||||
action = "Disabled" if site.disabled else "Enabled"
|
action = "Disabled" if site.disabled else "Enabled"
|
||||||
print(f"{action} site {site.name}...")
|
print(f"{action} site {site.name}...")
|
||||||
|
elif changes["issues"] and not silent and not diagnose:
|
||||||
|
# Report issues without disabling
|
||||||
|
print(f"Issues found in {site.name}: {len(changes['issues'])} (not auto-disabled)")
|
||||||
|
|
||||||
# remove service tag "unchecked"
|
# remove service tag "unchecked"
|
||||||
if "unchecked" in site.tags:
|
if "unchecked" in site.tags:
|
||||||
@@ -924,10 +979,24 @@ async def self_check(
|
|||||||
proxy=None,
|
proxy=None,
|
||||||
tor_proxy=None,
|
tor_proxy=None,
|
||||||
i2p_proxy=None,
|
i2p_proxy=None,
|
||||||
) -> bool:
|
auto_disable=False,
|
||||||
|
diagnose=False,
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Run self-check on sites.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
auto_disable: If True, automatically disable sites that fail checks.
|
||||||
|
If False (default), only report issues without disabling.
|
||||||
|
diagnose: If True, print detailed diagnosis for each failing site.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict with 'needs_update' bool and 'results' list of check results
|
||||||
|
"""
|
||||||
sem = asyncio.Semaphore(max_connections)
|
sem = asyncio.Semaphore(max_connections)
|
||||||
tasks = []
|
tasks = []
|
||||||
all_sites = site_data
|
all_sites = site_data
|
||||||
|
all_results = []
|
||||||
|
|
||||||
def disabled_count(lst):
|
def disabled_count(lst):
|
||||||
return len(list(filter(lambda x: x.disabled, lst)))
|
return len(list(filter(lambda x: x.disabled, lst)))
|
||||||
@@ -939,15 +1008,18 @@ async def self_check(
|
|||||||
|
|
||||||
for _, site in all_sites.items():
|
for _, site in all_sites.items():
|
||||||
check_coro = site_self_check(
|
check_coro = site_self_check(
|
||||||
site, logger, sem, db, silent, proxy, tor_proxy, i2p_proxy, skip_errors=True
|
site, logger, sem, db, silent, proxy, tor_proxy, i2p_proxy,
|
||||||
|
skip_errors=True, auto_disable=auto_disable, diagnose=diagnose
|
||||||
)
|
)
|
||||||
future = asyncio.ensure_future(check_coro)
|
future = asyncio.ensure_future(check_coro)
|
||||||
tasks.append(future)
|
tasks.append((site.name, future))
|
||||||
|
|
||||||
if tasks:
|
if tasks:
|
||||||
with alive_bar(len(tasks), title='Self-checking', force_tty=True) as progress:
|
with alive_bar(len(tasks), title='Self-checking', force_tty=True) as progress:
|
||||||
for f in asyncio.as_completed(tasks):
|
for site_name, f in tasks:
|
||||||
await f
|
result = await f
|
||||||
|
result['site_name'] = site_name
|
||||||
|
all_results.append(result)
|
||||||
progress() # Update the progress bar
|
progress() # Update the progress bar
|
||||||
|
|
||||||
unchecked_new_count = len(
|
unchecked_new_count = len(
|
||||||
@@ -956,7 +1028,10 @@ async def self_check(
|
|||||||
disabled_new_count = disabled_count(all_sites.values())
|
disabled_new_count = disabled_count(all_sites.values())
|
||||||
total_disabled = disabled_new_count - disabled_old_count
|
total_disabled = disabled_new_count - disabled_old_count
|
||||||
|
|
||||||
if total_disabled:
|
# Count issues
|
||||||
|
total_issues = sum(1 for r in all_results if r.get('issues'))
|
||||||
|
|
||||||
|
if auto_disable and total_disabled:
|
||||||
if total_disabled >= 0:
|
if total_disabled >= 0:
|
||||||
message = "Disabled"
|
message = "Disabled"
|
||||||
else:
|
else:
|
||||||
@@ -968,11 +1043,25 @@ async def self_check(
|
|||||||
f"{message} {total_disabled} ({disabled_old_count} => {disabled_new_count}) checked sites. "
|
f"{message} {total_disabled} ({disabled_old_count} => {disabled_new_count}) checked sites. "
|
||||||
"Run with `--info` flag to get more information"
|
"Run with `--info` flag to get more information"
|
||||||
)
|
)
|
||||||
|
elif total_issues and not silent:
|
||||||
|
print(f"\nFound issues in {total_issues} sites (auto-disable is OFF)")
|
||||||
|
print("Use --auto-disable to automatically disable failing sites")
|
||||||
|
print("Use --diagnose to see detailed diagnosis for each site")
|
||||||
|
|
||||||
if unchecked_new_count != unchecked_old_count:
|
if unchecked_new_count != unchecked_old_count:
|
||||||
print(f"Unchecked sites verified: {unchecked_old_count - unchecked_new_count}")
|
print(f"Unchecked sites verified: {unchecked_old_count - unchecked_new_count}")
|
||||||
|
|
||||||
return total_disabled != 0 or unchecked_new_count != unchecked_old_count
|
needs_update = total_disabled != 0 or unchecked_new_count != unchecked_old_count
|
||||||
|
|
||||||
|
# For backwards compatibility, return bool if auto_disable is True
|
||||||
|
if auto_disable:
|
||||||
|
return needs_update
|
||||||
|
|
||||||
|
return {
|
||||||
|
'needs_update': needs_update,
|
||||||
|
'results': all_results,
|
||||||
|
'total_issues': total_issues,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def extract_ids_data(html_text, logger, site) -> Dict:
|
def extract_ids_data(html_text, logger, site) -> Dict:
|
||||||
|
|||||||
+23
-2
@@ -316,7 +316,19 @@ def setup_arguments_parser(settings: Settings):
|
|||||||
"--self-check",
|
"--self-check",
|
||||||
action="store_true",
|
action="store_true",
|
||||||
default=settings.self_check_enabled,
|
default=settings.self_check_enabled,
|
||||||
help="Do self check for sites and database and disable non-working ones.",
|
help="Do self check for sites and database. Use --auto-disable to disable failing sites.",
|
||||||
|
)
|
||||||
|
modes_group.add_argument(
|
||||||
|
"--auto-disable",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
help="With --self-check: automatically disable sites that fail checks.",
|
||||||
|
)
|
||||||
|
modes_group.add_argument(
|
||||||
|
"--diagnose",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
help="With --self-check: print detailed diagnosis for each failing site.",
|
||||||
)
|
)
|
||||||
modes_group.add_argument(
|
modes_group.add_argument(
|
||||||
"--stats",
|
"--stats",
|
||||||
@@ -566,7 +578,7 @@ async def main():
|
|||||||
query_notify.success(
|
query_notify.success(
|
||||||
f'Maigret sites database self-check started for {len(site_data)} sites...'
|
f'Maigret sites database self-check started for {len(site_data)} sites...'
|
||||||
)
|
)
|
||||||
is_need_update = await self_check(
|
check_result = await self_check(
|
||||||
db,
|
db,
|
||||||
site_data,
|
site_data,
|
||||||
logger,
|
logger,
|
||||||
@@ -574,7 +586,16 @@ async def main():
|
|||||||
max_connections=args.connections,
|
max_connections=args.connections,
|
||||||
tor_proxy=args.tor_proxy,
|
tor_proxy=args.tor_proxy,
|
||||||
i2p_proxy=args.i2p_proxy,
|
i2p_proxy=args.i2p_proxy,
|
||||||
|
auto_disable=args.auto_disable,
|
||||||
|
diagnose=args.diagnose,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Handle both old (bool) and new (dict) return types
|
||||||
|
if isinstance(check_result, dict):
|
||||||
|
is_need_update = check_result.get('needs_update', False)
|
||||||
|
else:
|
||||||
|
is_need_update = check_result
|
||||||
|
|
||||||
if is_need_update:
|
if is_need_update:
|
||||||
if input('Do you want to save changes permanently? [Yn]\n').lower() in (
|
if input('Do you want to save changes permanently? [Yn]\n').lower() in (
|
||||||
'y',
|
'y',
|
||||||
|
|||||||
+51
-23
@@ -3214,18 +3214,17 @@
|
|||||||
" <h1>404 Page not found</h1>",
|
" <h1>404 Page not found</h1>",
|
||||||
"_404-header",
|
"_404-header",
|
||||||
"_404-inner-container",
|
"_404-inner-container",
|
||||||
" no-nav "
|
" no-nav ",
|
||||||
|
"not found."
|
||||||
],
|
],
|
||||||
"presenseStrs": [
|
"presenseStrs": [
|
||||||
"profile-top",
|
"\"player_id\":",
|
||||||
"og:title",
|
"\"@id\":\"https://api.chess.com/pub/player/"
|
||||||
" style=",
|
|
||||||
"view-profile",
|
|
||||||
" data-username="
|
|
||||||
],
|
],
|
||||||
"alexaRank": 211,
|
"alexaRank": 211,
|
||||||
"urlMain": "https://www.chess.com",
|
"urlMain": "https://www.chess.com",
|
||||||
"url": "https://www.chess.com/member/{username}",
|
"url": "https://www.chess.com/member/{username}",
|
||||||
|
"urlProbe": "https://api.chess.com/pub/player/{username}",
|
||||||
"usernameClaimed": "sexytwerker69",
|
"usernameClaimed": "sexytwerker69",
|
||||||
"usernameUnclaimed": "aublurbrxm",
|
"usernameUnclaimed": "aublurbrxm",
|
||||||
"headers": {
|
"headers": {
|
||||||
@@ -4929,6 +4928,7 @@
|
|||||||
"usernameUnclaimed": "noonewouldeverusethis7"
|
"usernameUnclaimed": "noonewouldeverusethis7"
|
||||||
},
|
},
|
||||||
"Etsy": {
|
"Etsy": {
|
||||||
|
"disabled": true,
|
||||||
"tags": [
|
"tags": [
|
||||||
"shopping",
|
"shopping",
|
||||||
"us"
|
"us"
|
||||||
@@ -7385,11 +7385,18 @@
|
|||||||
"tags": [
|
"tags": [
|
||||||
"in"
|
"in"
|
||||||
],
|
],
|
||||||
"checkType": "response_url",
|
"checkType": "message",
|
||||||
|
"presenseStrs": [
|
||||||
|
"id=\"profileApp\""
|
||||||
|
],
|
||||||
|
"absenceStrs": [
|
||||||
|
"Guru.com - Page Not Found",
|
||||||
|
"Guru.com - Content Deleted"
|
||||||
|
],
|
||||||
"alexaRank": 4420,
|
"alexaRank": 4420,
|
||||||
"urlMain": "https://www.guru.com",
|
"urlMain": "https://www.guru.com",
|
||||||
"url": "https://www.guru.com/freelancers/{username}",
|
"url": "https://www.guru.com/freelancers/{username}",
|
||||||
"usernameClaimed": "adam",
|
"usernameClaimed": "longhui-zhao",
|
||||||
"usernameUnclaimed": "noonewouldeverusethis7"
|
"usernameUnclaimed": "noonewouldeverusethis7"
|
||||||
},
|
},
|
||||||
"GuruShots": {
|
"GuruShots": {
|
||||||
@@ -10294,6 +10301,19 @@
|
|||||||
"usernameClaimed": "blue",
|
"usernameClaimed": "blue",
|
||||||
"usernameUnclaimed": "noonewouldeverusethis7"
|
"usernameUnclaimed": "noonewouldeverusethis7"
|
||||||
},
|
},
|
||||||
|
"MicrosoftLearn": {
|
||||||
|
"tags": [
|
||||||
|
"tech",
|
||||||
|
"us"
|
||||||
|
],
|
||||||
|
"checkType": "status_code",
|
||||||
|
"alexaRank": 21,
|
||||||
|
"urlMain": "https://learn.microsoft.com",
|
||||||
|
"url": "https://learn.microsoft.com/en-us/users/{username}",
|
||||||
|
"urlProbe": "https://learn.microsoft.com/api/profiles/{username}",
|
||||||
|
"usernameClaimed": "blue",
|
||||||
|
"usernameUnclaimed": "noonewouldeverusethis7"
|
||||||
|
},
|
||||||
"Minecraft-statistic": {
|
"Minecraft-statistic": {
|
||||||
"tags": [
|
"tags": [
|
||||||
"ru",
|
"ru",
|
||||||
@@ -12345,7 +12365,8 @@
|
|||||||
],
|
],
|
||||||
"alexaRank": 8904,
|
"alexaRank": 8904,
|
||||||
"urlMain": "https://picsart.com/",
|
"urlMain": "https://picsart.com/",
|
||||||
"url": "https://api.picsart.com/users/show/{username}.json",
|
"url": "https://picsart.com/u/{username}",
|
||||||
|
"urlProbe": "https://api.picsart.com/users/show/{username}.json",
|
||||||
"usernameClaimed": "adam",
|
"usernameClaimed": "adam",
|
||||||
"usernameUnclaimed": "noonewouldeverusethis7"
|
"usernameUnclaimed": "noonewouldeverusethis7"
|
||||||
},
|
},
|
||||||
@@ -12806,6 +12827,7 @@
|
|||||||
"tags": [
|
"tags": [
|
||||||
"porn"
|
"porn"
|
||||||
],
|
],
|
||||||
|
"disabled": true,
|
||||||
"checkType": "message",
|
"checkType": "message",
|
||||||
"presenseStrs": [
|
"presenseStrs": [
|
||||||
"profileInformation"
|
"profileInformation"
|
||||||
@@ -12817,7 +12839,7 @@
|
|||||||
"alexaRank": 74,
|
"alexaRank": 74,
|
||||||
"urlMain": "https://pornhub.com/",
|
"urlMain": "https://pornhub.com/",
|
||||||
"url": "https://pornhub.com/users/{username}",
|
"url": "https://pornhub.com/users/{username}",
|
||||||
"usernameClaimed": "blue",
|
"usernameClaimed": "verified",
|
||||||
"usernameUnclaimed": "noonewouldeverusethis7"
|
"usernameUnclaimed": "noonewouldeverusethis7"
|
||||||
},
|
},
|
||||||
"PornhubPornstars": {
|
"PornhubPornstars": {
|
||||||
@@ -13640,14 +13662,18 @@
|
|||||||
],
|
],
|
||||||
"checkType": "message",
|
"checkType": "message",
|
||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"Sorry, nobody on Reddit goes by that name."
|
"Not Found"
|
||||||
],
|
],
|
||||||
"presenseStrs": [
|
"presenseStrs": [
|
||||||
"Post karma"
|
"\"name\":"
|
||||||
],
|
],
|
||||||
|
"headers": {
|
||||||
|
"User-Agent": "maigret/0.4"
|
||||||
|
},
|
||||||
"alexaRank": 19,
|
"alexaRank": 19,
|
||||||
"urlMain": "https://www.reddit.com/",
|
"urlMain": "https://www.reddit.com/",
|
||||||
"url": "https://www.reddit.com/user/{username}",
|
"url": "https://www.reddit.com/user/{username}",
|
||||||
|
"urlProbe": "https://api.reddit.com/user/{username}/about",
|
||||||
"usernameClaimed": "blue",
|
"usernameClaimed": "blue",
|
||||||
"usernameUnclaimed": "noonewouldeverusethis7"
|
"usernameUnclaimed": "noonewouldeverusethis7"
|
||||||
},
|
},
|
||||||
@@ -16690,13 +16716,7 @@
|
|||||||
"trading",
|
"trading",
|
||||||
"us"
|
"us"
|
||||||
],
|
],
|
||||||
"checkType": "message",
|
"checkType": "status_code",
|
||||||
"presenseStrs": [
|
|
||||||
"tv-profile"
|
|
||||||
],
|
|
||||||
"absenceStrs": [
|
|
||||||
"<title>Page not found \u2014 TradingView</title>"
|
|
||||||
],
|
|
||||||
"alexaRank": 61,
|
"alexaRank": 61,
|
||||||
"urlMain": "https://www.tradingview.com/",
|
"urlMain": "https://www.tradingview.com/",
|
||||||
"url": "https://www.tradingview.com/u/{username}",
|
"url": "https://www.tradingview.com/u/{username}",
|
||||||
@@ -17185,6 +17205,7 @@
|
|||||||
"usernameUnclaimed": "noonewouldeverusethis7"
|
"usernameUnclaimed": "noonewouldeverusethis7"
|
||||||
},
|
},
|
||||||
"Udemy": {
|
"Udemy": {
|
||||||
|
"disabled": true,
|
||||||
"tags": [
|
"tags": [
|
||||||
"in"
|
"in"
|
||||||
],
|
],
|
||||||
@@ -17357,7 +17378,7 @@
|
|||||||
"tags": [
|
"tags": [
|
||||||
"ru"
|
"ru"
|
||||||
],
|
],
|
||||||
"checkType": "response_url",
|
"checkType": "status_code",
|
||||||
"regexCheck": "^(?!id\\d)\\w*$",
|
"regexCheck": "^(?!id\\d)\\w*$",
|
||||||
"alexaRank": 27,
|
"alexaRank": 27,
|
||||||
"urlMain": "https://vk.com/",
|
"urlMain": "https://vk.com/",
|
||||||
@@ -17584,7 +17605,7 @@
|
|||||||
"method": "vimeo"
|
"method": "vimeo"
|
||||||
},
|
},
|
||||||
"headers": {
|
"headers": {
|
||||||
"Authorization": "jwt eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE3MzQxMTc1NDAsInVzZXJfaWQiOm51bGwsImFwcF9pZCI6NTg0NzksInNjb3BlcyI6InB1YmxpYyIsInRlYW1fdXNlcl9pZCI6bnVsbCwianRpIjoiNDc4Y2ZhZGUtZjI0Yy00MDVkLTliYWItN2RlNGEzNGM4MzI5In0.guN7Fg8dqq7EYdckrJ-6Rdkj_5MOl6FaC4YUSOceDpU"
|
"Authorization": "jwt eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE3NzQxOTIxNDAsInVzZXJfaWQiOm51bGwsImFwcF9pZCI6NTg0NzksInNjb3BlcyI6InB1YmxpYyIsInRlYW1fdXNlcl9pZCI6bnVsbCwianRpIjoiYzdmMWJkYjAtMGZiMi00M2JiLTg0N2YtMGY5ZGViYTdkOGY0In0._ork2l2kSy1Xn4Pj8WmYvUfAezmXJeXxOZCoHAs5Q2M"
|
||||||
},
|
},
|
||||||
"urlProbe": "https://api.vimeo.com/users/{username}?fields=name%2Cgender%2Cbio%2Curi%2Clink%2Cbackground_video%2Clocation_details%2Cpictures%2Cverified%2Cmetadata.public_videos.total%2Cavailable_for_hire%2Ccan_work_remotely%2Cmetadata.connections.videos.total%2Cmetadata.connections.albums.total%2Cmetadata.connections.followers.total%2Cmetadata.connections.following.total%2Cmetadata.public_videos.total%2Cmetadata.connections.vimeo_experts.is_enrolled%2Ctotal_collection_count%2Ccreated_time%2Cprofile_preferences%2Cmembership%2Cclients%2Cskills%2Cproject_types%2Crates%2Ccategories%2Cis_expert%2Cprofile_discovery%2Cwebsites%2Ccontact_emails&fetch_user_profile=1",
|
"urlProbe": "https://api.vimeo.com/users/{username}?fields=name%2Cgender%2Cbio%2Curi%2Clink%2Cbackground_video%2Clocation_details%2Cpictures%2Cverified%2Cmetadata.public_videos.total%2Cavailable_for_hire%2Ccan_work_remotely%2Cmetadata.connections.videos.total%2Cmetadata.connections.albums.total%2Cmetadata.connections.followers.total%2Cmetadata.connections.following.total%2Cmetadata.public_videos.total%2Cmetadata.connections.vimeo_experts.is_enrolled%2Ctotal_collection_count%2Ccreated_time%2Cprofile_preferences%2Cmembership%2Cclients%2Cskills%2Cproject_types%2Crates%2Ccategories%2Cis_expert%2Cprofile_discovery%2Cwebsites%2Ccontact_emails&fetch_user_profile=1",
|
||||||
"checkType": "status_code",
|
"checkType": "status_code",
|
||||||
@@ -18189,6 +18210,7 @@
|
|||||||
"usernameUnclaimed": "noonewouldeverusethis77777"
|
"usernameUnclaimed": "noonewouldeverusethis77777"
|
||||||
},
|
},
|
||||||
"Wikipedia": {
|
"Wikipedia": {
|
||||||
|
"disabled": true,
|
||||||
"tags": [
|
"tags": [
|
||||||
"wiki"
|
"wiki"
|
||||||
],
|
],
|
||||||
@@ -18198,8 +18220,8 @@
|
|||||||
"Wikipedia does not have a"
|
"Wikipedia does not have a"
|
||||||
],
|
],
|
||||||
"alexaRank": 12,
|
"alexaRank": 12,
|
||||||
"urlMain": "https://www.wikipedia.org/",
|
"urlMain": "https://en.wikipedia.org/",
|
||||||
"url": "https://www.wikipedia.org/wiki/User:{username}",
|
"url": "https://en.wikipedia.org/wiki/User:{username}",
|
||||||
"usernameClaimed": "Hoadlck",
|
"usernameClaimed": "Hoadlck",
|
||||||
"usernameUnclaimed": "noonewouldeverusethis7"
|
"usernameUnclaimed": "noonewouldeverusethis7"
|
||||||
},
|
},
|
||||||
@@ -18743,6 +18765,7 @@
|
|||||||
"usernameUnclaimed": "noonewouldeverusethis77777"
|
"usernameUnclaimed": "noonewouldeverusethis77777"
|
||||||
},
|
},
|
||||||
"YandexMusic": {
|
"YandexMusic": {
|
||||||
|
"disabled": true,
|
||||||
"tags": [
|
"tags": [
|
||||||
"music",
|
"music",
|
||||||
"ru"
|
"ru"
|
||||||
@@ -31073,6 +31096,7 @@
|
|||||||
"alexaRank": 1513399
|
"alexaRank": 1513399
|
||||||
},
|
},
|
||||||
"Baidu": {
|
"Baidu": {
|
||||||
|
"disabled": true,
|
||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"error_404_iframe"
|
"error_404_iframe"
|
||||||
],
|
],
|
||||||
@@ -31868,6 +31892,7 @@
|
|||||||
]
|
]
|
||||||
},
|
},
|
||||||
"rblx.trade": {
|
"rblx.trade": {
|
||||||
|
"disabled": true,
|
||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"isRblxTradeException"
|
"isRblxTradeException"
|
||||||
],
|
],
|
||||||
@@ -31960,6 +31985,7 @@
|
|||||||
]
|
]
|
||||||
},
|
},
|
||||||
"giters.com": {
|
"giters.com": {
|
||||||
|
"disabled": true,
|
||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"This page could not be found"
|
"This page could not be found"
|
||||||
],
|
],
|
||||||
@@ -31978,6 +32004,7 @@
|
|||||||
]
|
]
|
||||||
},
|
},
|
||||||
"githubplus.com": {
|
"githubplus.com": {
|
||||||
|
"disabled": true,
|
||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"preconnect"
|
"preconnect"
|
||||||
],
|
],
|
||||||
@@ -32166,6 +32193,7 @@
|
|||||||
]
|
]
|
||||||
},
|
},
|
||||||
"Aparat": {
|
"Aparat": {
|
||||||
|
"disabled": true,
|
||||||
"absenceStrs": [
|
"absenceStrs": [
|
||||||
"404 - Page Not Found"
|
"404 - Page Not Found"
|
||||||
],
|
],
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
|
|
||||||
## List of supported sites (search methods): total 3143
|
## List of supported sites (search methods): total 3144
|
||||||
|
|
||||||
Rank data fetched from Alexa by domains.
|
Rank data fetched from Alexa by domains.
|
||||||
|
|
||||||
@@ -8,13 +8,14 @@ Rank data fetched from Alexa by domains.
|
|||||||
1.  [GooglePlayStore (https://play.google.com/store)](https://play.google.com/store)*: top 1, apps, us*
|
1.  [GooglePlayStore (https://play.google.com/store)](https://play.google.com/store)*: top 1, apps, us*
|
||||||
1.  [YouTube (https://www.youtube.com/)](https://www.youtube.com/)*: top 2, video*
|
1.  [YouTube (https://www.youtube.com/)](https://www.youtube.com/)*: top 2, video*
|
||||||
1.  [YouTube User (https://www.youtube.com/)](https://www.youtube.com/)*: top 2, video*
|
1.  [YouTube User (https://www.youtube.com/)](https://www.youtube.com/)*: top 2, video*
|
||||||
1.  [Baidu (https://tieba.baidu.com)](https://tieba.baidu.com)*: top 3, cn*
|
1.  [Baidu (https://tieba.baidu.com)](https://tieba.baidu.com)*: top 3, cn*, search is disabled
|
||||||
1.  [Facebook (https://www.facebook.com/)](https://www.facebook.com/)*: top 10, networking*
|
1.  [Facebook (https://www.facebook.com/)](https://www.facebook.com/)*: top 10, networking*
|
||||||
1.  [Amazon (https://amazon.com)](https://amazon.com)*: top 50, us*
|
1.  [Amazon (https://amazon.com)](https://amazon.com)*: top 50, us*
|
||||||
1.  [Wikipedia (https://www.wikipedia.org/)](https://www.wikipedia.org/)*: top 50, wiki*
|
1.  [Wikipedia (https://en.wikipedia.org/)](https://en.wikipedia.org/)*: top 50, wiki*, search is disabled
|
||||||
1.  [Reddit (https://www.reddit.com/)](https://www.reddit.com/)*: top 50, discussion, news*
|
1.  [Reddit (https://www.reddit.com/)](https://www.reddit.com/)*: top 50, discussion, news*
|
||||||
1.  [social.msdn.microsoft.com (https://social.msdn.microsoft.com)](https://social.msdn.microsoft.com)*: top 50, us*, search is disabled
|
1.  [social.msdn.microsoft.com (https://social.msdn.microsoft.com)](https://social.msdn.microsoft.com)*: top 50, us*, search is disabled
|
||||||
1.  [MicrosoftTechNet (https://social.technet.microsoft.com)](https://social.technet.microsoft.com)*: top 50, us*, search is disabled
|
1.  [MicrosoftTechNet (https://social.technet.microsoft.com)](https://social.technet.microsoft.com)*: top 50, us*, search is disabled
|
||||||
|
1.  [MicrosoftLearn (https://learn.microsoft.com)](https://learn.microsoft.com)*: top 50, tech, us*
|
||||||
1.  [Weibo (https://weibo.com)](https://weibo.com)*: top 50, cn, networking*
|
1.  [Weibo (https://weibo.com)](https://weibo.com)*: top 50, cn, networking*
|
||||||
1.  [GitHubGist (https://gist.github.com)](https://gist.github.com)*: top 50, coding, sharing*
|
1.  [GitHubGist (https://gist.github.com)](https://gist.github.com)*: top 50, coding, sharing*
|
||||||
1.  [VK (https://vk.com/)](https://vk.com/)*: top 50, ru*
|
1.  [VK (https://vk.com/)](https://vk.com/)*: top 50, ru*
|
||||||
@@ -52,7 +53,7 @@ Rank data fetched from Alexa by domains.
|
|||||||
1.  [YandexBugbounty (https://yandex.ru/bugbounty/)](https://yandex.ru/bugbounty/)*: top 50, hacking, ru*, search is disabled
|
1.  [YandexBugbounty (https://yandex.ru/bugbounty/)](https://yandex.ru/bugbounty/)*: top 50, hacking, ru*, search is disabled
|
||||||
1.  [YandexCollections API (by yandex_public_id) (https://yandex.ru/collections/)](https://yandex.ru/collections/)*: top 50, ru, sharing*
|
1.  [YandexCollections API (by yandex_public_id) (https://yandex.ru/collections/)](https://yandex.ru/collections/)*: top 50, ru, sharing*
|
||||||
1.  [YandexMarket (https://market.yandex.ru/)](https://market.yandex.ru/)*: top 50, ru*
|
1.  [YandexMarket (https://market.yandex.ru/)](https://market.yandex.ru/)*: top 50, ru*
|
||||||
1.  [YandexMusic (https://music.yandex.ru/)](https://music.yandex.ru/)*: top 50, music, ru*
|
1.  [YandexMusic (https://music.yandex.ru/)](https://music.yandex.ru/)*: top 50, music, ru*, search is disabled
|
||||||
1.  [YandexZnatoki (https://yandex.ru/q/)](https://yandex.ru/q/)*: top 50, ru*
|
1.  [YandexZnatoki (https://yandex.ru/q/)](https://yandex.ru/q/)*: top 50, ru*
|
||||||
1.  [YandexZenChannel (https://dzen.ru)](https://dzen.ru)*: top 50, ru*
|
1.  [YandexZenChannel (https://dzen.ru)](https://dzen.ru)*: top 50, ru*
|
||||||
1.  [YandexZenUser (https://zen.yandex.ru)](https://zen.yandex.ru)*: top 50, ru*
|
1.  [YandexZenUser (https://zen.yandex.ru)](https://zen.yandex.ru)*: top 50, ru*
|
||||||
@@ -61,18 +62,18 @@ Rank data fetched from Alexa by domains.
|
|||||||
1.  [OK (https://ok.ru/)](https://ok.ru/)*: top 100, ru*
|
1.  [OK (https://ok.ru/)](https://ok.ru/)*: top 100, ru*
|
||||||
1.  [community.adobe.com (https://community.adobe.com)](https://community.adobe.com)*: top 100, us*
|
1.  [community.adobe.com (https://community.adobe.com)](https://community.adobe.com)*: top 100, us*
|
||||||
1.  [TradingView (https://www.tradingview.com/)](https://www.tradingview.com/)*: top 100, trading, us*
|
1.  [TradingView (https://www.tradingview.com/)](https://www.tradingview.com/)*: top 100, trading, us*
|
||||||
1.  [Aparat (https://www.aparat.com)](https://www.aparat.com)*: top 100, ir, video*
|
1.  [Aparat (https://www.aparat.com)](https://www.aparat.com)*: top 100, ir, video*, search is disabled
|
||||||
1.  [ChaturBate (https://chaturbate.com)](https://chaturbate.com)*: top 100, us*
|
1.  [ChaturBate (https://chaturbate.com)](https://chaturbate.com)*: top 100, us*
|
||||||
1.  [Medium (https://medium.com/)](https://medium.com/)*: top 100, blog, us*, search is disabled
|
1.  [Medium (https://medium.com/)](https://medium.com/)*: top 100, blog, us*, search is disabled
|
||||||
1.  [Livejasmin (https://www.livejasmin.com/)](https://www.livejasmin.com/)*: top 100, us, webcam*
|
1.  [Livejasmin (https://www.livejasmin.com/)](https://www.livejasmin.com/)*: top 100, us, webcam*
|
||||||
1.  [Pornhub (https://pornhub.com/)](https://pornhub.com/)*: top 100, porn*
|
1.  [Pornhub (https://pornhub.com/)](https://pornhub.com/)*: top 100, porn*, search is disabled
|
||||||
1.  [Imgur (https://imgur.com)](https://imgur.com)*: top 100, photo*
|
1.  [Imgur (https://imgur.com)](https://imgur.com)*: top 100, photo*
|
||||||
1.  [Armchairgm (https://armchairgm.fandom.com/)](https://armchairgm.fandom.com/)*: top 100, us, wiki*
|
1.  [Armchairgm (https://armchairgm.fandom.com/)](https://armchairgm.fandom.com/)*: top 100, us, wiki*
|
||||||
1.  [Battleraprus (https://battleraprus.fandom.com/ru)](https://battleraprus.fandom.com/ru)*: top 100, ru, us, wiki*
|
1.  [Battleraprus (https://battleraprus.fandom.com/ru)](https://battleraprus.fandom.com/ru)*: top 100, ru, us, wiki*
|
||||||
1.  [BleachFandom (https://bleach.fandom.com/ru)](https://bleach.fandom.com/ru)*: top 100, ru, wiki*
|
1.  [BleachFandom (https://bleach.fandom.com/ru)](https://bleach.fandom.com/ru)*: top 100, ru, wiki*
|
||||||
1.  [Fandom (https://www.fandom.com/)](https://www.fandom.com/)*: top 100, us*
|
1.  [Fandom (https://www.fandom.com/)](https://www.fandom.com/)*: top 100, us*
|
||||||
1.  [FandomCommunityCentral (https://community.fandom.com)](https://community.fandom.com)*: top 100, wiki*
|
1.  [FandomCommunityCentral (https://community.fandom.com)](https://community.fandom.com)*: top 100, wiki*
|
||||||
1.  [Etsy (https://www.etsy.com/)](https://www.etsy.com/)*: top 100, shopping, us*
|
1.  [Etsy (https://www.etsy.com/)](https://www.etsy.com/)*: top 100, shopping, us*, search is disabled
|
||||||
1.  [GitHub (https://www.github.com/)](https://www.github.com/)*: top 100, coding*
|
1.  [GitHub (https://www.github.com/)](https://www.github.com/)*: top 100, coding*
|
||||||
1.  [Spotify (https://open.spotify.com/)](https://open.spotify.com/)*: top 100, music, us*, search is disabled
|
1.  [Spotify (https://open.spotify.com/)](https://open.spotify.com/)*: top 100, music, us*, search is disabled
|
||||||
1.  [TikTok (https://www.tiktok.com/)](https://www.tiktok.com/)*: top 100, video*
|
1.  [TikTok (https://www.tiktok.com/)](https://www.tiktok.com/)*: top 100, video*
|
||||||
@@ -80,7 +81,7 @@ Rank data fetched from Alexa by domains.
|
|||||||
1.  [Tumblr (https://www.tumblr.com)](https://www.tumblr.com)*: top 500, blog*
|
1.  [Tumblr (https://www.tumblr.com)](https://www.tumblr.com)*: top 500, blog*
|
||||||
1.  [Roblox (https://www.roblox.com/)](https://www.roblox.com/)*: top 500, gaming, us*
|
1.  [Roblox (https://www.roblox.com/)](https://www.roblox.com/)*: top 500, gaming, us*
|
||||||
1.  [SoundCloud (https://soundcloud.com/)](https://soundcloud.com/)*: top 500, music*
|
1.  [SoundCloud (https://soundcloud.com/)](https://soundcloud.com/)*: top 500, music*
|
||||||
1.  [Udemy (https://www.udemy.com)](https://www.udemy.com)*: top 500, in*
|
1.  [Udemy (https://www.udemy.com)](https://www.udemy.com)*: top 500, in*, search is disabled
|
||||||
1.  [discourse.mozilla.org (https://discourse.mozilla.org)](https://discourse.mozilla.org)*: top 500*
|
1.  [discourse.mozilla.org (https://discourse.mozilla.org)](https://discourse.mozilla.org)*: top 500*
|
||||||
1.  [linktr.ee (https://linktr.ee)](https://linktr.ee)*: top 500, links*
|
1.  [linktr.ee (https://linktr.ee)](https://linktr.ee)*: top 500, links*
|
||||||
1.  [xHamster (https://xhamster.com)](https://xhamster.com)*: top 500, porn, us*
|
1.  [xHamster (https://xhamster.com)](https://xhamster.com)*: top 500, porn, us*
|
||||||
@@ -525,7 +526,7 @@ Rank data fetched from Alexa by domains.
|
|||||||
1.  [Neoseeker (https://www.neoseeker.com)](https://www.neoseeker.com)*: top 100K, us*
|
1.  [Neoseeker (https://www.neoseeker.com)](https://www.neoseeker.com)*: top 100K, us*
|
||||||
1.  [InfosecInstitute (https://community.infosecinstitute.com)](https://community.infosecinstitute.com)*: top 100K, us*, search is disabled
|
1.  [InfosecInstitute (https://community.infosecinstitute.com)](https://community.infosecinstitute.com)*: top 100K, us*, search is disabled
|
||||||
1.  [Armorgames (https://armorgames.com)](https://armorgames.com)*: top 100K, gaming, us*
|
1.  [Armorgames (https://armorgames.com)](https://armorgames.com)*: top 100K, gaming, us*
|
||||||
1.  [giters.com (https://giters.com)](https://giters.com)*: top 100K, coding*
|
1.  [giters.com (https://giters.com)](https://giters.com)*: top 100K, coding*, search is disabled
|
||||||
1.  [teamtreehouse.com (https://teamtreehouse.com)](https://teamtreehouse.com)*: top 100K, us*
|
1.  [teamtreehouse.com (https://teamtreehouse.com)](https://teamtreehouse.com)*: top 100K, us*
|
||||||
1.  [Blu-ray (https://forum.blu-ray.com/)](https://forum.blu-ray.com/)*: top 100K, forum, us*, search is disabled
|
1.  [Blu-ray (https://forum.blu-ray.com/)](https://forum.blu-ray.com/)*: top 100K, forum, us*, search is disabled
|
||||||
1.  [TheOdysseyOnline (https://www.theodysseyonline.com)](https://www.theodysseyonline.com)*: top 100K, blog*
|
1.  [TheOdysseyOnline (https://www.theodysseyonline.com)](https://www.theodysseyonline.com)*: top 100K, blog*
|
||||||
@@ -1120,7 +1121,7 @@ Rank data fetched from Alexa by domains.
|
|||||||
1.  [commons.ishtar-collective.net (https://commons.ishtar-collective.net)](https://commons.ishtar-collective.net)*: top 10M, forum, gaming*
|
1.  [commons.ishtar-collective.net (https://commons.ishtar-collective.net)](https://commons.ishtar-collective.net)*: top 10M, forum, gaming*
|
||||||
1.  [4cheat (https://4cheat.ru)](https://4cheat.ru)*: top 10M, forum, ru*, search is disabled
|
1.  [4cheat (https://4cheat.ru)](https://4cheat.ru)*: top 10M, forum, ru*, search is disabled
|
||||||
1.  [svtperformance.com (https://svtperformance.com)](https://svtperformance.com)*: top 10M, forum, us*
|
1.  [svtperformance.com (https://svtperformance.com)](https://svtperformance.com)*: top 10M, forum, us*
|
||||||
1.  [githubplus.com (https://githubplus.com)](https://githubplus.com)*: top 10M, coding*
|
1.  [githubplus.com (https://githubplus.com)](https://githubplus.com)*: top 10M, coding*, search is disabled
|
||||||
1.  [Runitonce (https://www.runitonce.com/)](https://www.runitonce.com/)*: top 10M, ca, us*
|
1.  [Runitonce (https://www.runitonce.com/)](https://www.runitonce.com/)*: top 10M, ca, us*
|
||||||
1.  [Paypal (https://www.paypal.me)](https://www.paypal.me)*: top 10M, finance*
|
1.  [Paypal (https://www.paypal.me)](https://www.paypal.me)*: top 10M, finance*
|
||||||
1.  [Seatracker (https://seatracker.ru/)](https://seatracker.ru/)*: top 10M, ru*
|
1.  [Seatracker (https://seatracker.ru/)](https://seatracker.ru/)*: top 10M, ru*
|
||||||
@@ -1239,7 +1240,7 @@ Rank data fetched from Alexa by domains.
|
|||||||
1.  [Faqusha (https://faqusha.ru)](https://faqusha.ru)*: top 10M, ru*
|
1.  [Faqusha (https://faqusha.ru)](https://faqusha.ru)*: top 10M, ru*
|
||||||
1.  [Skyrimforums (https://skyrimforums.org)](https://skyrimforums.org)*: top 10M, forum, in, us*
|
1.  [Skyrimforums (https://skyrimforums.org)](https://skyrimforums.org)*: top 10M, forum, in, us*
|
||||||
1.  [juce (https://forum.juce.com)](https://forum.juce.com)*: top 10M, ca, forum, us*
|
1.  [juce (https://forum.juce.com)](https://forum.juce.com)*: top 10M, ca, forum, us*
|
||||||
1.  [rblx.trade (https://rblx.trade)](https://rblx.trade)*: top 10M, gaming*
|
1.  [rblx.trade (https://rblx.trade)](https://rblx.trade)*: top 10M, gaming*, search is disabled
|
||||||
1.  [quik (https://forum.quik.ru)](https://forum.quik.ru)*: top 10M, forum, ru*
|
1.  [quik (https://forum.quik.ru)](https://forum.quik.ru)*: top 10M, forum, ru*
|
||||||
1.  [navimba.com (https://navimba.com)](https://navimba.com)*: top 10M*
|
1.  [navimba.com (https://navimba.com)](https://navimba.com)*: top 10M*
|
||||||
1.  [Gardenstew (https://www.gardenstew.com)](https://www.gardenstew.com)*: top 10M, forum, in, us*, search is disabled
|
1.  [Gardenstew (https://www.gardenstew.com)](https://www.gardenstew.com)*: top 10M, forum, in, us*, search is disabled
|
||||||
@@ -3147,18 +3148,18 @@ Rank data fetched from Alexa by domains.
|
|||||||
1.  [OP.GG [Valorant] (https://valorant.op.gg)](https://valorant.op.gg)*: top 100M, gaming*
|
1.  [OP.GG [Valorant] (https://valorant.op.gg)](https://valorant.op.gg)*: top 100M, gaming*
|
||||||
1.  [write.as (https://write.as)](https://write.as)*: top 100M, writefreely*
|
1.  [write.as (https://write.as)](https://write.as)*: top 100M, writefreely*
|
||||||
|
|
||||||
The list was updated at (2026-03-21)
|
The list was updated at (2026-03-22)
|
||||||
## Statistics
|
## Statistics
|
||||||
|
|
||||||
Enabled/total sites: 2650/3143 = 84.31%
|
Enabled/total sites: 2641/3144 = 84.0%
|
||||||
|
|
||||||
Incomplete message checks: 387/2650 = 14.6% (false positive risks)
|
Incomplete message checks: 386/2641 = 14.62% (false positive risks)
|
||||||
|
|
||||||
Status code checks: 607/2650 = 22.91% (false positive risks)
|
Status code checks: 608/2641 = 23.02% (false positive risks)
|
||||||
|
|
||||||
False positive risk (total): 37.51%
|
False positive risk (total): 37.64%
|
||||||
|
|
||||||
Sites with probing: 500px, Aparat, BinarySearch (disabled), BongaCams, BuyMeACoffee, Cent, Disqus, Docker Hub, Duolingo, Gab, GitHub, GitLab, Google Plus (archived), Gravatar, Imgur, Issuu, Keybase, Livejasmin, LocalCryptos (disabled), MixCloud, Niftygateway, Reddit Search (Pushshift) (disabled), SportsTracker, Spotify (disabled), TAP'D, Trello, Twitch, Twitter, Twitter Shadowban (disabled), UnstoppableDomains, Vimeo, Weibo, Yapisal (disabled), YouNow, nightbot, notabug.org, polarsteps, qiwi.me (disabled)
|
Sites with probing: 500px, Aparat (disabled), BinarySearch (disabled), BongaCams, BuyMeACoffee, Cent, Chess, Disqus, Docker Hub, Duolingo, Gab, GitHub, GitLab, Google Plus (archived), Gravatar, Imgur, Issuu, Keybase, Livejasmin, LocalCryptos (disabled), MicrosoftLearn, MixCloud, Niftygateway, Picsart, Reddit, Reddit Search (Pushshift) (disabled), SportsTracker, Spotify (disabled), TAP'D, Trello, Twitch, Twitter, Twitter Shadowban (disabled), UnstoppableDomains, Vimeo, Weibo, Yapisal (disabled), YouNow, nightbot, notabug.org, polarsteps, qiwi.me (disabled)
|
||||||
|
|
||||||
Sites with activation: Spotify (disabled), Twitter, Vimeo, Weibo
|
Sites with activation: Spotify (disabled), Twitter, Vimeo, Weibo
|
||||||
|
|
||||||
@@ -3170,7 +3171,7 @@ Top 20 profile URLs:
|
|||||||
- (133) `{urlMain}{urlSubpath}/member.php?username={username} (vBulletin)`
|
- (133) `{urlMain}{urlSubpath}/member.php?username={username} (vBulletin)`
|
||||||
- (127) `{urlMain}{urlSubpath}/search.php?author={username} (phpBB/Search)`
|
- (127) `{urlMain}{urlSubpath}/search.php?author={username} (phpBB/Search)`
|
||||||
- (118) `/profile/{username}`
|
- (118) `/profile/{username}`
|
||||||
- (111) `/u/{username}`
|
- (112) `/u/{username}`
|
||||||
- (88) `/users/{username}`
|
- (88) `/users/{username}`
|
||||||
- (87) `{urlMain}/u/{username}/summary (Discourse)`
|
- (87) `{urlMain}/u/{username}/summary (Discourse)`
|
||||||
- (54) `/@{username}`
|
- (54) `/@{username}`
|
||||||
@@ -3191,7 +3192,7 @@ Top 20 tags:
|
|||||||
- (92) `gaming`
|
- (92) `gaming`
|
||||||
- (48) `photo`
|
- (48) `photo`
|
||||||
- (41) `coding`
|
- (41) `coding`
|
||||||
- (30) `tech`
|
- (31) `tech`
|
||||||
- (29) `news`
|
- (29) `news`
|
||||||
- (28) `blog`
|
- (28) `blog`
|
||||||
- (23) `music`
|
- (23) `music`
|
||||||
|
|||||||
@@ -5,11 +5,13 @@ from typing import Dict, Any
|
|||||||
|
|
||||||
DEFAULT_ARGS: Dict[str, Any] = {
|
DEFAULT_ARGS: Dict[str, Any] = {
|
||||||
'all_sites': False,
|
'all_sites': False,
|
||||||
|
'auto_disable': False,
|
||||||
'connections': 100,
|
'connections': 100,
|
||||||
'cookie_file': None,
|
'cookie_file': None,
|
||||||
'csv': False,
|
'csv': False,
|
||||||
'db_file': 'resources/data.json',
|
'db_file': 'resources/data.json',
|
||||||
'debug': False,
|
'debug': False,
|
||||||
|
'diagnose': False,
|
||||||
'disable_extracting': False,
|
'disable_extracting': False,
|
||||||
'disable_recursive_search': False,
|
'disable_recursive_search': False,
|
||||||
'folderoutput': 'reports',
|
'folderoutput': 'reports',
|
||||||
|
|||||||
@@ -27,7 +27,9 @@ async def test_self_check_db(test_db):
|
|||||||
assert test_db.sites_dict['ValidActive'].disabled is False
|
assert test_db.sites_dict['ValidActive'].disabled is False
|
||||||
assert test_db.sites_dict['InvalidInactive'].disabled is True
|
assert test_db.sites_dict['InvalidInactive'].disabled is True
|
||||||
|
|
||||||
await self_check(test_db, test_db.sites_dict, logger, silent=False)
|
await self_check(
|
||||||
|
test_db, test_db.sites_dict, logger, silent=False, auto_disable=True
|
||||||
|
)
|
||||||
|
|
||||||
assert test_db.sites_dict['InvalidActive'].disabled is True
|
assert test_db.sites_dict['InvalidActive'].disabled is True
|
||||||
assert test_db.sites_dict['ValidInactive'].disabled is False
|
assert test_db.sites_dict['ValidInactive'].disabled is False
|
||||||
|
|||||||
@@ -0,0 +1,63 @@
|
|||||||
|
"""Tests for the Twitter / X site entry and GraphQL probe."""
|
||||||
|
|
||||||
|
import re
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
import requests
|
||||||
|
|
||||||
|
from maigret.sites import MaigretSite
|
||||||
|
|
||||||
|
|
||||||
|
def _twitter_site(site: MaigretSite) -> None:
|
||||||
|
assert site.name == "Twitter"
|
||||||
|
assert site.disabled is False
|
||||||
|
assert site.check_type == "message"
|
||||||
|
assert site.url_probe and "{username}" in site.url_probe
|
||||||
|
assert "UserByScreenName" in site.url_probe or "graphql" in site.url_probe
|
||||||
|
assert site.regex_check
|
||||||
|
assert re.fullmatch(site.regex_check, site.username_claimed)
|
||||||
|
assert re.fullmatch(site.regex_check, site.username_unclaimed)
|
||||||
|
assert site.absence_strs
|
||||||
|
assert site.activation.get("method") == "twitter"
|
||||||
|
assert site.activation.get("url")
|
||||||
|
assert "authorization" in {k.lower() for k in site.headers.keys()}
|
||||||
|
|
||||||
|
|
||||||
|
def test_twitter_site_entry_config(default_db):
|
||||||
|
"""Twitter entry in data.json must define probe URL, regex, and activation."""
|
||||||
|
site = default_db.sites_dict["Twitter"]
|
||||||
|
assert isinstance(site, MaigretSite)
|
||||||
|
_twitter_site(site)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.slow
|
||||||
|
def test_twitter_graphql_probe_claimed_vs_unclaimed(default_db):
|
||||||
|
"""
|
||||||
|
Live check: guest activation + UserByScreenName GraphQL returns a user for
|
||||||
|
usernameClaimed and no user for usernameUnclaimed (same flow as urlProbe).
|
||||||
|
"""
|
||||||
|
site = default_db.sites_dict["Twitter"]
|
||||||
|
_twitter_site(site)
|
||||||
|
|
||||||
|
headers = dict(site.headers)
|
||||||
|
headers.pop("x-guest-token", None)
|
||||||
|
|
||||||
|
act = requests.post(site.activation["url"], headers=headers, timeout=45)
|
||||||
|
assert act.status_code == 200, act.text[:500]
|
||||||
|
body = act.json()
|
||||||
|
assert "guest_token" in body
|
||||||
|
headers["x-guest-token"] = body["guest_token"]
|
||||||
|
|
||||||
|
def fetch(username: str) -> dict:
|
||||||
|
url = site.url_probe.format(username=username)
|
||||||
|
resp = requests.get(url, headers=headers, timeout=45)
|
||||||
|
resp.raise_for_status()
|
||||||
|
return resp.json()
|
||||||
|
|
||||||
|
claimed_json = fetch(site.username_claimed)
|
||||||
|
assert "data" in claimed_json
|
||||||
|
assert claimed_json["data"].get("user") is not None
|
||||||
|
|
||||||
|
unclaimed_json = fetch(site.username_unclaimed)
|
||||||
|
data = unclaimed_json.get("data") or {}
|
||||||
|
assert data == {} or data.get("user") is None
|
||||||
@@ -0,0 +1,480 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Mass site checking utility for Maigret development.
|
||||||
|
Check top-N sites from data.json and generate a report.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python utils/check_top_n.py --top 100 # Check top 100 sites
|
||||||
|
python utils/check_top_n.py --top 50 --parallel 10 # Check with 10 parallel requests
|
||||||
|
python utils/check_top_n.py --top 100 --output report.json
|
||||||
|
python utils/check_top_n.py --top 100 --fix # Auto-fix simple issues
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from collections import defaultdict
|
||||||
|
from dataclasses import dataclass, field, asdict
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
# Add parent dir for imports
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||||
|
|
||||||
|
try:
|
||||||
|
import aiohttp
|
||||||
|
except ImportError:
|
||||||
|
print("aiohttp not installed. Run: pip install aiohttp")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
class Colors:
|
||||||
|
RED = "\033[91m"
|
||||||
|
GREEN = "\033[92m"
|
||||||
|
YELLOW = "\033[93m"
|
||||||
|
BLUE = "\033[94m"
|
||||||
|
CYAN = "\033[96m"
|
||||||
|
RESET = "\033[0m"
|
||||||
|
BOLD = "\033[1m"
|
||||||
|
|
||||||
|
|
||||||
|
def color(text: str, c: str) -> str:
|
||||||
|
return f"{c}{text}{Colors.RESET}"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class SiteCheckResult:
|
||||||
|
"""Result of checking a single site."""
|
||||||
|
site_name: str
|
||||||
|
alexa_rank: int
|
||||||
|
disabled: bool
|
||||||
|
check_type: str
|
||||||
|
|
||||||
|
# Status
|
||||||
|
status: str = "unknown" # working, broken, timeout, error, anti_bot, disabled
|
||||||
|
|
||||||
|
# HTTP results
|
||||||
|
claimed_http_status: Optional[int] = None
|
||||||
|
unclaimed_http_status: Optional[int] = None
|
||||||
|
claimed_error: Optional[str] = None
|
||||||
|
unclaimed_error: Optional[str] = None
|
||||||
|
|
||||||
|
# Issues detected
|
||||||
|
issues: List[str] = field(default_factory=list)
|
||||||
|
warnings: List[str] = field(default_factory=list)
|
||||||
|
|
||||||
|
# Recommendations
|
||||||
|
recommendations: List[str] = field(default_factory=list)
|
||||||
|
|
||||||
|
# Timing
|
||||||
|
check_time_ms: int = 0
|
||||||
|
|
||||||
|
|
||||||
|
DEFAULT_HEADERS = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||||
|
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
||||||
|
"Accept-Language": "en-US,en;q=0.5",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
async def check_url(url: str, headers: dict, timeout: int = 15) -> dict:
|
||||||
|
"""Quick URL check returning status and basic info."""
|
||||||
|
result = {
|
||||||
|
"status": None,
|
||||||
|
"final_url": None,
|
||||||
|
"content_length": 0,
|
||||||
|
"error": None,
|
||||||
|
"error_type": None,
|
||||||
|
"content": None,
|
||||||
|
"markers": {},
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
connector = aiohttp.TCPConnector(ssl=False)
|
||||||
|
timeout_obj = aiohttp.ClientTimeout(total=timeout)
|
||||||
|
|
||||||
|
async with aiohttp.ClientSession(connector=connector, timeout=timeout_obj) as session:
|
||||||
|
async with session.get(url, headers=headers, allow_redirects=True) as resp:
|
||||||
|
result["status"] = resp.status
|
||||||
|
result["final_url"] = str(resp.url)
|
||||||
|
|
||||||
|
try:
|
||||||
|
text = await resp.text()
|
||||||
|
result["content_length"] = len(text)
|
||||||
|
result["content"] = text
|
||||||
|
|
||||||
|
text_lower = text.lower()
|
||||||
|
result["markers"] = {
|
||||||
|
"404_text": any(m in text_lower for m in ["not found", "404", "doesn't exist"]),
|
||||||
|
"captcha": any(m in text_lower for m in ["captcha", "recaptcha", "challenge"]),
|
||||||
|
"cloudflare": "cloudflare" in text_lower,
|
||||||
|
"login": any(m in text_lower for m in ["log in", "login", "sign in"]),
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
result["error"] = f"Content error: {e}"
|
||||||
|
result["error_type"] = "content"
|
||||||
|
|
||||||
|
except asyncio.TimeoutError:
|
||||||
|
result["error"] = "Timeout"
|
||||||
|
result["error_type"] = "timeout"
|
||||||
|
except aiohttp.ClientError as e:
|
||||||
|
result["error"] = str(e)
|
||||||
|
result["error_type"] = "client"
|
||||||
|
except Exception as e:
|
||||||
|
result["error"] = str(e)
|
||||||
|
result["error_type"] = "unknown"
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
async def check_site(site_name: str, config: dict, timeout: int = 15) -> SiteCheckResult:
|
||||||
|
"""Check a single site and return detailed result."""
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
result = SiteCheckResult(
|
||||||
|
site_name=site_name,
|
||||||
|
alexa_rank=config.get("alexaRank", 999999),
|
||||||
|
disabled=config.get("disabled", False),
|
||||||
|
check_type=config.get("checkType", "status_code"),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Skip disabled sites
|
||||||
|
if result.disabled:
|
||||||
|
result.status = "disabled"
|
||||||
|
return result
|
||||||
|
|
||||||
|
# Build URL
|
||||||
|
url_template = config.get("url", "")
|
||||||
|
url_main = config.get("urlMain", "")
|
||||||
|
url_subpath = config.get("urlSubpath", "")
|
||||||
|
url_template = url_template.replace("{urlMain}", url_main).replace("{urlSubpath}", url_subpath)
|
||||||
|
|
||||||
|
claimed = config.get("usernameClaimed")
|
||||||
|
unclaimed = config.get("usernameUnclaimed", "noonewouldeverusethis7")
|
||||||
|
|
||||||
|
if not claimed:
|
||||||
|
result.status = "error"
|
||||||
|
result.issues.append("No usernameClaimed defined")
|
||||||
|
return result
|
||||||
|
|
||||||
|
# Prepare headers
|
||||||
|
headers = DEFAULT_HEADERS.copy()
|
||||||
|
if config.get("headers"):
|
||||||
|
headers.update(config["headers"])
|
||||||
|
|
||||||
|
# Check both URLs
|
||||||
|
url_claimed = url_template.replace("{username}", claimed)
|
||||||
|
url_unclaimed = url_template.replace("{username}", unclaimed)
|
||||||
|
|
||||||
|
try:
|
||||||
|
claimed_result, unclaimed_result = await asyncio.gather(
|
||||||
|
check_url(url_claimed, headers, timeout),
|
||||||
|
check_url(url_unclaimed, headers, timeout),
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
result.status = "error"
|
||||||
|
result.issues.append(f"Check failed: {e}")
|
||||||
|
return result
|
||||||
|
|
||||||
|
result.claimed_http_status = claimed_result["status"]
|
||||||
|
result.unclaimed_http_status = unclaimed_result["status"]
|
||||||
|
result.claimed_error = claimed_result.get("error")
|
||||||
|
result.unclaimed_error = unclaimed_result.get("error")
|
||||||
|
|
||||||
|
# Categorize result
|
||||||
|
if claimed_result["error_type"] == "timeout" or unclaimed_result["error_type"] == "timeout":
|
||||||
|
result.status = "timeout"
|
||||||
|
result.issues.append("Request timeout")
|
||||||
|
|
||||||
|
elif claimed_result["status"] == 403 or claimed_result["status"] == 429:
|
||||||
|
result.status = "anti_bot"
|
||||||
|
result.issues.append(f"Anti-bot protection (HTTP {claimed_result['status']})")
|
||||||
|
|
||||||
|
elif claimed_result.get("markers", {}).get("captcha"):
|
||||||
|
result.status = "anti_bot"
|
||||||
|
result.issues.append("Captcha detected")
|
||||||
|
|
||||||
|
elif claimed_result.get("markers", {}).get("cloudflare"):
|
||||||
|
result.status = "anti_bot"
|
||||||
|
result.warnings.append("Cloudflare protection detected")
|
||||||
|
|
||||||
|
elif claimed_result["error"] or unclaimed_result["error"]:
|
||||||
|
result.status = "error"
|
||||||
|
if claimed_result["error"]:
|
||||||
|
result.issues.append(f"Claimed error: {claimed_result['error']}")
|
||||||
|
if unclaimed_result["error"]:
|
||||||
|
result.issues.append(f"Unclaimed error: {unclaimed_result['error']}")
|
||||||
|
|
||||||
|
else:
|
||||||
|
# Validate check type
|
||||||
|
check_type = config.get("checkType", "status_code")
|
||||||
|
|
||||||
|
if check_type == "status_code":
|
||||||
|
if claimed_result["status"] == unclaimed_result["status"]:
|
||||||
|
result.status = "broken"
|
||||||
|
result.issues.append(f"Same status code ({claimed_result['status']}) for both")
|
||||||
|
# Suggest fix
|
||||||
|
if claimed_result["final_url"] != unclaimed_result["final_url"]:
|
||||||
|
result.recommendations.append("Switch to checkType: response_url")
|
||||||
|
else:
|
||||||
|
result.status = "working"
|
||||||
|
|
||||||
|
elif check_type == "response_url":
|
||||||
|
if claimed_result["final_url"] == unclaimed_result["final_url"]:
|
||||||
|
result.status = "broken"
|
||||||
|
result.issues.append("Same final URL for both")
|
||||||
|
if claimed_result["status"] != unclaimed_result["status"]:
|
||||||
|
result.recommendations.append("Switch to checkType: status_code")
|
||||||
|
else:
|
||||||
|
result.status = "working"
|
||||||
|
|
||||||
|
elif check_type == "message":
|
||||||
|
presense_strs = config.get("presenseStrs", [])
|
||||||
|
absence_strs = config.get("absenceStrs", [])
|
||||||
|
|
||||||
|
claimed_content = claimed_result.get("content", "") or ""
|
||||||
|
unclaimed_content = unclaimed_result.get("content", "") or ""
|
||||||
|
|
||||||
|
presense_ok = not presense_strs or any(s in claimed_content for s in presense_strs)
|
||||||
|
absence_claimed = absence_strs and any(s in claimed_content for s in absence_strs)
|
||||||
|
absence_unclaimed = absence_strs and any(s in unclaimed_content for s in absence_strs)
|
||||||
|
|
||||||
|
if presense_strs and not presense_ok:
|
||||||
|
result.status = "broken"
|
||||||
|
result.issues.append(f"presenseStrs not found: {presense_strs}")
|
||||||
|
# Check if status_code would work
|
||||||
|
if claimed_result["status"] != unclaimed_result["status"]:
|
||||||
|
result.recommendations.append(f"Switch to checkType: status_code ({claimed_result['status']} vs {unclaimed_result['status']})")
|
||||||
|
elif absence_claimed:
|
||||||
|
result.status = "broken"
|
||||||
|
result.issues.append(f"absenceStrs found in claimed page")
|
||||||
|
elif absence_strs and not absence_unclaimed:
|
||||||
|
result.status = "broken"
|
||||||
|
result.warnings.append("absenceStrs not found in unclaimed page")
|
||||||
|
else:
|
||||||
|
result.status = "working"
|
||||||
|
|
||||||
|
else:
|
||||||
|
result.status = "unknown"
|
||||||
|
result.warnings.append(f"Unknown checkType: {check_type}")
|
||||||
|
|
||||||
|
result.check_time_ms = int((time.time() - start_time) * 1000)
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def load_sites(db_path: Path) -> Dict[str, dict]:
|
||||||
|
"""Load all sites from data.json."""
|
||||||
|
with open(db_path) as f:
|
||||||
|
data = json.load(f)
|
||||||
|
return data.get("sites", {})
|
||||||
|
|
||||||
|
|
||||||
|
def get_top_sites(sites: Dict[str, dict], n: int) -> List[Tuple[str, dict]]:
|
||||||
|
"""Get top N sites by Alexa rank."""
|
||||||
|
ranked = []
|
||||||
|
for name, config in sites.items():
|
||||||
|
rank = config.get("alexaRank", 999999)
|
||||||
|
ranked.append((name, config, rank))
|
||||||
|
|
||||||
|
ranked.sort(key=lambda x: x[2])
|
||||||
|
return [(name, config) for name, config, _ in ranked[:n]]
|
||||||
|
|
||||||
|
|
||||||
|
async def check_sites_batch(sites: List[Tuple[str, dict]], parallel: int = 5,
|
||||||
|
timeout: int = 15, progress_callback=None) -> List[SiteCheckResult]:
|
||||||
|
"""Check multiple sites with parallelism control."""
|
||||||
|
results = []
|
||||||
|
semaphore = asyncio.Semaphore(parallel)
|
||||||
|
|
||||||
|
async def check_with_semaphore(name, config, index):
|
||||||
|
async with semaphore:
|
||||||
|
if progress_callback:
|
||||||
|
progress_callback(index, len(sites), name)
|
||||||
|
return await check_site(name, config, timeout)
|
||||||
|
|
||||||
|
tasks = [
|
||||||
|
check_with_semaphore(name, config, i)
|
||||||
|
for i, (name, config) in enumerate(sites)
|
||||||
|
]
|
||||||
|
|
||||||
|
results = await asyncio.gather(*tasks)
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def print_progress(current: int, total: int, site_name: str):
|
||||||
|
"""Print progress indicator."""
|
||||||
|
pct = int(current / total * 100)
|
||||||
|
bar_width = 30
|
||||||
|
filled = int(bar_width * current / total)
|
||||||
|
bar = "█" * filled + "░" * (bar_width - filled)
|
||||||
|
print(f"\r[{bar}] {pct:3d}% ({current}/{total}) {site_name:<30}", end="", flush=True)
|
||||||
|
|
||||||
|
|
||||||
|
def generate_report(results: List[SiteCheckResult]) -> dict:
|
||||||
|
"""Generate a summary report from check results."""
|
||||||
|
report = {
|
||||||
|
"summary": {
|
||||||
|
"total": len(results),
|
||||||
|
"working": 0,
|
||||||
|
"broken": 0,
|
||||||
|
"disabled": 0,
|
||||||
|
"timeout": 0,
|
||||||
|
"anti_bot": 0,
|
||||||
|
"error": 0,
|
||||||
|
"unknown": 0,
|
||||||
|
},
|
||||||
|
"by_status": defaultdict(list),
|
||||||
|
"issues": [],
|
||||||
|
"recommendations": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
for r in results:
|
||||||
|
report["summary"][r.status] = report["summary"].get(r.status, 0) + 1
|
||||||
|
report["by_status"][r.status].append(r.site_name)
|
||||||
|
|
||||||
|
if r.issues:
|
||||||
|
report["issues"].append({
|
||||||
|
"site": r.site_name,
|
||||||
|
"rank": r.alexa_rank,
|
||||||
|
"issues": r.issues,
|
||||||
|
})
|
||||||
|
|
||||||
|
if r.recommendations:
|
||||||
|
report["recommendations"].append({
|
||||||
|
"site": r.site_name,
|
||||||
|
"rank": r.alexa_rank,
|
||||||
|
"recommendations": r.recommendations,
|
||||||
|
})
|
||||||
|
|
||||||
|
return report
|
||||||
|
|
||||||
|
|
||||||
|
def print_report(report: dict, results: List[SiteCheckResult]):
|
||||||
|
"""Print a formatted report to console."""
|
||||||
|
summary = report["summary"]
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"{color('SITE CHECK REPORT', Colors.CYAN)}")
|
||||||
|
print(f"{'='*60}\n")
|
||||||
|
|
||||||
|
print(f"{color('SUMMARY:', Colors.BOLD)}")
|
||||||
|
print(f" Total sites checked: {summary['total']}")
|
||||||
|
print(f" {color('Working:', Colors.GREEN)} {summary['working']}")
|
||||||
|
print(f" {color('Broken:', Colors.RED)} {summary['broken']}")
|
||||||
|
print(f" {color('Disabled:', Colors.YELLOW)} {summary['disabled']}")
|
||||||
|
print(f" {color('Timeout:', Colors.YELLOW)} {summary['timeout']}")
|
||||||
|
print(f" {color('Anti-bot:', Colors.YELLOW)} {summary['anti_bot']}")
|
||||||
|
print(f" {color('Error:', Colors.RED)} {summary['error']}")
|
||||||
|
|
||||||
|
# Broken sites
|
||||||
|
if report["by_status"]["broken"]:
|
||||||
|
print(f"\n{color('BROKEN SITES:', Colors.RED)}")
|
||||||
|
for site in report["by_status"]["broken"][:20]:
|
||||||
|
r = next(x for x in results if x.site_name == site)
|
||||||
|
print(f" - {site} (rank {r.alexa_rank}): {', '.join(r.issues)}")
|
||||||
|
if len(report["by_status"]["broken"]) > 20:
|
||||||
|
print(f" ... and {len(report['by_status']['broken']) - 20} more")
|
||||||
|
|
||||||
|
# Timeout sites
|
||||||
|
if report["by_status"]["timeout"]:
|
||||||
|
print(f"\n{color('TIMEOUT SITES:', Colors.YELLOW)}")
|
||||||
|
for site in report["by_status"]["timeout"][:10]:
|
||||||
|
print(f" - {site}")
|
||||||
|
if len(report["by_status"]["timeout"]) > 10:
|
||||||
|
print(f" ... and {len(report['by_status']['timeout']) - 10} more")
|
||||||
|
|
||||||
|
# Anti-bot sites
|
||||||
|
if report["by_status"]["anti_bot"]:
|
||||||
|
print(f"\n{color('ANTI-BOT PROTECTED:', Colors.YELLOW)}")
|
||||||
|
for site in report["by_status"]["anti_bot"][:10]:
|
||||||
|
r = next(x for x in results if x.site_name == site)
|
||||||
|
print(f" - {site}: {', '.join(r.issues)}")
|
||||||
|
if len(report["by_status"]["anti_bot"]) > 10:
|
||||||
|
print(f" ... and {len(report['by_status']['anti_bot']) - 10} more")
|
||||||
|
|
||||||
|
# Recommendations
|
||||||
|
if report["recommendations"]:
|
||||||
|
print(f"\n{color('RECOMMENDATIONS:', Colors.CYAN)}")
|
||||||
|
for rec in report["recommendations"][:15]:
|
||||||
|
print(f" {rec['site']} (rank {rec['rank']}):")
|
||||||
|
for r in rec["recommendations"]:
|
||||||
|
print(f" -> {r}")
|
||||||
|
if len(report["recommendations"]) > 15:
|
||||||
|
print(f" ... and {len(report['recommendations']) - 15} more")
|
||||||
|
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Mass site checking for Maigret",
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
)
|
||||||
|
parser.add_argument("--top", "-n", type=int, default=100,
|
||||||
|
help="Check top N sites by Alexa rank (default: 100)")
|
||||||
|
parser.add_argument("--parallel", "-p", type=int, default=5,
|
||||||
|
help="Number of parallel requests (default: 5)")
|
||||||
|
parser.add_argument("--timeout", "-t", type=int, default=15,
|
||||||
|
help="Request timeout in seconds (default: 15)")
|
||||||
|
parser.add_argument("--output", "-o", help="Output JSON report to file")
|
||||||
|
parser.add_argument("--include-disabled", action="store_true",
|
||||||
|
help="Include disabled sites in results")
|
||||||
|
parser.add_argument("--only-broken", action="store_true",
|
||||||
|
help="Only show broken sites")
|
||||||
|
parser.add_argument("--json", action="store_true",
|
||||||
|
help="Output as JSON only")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Load sites
|
||||||
|
db_path = Path(__file__).parent.parent / "maigret" / "resources" / "data.json"
|
||||||
|
if not db_path.exists():
|
||||||
|
print(f"Database not found: {db_path}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
sites = load_sites(db_path)
|
||||||
|
top_sites = get_top_sites(sites, args.top)
|
||||||
|
|
||||||
|
if not args.json:
|
||||||
|
print(f"Checking top {len(top_sites)} sites (parallel={args.parallel}, timeout={args.timeout}s)...")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Run checks
|
||||||
|
progress = print_progress if not args.json else None
|
||||||
|
results = await check_sites_batch(top_sites, args.parallel, args.timeout, progress)
|
||||||
|
|
||||||
|
if not args.json:
|
||||||
|
print() # Clear progress line
|
||||||
|
|
||||||
|
# Filter results
|
||||||
|
if not args.include_disabled:
|
||||||
|
results = [r for r in results if r.status != "disabled"]
|
||||||
|
if args.only_broken:
|
||||||
|
results = [r for r in results if r.status in ("broken", "error", "timeout")]
|
||||||
|
|
||||||
|
# Generate report
|
||||||
|
report = generate_report(results)
|
||||||
|
|
||||||
|
# Output
|
||||||
|
if args.json:
|
||||||
|
output = {
|
||||||
|
"report": report,
|
||||||
|
"results": [asdict(r) for r in results],
|
||||||
|
}
|
||||||
|
print(json.dumps(output, indent=2))
|
||||||
|
else:
|
||||||
|
print_report(report, results)
|
||||||
|
|
||||||
|
# Save to file
|
||||||
|
if args.output:
|
||||||
|
output = {
|
||||||
|
"report": report,
|
||||||
|
"results": [asdict(r) for r in results],
|
||||||
|
}
|
||||||
|
with open(args.output, "w") as f:
|
||||||
|
json.dump(output, f, indent=2)
|
||||||
|
print(f"\nReport saved to: {args.output}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
@@ -0,0 +1,223 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Probe likely false-positive sites among the top-N Alexa-ranked entries.
|
||||||
|
|
||||||
|
For each of K random *distinct* usernames taken from ``usernameClaimed`` fields in
|
||||||
|
the Maigret database, runs a clean ``maigret`` scan (``--top-sites N --json simple|ndjson``).
|
||||||
|
Sites that return CLAIMED in *every* run are reported: unrelated random claimed
|
||||||
|
handles are unlikely to all exist on the same third-party site, so such sites are
|
||||||
|
candidates for broken checks.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import random
|
||||||
|
import shutil
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import tempfile
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def repo_root() -> Path:
|
||||||
|
return Path(__file__).resolve().parent.parent
|
||||||
|
|
||||||
|
|
||||||
|
def load_username_claimed_pool(db_path: Path) -> list[str]:
|
||||||
|
with db_path.open(encoding="utf-8") as f:
|
||||||
|
data = json.load(f)
|
||||||
|
sites = data.get("sites") or {}
|
||||||
|
seen: set[str] = set()
|
||||||
|
pool: list[str] = []
|
||||||
|
for _name, site in sites.items():
|
||||||
|
u = (site or {}).get("usernameClaimed")
|
||||||
|
if not u or not isinstance(u, str):
|
||||||
|
continue
|
||||||
|
u = u.strip()
|
||||||
|
if not u or u in seen:
|
||||||
|
continue
|
||||||
|
seen.add(u)
|
||||||
|
pool.append(u)
|
||||||
|
return pool
|
||||||
|
|
||||||
|
|
||||||
|
def run_maigret(
|
||||||
|
*,
|
||||||
|
username: str,
|
||||||
|
db_path: Path,
|
||||||
|
out_dir: Path,
|
||||||
|
top_sites: int,
|
||||||
|
json_format: str,
|
||||||
|
quiet: bool,
|
||||||
|
) -> Path:
|
||||||
|
"""Run maigret subprocess; return path to the written JSON report."""
|
||||||
|
safe = username.replace("/", "_")
|
||||||
|
report_name = f"report_{safe}_{json_format}.json"
|
||||||
|
report_path = out_dir / report_name
|
||||||
|
|
||||||
|
cmd = [
|
||||||
|
sys.executable,
|
||||||
|
"-m",
|
||||||
|
"maigret",
|
||||||
|
username,
|
||||||
|
"--db",
|
||||||
|
str(db_path),
|
||||||
|
"--top-sites",
|
||||||
|
str(top_sites),
|
||||||
|
"--json",
|
||||||
|
json_format,
|
||||||
|
"--folderoutput",
|
||||||
|
str(out_dir),
|
||||||
|
"--no-progressbar",
|
||||||
|
"--no-color",
|
||||||
|
"--no-recursion",
|
||||||
|
"--no-extracting",
|
||||||
|
]
|
||||||
|
sink = subprocess.DEVNULL if quiet else None
|
||||||
|
proc = subprocess.run(
|
||||||
|
cmd,
|
||||||
|
cwd=str(repo_root()),
|
||||||
|
text=True,
|
||||||
|
stdout=sink,
|
||||||
|
stderr=sink,
|
||||||
|
)
|
||||||
|
if proc.returncode != 0:
|
||||||
|
raise RuntimeError(
|
||||||
|
f"maigret exited with {proc.returncode} for username {username!r}"
|
||||||
|
)
|
||||||
|
if not report_path.is_file():
|
||||||
|
raise FileNotFoundError(f"Expected report missing: {report_path}")
|
||||||
|
return report_path
|
||||||
|
|
||||||
|
|
||||||
|
def claimed_sites_from_report(path: Path, json_format: str) -> set[str]:
|
||||||
|
if json_format == "simple":
|
||||||
|
with path.open(encoding="utf-8") as f:
|
||||||
|
data = json.load(f)
|
||||||
|
if not isinstance(data, dict):
|
||||||
|
return set()
|
||||||
|
return set(data.keys())
|
||||||
|
# ndjson: one object per line, each has "sitename"
|
||||||
|
sites: set[str] = set()
|
||||||
|
with path.open(encoding="utf-8") as f:
|
||||||
|
for line in f:
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
obj = json.loads(line)
|
||||||
|
name = obj.get("sitename")
|
||||||
|
if isinstance(name, str) and name:
|
||||||
|
sites.add(name)
|
||||||
|
return sites
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description=(
|
||||||
|
"Pick random distinct usernameClaimed values, run maigret --top-sites N "
|
||||||
|
"with JSON reports, and list sites that claimed all of them (suspicious FP)."
|
||||||
|
)
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--db",
|
||||||
|
"-b",
|
||||||
|
type=Path,
|
||||||
|
default=repo_root() / "maigret" / "resources" / "data.json",
|
||||||
|
help="Path to Maigret data.json (a temp copy is used for runs).",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--top-sites",
|
||||||
|
"-n",
|
||||||
|
type=int,
|
||||||
|
default=500,
|
||||||
|
metavar="N",
|
||||||
|
help="Value for maigret --top-sites (default: 500).",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--samples",
|
||||||
|
"-k",
|
||||||
|
type=int,
|
||||||
|
default=5,
|
||||||
|
metavar="K",
|
||||||
|
help="How many distinct random usernames to draw (default: 5).",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--seed",
|
||||||
|
type=int,
|
||||||
|
default=None,
|
||||||
|
help="RNG seed for reproducible username selection.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--json",
|
||||||
|
dest="json_format",
|
||||||
|
default="simple",
|
||||||
|
choices=["simple", "ndjson"],
|
||||||
|
help="JSON report type passed to maigret -J (default: simple).",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--verbose",
|
||||||
|
"-v",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
help="Print maigret stdout/stderr (default: suppress child output).",
|
||||||
|
)
|
||||||
|
args = parser.parse_args()
|
||||||
|
quiet = not args.verbose
|
||||||
|
|
||||||
|
db_src = args.db.resolve()
|
||||||
|
if not db_src.is_file():
|
||||||
|
print(f"Database not found: {db_src}", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
pool = load_username_claimed_pool(db_src)
|
||||||
|
if len(pool) < args.samples:
|
||||||
|
print(
|
||||||
|
f"Need at least {args.samples} distinct usernameClaimed entries, "
|
||||||
|
f"found {len(pool)}.",
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
rng = random.Random(args.seed)
|
||||||
|
picked = rng.sample(pool, args.samples)
|
||||||
|
|
||||||
|
print(f"Database: {db_src}")
|
||||||
|
print(f"--top-sites {args.top_sites}, {args.samples} random usernameClaimed:")
|
||||||
|
for i, u in enumerate(picked, 1):
|
||||||
|
print(f" {i}. {u}")
|
||||||
|
|
||||||
|
site_sets: list[set[str]] = []
|
||||||
|
with tempfile.TemporaryDirectory(prefix="maigret_fp_probe_") as tmp:
|
||||||
|
tmp_path = Path(tmp)
|
||||||
|
db_work = tmp_path / "data.json"
|
||||||
|
shutil.copyfile(db_src, db_work)
|
||||||
|
|
||||||
|
for u in picked:
|
||||||
|
print(f"\nRunning maigret for {u!r} ...", flush=True)
|
||||||
|
report = run_maigret(
|
||||||
|
username=u,
|
||||||
|
db_path=db_work,
|
||||||
|
out_dir=tmp_path,
|
||||||
|
top_sites=args.top_sites,
|
||||||
|
json_format=args.json_format,
|
||||||
|
quiet=quiet,
|
||||||
|
)
|
||||||
|
sites = claimed_sites_from_report(report, args.json_format)
|
||||||
|
site_sets.append(sites)
|
||||||
|
print(f" -> {len(sites)} positive site(s) in JSON", flush=True)
|
||||||
|
|
||||||
|
always = set.intersection(*site_sets) if site_sets else set()
|
||||||
|
print("\n--- Sites with CLAIMED in all runs (candidates for false positives) ---")
|
||||||
|
if not always:
|
||||||
|
print("(none)")
|
||||||
|
else:
|
||||||
|
for name in sorted(always):
|
||||||
|
print(name)
|
||||||
|
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
@@ -0,0 +1,750 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Site check utility for Maigret development.
|
||||||
|
Quickly test site availability, find valid usernames, and diagnose check issues.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python utils/site_check.py --site "SiteName" --check-claimed
|
||||||
|
python utils/site_check.py --site "SiteName" --maigret # Test via Maigret
|
||||||
|
python utils/site_check.py --site "SiteName" --compare-methods # aiohttp vs Maigret
|
||||||
|
python utils/site_check.py --url "https://example.com/user/{username}" --test "john"
|
||||||
|
python utils/site_check.py --site "SiteName" --find-user
|
||||||
|
python utils/site_check.py --site "SiteName" --diagnose # Full diagnosis
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
# Add parent dir for imports
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||||
|
|
||||||
|
try:
|
||||||
|
import aiohttp
|
||||||
|
except ImportError:
|
||||||
|
print("aiohttp not installed. Run: pip install aiohttp")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Maigret imports (optional, for --maigret mode)
|
||||||
|
MAIGRET_AVAILABLE = False
|
||||||
|
try:
|
||||||
|
from maigret.sites import MaigretDatabase, MaigretSite
|
||||||
|
from maigret.checking import (
|
||||||
|
SimpleAiohttpChecker,
|
||||||
|
check_site_for_username,
|
||||||
|
process_site_result,
|
||||||
|
make_site_result,
|
||||||
|
)
|
||||||
|
from maigret.notify import QueryNotifyPrint
|
||||||
|
from maigret.result import QueryStatus
|
||||||
|
MAIGRET_AVAILABLE = True
|
||||||
|
except ImportError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
DEFAULT_HEADERS = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||||
|
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
||||||
|
"Accept-Language": "en-US,en;q=0.5",
|
||||||
|
}
|
||||||
|
|
||||||
|
COMMON_USERNAMES = ["blue", "test", "admin", "user", "john", "alex", "david", "mike", "chris", "dan"]
|
||||||
|
|
||||||
|
|
||||||
|
class Colors:
|
||||||
|
"""ANSI color codes for terminal output."""
|
||||||
|
RED = "\033[91m"
|
||||||
|
GREEN = "\033[92m"
|
||||||
|
YELLOW = "\033[93m"
|
||||||
|
BLUE = "\033[94m"
|
||||||
|
MAGENTA = "\033[95m"
|
||||||
|
CYAN = "\033[96m"
|
||||||
|
RESET = "\033[0m"
|
||||||
|
BOLD = "\033[1m"
|
||||||
|
|
||||||
|
|
||||||
|
def color(text: str, c: str) -> str:
|
||||||
|
"""Wrap text with color codes."""
|
||||||
|
return f"{c}{text}{Colors.RESET}"
|
||||||
|
|
||||||
|
|
||||||
|
async def check_url_aiohttp(url: str, headers: dict = None, follow_redirects: bool = True,
|
||||||
|
timeout: int = 15, ssl_verify: bool = False) -> dict:
|
||||||
|
"""Check a URL using aiohttp and return detailed response info."""
|
||||||
|
headers = headers or DEFAULT_HEADERS.copy()
|
||||||
|
result = {
|
||||||
|
"method": "aiohttp",
|
||||||
|
"url": url,
|
||||||
|
"status": None,
|
||||||
|
"final_url": None,
|
||||||
|
"redirects": [],
|
||||||
|
"content_length": 0,
|
||||||
|
"content": None,
|
||||||
|
"title": None,
|
||||||
|
"error": None,
|
||||||
|
"error_type": None,
|
||||||
|
"markers": {},
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
connector = aiohttp.TCPConnector(ssl=ssl_verify)
|
||||||
|
timeout_obj = aiohttp.ClientTimeout(total=timeout)
|
||||||
|
|
||||||
|
async with aiohttp.ClientSession(connector=connector, timeout=timeout_obj) as session:
|
||||||
|
async with session.get(url, headers=headers, allow_redirects=follow_redirects) as resp:
|
||||||
|
result["status"] = resp.status
|
||||||
|
result["final_url"] = str(resp.url)
|
||||||
|
|
||||||
|
# Get redirect history
|
||||||
|
if resp.history:
|
||||||
|
result["redirects"] = [str(r.url) for r in resp.history]
|
||||||
|
|
||||||
|
# Read content
|
||||||
|
try:
|
||||||
|
text = await resp.text()
|
||||||
|
result["content_length"] = len(text)
|
||||||
|
result["content"] = text
|
||||||
|
|
||||||
|
# Extract title
|
||||||
|
title_match = re.search(r'<title>([^<]*)</title>', text, re.IGNORECASE)
|
||||||
|
if title_match:
|
||||||
|
result["title"] = title_match.group(1).strip()[:100]
|
||||||
|
|
||||||
|
# Check common markers
|
||||||
|
text_lower = text.lower()
|
||||||
|
markers = {
|
||||||
|
"404_text": any(m in text_lower for m in ["not found", "404", "doesn't exist", "does not exist"]),
|
||||||
|
"profile_markers": any(m in text_lower for m in ["profile", "user", "member", "account"]),
|
||||||
|
"error_markers": any(m in text_lower for m in ["error", "banned", "suspended", "blocked"]),
|
||||||
|
"login_required": any(m in text_lower for m in ["log in", "login", "sign in", "signin"]),
|
||||||
|
"captcha": any(m in text_lower for m in ["captcha", "recaptcha", "challenge", "verify you"]),
|
||||||
|
"cloudflare": "cloudflare" in text_lower or "cf-ray" in text_lower,
|
||||||
|
"rate_limit": any(m in text_lower for m in ["rate limit", "too many requests", "429"]),
|
||||||
|
}
|
||||||
|
result["markers"] = markers
|
||||||
|
|
||||||
|
# First 500 chars of body for inspection
|
||||||
|
result["body_preview"] = text[:500].replace("\n", " ").strip()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
result["error"] = f"Content read error: {e}"
|
||||||
|
result["error_type"] = "content_error"
|
||||||
|
|
||||||
|
except asyncio.TimeoutError:
|
||||||
|
result["error"] = "Timeout"
|
||||||
|
result["error_type"] = "timeout"
|
||||||
|
except aiohttp.ClientError as e:
|
||||||
|
result["error"] = f"Client error: {e}"
|
||||||
|
result["error_type"] = "client_error"
|
||||||
|
except Exception as e:
|
||||||
|
result["error"] = f"Error: {e}"
|
||||||
|
result["error_type"] = "unknown"
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
async def check_url_maigret(site: 'MaigretSite', username: str, logger=None) -> dict:
|
||||||
|
"""Check a URL using Maigret's checking mechanism."""
|
||||||
|
if not MAIGRET_AVAILABLE:
|
||||||
|
return {"error": "Maigret not available", "method": "maigret"}
|
||||||
|
|
||||||
|
if logger is None:
|
||||||
|
logger = logging.getLogger("site_check")
|
||||||
|
logger.setLevel(logging.WARNING)
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"method": "maigret",
|
||||||
|
"url": None,
|
||||||
|
"status": None,
|
||||||
|
"status_str": None,
|
||||||
|
"http_status": None,
|
||||||
|
"final_url": None,
|
||||||
|
"error": None,
|
||||||
|
"error_type": None,
|
||||||
|
"ids_data": None,
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create query options
|
||||||
|
options = {
|
||||||
|
"parsing": False,
|
||||||
|
"cookie_jar": None,
|
||||||
|
"timeout": 15,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Create a simple notifier
|
||||||
|
class SilentNotify:
|
||||||
|
def start(self, msg=None): pass
|
||||||
|
def update(self, status, similar=False): pass
|
||||||
|
def finish(self, msg=None, status=None): pass
|
||||||
|
|
||||||
|
notifier = SilentNotify()
|
||||||
|
|
||||||
|
# Run the check
|
||||||
|
site_name, site_result = await check_site_for_username(
|
||||||
|
site, username, options, logger, notifier
|
||||||
|
)
|
||||||
|
|
||||||
|
result["url"] = site_result.get("url_user")
|
||||||
|
result["status"] = site_result.get("status")
|
||||||
|
result["status_str"] = str(site_result.get("status"))
|
||||||
|
result["http_status"] = site_result.get("http_status")
|
||||||
|
result["ids_data"] = site_result.get("ids_data")
|
||||||
|
|
||||||
|
# Check for errors
|
||||||
|
status = site_result.get("status")
|
||||||
|
if status and hasattr(status, 'error') and status.error:
|
||||||
|
result["error"] = f"{status.error.type}: {status.error.desc}"
|
||||||
|
result["error_type"] = str(status.error.type)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
result["error"] = str(e)
|
||||||
|
result["error_type"] = "exception"
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
async def find_valid_username(url_template: str, usernames: list = None, headers: dict = None) -> Optional[str]:
|
||||||
|
"""Try common usernames to find one that works."""
|
||||||
|
usernames = usernames or COMMON_USERNAMES
|
||||||
|
headers = headers or DEFAULT_HEADERS.copy()
|
||||||
|
|
||||||
|
print(f"Testing {len(usernames)} usernames on {url_template}...")
|
||||||
|
|
||||||
|
for username in usernames:
|
||||||
|
url = url_template.replace("{username}", username)
|
||||||
|
result = await check_url_aiohttp(url, headers)
|
||||||
|
|
||||||
|
status = result["status"]
|
||||||
|
markers = result.get("markers", {})
|
||||||
|
|
||||||
|
# Good signs: 200 status, profile markers, no 404 text
|
||||||
|
if status == 200 and not markers.get("404_text") and markers.get("profile_markers"):
|
||||||
|
print(f" {color('[+]', Colors.GREEN)} {username}: status={status}, has profile markers")
|
||||||
|
return username
|
||||||
|
elif status == 200 and not markers.get("404_text"):
|
||||||
|
print(f" {color('[?]', Colors.YELLOW)} {username}: status={status}, might work")
|
||||||
|
else:
|
||||||
|
print(f" {color('[-]', Colors.RED)} {username}: status={status}")
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
async def compare_users_aiohttp(url_template: str, claimed: str, unclaimed: str = "noonewouldeverusethis7",
|
||||||
|
headers: dict = None) -> Tuple[dict, dict]:
|
||||||
|
"""Compare responses for claimed vs unclaimed usernames using aiohttp."""
|
||||||
|
headers = headers or DEFAULT_HEADERS.copy()
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"Comparing: {color(claimed, Colors.GREEN)} vs {color(unclaimed, Colors.RED)}")
|
||||||
|
print(f"URL template: {url_template}")
|
||||||
|
print(f"Method: aiohttp")
|
||||||
|
print(f"{'='*60}\n")
|
||||||
|
|
||||||
|
url_claimed = url_template.replace("{username}", claimed)
|
||||||
|
url_unclaimed = url_template.replace("{username}", unclaimed)
|
||||||
|
|
||||||
|
result_claimed, result_unclaimed = await asyncio.gather(
|
||||||
|
check_url_aiohttp(url_claimed, headers),
|
||||||
|
check_url_aiohttp(url_unclaimed, headers)
|
||||||
|
)
|
||||||
|
|
||||||
|
def print_result(name, r, c):
|
||||||
|
print(f"--- {color(name, c)} ---")
|
||||||
|
print(f" URL: {r['url']}")
|
||||||
|
print(f" Status: {color(str(r['status']), Colors.GREEN if r['status'] == 200 else Colors.RED)}")
|
||||||
|
if r["redirects"]:
|
||||||
|
print(f" Redirects: {' -> '.join(r['redirects'])} -> {r['final_url']}")
|
||||||
|
print(f" Final URL: {r['final_url']}")
|
||||||
|
print(f" Content length: {r['content_length']}")
|
||||||
|
print(f" Title: {r['title']}")
|
||||||
|
if r["error"]:
|
||||||
|
print(f" Error: {color(r['error'], Colors.RED)}")
|
||||||
|
print(f" Markers: {r['markers']}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print_result(f"CLAIMED ({claimed})", result_claimed, Colors.GREEN)
|
||||||
|
print_result(f"UNCLAIMED ({unclaimed})", result_unclaimed, Colors.RED)
|
||||||
|
|
||||||
|
# Analysis
|
||||||
|
print(f"--- {color('ANALYSIS', Colors.CYAN)} ---")
|
||||||
|
recommendations = []
|
||||||
|
|
||||||
|
if result_claimed["status"] != result_unclaimed["status"]:
|
||||||
|
print(f" [!] Status codes differ: {result_claimed['status']} vs {result_unclaimed['status']}")
|
||||||
|
recommendations.append(("status_code", f"Status codes: {result_claimed['status']} vs {result_unclaimed['status']}"))
|
||||||
|
|
||||||
|
if result_claimed["final_url"] != result_unclaimed["final_url"]:
|
||||||
|
print(f" [!] Final URLs differ")
|
||||||
|
recommendations.append(("response_url", "Final URLs differ"))
|
||||||
|
|
||||||
|
if result_claimed["content_length"] != result_unclaimed["content_length"]:
|
||||||
|
diff = abs(result_claimed["content_length"] - result_unclaimed["content_length"])
|
||||||
|
print(f" [!] Content length differs by {diff} bytes")
|
||||||
|
recommendations.append(("message", f"Content differs by {diff} bytes"))
|
||||||
|
|
||||||
|
if result_claimed["title"] != result_unclaimed["title"]:
|
||||||
|
print(f" [!] Titles differ:")
|
||||||
|
print(f" Claimed: {result_claimed['title']}")
|
||||||
|
print(f" Unclaimed: {result_unclaimed['title']}")
|
||||||
|
recommendations.append(("message", f"Titles differ: '{result_claimed['title']}' vs '{result_unclaimed['title']}'"))
|
||||||
|
|
||||||
|
# Check for problems
|
||||||
|
if result_claimed.get("markers", {}).get("captcha"):
|
||||||
|
print(f" {color('[WARN]', Colors.YELLOW)} Captcha detected on claimed page")
|
||||||
|
if result_claimed.get("markers", {}).get("cloudflare"):
|
||||||
|
print(f" {color('[WARN]', Colors.YELLOW)} Cloudflare protection detected")
|
||||||
|
if result_claimed.get("markers", {}).get("login_required"):
|
||||||
|
print(f" {color('[WARN]', Colors.YELLOW)} Login may be required")
|
||||||
|
|
||||||
|
if recommendations:
|
||||||
|
print(f"\n {color('Recommended checkType:', Colors.BOLD)} {recommendations[0][0]}")
|
||||||
|
else:
|
||||||
|
print(f" {color('[!]', Colors.RED)} No clear difference found - site may need special handling")
|
||||||
|
|
||||||
|
return result_claimed, result_unclaimed
|
||||||
|
|
||||||
|
|
||||||
|
async def compare_methods(site: 'MaigretSite', claimed: str, unclaimed: str) -> dict:
|
||||||
|
"""Compare aiohttp vs Maigret results for the same site."""
|
||||||
|
if not MAIGRET_AVAILABLE:
|
||||||
|
print(color("Maigret not available for comparison", Colors.RED))
|
||||||
|
return {}
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"{color('METHOD COMPARISON', Colors.CYAN)}: aiohttp vs Maigret")
|
||||||
|
print(f"Site: {site.name}")
|
||||||
|
print(f"Claimed: {claimed}, Unclaimed: {unclaimed}")
|
||||||
|
print(f"{'='*60}\n")
|
||||||
|
|
||||||
|
# Build URL template
|
||||||
|
url_template = site.url
|
||||||
|
url_template = url_template.replace("{urlMain}", site.url_main or "")
|
||||||
|
url_template = url_template.replace("{urlSubpath}", getattr(site, 'url_subpath', '') or "")
|
||||||
|
|
||||||
|
headers = DEFAULT_HEADERS.copy()
|
||||||
|
if hasattr(site, 'headers') and site.headers:
|
||||||
|
headers.update(site.headers)
|
||||||
|
|
||||||
|
# Run all checks in parallel
|
||||||
|
url_claimed = url_template.replace("{username}", claimed)
|
||||||
|
url_unclaimed = url_template.replace("{username}", unclaimed)
|
||||||
|
|
||||||
|
aiohttp_claimed, aiohttp_unclaimed, maigret_claimed, maigret_unclaimed = await asyncio.gather(
|
||||||
|
check_url_aiohttp(url_claimed, headers),
|
||||||
|
check_url_aiohttp(url_unclaimed, headers),
|
||||||
|
check_url_maigret(site, claimed),
|
||||||
|
check_url_maigret(site, unclaimed),
|
||||||
|
)
|
||||||
|
|
||||||
|
def status_icon(status):
|
||||||
|
if status == 200:
|
||||||
|
return color("200", Colors.GREEN)
|
||||||
|
elif status == 404:
|
||||||
|
return color("404", Colors.YELLOW)
|
||||||
|
elif status and status >= 400:
|
||||||
|
return color(str(status), Colors.RED)
|
||||||
|
return str(status)
|
||||||
|
|
||||||
|
def maigret_status_icon(status_str):
|
||||||
|
if "Claimed" in str(status_str):
|
||||||
|
return color("Claimed", Colors.GREEN)
|
||||||
|
elif "Available" in str(status_str):
|
||||||
|
return color("Available", Colors.YELLOW)
|
||||||
|
else:
|
||||||
|
return color(str(status_str), Colors.RED)
|
||||||
|
|
||||||
|
print(f"{'Method':<12} {'Username':<25} {'HTTP Status':<12} {'Result':<20}")
|
||||||
|
print("-" * 70)
|
||||||
|
print(f"{'aiohttp':<12} {claimed:<25} {status_icon(aiohttp_claimed['status']):<20} {'OK' if not aiohttp_claimed['error'] else aiohttp_claimed['error'][:20]}")
|
||||||
|
print(f"{'aiohttp':<12} {unclaimed:<25} {status_icon(aiohttp_unclaimed['status']):<20} {'OK' if not aiohttp_unclaimed['error'] else aiohttp_unclaimed['error'][:20]}")
|
||||||
|
print(f"{'Maigret':<12} {claimed:<25} {status_icon(maigret_claimed.get('http_status')):<20} {maigret_status_icon(maigret_claimed.get('status_str'))}")
|
||||||
|
print(f"{'Maigret':<12} {unclaimed:<25} {status_icon(maigret_unclaimed.get('http_status')):<20} {maigret_status_icon(maigret_unclaimed.get('status_str'))}")
|
||||||
|
|
||||||
|
# Check for discrepancies
|
||||||
|
print(f"\n--- {color('DISCREPANCY ANALYSIS', Colors.CYAN)} ---")
|
||||||
|
issues = []
|
||||||
|
|
||||||
|
if aiohttp_claimed['status'] != maigret_claimed.get('http_status'):
|
||||||
|
issues.append(f"HTTP status mismatch for claimed: aiohttp={aiohttp_claimed['status']}, Maigret={maigret_claimed.get('http_status')}")
|
||||||
|
|
||||||
|
if aiohttp_unclaimed['status'] != maigret_unclaimed.get('http_status'):
|
||||||
|
issues.append(f"HTTP status mismatch for unclaimed: aiohttp={aiohttp_unclaimed['status']}, Maigret={maigret_unclaimed.get('http_status')}")
|
||||||
|
|
||||||
|
# Check Maigret detection correctness
|
||||||
|
claimed_detected = "Claimed" in str(maigret_claimed.get('status_str', ''))
|
||||||
|
unclaimed_detected = "Available" in str(maigret_unclaimed.get('status_str', ''))
|
||||||
|
|
||||||
|
if not claimed_detected:
|
||||||
|
issues.append(f"Maigret did NOT detect claimed user '{claimed}' as Claimed")
|
||||||
|
if not unclaimed_detected:
|
||||||
|
issues.append(f"Maigret did NOT detect unclaimed user '{unclaimed}' as Available")
|
||||||
|
|
||||||
|
if issues:
|
||||||
|
for issue in issues:
|
||||||
|
print(f" {color('[!]', Colors.RED)} {issue}")
|
||||||
|
else:
|
||||||
|
print(f" {color('[OK]', Colors.GREEN)} Both methods agree on results")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"aiohttp_claimed": aiohttp_claimed,
|
||||||
|
"aiohttp_unclaimed": aiohttp_unclaimed,
|
||||||
|
"maigret_claimed": maigret_claimed,
|
||||||
|
"maigret_unclaimed": maigret_unclaimed,
|
||||||
|
"issues": issues,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
async def diagnose_site(site_config: dict, site_name: str) -> dict:
|
||||||
|
"""Full diagnosis of a site configuration."""
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"{color('FULL SITE DIAGNOSIS', Colors.CYAN)}: {site_name}")
|
||||||
|
print(f"{'='*60}\n")
|
||||||
|
|
||||||
|
diagnosis = {
|
||||||
|
"site_name": site_name,
|
||||||
|
"issues": [],
|
||||||
|
"warnings": [],
|
||||||
|
"recommendations": [],
|
||||||
|
"working": False,
|
||||||
|
}
|
||||||
|
|
||||||
|
# 1. Config analysis
|
||||||
|
print(f"--- {color('1. CONFIGURATION', Colors.BOLD)} ---")
|
||||||
|
check_type = site_config.get("checkType", "status_code")
|
||||||
|
url = site_config.get("url", "")
|
||||||
|
url_main = site_config.get("urlMain", "")
|
||||||
|
claimed = site_config.get("usernameClaimed")
|
||||||
|
unclaimed = site_config.get("usernameUnclaimed", "noonewouldeverusethis7")
|
||||||
|
disabled = site_config.get("disabled", False)
|
||||||
|
|
||||||
|
print(f" checkType: {check_type}")
|
||||||
|
print(f" URL: {url}")
|
||||||
|
print(f" urlMain: {url_main}")
|
||||||
|
print(f" usernameClaimed: {claimed}")
|
||||||
|
print(f" disabled: {disabled}")
|
||||||
|
|
||||||
|
if disabled:
|
||||||
|
diagnosis["issues"].append("Site is disabled")
|
||||||
|
print(f" {color('[!]', Colors.YELLOW)} Site is disabled")
|
||||||
|
|
||||||
|
if not claimed:
|
||||||
|
diagnosis["issues"].append("No usernameClaimed defined")
|
||||||
|
print(f" {color('[!]', Colors.RED)} No usernameClaimed defined")
|
||||||
|
return diagnosis
|
||||||
|
|
||||||
|
# Build full URL
|
||||||
|
url_template = url.replace("{urlMain}", url_main).replace("{urlSubpath}", site_config.get("urlSubpath", ""))
|
||||||
|
|
||||||
|
headers = DEFAULT_HEADERS.copy()
|
||||||
|
if site_config.get("headers"):
|
||||||
|
headers.update(site_config["headers"])
|
||||||
|
|
||||||
|
# 2. Connectivity test
|
||||||
|
print(f"\n--- {color('2. CONNECTIVITY TEST', Colors.BOLD)} ---")
|
||||||
|
url_claimed = url_template.replace("{username}", claimed)
|
||||||
|
url_unclaimed = url_template.replace("{username}", unclaimed)
|
||||||
|
|
||||||
|
result_claimed, result_unclaimed = await asyncio.gather(
|
||||||
|
check_url_aiohttp(url_claimed, headers),
|
||||||
|
check_url_aiohttp(url_unclaimed, headers)
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f" Claimed ({claimed}): status={result_claimed['status']}, error={result_claimed['error']}")
|
||||||
|
print(f" Unclaimed ({unclaimed}): status={result_unclaimed['status']}, error={result_unclaimed['error']}")
|
||||||
|
|
||||||
|
# Check for common problems
|
||||||
|
if result_claimed["error_type"] == "timeout":
|
||||||
|
diagnosis["issues"].append("Timeout on claimed username")
|
||||||
|
if result_unclaimed["error_type"] == "timeout":
|
||||||
|
diagnosis["issues"].append("Timeout on unclaimed username")
|
||||||
|
|
||||||
|
if result_claimed.get("markers", {}).get("cloudflare"):
|
||||||
|
diagnosis["warnings"].append("Cloudflare protection detected")
|
||||||
|
if result_claimed.get("markers", {}).get("captcha"):
|
||||||
|
diagnosis["warnings"].append("Captcha detected")
|
||||||
|
if result_claimed["status"] == 403:
|
||||||
|
diagnosis["issues"].append("403 Forbidden - possible anti-bot protection")
|
||||||
|
if result_claimed["status"] == 429:
|
||||||
|
diagnosis["issues"].append("429 Rate Limited")
|
||||||
|
|
||||||
|
# 3. Check type validation
|
||||||
|
print(f"\n--- {color('3. CHECK TYPE VALIDATION', Colors.BOLD)} ---")
|
||||||
|
|
||||||
|
if check_type == "status_code":
|
||||||
|
if result_claimed["status"] == result_unclaimed["status"]:
|
||||||
|
diagnosis["issues"].append(f"status_code check but same status ({result_claimed['status']}) for both")
|
||||||
|
print(f" {color('[FAIL]', Colors.RED)} Same status code for claimed and unclaimed: {result_claimed['status']}")
|
||||||
|
else:
|
||||||
|
print(f" {color('[OK]', Colors.GREEN)} Status codes differ: {result_claimed['status']} vs {result_unclaimed['status']}")
|
||||||
|
diagnosis["working"] = True
|
||||||
|
|
||||||
|
elif check_type == "response_url":
|
||||||
|
if result_claimed["final_url"] == result_unclaimed["final_url"]:
|
||||||
|
diagnosis["issues"].append("response_url check but same final URL for both")
|
||||||
|
print(f" {color('[FAIL]', Colors.RED)} Same final URL for both")
|
||||||
|
else:
|
||||||
|
print(f" {color('[OK]', Colors.GREEN)} Final URLs differ")
|
||||||
|
diagnosis["working"] = True
|
||||||
|
|
||||||
|
elif check_type == "message":
|
||||||
|
presense_strs = site_config.get("presenseStrs", [])
|
||||||
|
absence_strs = site_config.get("absenceStrs", [])
|
||||||
|
|
||||||
|
print(f" presenseStrs: {presense_strs}")
|
||||||
|
print(f" absenceStrs: {absence_strs}")
|
||||||
|
|
||||||
|
claimed_content = result_claimed.get("content", "") or ""
|
||||||
|
unclaimed_content = result_unclaimed.get("content", "") or ""
|
||||||
|
|
||||||
|
# Check presenseStrs
|
||||||
|
presense_found_claimed = any(s in claimed_content for s in presense_strs) if presense_strs else True
|
||||||
|
presense_found_unclaimed = any(s in unclaimed_content for s in presense_strs) if presense_strs else True
|
||||||
|
|
||||||
|
# Check absenceStrs
|
||||||
|
absence_found_claimed = any(s in claimed_content for s in absence_strs) if absence_strs else False
|
||||||
|
absence_found_unclaimed = any(s in unclaimed_content for s in absence_strs) if absence_strs else False
|
||||||
|
|
||||||
|
print(f" Claimed - presenseStrs found: {presense_found_claimed}, absenceStrs found: {absence_found_claimed}")
|
||||||
|
print(f" Unclaimed - presenseStrs found: {presense_found_unclaimed}, absenceStrs found: {absence_found_unclaimed}")
|
||||||
|
|
||||||
|
if presense_strs and not presense_found_claimed:
|
||||||
|
diagnosis["issues"].append(f"presenseStrs {presense_strs} not found in claimed page")
|
||||||
|
print(f" {color('[FAIL]', Colors.RED)} presenseStrs not found in claimed page")
|
||||||
|
if absence_strs and absence_found_claimed:
|
||||||
|
diagnosis["issues"].append(f"absenceStrs {absence_strs} found in claimed page (should not be)")
|
||||||
|
print(f" {color('[FAIL]', Colors.RED)} absenceStrs found in claimed page")
|
||||||
|
if absence_strs and not absence_found_unclaimed:
|
||||||
|
diagnosis["warnings"].append(f"absenceStrs not found in unclaimed page")
|
||||||
|
print(f" {color('[WARN]', Colors.YELLOW)} absenceStrs not found in unclaimed page")
|
||||||
|
|
||||||
|
if presense_found_claimed and not absence_found_claimed and absence_found_unclaimed:
|
||||||
|
print(f" {color('[OK]', Colors.GREEN)} Message check should work correctly")
|
||||||
|
diagnosis["working"] = True
|
||||||
|
|
||||||
|
# 4. Recommendations
|
||||||
|
print(f"\n--- {color('4. RECOMMENDATIONS', Colors.BOLD)} ---")
|
||||||
|
|
||||||
|
if not diagnosis["working"]:
|
||||||
|
# Suggest alternatives
|
||||||
|
if result_claimed["status"] != result_unclaimed["status"]:
|
||||||
|
diagnosis["recommendations"].append(f"Switch to checkType: status_code (status {result_claimed['status']} vs {result_unclaimed['status']})")
|
||||||
|
if result_claimed["final_url"] != result_unclaimed["final_url"]:
|
||||||
|
diagnosis["recommendations"].append("Switch to checkType: response_url")
|
||||||
|
if result_claimed["title"] != result_unclaimed["title"]:
|
||||||
|
diagnosis["recommendations"].append(f"Use title as marker: presenseStrs=['{result_claimed['title']}'] or absenceStrs=['{result_unclaimed['title']}']")
|
||||||
|
|
||||||
|
if diagnosis["recommendations"]:
|
||||||
|
for rec in diagnosis["recommendations"]:
|
||||||
|
print(f" -> {rec}")
|
||||||
|
elif diagnosis["working"]:
|
||||||
|
print(f" {color('Site appears to be working correctly', Colors.GREEN)}")
|
||||||
|
else:
|
||||||
|
print(f" {color('No clear fix found - site may need special handling or should be disabled', Colors.RED)}")
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
print(f"\n--- {color('SUMMARY', Colors.BOLD)} ---")
|
||||||
|
if diagnosis["issues"]:
|
||||||
|
print(f" Issues: {len(diagnosis['issues'])}")
|
||||||
|
for issue in diagnosis["issues"]:
|
||||||
|
print(f" - {issue}")
|
||||||
|
if diagnosis["warnings"]:
|
||||||
|
print(f" Warnings: {len(diagnosis['warnings'])}")
|
||||||
|
for warn in diagnosis["warnings"]:
|
||||||
|
print(f" - {warn}")
|
||||||
|
print(f" Working: {color('YES', Colors.GREEN) if diagnosis['working'] else color('NO', Colors.RED)}")
|
||||||
|
|
||||||
|
return diagnosis
|
||||||
|
|
||||||
|
|
||||||
|
def load_site_from_db(site_name: str) -> Tuple[Optional[dict], Optional['MaigretSite']]:
|
||||||
|
"""Load site config from data.json. Returns (config_dict, MaigretSite or None)."""
|
||||||
|
db_path = Path(__file__).parent.parent / "maigret" / "resources" / "data.json"
|
||||||
|
|
||||||
|
with open(db_path) as f:
|
||||||
|
data = json.load(f)
|
||||||
|
|
||||||
|
config = None
|
||||||
|
if site_name in data["sites"]:
|
||||||
|
config = data["sites"][site_name]
|
||||||
|
else:
|
||||||
|
# Try case-insensitive search
|
||||||
|
for name, cfg in data["sites"].items():
|
||||||
|
if name.lower() == site_name.lower():
|
||||||
|
config = cfg
|
||||||
|
site_name = name
|
||||||
|
break
|
||||||
|
|
||||||
|
if not config:
|
||||||
|
return None, None
|
||||||
|
|
||||||
|
# Also load MaigretSite if available
|
||||||
|
maigret_site = None
|
||||||
|
if MAIGRET_AVAILABLE:
|
||||||
|
try:
|
||||||
|
db = MaigretDatabase().load_from_path(db_path)
|
||||||
|
maigret_site = db.sites_dict.get(site_name)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return config, maigret_site
|
||||||
|
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Site check utility for Maigret development",
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""
|
||||||
|
Examples:
|
||||||
|
%(prog)s --site "VK" --check-claimed # Test site with aiohttp
|
||||||
|
%(prog)s --site "VK" --maigret # Test site with Maigret
|
||||||
|
%(prog)s --site "VK" --compare-methods # Compare aiohttp vs Maigret
|
||||||
|
%(prog)s --site "VK" --diagnose # Full diagnosis
|
||||||
|
%(prog)s --url "https://vk.com/{username}" --compare blue nobody123
|
||||||
|
%(prog)s --site "VK" --find-user # Find a valid username
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
parser.add_argument("--site", "-s", help="Site name from data.json")
|
||||||
|
parser.add_argument("--url", "-u", help="URL template with {username}")
|
||||||
|
parser.add_argument("--test", "-t", help="Username to test")
|
||||||
|
parser.add_argument("--compare", "-c", nargs=2, metavar=("CLAIMED", "UNCLAIMED"),
|
||||||
|
help="Compare two usernames")
|
||||||
|
parser.add_argument("--find-user", "-f", action="store_true",
|
||||||
|
help="Find a valid username")
|
||||||
|
parser.add_argument("--check-claimed", action="store_true",
|
||||||
|
help="Check if claimed username still works (aiohttp)")
|
||||||
|
parser.add_argument("--maigret", "-m", action="store_true",
|
||||||
|
help="Test using Maigret's checker instead of aiohttp")
|
||||||
|
parser.add_argument("--compare-methods", action="store_true",
|
||||||
|
help="Compare aiohttp vs Maigret results")
|
||||||
|
parser.add_argument("--diagnose", "-d", action="store_true",
|
||||||
|
help="Full diagnosis of site configuration")
|
||||||
|
parser.add_argument("--headers", help="Custom headers as JSON")
|
||||||
|
parser.add_argument("--timeout", type=int, default=15, help="Request timeout in seconds")
|
||||||
|
parser.add_argument("--json", action="store_true", help="Output results as JSON")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
url_template = None
|
||||||
|
claimed = None
|
||||||
|
unclaimed = "noonewouldeverusethis7"
|
||||||
|
headers = DEFAULT_HEADERS.copy()
|
||||||
|
site_config = None
|
||||||
|
maigret_site = None
|
||||||
|
|
||||||
|
# Load from site name
|
||||||
|
if args.site:
|
||||||
|
site_config, maigret_site = load_site_from_db(args.site)
|
||||||
|
if not site_config:
|
||||||
|
print(f"Site '{args.site}' not found in database")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
url_template = site_config.get("url", "")
|
||||||
|
url_main = site_config.get("urlMain", "")
|
||||||
|
url_subpath = site_config.get("urlSubpath", "")
|
||||||
|
url_template = url_template.replace("{urlMain}", url_main).replace("{urlSubpath}", url_subpath)
|
||||||
|
|
||||||
|
claimed = site_config.get("usernameClaimed")
|
||||||
|
unclaimed = site_config.get("usernameUnclaimed", unclaimed)
|
||||||
|
|
||||||
|
if site_config.get("headers"):
|
||||||
|
headers.update(site_config["headers"])
|
||||||
|
|
||||||
|
if not args.json:
|
||||||
|
print(f"Loaded site: {args.site}")
|
||||||
|
print(f" URL: {url_template}")
|
||||||
|
print(f" Claimed: {claimed}")
|
||||||
|
print(f" CheckType: {site_config.get('checkType', 'unknown')}")
|
||||||
|
print(f" Disabled: {site_config.get('disabled', False)}")
|
||||||
|
|
||||||
|
# Override with explicit URL
|
||||||
|
if args.url:
|
||||||
|
url_template = args.url
|
||||||
|
|
||||||
|
# Custom headers
|
||||||
|
if args.headers:
|
||||||
|
headers.update(json.loads(args.headers))
|
||||||
|
|
||||||
|
# Actions
|
||||||
|
if args.diagnose:
|
||||||
|
if not site_config:
|
||||||
|
print("--diagnose requires --site")
|
||||||
|
sys.exit(1)
|
||||||
|
result = await diagnose_site(site_config, args.site)
|
||||||
|
if args.json:
|
||||||
|
print(json.dumps(result, indent=2, default=str))
|
||||||
|
|
||||||
|
elif args.compare_methods:
|
||||||
|
if not maigret_site:
|
||||||
|
if not MAIGRET_AVAILABLE:
|
||||||
|
print("Maigret imports not available")
|
||||||
|
else:
|
||||||
|
print("Could not load MaigretSite object")
|
||||||
|
sys.exit(1)
|
||||||
|
result = await compare_methods(maigret_site, claimed, unclaimed)
|
||||||
|
if args.json:
|
||||||
|
print(json.dumps(result, indent=2, default=str))
|
||||||
|
|
||||||
|
elif args.maigret:
|
||||||
|
if not maigret_site:
|
||||||
|
if not MAIGRET_AVAILABLE:
|
||||||
|
print("Maigret imports not available")
|
||||||
|
else:
|
||||||
|
print("Could not load MaigretSite object")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print(f"\n--- Testing with Maigret ---")
|
||||||
|
for username in [claimed, unclaimed]:
|
||||||
|
result = await check_url_maigret(maigret_site, username)
|
||||||
|
print(f" {username}: status={result.get('status_str')}, http={result.get('http_status')}, error={result.get('error')}")
|
||||||
|
|
||||||
|
elif args.find_user:
|
||||||
|
if not url_template:
|
||||||
|
print("--find-user requires --site or --url")
|
||||||
|
sys.exit(1)
|
||||||
|
result = await find_valid_username(url_template, headers=headers)
|
||||||
|
if result:
|
||||||
|
print(f"\n{color('Found valid username:', Colors.GREEN)} {result}")
|
||||||
|
else:
|
||||||
|
print(f"\n{color('No valid username found', Colors.RED)}")
|
||||||
|
|
||||||
|
elif args.compare:
|
||||||
|
if not url_template:
|
||||||
|
print("--compare requires --site or --url")
|
||||||
|
sys.exit(1)
|
||||||
|
result = await compare_users_aiohttp(url_template, args.compare[0], args.compare[1], headers)
|
||||||
|
if args.json:
|
||||||
|
# Remove content field for JSON output (too large)
|
||||||
|
for r in result:
|
||||||
|
if isinstance(r, dict) and "content" in r:
|
||||||
|
del r["content"]
|
||||||
|
print(json.dumps(result, indent=2, default=str))
|
||||||
|
|
||||||
|
elif args.check_claimed and claimed:
|
||||||
|
result = await compare_users_aiohttp(url_template, claimed, unclaimed, headers)
|
||||||
|
|
||||||
|
elif args.test:
|
||||||
|
if not url_template:
|
||||||
|
print("--test requires --site or --url")
|
||||||
|
sys.exit(1)
|
||||||
|
url = url_template.replace("{username}", args.test)
|
||||||
|
result = await check_url_aiohttp(url, headers, timeout=args.timeout)
|
||||||
|
if "content" in result:
|
||||||
|
del result["content"] # Too large for display
|
||||||
|
print(json.dumps(result, indent=2, default=str))
|
||||||
|
|
||||||
|
else:
|
||||||
|
# Default: check claimed username if available
|
||||||
|
if url_template and claimed:
|
||||||
|
await compare_users_aiohttp(url_template, claimed, unclaimed, headers)
|
||||||
|
else:
|
||||||
|
parser.print_help()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
Reference in New Issue
Block a user