mirror of
https://github.com/soxoj/maigret.git
synced 2026-05-07 06:24:35 +00:00
Tags and site names improvements (#2427)
- Added social tag to social networks (33 sites) - Fixed wrong tags (8 sites) - Filled empty tags for 213 sites in top-1000 - Country tag cleanup (~374 sites) - Site naming normalization (75 sites) - New tests (3) - Documentation updates
This commit is contained in:
@@ -97,6 +97,46 @@ Practical observations from fixing top-ranked sites. Full details: section **7**
|
|||||||
| **Non-standard anti-bot HTTP codes** | HTTP 468 (Tengine), 520–530 (Cloudflare) — not standard 403/429. Check with `curl -sIL`; if code is from intermediary → `disabled: true`. |
|
| **Non-standard anti-bot HTTP codes** | HTTP 468 (Tengine), 520–530 (Cloudflare) — not standard 403/429. Check with `curl -sIL`; if code is from intermediary → `disabled: true`. |
|
||||||
| **`--diagnose` doesn't test POST** | `site_check.py --diagnose` uses GET only. For POST APIs (Discord, Holopin), verify with `curl -X POST` or `maigret --self-check`. |
|
| **`--diagnose` doesn't test POST** | `site_check.py --diagnose` uses GET only. For POST APIs (Discord, Holopin), verify with `curl -X POST` or `maigret --self-check`. |
|
||||||
|
|
||||||
## 8. Documentation maintenance
|
## 8. Site naming rules
|
||||||
|
|
||||||
|
Site names in `data.json` are the **keys** of the `"sites"` object and appear in user-facing reports. Follow these rules:
|
||||||
|
|
||||||
|
| Rule | Example | Counter-example |
|
||||||
|
|------|---------|-----------------|
|
||||||
|
| **Title Case** by default | `Hacker News`, `Product Hunt` | ~~`hackernews`~~, ~~`product hunt`~~ |
|
||||||
|
| **Lowercase** if the brand is written that way | `kofi`, `note`, `hi5` | ~~`Kofi`~~, ~~`Note`~~ |
|
||||||
|
| **No domain suffix** unless it is part of the recognized brand | `Flickr`, `Calendly`, `Upwork` | ~~`www.flickr.com`~~, ~~`calendly.com`~~ |
|
||||||
|
| **Domain OK** when the brand is commonly written with it | `last.fm`, `VC.ru`, `Archive.org` | |
|
||||||
|
| **No full UPPERCASE** unless the brand is an acronym/initialism | `VK`, `CNET`, `ICQ`, `IFTTT` | ~~`BOOTH`~~, ~~`VSCO`~~ → `Booth`, `VSCO` (brand) |
|
||||||
|
| **`{username}` templates** in names are OK | `{username}.tilda.ws` | |
|
||||||
|
| **Spaces** are allowed when the brand uses them | `Star Citizen`, `Google Maps` | |
|
||||||
|
| **No `www.` or `https://`** prefix | `Flickr`, `Change.org` | ~~`www.flickr.com`~~, ~~`https:`~~ |
|
||||||
|
|
||||||
|
When in doubt, check how the service refers to itself on its homepage or in its page title.
|
||||||
|
|
||||||
|
## 9. Tagging rules
|
||||||
|
|
||||||
|
### Country tags (ISO 3166-1 alpha-2)
|
||||||
|
|
||||||
|
The goal of a country tag is to **attribute a person to their country of origin or residence**, not to be a perfect truth source.
|
||||||
|
|
||||||
|
| Scenario | Action | Example |
|
||||||
|
|----------|--------|---------|
|
||||||
|
| Site is global, account says nothing about country | **No country tag** | GitHub, YouTube, Reddit, Medium, Udemy |
|
||||||
|
| Account implies connection to a specific country | **Add country tag** | VK → `ru`, Naver → `kr`, Zhihu → `cn` |
|
||||||
|
| Service used mostly in a few specific countries | **Multiple country tags OK** | Xing → `de`, `eu` |
|
||||||
|
| Very local/regional site | **Must have country tag** | Nairaland → `ng`, 4pda → `ru` |
|
||||||
|
|
||||||
|
**Do NOT** assign country tags based on traffic statistics (e.g. Alexa/SimilarWeb audience data). A site popular in India by traffic is not "Indian" if it is used globally. The `in` tag was previously over-applied this way.
|
||||||
|
|
||||||
|
### Category tags
|
||||||
|
|
||||||
|
- Every tag used in `data.json` must be registered in the `"tags"` array at the bottom of the file. The `test_tags_validity` test enforces this.
|
||||||
|
- Do not use platform/software names as tags (`writefreely`, `pixelfed`). Use category names instead (`blog`, `photo`).
|
||||||
|
- Avoid 2-letter category tags that collide with ISO country codes (e.g. `ai` = Anguilla). The `is_country_tag()` function treats any 2-letter tag as a country code.
|
||||||
|
- Keep existing category tags when modifying country tags.
|
||||||
|
- Top-50 sites by alexaRank must have at least one category tag (enforced by `test_top_sites_have_category_tag`).
|
||||||
|
|
||||||
|
## 10. Documentation maintenance
|
||||||
|
|
||||||
When you change Maigret, add search tools, or change check logic, keep **this playbook**, [`site-checks-guide.md`](site-checks-guide.md), and (when applicable) the template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) aligned. New log **entries** are append-only at the bottom of that file.
|
When you change Maigret, add search tools, or change check logic, keep **this playbook**, [`site-checks-guide.md`](site-checks-guide.md), and (when applicable) the template in [`socid_extractor_improvements.log`](socid_extractor_improvements.log) aligned. New log **entries** are append-only at the bottom of that file.
|
||||||
|
|||||||
@@ -69,6 +69,21 @@ Use the following commands to check Maigret:
|
|||||||
make speed
|
make speed
|
||||||
|
|
||||||
|
|
||||||
|
Site naming conventions
|
||||||
|
-----------------------------------------------
|
||||||
|
|
||||||
|
Site names are the keys in ``data.json`` and appear in user-facing reports. Follow these rules:
|
||||||
|
|
||||||
|
- **Title Case** by default: ``Product Hunt``, ``Hacker News``.
|
||||||
|
- **Lowercase** only if the brand itself is written that way: ``kofi``, ``note``, ``hi5``.
|
||||||
|
- **No domain suffix** (``calendly.com`` → ``Calendly``), unless the domain is part of the recognized brand name: ``last.fm``, ``VC.ru``, ``Archive.org``.
|
||||||
|
- **No full UPPERCASE** unless the brand is an acronym: ``VK``, ``CNET``, ``ICQ``, ``IFTTT``.
|
||||||
|
- **No** ``www.`` **or** ``https://`` **prefix** in the name.
|
||||||
|
- **Spaces** are allowed when the brand uses them: ``Star Citizen``, ``Google Maps``.
|
||||||
|
- **{username} templates** in names are acceptable: ``{username}.tilda.ws``.
|
||||||
|
|
||||||
|
When in doubt, check how the service refers to itself on its homepage.
|
||||||
|
|
||||||
How to fix false-positives
|
How to fix false-positives
|
||||||
-----------------------------------------------
|
-----------------------------------------------
|
||||||
|
|
||||||
|
|||||||
@@ -10,7 +10,12 @@ The use of tags allows you to select a subset of the sites from big Maigret DB f
|
|||||||
|
|
||||||
There are several types of tags:
|
There are several types of tags:
|
||||||
|
|
||||||
1. **Country codes**: ``us``, ``jp``, ``br``... (`ISO 3166-1 alpha-2 <https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2>`_). These tags reflect the site language and regional origin of its users and are then used to locate the owner of a username. If the regional origin is difficult to establish or a site is positioned as worldwide, `no country code is given`. There could be multiple country code tags for one site.
|
1. **Country codes**: ``us``, ``jp``, ``br``... (`ISO 3166-1 alpha-2 <https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2>`_). A country tag means that having an account on the site implies a connection to that country — either origin or residence. The goal is attribution, not perfect accuracy.
|
||||||
|
|
||||||
|
- **Global sites** (GitHub, YouTube, Reddit, Medium, etc.) get **no country tag** — an account there says nothing about where a person is from.
|
||||||
|
- **Regional/local sites** where an account implies a specific country **must** have a country tag: ``VK`` → ``ru``, ``Naver`` → ``kr``, ``Zhihu`` → ``cn``.
|
||||||
|
- Multiple country tags are allowed when a service is used predominantly in a few countries (e.g. ``Xing`` → ``de``, ``eu``).
|
||||||
|
- Do **not** assign country tags based on traffic statistics alone — a site popular in India by traffic is not "Indian" if it is used globally.
|
||||||
|
|
||||||
2. **Site engines**. Most of them are forum engines now: ``uCoz``, ``vBulletin``, ``XenForo`` et al. Full list of engines stored in the Maigret database.
|
2. **Site engines**. Most of them are forum engines now: ``uCoz``, ``vBulletin``, ``XenForo`` et al. Full list of engines stored in the Maigret database.
|
||||||
|
|
||||||
|
|||||||
+1557
-1822
File diff suppressed because it is too large
Load Diff
@@ -4,6 +4,30 @@ import pytest
|
|||||||
from maigret.utils import is_country_tag
|
from maigret.utils import is_country_tag
|
||||||
|
|
||||||
|
|
||||||
|
TOP_SITES_ALEXA_RANK_LIMIT = 50
|
||||||
|
|
||||||
|
KNOWN_SOCIAL_DOMAINS = [
|
||||||
|
"facebook.com",
|
||||||
|
"instagram.com",
|
||||||
|
"twitter.com",
|
||||||
|
"tiktok.com",
|
||||||
|
"vk.com",
|
||||||
|
"reddit.com",
|
||||||
|
"pinterest.com",
|
||||||
|
"snapchat.com",
|
||||||
|
"linkedin.com",
|
||||||
|
"tumblr.com",
|
||||||
|
"threads.net",
|
||||||
|
"bsky.app",
|
||||||
|
"myspace.com",
|
||||||
|
"weibo.com",
|
||||||
|
"mastodon.social",
|
||||||
|
"gab.com",
|
||||||
|
"minds.com",
|
||||||
|
"clubhouse.com",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.slow
|
@pytest.mark.slow
|
||||||
def test_tags_validity(default_db):
|
def test_tags_validity(default_db):
|
||||||
unknown_tags = set()
|
unknown_tags = set()
|
||||||
@@ -19,3 +43,62 @@ def test_tags_validity(default_db):
|
|||||||
# if you see "unchecked" tag error, please, do
|
# if you see "unchecked" tag error, please, do
|
||||||
# maigret --db `pwd`/maigret/resources/data.json --self-check --tag unchecked --use-disabled-sites
|
# maigret --db `pwd`/maigret/resources/data.json --self-check --tag unchecked --use-disabled-sites
|
||||||
assert unknown_tags == set()
|
assert unknown_tags == set()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.slow
|
||||||
|
def test_top_sites_have_category_tag(default_db):
|
||||||
|
"""Top sites by alexaRank must have at least one category tag (not just country codes)."""
|
||||||
|
sites_ranked = sorted(
|
||||||
|
[s for s in default_db.sites if s.alexa_rank],
|
||||||
|
key=lambda s: s.alexa_rank,
|
||||||
|
)[:TOP_SITES_ALEXA_RANK_LIMIT]
|
||||||
|
|
||||||
|
missing_category = []
|
||||||
|
for site in sites_ranked:
|
||||||
|
category_tags = [t for t in site.tags if not is_country_tag(t)]
|
||||||
|
if not category_tags:
|
||||||
|
missing_category.append(f"{site.name} (rank {site.alexa_rank})")
|
||||||
|
|
||||||
|
assert missing_category == [], (
|
||||||
|
f"{len(missing_category)} top-{TOP_SITES_ALEXA_RANK_LIMIT} sites have no category tag: "
|
||||||
|
+ ", ".join(missing_category[:20])
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.slow
|
||||||
|
def test_no_unused_tags_in_registry(default_db):
|
||||||
|
"""Every tag in the registry should be used by at least one site."""
|
||||||
|
all_used_tags = set()
|
||||||
|
for site in default_db.sites:
|
||||||
|
for tag in site.tags:
|
||||||
|
if not is_country_tag(tag):
|
||||||
|
all_used_tags.add(tag)
|
||||||
|
|
||||||
|
registered_tags = set(default_db._tags)
|
||||||
|
unused = registered_tags - all_used_tags
|
||||||
|
|
||||||
|
assert unused == set(), f"Tags registered but not used by any site: {unused}"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.slow
|
||||||
|
def test_social_networks_have_social_tag(default_db):
|
||||||
|
"""Known social network domains must have the 'social' tag."""
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
|
||||||
|
missing_social = []
|
||||||
|
for site in default_db.sites:
|
||||||
|
url = site.url_main or ""
|
||||||
|
try:
|
||||||
|
hostname = urlparse(url).hostname or ""
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
for domain in KNOWN_SOCIAL_DOMAINS:
|
||||||
|
if hostname == domain or hostname.endswith("." + domain):
|
||||||
|
if "social" not in site.tags:
|
||||||
|
missing_social.append(f"{site.name} ({domain})")
|
||||||
|
break
|
||||||
|
|
||||||
|
assert missing_social == [], (
|
||||||
|
f"{len(missing_social)} known social networks missing 'social' tag: "
|
||||||
|
+ ", ".join(missing_social)
|
||||||
|
)
|
||||||
|
|||||||
Reference in New Issue
Block a user