Fix crash on -a --self-check by adding exception handling to site check coroutines (#2466)

* Initial plan

* Fix crash on -a --self-check by adding exception handling in site_self_check and self_check

Wrap the body of site_self_check in try/except to catch unexpected errors
and always return a valid changes dict. Also add a safety-net try/except
in self_check around awaiting individual site check futures so that a
single site failure doesn't crash the entire self-check process.

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/5e27d620-5cbb-43d2-a9f9-ecb53a29904d

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

* Restore @pytest.mark.slow on test_maigret_results

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/5e27d620-5cbb-43d2-a9f9-ecb53a29904d

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

* Document --self-check error resilience, --auto-disable, and --diagnose in docs/

Update command-line-options.rst with expanded --self-check description
and new --auto-disable and --diagnose entries. Add a "Database self-check"
section to features.rst explaining error-resilient behaviour and usage
examples. Update usage-examples.rst to reference --auto-disable.

Agent-Logs-Url: https://github.com/soxoj/maigret/sessions/af1f0f09-9112-4902-8475-e81d235ff3ed

Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: soxoj <31013580+soxoj@users.noreply.github.com>
This commit is contained in:
Copilot
2026-04-07 19:44:09 +02:00
committed by GitHub
parent 6834483360
commit 23adc178ea
5 changed files with 222 additions and 119 deletions
+19 -6
View File
@@ -133,12 +133,25 @@ Other operations modes
``--version`` - Display version information and dependencies. ``--version`` - Display version information and dependencies.
``--self-check`` - Do self-checking for sites and database and disable ``--self-check`` - Do self-checking for sites and database. Each site is
non-working ones **for current search session** by default. Its useful tested by looking up its known-claimed and known-unclaimed usernames and
for testing new internet connection (it depends on provider/hosting on verifying that the results match expectations. Individual site failures
which sites there will be censorship stub or captcha display). After (network errors, unexpected exceptions, etc.) are caught and logged
checking Maigret asks if you want to save updates, answering y/Y will without stopping the overall process, so the check always runs to
rewrite the local database. completion. After checking, Maigret reports a summary of issues found.
If any sites were disabled (see ``--auto-disable``), Maigret asks if you
want to save updates; answering y/Y will rewrite the local database.
``--auto-disable`` - Used with ``--self-check``: automatically disable
sites that fail checks (incorrect detection of claimed/unclaimed
usernames, connection errors, or unexpected exceptions). Without this
flag, ``--self-check`` only **reports** issues without modifying the
database.
``--diagnose`` - Used with ``--self-check``: print detailed diagnosis
information for each failing site, including the check type, the list
of issues found, and recommendations (e.g. suggesting a different
``checkType``).
``--submit URL`` - Do an automatic analysis of the given account URL or ``--submit URL`` - Do an automatic analysis of the given account URL or
site main page URL to determine the site engine and methods to check site main page URL to determine the site engine and methods to check
+29
View File
@@ -170,6 +170,35 @@ Maigret will do retries of the requests with temporary errors got (connection fa
One attempt by default, can be changed with option ``--retries N``. One attempt by default, can be changed with option ``--retries N``.
Database self-check
-------------------
Maigret includes a self-check mode (``--self-check``) that validates every site
in the database by looking up its known-claimed and known-unclaimed usernames
and verifying that the detection results match expectations.
The self-check is **error-resilient**: if an individual site check raises an
unexpected exception (e.g. a network error or a parsing failure), the error is
caught, logged, and recorded as an issue — the remaining sites continue to be
checked without interruption. This means the process always runs to completion,
even when checking hundreds of sites with ``-a --self-check``.
Use ``--auto-disable`` together with ``--self-check`` to automatically disable
sites that fail checks. Without it, issues are only reported. Use ``--diagnose``
to print detailed per-site diagnosis including the check type, specific issues,
and recommendations.
.. code-block:: console
# Report-only mode (no changes to the database)
maigret --self-check
# Automatically disable failing sites and save updates
maigret -a --self-check --auto-disable
# Show detailed diagnosis for each failing site
maigret -a --self-check --diagnose
Archives and mirrors checking Archives and mirrors checking
----------------------------- -----------------------------
+1 -1
View File
@@ -33,7 +33,7 @@ Use Cases
If you experience many false positives, you can do the following: If you experience many false positives, you can do the following:
- Install the last development version of Maigret from GitHub - Install the last development version of Maigret from GitHub
- Run Maigret with ``--self-check`` flag and agree on disabling of problematic sites - Run Maigret with ``--self-check --auto-disable`` flag and agree on disabling of problematic sites
3. Search for accounts with username ``machine42`` and generate HTML and PDF reports. 3. Search for accounts with username ``machine42`` and generate HTML and PDF reports.
+136 -111
View File
@@ -967,129 +967,143 @@ async def site_self_check(
"recommendations": [], "recommendations": [],
} }
check_data = [ try:
(site.username_claimed, MaigretCheckStatus.CLAIMED), check_data = [
(site.username_unclaimed, MaigretCheckStatus.AVAILABLE), (site.username_claimed, MaigretCheckStatus.CLAIMED),
] (site.username_unclaimed, MaigretCheckStatus.AVAILABLE),
]
logger.info(f"Checking {site.name}...") logger.info(f"Checking {site.name}...")
results_cache = {} results_cache = {}
for username, status in check_data: for username, status in check_data:
async with semaphore: async with semaphore:
results_dict = await maigret( results_dict = await maigret(
username=username, username=username,
site_dict={site.name: site}, site_dict={site.name: site},
logger=logger, logger=logger,
timeout=30, timeout=30,
id_type=site.type, id_type=site.type,
forced=True, forced=True,
no_progressbar=True, no_progressbar=True,
retries=1, retries=1,
proxy=proxy, proxy=proxy,
tor_proxy=tor_proxy, tor_proxy=tor_proxy,
i2p_proxy=i2p_proxy, i2p_proxy=i2p_proxy,
cookies=cookies, cookies=cookies,
)
# don't disable entries with other ids types
# TODO: make normal checking
if site.name not in results_dict:
logger.info(results_dict)
changes["issues"].append(f"Site {site.name} not in results (wrong id_type?)")
if auto_disable:
changes["disabled"] = True
continue
logger.debug(results_dict)
result = results_dict[site.name]["status"]
results_cache[username] = results_dict[site.name]
if result.error and 'Cannot connect to host' in result.error.desc:
changes["issues"].append("Cannot connect to host")
if auto_disable:
changes["disabled"] = True
site_status = result.status
if site_status != status:
if site_status == MaigretCheckStatus.UNKNOWN:
msgs = site.absence_strs
etype = site.check_type
error_msg = f"Error checking {username}: {result.context}"
changes["issues"].append(error_msg)
logger.warning(
f"Error while searching {username} in {site.name}: {result.context}, {msgs}, type {etype}"
) )
# don't disable sites after the error
# meaning that the site could be available, but returned error for the check # don't disable entries with other ids types
# e.g. many sites protected by cloudflare and available in general # TODO: make normal checking
if skip_errors: if site.name not in results_dict:
pass logger.info(results_dict)
# don't disable in case of available username changes["issues"].append(f"Site {site.name} not in results (wrong id_type?)")
elif status == MaigretCheckStatus.CLAIMED and auto_disable: if auto_disable:
changes["disabled"] = True changes["disabled"] = True
elif status == MaigretCheckStatus.CLAIMED: continue
changes["issues"].append(f"Claimed user '{username}' not detected as claimed")
logger.warning( logger.debug(results_dict)
f"Not found `{username}` in {site.name}, must be claimed"
) result = results_dict[site.name]["status"]
logger.info(results_dict[site.name]) results_cache[username] = results_dict[site.name]
if auto_disable:
changes["disabled"] = True if result.error and 'Cannot connect to host' in result.error.desc:
else: changes["issues"].append("Cannot connect to host")
changes["issues"].append(f"Unclaimed user '{username}' detected as claimed")
logger.warning(f"Found `{username}` in {site.name}, must be available")
logger.info(results_dict[site.name])
if auto_disable: if auto_disable:
changes["disabled"] = True changes["disabled"] = True
logger.info(f"Site {site.name} checking is finished") site_status = result.status
# Generate recommendations based on issues if site_status != status:
if changes["issues"] and len(results_cache) == 2: if site_status == MaigretCheckStatus.UNKNOWN:
claimed_result = results_cache.get(site.username_claimed, {}) msgs = site.absence_strs
unclaimed_result = results_cache.get(site.username_unclaimed, {}) etype = site.check_type
error_msg = f"Error checking {username}: {result.context}"
changes["issues"].append(error_msg)
logger.warning(
f"Error while searching {username} in {site.name}: {result.context}, {msgs}, type {etype}"
)
# don't disable sites after the error
# meaning that the site could be available, but returned error for the check
# e.g. many sites protected by cloudflare and available in general
if skip_errors:
pass
# don't disable in case of available username
elif status == MaigretCheckStatus.CLAIMED and auto_disable:
changes["disabled"] = True
elif status == MaigretCheckStatus.CLAIMED:
changes["issues"].append(f"Claimed user '{username}' not detected as claimed")
logger.warning(
f"Not found `{username}` in {site.name}, must be claimed"
)
logger.info(results_dict[site.name])
if auto_disable:
changes["disabled"] = True
else:
changes["issues"].append(f"Unclaimed user '{username}' detected as claimed")
logger.warning(f"Found `{username}` in {site.name}, must be available")
logger.info(results_dict[site.name])
if auto_disable:
changes["disabled"] = True
claimed_http = claimed_result.get("http_status") logger.info(f"Site {site.name} checking is finished")
unclaimed_http = unclaimed_result.get("http_status")
if claimed_http and unclaimed_http: # Generate recommendations based on issues
if claimed_http != unclaimed_http and site.check_type != "status_code": if changes["issues"] and len(results_cache) == 2:
changes["recommendations"].append( claimed_result = results_cache.get(site.username_claimed, {})
f"Consider checkType: status_code (HTTP {claimed_http} vs {unclaimed_http})" unclaimed_result = results_cache.get(site.username_unclaimed, {})
)
# Print diagnosis if requested claimed_http = claimed_result.get("http_status")
if diagnose and changes["issues"]: unclaimed_http = unclaimed_result.get("http_status")
print(f"\n--- {site.name} DIAGNOSIS ---")
print(f" Check type: {site.check_type}")
print(" Issues:")
for issue in changes["issues"]:
print(f" - {issue}")
if changes["recommendations"]:
print(" Recommendations:")
for rec in changes["recommendations"]:
print(f" -> {rec}")
# Only modify site if auto_disable is enabled if claimed_http and unclaimed_http:
if auto_disable and changes["disabled"] != site.disabled: if claimed_http != unclaimed_http and site.check_type != "status_code":
site.disabled = changes["disabled"] changes["recommendations"].append(
logger.info(f"Switching property 'disabled' for {site.name} to {site.disabled}") f"Consider checkType: status_code (HTTP {claimed_http} vs {unclaimed_http})"
db.update_site(site) )
if not silent:
action = "Disabled" if site.disabled else "Enabled"
print(f"{action} site {site.name}...")
elif changes["issues"] and not silent and not diagnose:
# Report issues without disabling
print(f"Issues found in {site.name}: {len(changes['issues'])} (not auto-disabled)")
# remove service tag "unchecked" # Print diagnosis if requested
if "unchecked" in site.tags: if diagnose and changes["issues"]:
site.tags.remove("unchecked") print(f"\n--- {site.name} DIAGNOSIS ---")
db.update_site(site) print(f" Check type: {site.check_type}")
print(" Issues:")
for issue in changes["issues"]:
print(f" - {issue}")
if changes["recommendations"]:
print(" Recommendations:")
for rec in changes["recommendations"]:
print(f" -> {rec}")
# Only modify site if auto_disable is enabled
if auto_disable and changes["disabled"] != site.disabled:
site.disabled = changes["disabled"]
logger.info(f"Switching property 'disabled' for {site.name} to {site.disabled}")
db.update_site(site)
if not silent:
action = "Disabled" if site.disabled else "Enabled"
print(f"{action} site {site.name}...")
elif changes["issues"] and not silent and not diagnose:
# Report issues without disabling
print(f"Issues found in {site.name}: {len(changes['issues'])} (not auto-disabled)")
# remove service tag "unchecked"
if "unchecked" in site.tags:
site.tags.remove("unchecked")
db.update_site(site)
except Exception as e:
logger.warning(
f"Self-check of {site.name} failed with unexpected error: {e}",
exc_info=True,
)
changes["issues"].append(f"Unexpected error: {e}")
if auto_disable and not site.disabled:
changes["disabled"] = True
site.disabled = True
db.update_site(site)
if not silent:
print(f"Disabled site {site.name} (unexpected error)...")
return changes return changes
@@ -1142,7 +1156,18 @@ async def self_check(
if tasks: if tasks:
with alive_bar(len(tasks), title='Self-checking', force_tty=True, disable=no_progressbar) as progress: with alive_bar(len(tasks), title='Self-checking', force_tty=True, disable=no_progressbar) as progress:
for site_name, f in tasks: for site_name, f in tasks:
result = await f try:
result = await f
except Exception as e:
logger.warning(
f"Self-check task for {site_name} raised unexpected error: {e}",
exc_info=True,
)
result = {
"disabled": False,
"issues": [f"Unexpected error: {e}"],
"recommendations": [],
}
result['site_name'] = site_name result['site_name'] = site_name
all_results.append(result) all_results.append(result)
progress() # Update the progress bar progress() # Update the progress bar
+37 -1
View File
@@ -12,7 +12,8 @@ from maigret.maigret import (
extract_ids_from_page, extract_ids_from_page,
extract_ids_from_results, extract_ids_from_results,
) )
from maigret.sites import MaigretSite from maigret.checking import site_self_check
from maigret.sites import MaigretSite, MaigretDatabase
from maigret.result import MaigretCheckResult, MaigretCheckStatus from maigret.result import MaigretCheckResult, MaigretCheckStatus
from tests.conftest import RESULTS_EXAMPLE from tests.conftest import RESULTS_EXAMPLE
@@ -83,6 +84,41 @@ async def test_self_check_progressbar_enabled_by_default(test_db):
assert kwargs.get('disable') is False assert kwargs.get('disable') is False
@pytest.mark.asyncio
async def test_site_self_check_handles_exception(test_db):
"""Verify that site_self_check catches unexpected exceptions and returns a valid result."""
logger = Mock()
sem = asyncio.Semaphore(1)
site = test_db.sites_dict['ValidActive']
with patch('maigret.checking.maigret', side_effect=RuntimeError("test crash")):
result = await site_self_check(site, logger, sem, test_db)
assert isinstance(result, dict)
assert "issues" in result
assert len(result["issues"]) > 0
assert any("Unexpected error" in issue for issue in result["issues"])
@pytest.mark.asyncio
async def test_self_check_handles_task_exception(test_db):
"""Verify that self_check continues when individual site checks raise exceptions."""
logger = Mock()
with patch('maigret.checking.maigret', side_effect=RuntimeError("test crash")):
result = await self_check(
test_db, test_db.sites_dict, logger, silent=True,
no_progressbar=True,
)
assert isinstance(result, dict)
assert 'results' in result
assert len(result['results']) == len(test_db.sites_dict)
for r in result['results']:
assert 'site_name' in r
assert 'issues' in r
@pytest.mark.slow @pytest.mark.slow
@pytest.mark.skip(reason="broken, fixme") @pytest.mark.skip(reason="broken, fixme")
def test_maigret_results(test_db): def test_maigret_results(test_db):