feat: add multilingual batch scanner with parallel execution and LLM gap-fill#100
feat: add multilingual batch scanner with parallel execution and LLM gap-fill#100WhereIs38 wants to merge 15 commits into
Conversation
batch_scan.py main(): reconfigure stdout to UTF-8 on win32 so Rich terminal output with CJK characters renders correctly. Co-Authored-By: Claude <noreply@anthropic.com>
fix: add Windows Unicode stdout support for CJK output
a32aa67 to
22de8d6
Compare
rng1995
left a comment
There was a problem hiding this comment.
This is a substantial, thoughtfully-engineered contribution — and keeping it entirely under contrib/multilingual/ (no changes to core files) is the right call, since it can't affect the core scanner for normal users. Language detection, parallel ThreadPoolExecutor orchestration, and the additive gap-fill pass all look reasonable, and the annotation layer only labels findings (language_compatible) rather than suppressing them, so detection coverage isn't weakened. A few things should be addressed before merge, though.
1. The API key pool is built but never actually used in the scan path.
create_api_key_pool_from_env() is instantiated in batch_scan.main(), but pool.acquire() / pool.release() are only ever called inside PooledChatModel, and PooledChatModel is never instantiated anywhere in the flow. Gap-fill goes through GapFillAnalyzer (core LLMAnalyzerBase → get_chat_model) and the graph uses core directly — neither touches the pool. Net effect:
- the ~590-line multi-key rotation / 429 backoff logic is effectively dead code at runtime;
snapshot()['rate_limits_hit']stays 0, so the pool summary never prints;- the
batch_scanmodule docstring's claim that "API rate-limit protection is provided by theApiKeyPoolfor GapFill calls" is inaccurate; - a user who configures only
SKILLSPECTOR_API_KEYS(and not the env var core reads) gets a cosmetic pool but no key actually used. Please either wirePooledChatModelinto the gap-fill / graph LLM calls, or drop the pool and adjust the docs to match what's implemented.
2. Import-time global monkey-patching is invasive and fragile.
runner.py replaces asyncio.run process-wide and patches LLMAnalyzerBase, LLMMetaAnalyzer, and ChatOpenAI at import time. Importing the module silently mutates global behavior for the whole process, and several patches depend on internal details (Pydantic alias precedence, MRO instance-attribute injection) that can break on upstream updates. Please consider scoping these via an explicit setup function / context manager rather than import side-effects, and narrow the broad except (json.JSONDecodeError, Exception) handlers (the second makes the first redundant and can mask real bugs).
3. The riskiest code is untested.
Tests cover discovery, detection, report structure, and a --no-llm e2e — good — but the concurrency-heavy / failure-prone pieces (the ApiKeyPool scheduler, retry/backoff, the monkey-patches, and gap_fill parsing) have no coverage. Given this is where bugs are most likely, please add unit tests for the pool's acquire/release/backoff/recovery and for GapFillAnalyzer.parse_response.
Minor: record_retry_success() is incremented on each retry attempt, not on success; and the rm -rf subprocess fallback in cleanup_result is largely unreachable since shutil.rmtree(ignore_errors=True) won't raise.
None of this affects core, and the bones are good — happy to re-review once the pool is integrated (or removed) and the risky paths have tests.
…l test files - Add SPDX license header to 8 test files - Add from __future__ import annotations to 8 test files - Fix Unicode stdout crash in test_pool_wiring.py on Windows - Add conftest.py with pytest markers registration - 120 tests passing Co-Authored-By: Claude <noreply@anthropic.com>
a4a2a0c to
d1a0b1f
Compare
fix: add SPDX headers, from __future__ annotations, conftest.py to al…
set_api_pool previously only patched llm_utils.get_chat_model, but llm_analyzer_base uses a module-level from-import that created a local reference bypassing the pool. Graph analyzers (95% of LLM calls) were not using PooledChatModel. Now patches both llm_utils and llm_analyzer_base, plus adds LLMAnalyzerBase._llm verification to test_pool_wiring.py. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
Documentation (12 md, zero stale refs, cross-linked footers): - README: TOC, badges, all commands, reviewer index - REVIEW_RESPONSE: full 3-issue response, before/after tables - DESIGN: dual-patch mechanism, updated file layout - New CONTRIBUTING.md at module root (GitHub standard) - Archive 7->5: merged COMMAND_REFERENCE->README, RISK_TABLE->PITFALLS New thematic tests (44 tests, answering review concerns): - test_monkeypatch_invasiveness.py: 14 tests (thread isolation, import safety) - test_monkeypatch_fragility.py: 26 tests (per-patch guard, deep deps, atomicity) 164 tests total, all passing. Production code unchanged (runner.py fix 08f624c). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
d1a0b1f to
d1b157e
Compare
|
@rng1995 — thanks for the thorough review, and apologies for the delayed response. I've spent the last few days systematically addressing each concern with tests and documentation. This module has been merged with the upstream Quick note on diff size: the 9,100 lines break down as ~2,900 production ( All 7 monkey-patches survived the merge with zero conflicts, and all 164 tests continue to pass. The mutation suite holds steady at 21/30 caught — no regressions after the upstream merge. The 9 misses (least-loaded scheduling edge case, Patch 3/6/7 restore ordering, The documentation has been reorganized for contributor readability: a reviewer index with links to every changed file, cross-linked footers throughout, and a Note also the upstream To your three points: #1 Pool dead code → fully wired
$ python contrib/multilingual/tests/test_pool_wiring.py
✅ Pool created: 10 keys
✅ get_chat_model → PooledChatModel (llm_utils path)
✅ LLMAnalyzerBase._llm → PooledChatModel (graph path, 95% of calls)
✅ GapFillAnalyzer.chat_model → PooledChatModel (gap-fill path)
✅ set_api_pool(None) restores both modulesAnd under real load — 23 skills, 20 workers, 10 keys: The pool schedules, rate-limits are tracked, and the summary prints. No longer cosmetic. #2 Invasive + fragile → explicit + guarded + 44 testsPatches now fire only via
The broad $ python contrib/multilingual/tests/test_monkeypatch_invasiveness.py
Ran 14 tests in 8.859s — OK
Subprocess import isolation
50 concurrent instances, zero races (V1 regression)
Cross-thread independent contexts
Instance-attr vs class-attr proof
$ python contrib/multilingual/tests/test_monkeypatch_fragility.py
Ran 26 tests in 0.001s — OK
Each of 7 patches individually guarded
Deep dependency detection (model_validate, to_finding, file_path, etc.)
Guard fails → ZERO patches applied#3 Risky code untested → 164 tests120 unit tests across the 4 areas you flagged, plus 44 thematic tests directly answering #1 and #2:
$ python contrib/multilingual/tests/tests-pro/random_numbered.py
Ran 120 tests in 1.2s — OK (seed=42, random order)Real-world performance23 skills, 20 workers, 10 keys. Pool saturated at 50/50 slots, 157 requests served, zero deadlocks. LLM output errors (malformed JSON) are caught and Clean skills stay clean ( Upstream compatibilityMerged ab0431f (OSS 2.3.7, 130+ commits, 89 files). Minor items you noted
Full response with before/after tables: Reviewer index with links to all changed files: Signed-off-by: WhereIs38 CinderellaDoyle@icloud.com |
Closes #98
Summary
Adds
contrib/multilingual/— a multilingual batch scanner that scans directories of AI agent skills in parallel, with automatic language detection and targeted LLM gap-fill for non-English skills.Zero changes to
src/skillspector/. All integration is via import-time patches that wrap upstream constructors without modifying any source file.Why this module exists
The upstream project scans one skill at a time — great for depth, but serial execution means LLM latency stacks linearly. I needed to scan many skills quickly, so this module avoids serial bottlenecks by design.
Scale. Each skill runs in an isolated thread via
ThreadPoolExecutor. With enough API keys, adding workers cuts total scan time proportionally — 23 skills finish in ~2 minutes at 8 workers, roughly one human-agent conversation round. The ceiling is the user's key count, not the code: 100 keys scanning 2000 skills still finish in minutes.Cost. Parallel scanning means high token throughput, so I chose DeepSeek — the cheapest per-token option — for development and testing. The module itself is provider-agnostic: any OpenAI-compatible endpoint works. I couldn't test local models due to hardware constraints (Mac with limited RAM, a 4 GB VRAM Windows machine). That remains a known gap I hope someone with better hardware can fill.
Compatibility. The module is tested on macOS and Windows.
runner.pyapplies a small set of import-time patches so DeepSeek works out of the box; the patches follow standard OpenAI-compatible protocol, so Ollama and other endpoints should work as well. All patches are non-invasive and self-contained withincontrib/multilingual/.In short: upstream provides the detection algorithms; this contrib provides the reach. If accepted, I'm interested in continuing to improve scalability and external provider compatibility upstream.
What It Does
SKILL.mddirectories under input rootThreadPoolExecutorrunsgraph.invoke()per skill, configurable--workersEvidence (23 built-in fixtures, 8 workers)
--no-llmssd1_semantic_injectionssd3_nl_exfiltrationssd4_narrative_deceptionsdi4_divergencesafe_skillssd_cleanLLM semantic analyzers catch entire vulnerability categories invisible to static patterns. Clean skills remain clean — zero false-positive inflation.
How to verify
Prerequisites
Create
.envin the repo root with 10 different DeepSeek API keys (theApiKeyPoolrotates across keys to avoid rate-limiting):Edit
.envand fill in:Activation
source .venv/bin/activateUnit tests (no API keys needed, < 2s)
Test 1 — Static mode (no LLM required, ~0.7s, default 4 workers)
Expected: 23/23 skills, ~0.7 s, 8 CRITICAL / HIGH findings.
Test 2 — LLM parallel mode (requires API keys, ~2 min)
Expected: 23/23 skills, ~2 min, 15 CRITICAL / HIGH findings (LLM catches semantic injection, narrative deception, and other vulnerabilities that static patterns miss).
Test 3 — Single-worker mode (for free-tier API keys)
Testing
18 unit tests in
contrib/multilingual/tests/cover discovery, language detection, JSON / Markdown report formatting, and an end-to-end--no-llmscan. Deterministic components are fully covered. LLM-dependent output is inherently non-deterministic and requires live API keys — the static-vs-LLM comparison in README provides more meaningful evidence for those paths than any mock-based test could.make lintpasses on the upstream codebase.🤖 Generated with Claude Code
Signed-off-by: WhereIs38 CinderellaDoyle@icloud.com
README.md
DESIGN.md
CONTRIBUTING.md