diff --git a/CHANGELOG.md b/CHANGELOG.md index 18603744..7706f5c4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/). - **Nature workflow bridge skill** ([#107](https://github.com/ZimoLiao/scholaraio/issues/107)): Added a ScholarAIO `nature-workflow` bridge skill that routes Nature Portfolio writing and figure workflows to the upstream `nature-skills` repository when installed, keeps ScholarAIO-native fallbacks explicit, documents the install and quick-start path, and includes deterministic plus product-demo fixtures that generate reviewable manuscript, figure, slide, and QA artifacts. +### Fixed + +- **Webextract Markdown table-cell cleanup** ([#110](https://github.com/ZimoLiao/scholaraio/pull/110)): Sanitized malformed block-level code fences emitted inside `qt-web-extractor` table cells before HTTP/MCP extraction results reach `webextract` and ingest consumers, while preserving standalone fenced code blocks and pipe characters inside code-cell content. + ## [1.5.0] — 2026-05-24 ### Added diff --git a/docs/development/third-party-integration-audit.md b/docs/development/third-party-integration-audit.md new file mode 100644 index 00000000..9ddf38ea --- /dev/null +++ b/docs/development/third-party-integration-audit.md @@ -0,0 +1,80 @@ +# ScholarAIO Third-Party Integration Quality Audit + +This document records the quality, reachability, and output validation status of the third-party integrations, APIs, CLIs, and optional toolchains supported by ScholarAIO. + +Integrations are evaluated at the workflow boundary, checking CLI/skill entrypoints, provider implementations, setup diagnostics, output formatting, fallback behaviors, and failure handling. A config test or a broad unit-test filename is not enough evidence to mark an integration surface as Good. + +This audit is not a declaration that the full third-party toolchain is adapted or verified. Each row claims only the evidence listed in that row; everything else remains inventory until a focused live or workflow-boundary pass verifies it. + +Status is intentionally conservative: + +- **good**: workflow-boundary evidence exists, including commands, representative output, and failure handling. +- **partially-reviewed**: code-level or fixture evidence exists, but live workflow evidence is still missing. +- **not-yet-reviewed**: inventory only; no quality claim is made. + +--- + +## 1. Quality Matrix + +| Integration / Surface | Category | Status | Verification Path / Test Evidence | Observed Result / Config & Version Boundaries | +| :--- | :--- | :--- | :--- | :--- | +| **qt-web-extractor (HTTP & MCP)** | Web / Agent | **partially-reviewed** | `extract_web`, `_clean_table_code_fences`, `tests/test_webtools_source.py`, fixture pair under `tests/fixtures/` | Sanitizer regression is covered for malformed table-cell code fences and adjacent standalone code blocks. Live daemon canary evidence is still required before this surface is promoted to `good`. Boundaries: `webextract.transport` (HTTP/MCP), `webextract.base_url`, `webextract.mcp_url`, `webextract.api_key`. | +| **GUILessBingSearch** | Web / Agent | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **MinerU Local API** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **MinerU Cloud CLI** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Paper2Any MCP Sidecar** | Parsing/MCP | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Docling Fallback** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **PyMuPDF Fallback** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **arXiv Search (Atom API)** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **arXiv PDF Download** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **OpenAlex Explore** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Crossref / Semantic Scholar** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Zotero SQLite Import** | Import/Export | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Zotero Web API** | Import/Export | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **EndNote / RIS** | Import/Export | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **USPTO ODP / PPubs** | Patents | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **OpenAI-compatible Chat API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Anthropic Messages API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Google Gemini API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Zhipu API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **vLLM / Ollama Local** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Sentence-transformers Embeddings** | Vector/Embed | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **FAISS Vector / BERTopic** | Vector/Embed | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **MarkItDown Office Ingest** | Office/Output | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Office PPTX / DOCX Libraries** | Office/Output | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Mermaid / DOT Rendering** | Diagram | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Scientific Toolref (Quantum ESPRESSO, etc.)** | Toolref | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **AmberTools / PyMOL** | Scientific | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **rsync / SSH Backup** | System | **not-yet-reviewed** | N/A | Excluded from current triage phase. | +| **Setup Diagnostics** | System | **not-yet-reviewed** | N/A | Excluded from current triage phase. | + +--- + +## 2. Current Reviewed Surface + +### 2.1 qt-web-extractor (HTTP & MCP) +* **CLI/Skill Entrypoint**: + * CLI: `scholaraio webextract ` (implemented in `cmd_webextract` inside [web.py](../../scholaraio/interfaces/cli/web.py)) + * Skill: `.claude/skills/webextract` +* **Provider/Service Implementation Path**: + * [webtools.py:extract_web](../../scholaraio/providers/webtools.py) +* **Setup Diagnostics**: + * Diagnostic path exists through `scholaraio setup check` (calls `_optional_webtool_detail` inside [setup.py](../../scholaraio/services/setup.py)), which executes `check_webextract_service` to verify that the HTTP/MCP endpoint responds. This PR does not include live daemon evidence from that path. +* **Output Quality & Validation**: + * Outputs parsed GFM Markdown. Output quality is protected by `_clean_table_code_fences` to sanitize malformed block code fences in Wikipedia/infobox table cells, resolving broken table rendering. + * Verified via unit and fixture coverage: [wikipedia_infobox_bad.md](../../tests/fixtures/wikipedia_infobox_bad.md), [wikipedia_infobox_clean.md](../../tests/fixtures/wikipedia_infobox_clean.md), and regression tests for standalone fenced code blocks near table or pipe-prefixed lines. +* **Fallback Behavior**: + * Configured via `webextract.transport` (HTTP or MCP). When configured as HTTP, failure to connect triggers fallback hint to MCP or setup checks. +* **Failure Handling**: + * Unreachable HTTP endpoints raise `WebExtractServiceUnavailableError`, returning a clean user-facing hint with exit code `1`. + * API/Server errors raise `WebExtractError`, showing warnings/errors instead of generic crashes. + +## 3. Not-Yet-Reviewed Inventory + +Rows marked `not-yet-reviewed` in the matrix are intentionally inventory-only. Promoting any of them to `partially-reviewed` or `good` should happen in a focused follow-up that includes: + +- exact CLI command or skill workflow exercised; +- relevant config/version boundaries; +- representative success output; +- failure-mode behavior; +- targeted tests or reproducible smoke evidence. diff --git a/scholaraio/providers/webtools.py b/scholaraio/providers/webtools.py index dc200685..22fcaf04 100644 --- a/scholaraio/providers/webtools.py +++ b/scholaraio/providers/webtools.py @@ -572,6 +572,159 @@ def _extract_web_mcp(url: str, *, cfg: Config | None, timeout: float) -> dict: } +def _split_table_row_cells(row_text: str) -> list[str]: + cells: list[str] = [] + current: list[str] = [] + in_code_fence = False + i = 0 + while i < len(row_text): + if row_text.startswith("```", i): + in_code_fence = not in_code_fence + current.append("```") + i += 3 + continue + char = row_text[i] + if char == "|" and not in_code_fence: + cells.append("".join(current)) + current = [] + else: + current.append(char) + i += 1 + cells.append("".join(current)) + return cells + + +def _clean_single_row(row_text: str) -> str: + cells = _split_table_row_cells(row_text) + cleaned_cells: list[str] = [] + + for i, cell in enumerate(cells): + if i == 0 and not cell.strip(): + cleaned_cells.append(cell) + continue + if i == len(cells) - 1 and not cell.strip() and row_text.endswith("|"): + cleaned_cells.append(cell) + continue + + if "```" in cell: + fence_count = cell.count("```") + cell_to_clean = cell + "\n```" if fence_count % 2 != 0 else cell + parts = cell_to_clean.split("```") + cleaned_parts = [] + for j, part in enumerate(parts): + if j % 2 == 0: + cleaned_parts.append(part.replace("\n", " ")) + else: + block = part + if block.startswith("\n"): + block = block[1:] + else: + block_lines = block.split("\n", 1) + if len(block_lines) > 1: + first_line = block_lines[0].strip() + if re.match(r"^[a-zA-Z0-9_-]+$", first_line): + block = block_lines[1] + block_clean = block.replace("\n", " ").strip() + if block_clean: + cleaned_parts.append(f"`{block_clean}`") + else: + cleaned_parts.append("") + cleaned_cell = "".join(cleaned_parts) + cleaned_cell = " " + cleaned_cell.strip() + " " + cleaned_cells.append(cleaned_cell) + else: + cleaned_cells.append(cell.replace("\n", " ")) + + res = "|".join(cleaned_cells) + if not res.endswith("|"): + res += "|" + return res + + +def _clean_table_code_fences(text: str) -> str: + """Sanitize Markdown table cells that contain block-level code blocks/fences. + + Transforms: + | Col | ```\nval\n``` | + Into: + | Col | `val` | + """ + if not text: + return "" + + lines = text.splitlines() + cleaned_lines: list[str] = [] + current_row_lines: list[str] = [] + in_multiline_row = False + in_code_block = False + + def flush_current_row(): + nonlocal in_multiline_row, current_row_lines, in_code_block + if current_row_lines: + row_text = "\n".join(current_row_lines) + cleaned_row = _clean_single_row(row_text) + cleaned_lines.append(cleaned_row) + current_row_lines = [] + in_multiline_row = False + in_code_block = False + + for line in lines: + stripped = line.strip() + + if in_multiline_row: + num_fences = stripped.count("```") + if stripped.startswith("|") and (stripped.count("|") >= 2 or "```" in stripped): + flush_current_row() + # fall through to process as a new row start below + else: + if num_fences % 2 != 0: + in_code_block = not in_code_block + + if not in_code_block: + if stripped.endswith("|"): + current_row_lines.append(line) + flush_current_row() + continue + elif not stripped: + flush_current_row() + cleaned_lines.append(line) + continue + elif stripped.startswith("```"): + flush_current_row() + # fall through to process as normal + + if in_multiline_row: + current_row_lines.append(line) + continue + + if stripped.startswith("|") and (stripped.count("|") >= 2 or "```" in stripped): + if "```" in stripped: + num_fences = stripped.count("```") + in_code = num_fences % 2 != 0 + if not in_code and stripped.endswith("|"): + cleaned_lines.append(_clean_single_row(line)) + else: + in_multiline_row = True + in_code_block = in_code + current_row_lines = [line] + else: + if stripped.endswith("|"): + cleaned_lines.append(line) + else: + in_multiline_row = True + in_code_block = False + current_row_lines = [line] + else: + cleaned_lines.append(line) + + flush_current_row() + + result = "\n".join(cleaned_lines) + if text.endswith("\n") and not result.endswith("\n"): + result += "\n" + return result + + def extract_web( url: str, *, @@ -600,33 +753,39 @@ def extract_web( """ transport = _get_webextract_transport(cfg) if transport == "mcp": - return _extract_web_mcp(url, cfg=cfg, timeout=timeout) - if transport != "http": - raise WebExtractError(f"未知 webextract transport: {transport}") + res = _extract_web_mcp(url, cfg=cfg, timeout=timeout) + else: + if transport != "http": + raise WebExtractError(f"未知 webextract transport: {transport}") - base_url = _get_webextract_base_url(cfg) - if not check_webextract_service(cfg, timeout=3.0): - raise WebExtractServiceUnavailableError( - f"提取服务未启动或不可达: {base_url}\n请确保 qt-web-extractor 服务已运行" + base_url = _get_webextract_base_url(cfg) + if not check_webextract_service(cfg, timeout=3.0): + raise WebExtractServiceUnavailableError( + f"提取服务未启动或不可达: {base_url}\n请确保 qt-web-extractor 服务已运行" + ) + + body: dict[str, object] = {"url": url} + if pdf is not None: + body["pdf"] = pdf + if include_html: + body["include_html"] = include_html + + api_key = _get_webextract_api_key(cfg) or "" + req = Request( + f"{base_url}/extract", + data=json.dumps(body).encode("utf-8"), + headers=_headers(api_key), + method="POST", ) + try: + res = _load_json_response(req, timeout=int(timeout), error_prefix="提取失败") + except RuntimeError as e: + raise WebExtractError(str(e)) from e - body: dict[str, object] = {"url": url} - if pdf is not None: - body["pdf"] = pdf - if include_html: - body["include_html"] = include_html + if isinstance(res, dict) and "text" in res and res["text"]: + res["text"] = _clean_table_code_fences(res["text"]) - api_key = _get_webextract_api_key(cfg) or "" - req = Request( - f"{base_url}/extract", - data=json.dumps(body).encode("utf-8"), - headers=_headers(api_key), - method="POST", - ) - try: - return _load_json_response(req, timeout=int(timeout), error_prefix="提取失败") - except RuntimeError as e: - raise WebExtractError(str(e)) from e + return res def extract_and_display( diff --git a/tests/fixtures/wikipedia_infobox_bad.md b/tests/fixtures/wikipedia_infobox_bad.md new file mode 100644 index 00000000..c568fd1a --- /dev/null +++ b/tests/fixtures/wikipedia_infobox_bad.md @@ -0,0 +1,10 @@ +| 性别 | 男 | +| 出生 | ``` +1902年8月28日 +``` | +| 逝世 | ``` +1993年11月24日 +``` | +| 国籍 | ``` +中华人民共和国 +``` | diff --git a/tests/fixtures/wikipedia_infobox_clean.md b/tests/fixtures/wikipedia_infobox_clean.md new file mode 100644 index 00000000..e718f65e --- /dev/null +++ b/tests/fixtures/wikipedia_infobox_clean.md @@ -0,0 +1,4 @@ +| 性别 | 男 | +| 出生 | `1902年8月28日` | +| 逝世 | `1993年11月24日` | +| 国籍 | `中华人民共和国` | diff --git a/tests/test_webtools_source.py b/tests/test_webtools_source.py index 8fa788aa..c0088e77 100644 --- a/tests/test_webtools_source.py +++ b/tests/test_webtools_source.py @@ -3,6 +3,7 @@ from __future__ import annotations import json +import pathlib import pytest @@ -781,3 +782,68 @@ def fake_urlopen(req, timeout=0): assert result["title"] == "Page" captured = capsys.readouterr() assert "markdown body" in captured.out + + def test_clean_table_code_fences_with_fixtures(self): + from scholaraio.providers.webtools import _clean_table_code_fences + + fixtures_dir = pathlib.Path(__file__).parent / "fixtures" + bad_path = fixtures_dir / "wikipedia_infobox_bad.md" + clean_path = fixtures_dir / "wikipedia_infobox_clean.md" + + assert bad_path.exists() + assert clean_path.exists() + + bad_text = bad_path.read_text(encoding="utf-8") + expected_clean_text = clean_path.read_text(encoding="utf-8") + + cleaned_text = _clean_table_code_fences(bad_text) + assert cleaned_text.strip() == expected_clean_text.strip() + + def test_clean_table_code_fences_ignores_normal_structures(self): + from scholaraio.providers.webtools import _clean_table_code_fences + + # Test normal code block outside table should not be changed + normal_code = "Here is a code snippet:\n```python\ndef test():\n return True\n```\nAnd here is normal text." + assert _clean_table_code_fences(normal_code) == normal_code + + # Test normal table with inline code should not be changed + normal_table = "| Column 1 | Column 2 |\n| --- | --- |\n| `inline code` | value |\n" + assert _clean_table_code_fences(normal_table) == normal_table + + # Test standalone code block between tables should not be changed + standalone_between_tables = ( + "| A | B |\n" + "| --- | --- |\n" + "| one | two |\n\n" + "```python\n" + "print(1)\n" + "```\n\n" + "| C | D |\n" + "| --- | --- |\n" + "| three | four |\n" + ) + assert _clean_table_code_fences(standalone_between_tables) == standalone_between_tables + + adjacent_standalone_code = ( + "| A | B |\n| one | two |\n```python\nprint(1)\n```\n| next paragraph starts with pipe |\n" + ) + assert _clean_table_code_fences(adjacent_standalone_code) == adjacent_standalone_code + + table_cell_code_with_pipe = "| A | B |\n| code | ```\na | b\n``` |\n" + assert _clean_table_code_fences(table_cell_code_with_pipe) == "| A | B |\n| code | `a | b` |\n" + + def test_extract_web_applies_cleanup_http(self, monkeypatch): + # Verify that HTTP path runs the clean helper + def fake_urlopen(req, timeout=0): + return _FakeResponse({"title": "Page", "text": "| 性别 |\n| 出生 | ```\n1902\n``` |"}) + + def fake_check_service(cfg, timeout=3.0): + return True + + monkeypatch.setattr("scholaraio.providers.webtools.urlopen", fake_urlopen) + monkeypatch.setattr("scholaraio.providers.webtools.check_webextract_service", fake_check_service) + + from scholaraio.providers.webtools import extract_web + + res = extract_web("https://example.com") + assert res["text"] == "| 性别 |\n| 出生 | `1902` |"