ZimoLiao · ZimoLiao · Jun 8, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
 
 - **Nature workflow bridge skill** ([#107](https://github.com/ZimoLiao/scholaraio/issues/107)): Added a ScholarAIO `nature-workflow` bridge skill that routes Nature Portfolio writing and figure workflows to the upstream `nature-skills` repository when installed, keeps ScholarAIO-native fallbacks explicit, documents the install and quick-start path, and includes deterministic plus product-demo fixtures that generate reviewable manuscript, figure, slide, and QA artifacts.
 
+### Fixed
+
+- **Webextract Markdown table-cell cleanup** ([#110](https://github.com/ZimoLiao/scholaraio/pull/110)): Sanitized malformed block-level code fences emitted inside `qt-web-extractor` table cells before HTTP/MCP extraction results reach `webextract` and ingest consumers, while preserving standalone fenced code blocks and pipe characters inside code-cell content.
+
 ## [1.5.0] — 2026-05-24
 
 ### Added

diff --git a/docs/development/third-party-integration-audit.md b/docs/development/third-party-integration-audit.md
@@ -0,0 +1,80 @@
+# ScholarAIO Third-Party Integration Quality Audit
+
+This document records the quality, reachability, and output validation status of the third-party integrations, APIs, CLIs, and optional toolchains supported by ScholarAIO.
+
+Integrations are evaluated at the workflow boundary, checking CLI/skill entrypoints, provider implementations, setup diagnostics, output formatting, fallback behaviors, and failure handling. A config test or a broad unit-test filename is not enough evidence to mark an integration surface as Good.
+
+This audit is not a declaration that the full third-party toolchain is adapted or verified. Each row claims only the evidence listed in that row; everything else remains inventory until a focused live or workflow-boundary pass verifies it.
+
+Status is intentionally conservative:
+
+- **good**: workflow-boundary evidence exists, including commands, representative output, and failure handling.
+- **partially-reviewed**: code-level or fixture evidence exists, but live workflow evidence is still missing.
+- **not-yet-reviewed**: inventory only; no quality claim is made.
+
+---
+
+## 1. Quality Matrix
+
+| Integration / Surface | Category | Status | Verification Path / Test Evidence | Observed Result / Config & Version Boundaries |
+| :--- | :--- | :--- | :--- | :--- |
+| **qt-web-extractor (HTTP & MCP)** | Web / Agent | **partially-reviewed** | `extract_web`, `_clean_table_code_fences`, `tests/test_webtools_source.py`, fixture pair under `tests/fixtures/` | Sanitizer regression is covered for malformed table-cell code fences and adjacent standalone code blocks. Live daemon canary evidence is still required before this surface is promoted to `good`. Boundaries: `webextract.transport` (HTTP/MCP), `webextract.base_url`, `webextract.mcp_url`, `webextract.api_key`. |
+| **GUILessBingSearch** | Web / Agent | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **MinerU Local API** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **MinerU Cloud CLI** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Paper2Any MCP Sidecar** | Parsing/MCP | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Docling Fallback** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **PyMuPDF Fallback** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **arXiv Search (Atom API)** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **arXiv PDF Download** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **OpenAlex Explore** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Crossref / Semantic Scholar** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Zotero SQLite Import** | Import/Export | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Zotero Web API** | Import/Export | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **EndNote / RIS** | Import/Export | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **USPTO ODP / PPubs** | Patents | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **OpenAI-compatible Chat API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Anthropic Messages API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Google Gemini API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Zhipu API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **vLLM / Ollama Local** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Sentence-transformers Embeddings** | Vector/Embed | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **FAISS Vector / BERTopic** | Vector/Embed | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **MarkItDown Office Ingest** | Office/Output | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Office PPTX / DOCX Libraries** | Office/Output | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Mermaid / DOT Rendering** | Diagram | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Scientific Toolref (Quantum ESPRESSO, etc.)** | Toolref | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **AmberTools / PyMOL** | Scientific | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **rsync / SSH Backup** | System | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+| **Setup Diagnostics** | System | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
+
+---
+
+## 2. Current Reviewed Surface
+
+### 2.1 qt-web-extractor (HTTP & MCP)
+* **CLI/Skill Entrypoint**:
+  * CLI: `scholaraio webextract <url>` (implemented in `cmd_webextract` inside [web.py](../../scholaraio/interfaces/cli/web.py))
+  * Skill: `.claude/skills/webextract`
+* **Provider/Service Implementation Path**:
+  * [webtools.py:extract_web](../../scholaraio/providers/webtools.py)
+* **Setup Diagnostics**:
+  * Diagnostic path exists through `scholaraio setup check` (calls `_optional_webtool_detail` inside [setup.py](../../scholaraio/services/setup.py)), which executes `check_webextract_service` to verify that the HTTP/MCP endpoint responds. This PR does not include live daemon evidence from that path.
+* **Output Quality & Validation**:
+  * Outputs parsed GFM Markdown. Output quality is protected by `_clean_table_code_fences` to sanitize malformed block code fences in Wikipedia/infobox table cells, resolving broken table rendering.
+  * Verified via unit and fixture coverage: [wikipedia_infobox_bad.md](../../tests/fixtures/wikipedia_infobox_bad.md), [wikipedia_infobox_clean.md](../../tests/fixtures/wikipedia_infobox_clean.md), and regression tests for standalone fenced code blocks near table or pipe-prefixed lines.
+* **Fallback Behavior**:
+  * Configured via `webextract.transport` (HTTP or MCP). When configured as HTTP, failure to connect triggers fallback hint to MCP or setup checks.
+* **Failure Handling**:
+  * Unreachable HTTP endpoints raise `WebExtractServiceUnavailableError`, returning a clean user-facing hint with exit code `1`.
+  * API/Server errors raise `WebExtractError`, showing warnings/errors instead of generic crashes.
+
+## 3. Not-Yet-Reviewed Inventory
+
+Rows marked `not-yet-reviewed` in the matrix are intentionally inventory-only. Promoting any of them to `partially-reviewed` or `good` should happen in a focused follow-up that includes:
+
+- exact CLI command or skill workflow exercised;
+- relevant config/version boundaries;
+- representative success output;
+- failure-mode behavior;
+- targeted tests or reproducible smoke evidence.
diff --git a/scholaraio/providers/webtools.py b/scholaraio/providers/webtools.py
@@ -572,6 +572,159 @@ def _extract_web_mcp(url: str, *, cfg: Config | None, timeout: float) -> dict:
     }
 
 
+def _split_table_row_cells(row_text: str) -> list[str]:
+    cells: list[str] = []
+    current: list[str] = []
+    in_code_fence = False
+    i = 0
+    while i < len(row_text):
+        if row_text.startswith("```", i):
+            in_code_fence = not in_code_fence
+            current.append("```")
+            i += 3
+            continue
+        char = row_text[i]
+        if char == "|" and not in_code_fence:
+            cells.append("".join(current))
+            current = []
+        else:
+            current.append(char)
+        i += 1
+    cells.append("".join(current))
+    return cells
+
+
+def _clean_single_row(row_text: str) -> str:
+    cells = _split_table_row_cells(row_text)
+    cleaned_cells: list[str] = []
+
+    for i, cell in enumerate(cells):
+        if i == 0 and not cell.strip():
+            cleaned_cells.append(cell)
+            continue
+        if i == len(cells) - 1 and not cell.strip() and row_text.endswith("|"):
+            cleaned_cells.append(cell)
+            continue
+
+        if "```" in cell:
+            fence_count = cell.count("```")
+            cell_to_clean = cell + "\n```" if fence_count % 2 != 0 else cell
+            parts = cell_to_clean.split("```")
+            cleaned_parts = []
+            for j, part in enumerate(parts):
+                if j % 2 == 0:
+                    cleaned_parts.append(part.replace("\n", " "))
+                else:
+                    block = part
+                    if block.startswith("\n"):
+                        block = block[1:]
+                    else:
+                        block_lines = block.split("\n", 1)
+                        if len(block_lines) > 1:
+                            first_line = block_lines[0].strip()
+                            if re.match(r"^[a-zA-Z0-9_-]+$", first_line):
+                                block = block_lines[1]
+                    block_clean = block.replace("\n", " ").strip()
+                    if block_clean:
+                        cleaned_parts.append(f"`{block_clean}`")
+                    else:
+                        cleaned_parts.append("")
+            cleaned_cell = "".join(cleaned_parts)
+            cleaned_cell = " " + cleaned_cell.strip() + " "
+            cleaned_cells.append(cleaned_cell)
+        else:
+            cleaned_cells.append(cell.replace("\n", " "))
+
+    res = "|".join(cleaned_cells)
+    if not res.endswith("|"):
+        res += "|"
+    return res
+
+
+def _clean_table_code_fences(text: str) -> str:
+    """Sanitize Markdown table cells that contain block-level code blocks/fences.
+
+    Transforms:
+        | Col | ```\nval\n``` |
+    Into:
+        | Col | `val` |
+    """
+    if not text:
+        return ""
+
+    lines = text.splitlines()
+    cleaned_lines: list[str] = []
+    current_row_lines: list[str] = []
+    in_multiline_row = False
+    in_code_block = False
+
+    def flush_current_row():
+        nonlocal in_multiline_row, current_row_lines, in_code_block
+        if current_row_lines:
+            row_text = "\n".join(current_row_lines)
+            cleaned_row = _clean_single_row(row_text)
+            cleaned_lines.append(cleaned_row)
+            current_row_lines = []
+        in_multiline_row = False
+        in_code_block = False
+
+    for line in lines:
+        stripped = line.strip()
+
+        if in_multiline_row:
+            num_fences = stripped.count("```")
+            if stripped.startswith("|") and (stripped.count("|") >= 2 or "```" in stripped):
+                flush_current_row()
+                # fall through to process as a new row start below
+            else:
+                if num_fences % 2 != 0:
+                    in_code_block = not in_code_block
+
+                if not in_code_block:
+                    if stripped.endswith("|"):
+                        current_row_lines.append(line)
+                        flush_current_row()
+                        continue
+                    elif not stripped:
+                        flush_current_row()
+                        cleaned_lines.append(line)
+                        continue
+                    elif stripped.startswith("```"):
+                        flush_current_row()
+                        # fall through to process as normal
+
+                if in_multiline_row:
+                    current_row_lines.append(line)
+                    continue
+
+        if stripped.startswith("|") and (stripped.count("|") >= 2 or "```" in stripped):
+            if "```" in stripped:
+                num_fences = stripped.count("```")
+                in_code = num_fences % 2 != 0
+                if not in_code and stripped.endswith("|"):
+                    cleaned_lines.append(_clean_single_row(line))
+                else:
+                    in_multiline_row = True
+                    in_code_block = in_code
+                    current_row_lines = [line]
+            else:
+                if stripped.endswith("|"):
+                    cleaned_lines.append(line)
+                else:
+                    in_multiline_row = True
+                    in_code_block = False
+                    current_row_lines = [line]
+        else:
+            cleaned_lines.append(line)
+
+    flush_current_row()
+
+    result = "\n".join(cleaned_lines)
+    if text.endswith("\n") and not result.endswith("\n"):
+        result += "\n"
+    return result
+
+
 def extract_web(
     url: str,
     *,
@@ -600,33 +753,39 @@ def extract_web(
     """
     transport = _get_webextract_transport(cfg)
     if transport == "mcp":
-        return _extract_web_mcp(url, cfg=cfg, timeout=timeout)
-    if transport != "http":
-        raise WebExtractError(f"未知 webextract transport: {transport}")
+        res = _extract_web_mcp(url, cfg=cfg, timeout=timeout)
+    else:
+        if transport != "http":
+            raise WebExtractError(f"未知 webextract transport: {transport}")
 
-    base_url = _get_webextract_base_url(cfg)
-    if not check_webextract_service(cfg, timeout=3.0):
-        raise WebExtractServiceUnavailableError(
-            f"提取服务未启动或不可达: {base_url}\n请确保 qt-web-extractor 服务已运行"
+        base_url = _get_webextract_base_url(cfg)
+        if not check_webextract_service(cfg, timeout=3.0):
+            raise WebExtractServiceUnavailableError(
+                f"提取服务未启动或不可达: {base_url}\n请确保 qt-web-extractor 服务已运行"
+            )
+
+        body: dict[str, object] = {"url": url}
+        if pdf is not None:
+            body["pdf"] = pdf
+        if include_html:
+            body["include_html"] = include_html
+
+        api_key = _get_webextract_api_key(cfg) or ""
+        req = Request(
+            f"{base_url}/extract",
+            data=json.dumps(body).encode("utf-8"),
+            headers=_headers(api_key),
+            method="POST",
         )
+        try:
+            res = _load_json_response(req, timeout=int(timeout), error_prefix="提取失败")
+        except RuntimeError as e:
+            raise WebExtractError(str(e)) from e
 
-    body: dict[str, object] = {"url": url}
-    if pdf is not None:
-        body["pdf"] = pdf
-    if include_html:
-        body["include_html"] = include_html
+    if isinstance(res, dict) and "text" in res and res["text"]:
+        res["text"] = _clean_table_code_fences(res["text"])
 
-    api_key = _get_webextract_api_key(cfg) or ""
-    req = Request(
-        f"{base_url}/extract",
-        data=json.dumps(body).encode("utf-8"),
-        headers=_headers(api_key),
-        method="POST",
-    )
-    try:
-        return _load_json_response(req, timeout=int(timeout), error_prefix="提取失败")
-    except RuntimeError as e:
-        raise WebExtractError(str(e)) from e
+    return res
 
 
 def extract_and_display(

diff --git a/tests/fixtures/wikipedia_infobox_bad.md b/tests/fixtures/wikipedia_infobox_bad.md
@@ -0,0 +1,10 @@
+| 性别 | 男 |
+| 出生 | ```
+1902年8月28日
+``` |
+| 逝世 | ```
+1993年11月24日
+``` |
+| 国籍 | ```
+中华人民共和国
+``` |
diff --git a/tests/fixtures/wikipedia_infobox_clean.md b/tests/fixtures/wikipedia_infobox_clean.md
@@ -0,0 +1,4 @@
+| 性别 | 男 |
+| 出生 | `1902年8月28日` |
+| 逝世 | `1993年11月24日` |
+| 国籍 | `中华人民共和国` |