Fix/integrations quality audit#110
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cc56160016
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
cc56160 to
8c9836f
Compare
ZimoLiao
left a comment
There was a problem hiding this comment.
Thanks for the PR. I reviewed the current head (80a584c) and I don't think this is merge-ready yet.
What I verified locally:
python -m pytest tests/test_webtools_source.py -q -p no:cacheproviderpasses (32 passed).python -m ruff check scholaraio/providers/webtools.py tests/test_webtools_source.py docs/development/third-party-integration-audit.mdfails.python -m ruff format --check scholaraio/providers/webtools.py tests/test_webtools_source.pyfails.python -m mkdocs build --strictpasses.- I also attempted a live
qt-web-extractorcanary rather than only relying on unit tests. I clonedwszqkzqk/qt-web-extractor, but this environment has no running daemon at127.0.0.1:8766,scholaraio setup checkreports the webextract MCP endpoint as unreachable, and the required PySide6/Qt WebEngine dependencies are not installed. Attempts to install the real dependency path via pip/uv stalled while downloading the large PySide6 wheels, so I could not honestly validate a live daemon extraction. Given the audit claims here, please include your own reproducible live canary evidence: the exactqt-web-extractor servesetup plus one or morescholaraio webextract <url> --fullruns showing raw/cleaned behavior.
Blocking issues:
-
The sanitizer is still too broad and can corrupt non-target Markdown.
scholaraio/providers/webtools.py:587-589uses a DOTALL regex bounded only by|. It does not require the opening and closing pipes to belong to the same table row. A normal table followed immediately by a standalone fenced code block and then a line starting with|is rewritten incorrectly.Minimal reproduction against this branch:
from scholaraio.providers.webtools import _clean_table_code_fences sample = "| A | B |\n| one | two |\n```python\nprint(1)\n```\n| next paragraph starts with pipe |\n" print(_clean_table_code_fences(sample))
Actual output:
| A | B | | one | two | `print(1)` | next paragraph starts with pipe |
That turns an ordinary standalone code block into a table cell before
webextract/ingest-linkconsumers see the text. The cleanup should be row-scoped, or it should parse/process table rows line-by-line rather than using a cross-line pipe search over arbitrary Markdown. -
Static checks currently fail.
Ruff reports whitespace-only blank lines in
scholaraio/providers/webtools.py:600andscholaraio/providers/webtools.py:603, plus an import-order issue intests/test_webtools_source.py:786.ruff format --checkwould also reformat both changed Python files. This is a basic CI blocker. -
The audit document overstates integrations that this PR did not actually audit.
In
docs/development/third-party-integration-audit.md:16-25, several non-webextract surfaces are markedgoodusing broad test filenames rather than direct workflow-boundary evidence. Issue #96 specifically asked not to mark broader surfaces as good unless the PR includes real CLI/provider smoke evidence. Please either downgrade those rows back tonot-yet-reviewedor add the exact command-level evidence for each surface.The Zotero SQLite row is especially problematic:
test_workspace.pyis not evidence that localzotero.sqliteimport works. -
The arXiv workflow commands in the audit doc are wrong.
docs/development/third-party-integration-audit.md:99listsscholaraio search --arxivandscholaraio paper fetch. The parser exposesscholaraio arxiv searchandscholaraio arxiv fetch;fsearch --scope arxivis the federated-search path. Since this document is supposed to verify workflow reachability, commands users cannot run make the audit unreliable. -
The audit document contains local Windows file URLs.
docs/development/third-party-integration-audit.md:49,:52,:57,:69,:87,:99,:116, and:119includefile:///c:/Users/hp/Desktop/Scholara_oss/...links. These only work on the author's machine and are broken in GitHub-rendered docs. Please replace them with repo-relative links or plain source paths. -
The live-value claim needs evidence.
The unit fixture demonstrates the intended Wikipedia/infobox cleanup shape, but this PR is framed as an integration quality audit and a webextract quality fix. Before merge, I would like to see reproducible live evidence from a real
qt-web-extractordaemon, including the exact URL(s), commands, and representative output before/after cleanup. Without that, the code may still be useful, but the audit document should not claim workflow-level validation.
Once the regex is constrained, lint/format passes, the audit statuses/links/commands are corrected, and live canary evidence is provided or the claims are narrowed, I can take another look.
|
Hi @ZimoLiao, Thanks for the detailed review and for catching those edge cases! I’ve refactored the changes to address all your comments:
Let me know if there's anything else you'd like me to change or iterate on! |
- Keep qt-web-extractor audit claims within available fixture evidence - Add regression coverage for adjacent standalone fenced code blocks - Fix mypy inference for row cleanup state
ZimoLiao
left a comment
There was a problem hiding this comment.
Thanks for iterating on this. I pushed owner follow-up 6d6bf2c to keep the audit claims conservative, add the adjacent standalone-fence regression, and clear the mypy inference issue.\n\nVerified locally on the latest head: python -m pytest tests/test_webtools_source.py -q -p no:cacheprovider, python -m pytest -q -p no:cacheprovider (1471 passed), python -m mypy scholaraio/, python -m ruff check scholaraio tests, python -m ruff format --check scholaraio tests, python -m mkdocs build --strict, and git diff --check. The audit doc now correctly marks qt-web-extractor as partially reviewed until live daemon canary evidence is added in a future pass. This is merge-ready from my side.
|
Thanks for the review and for the follow-up commit! I appreciate your help in getting this ready. |
This Pull Request addresses the centralized integration quality audit (Issue #96) and implements a fix for the
qt-web-extractortable cell markdown corruption.Key Changes
qt-web-extractor Table Cell Sanitization:
_clean_table_code_fencesinscholaraio/providers/webtools.pyto collapse block-level code fences inside table cells into inline backticks.extract_webso it covers both HTTP and MCP extraction paths uniformly before downstream ingest/CLI consumers see it.Durable Integration Audit Doc:
docs/development/third-party-integration-audit.mddetailing every surface in the inventory at the workflow boundary (CLI entrypoints, provider implementation paths, setup diagnostics, output formatting/validation, fallback behaviors, and failure handling).not-yet-reviewedto avoid overpromising.Deterministic Fixtures & Tests:
wikipedia_infobox_bad.md) and expected (wikipedia_infobox_clean.md) fixtures inside the nativetests/fixtures/directory.tests/test_webtools_source.pyto verify cell sanitization while asserting that standard tables and code blocks outside tables remain untouched.