Context
While validating #95 on a real page (https://zh.wikipedia.org/wiki/周培源), the image localization path worked, but the upstream qt-web-extractor Markdown contained malformed code fences inside Wikipedia infobox table cells, for example around fields like 性别, 出生, 逝世, and 国籍.
That specific case is not caused by ScholarAIO image localization. It is a signal that our third-party integration quality needs a centralized audit instead of one-off fixes.
Goal
Create a single quality-audit track for third-party projects, services, CLIs, APIs, and optional toolchains that ScholarAIO supports or documents. The audit should verify not only whether each integration is reachable, but whether its outputs are good enough for real user workflows.
Initial Inventory
This inventory is intentionally broad. It should be refined during the audit, but the first pass should not silently exclude a supported surface just because it is optional or skill-driven.
Web / Agent Services
qt-web-extractor / fetch_url / scholaraio webextract / scholaraio ingest-link
GUILessBingSearch / search_bing / scholaraio websearch
- MCP Streamable HTTP transport used by external services
- Project
.mcp.json registrations: web-search, web-extractor, paper2any
Paper And Document Parsing
- MinerU local service
- MinerU cloud /
mineru-open-api
- Docling fallback parser
- PyMuPDF fallback parser
- PDF parser matrix behavior: preferred parser, fallback order, chunking, image asset handling, table/formula quality
Paper2Any
- OpenDCAI/Paper2Any external checkout
- Paper2Any FastAPI backend
- ScholarAIO Paper2Any MCP sidecar
- Paper2Any workflows documented in ScholarAIO: paper-to-figure, PPT, poster, video, citation, rebuttal, DrawIO, mindmap, PDF-to-PPT, image-to-PPT, KB workflows
Literature / Metadata / Discovery APIs
- arXiv search and PDF fetch
- OpenAlex explore and metadata enrichment
- Crossref metadata/reference enrichment
- Semantic Scholar metadata/reference/citation enrichment
- DOI landing-page / publisher abstract fallback paths, where applicable
Import / Export Ecosystem
- Zotero Web API
- Zotero local SQLite import
- EndNote XML import
- RIS import/export compatibility
- Downstream RIS consumers mentioned in docs, such as Zotero, EndNote, and Mendeley
Patent Sources
- USPTO Open Data Portal API
- USPTO Patent Public Search / PPUBS PDF discovery and fetch
LLM / Model / Embedding Backends
- OpenAI-compatible chat APIs
- DeepSeek default configuration
- OpenAI API-compatible deployments
- Anthropic Messages API
- Google Gemini API
- Zhipu OpenAI-compatible handling
- vLLM / Ollama-style OpenAI-compatible local deployments
- OpenAI-compatible embedding APIs
- Local sentence-transformers embeddings
- Qwen/Qwen3 embedding model distribution via ModelScope and Hugging Face
- FAISS vector indexes
- BERTopic topic modeling stack, including UMAP/HDBSCAN behavior on small corpora
Office / Output / Diagram Tooling
- MarkItDown Office document ingest/inspection path
- python-docx
- python-pptx
- openpyxl
- Mermaid / mermaid-py
- Graphviz / DOT
- Inkscape /
cli-anything-inkscape
- draw.io / diagrams.net XML outputs
Scientific Toolref / Runtime Support
Core toolref surfaces:
- Quantum ESPRESSO
- LAMMPS
- GROMACS
- OpenFOAM
- Bioinformatics toolchain docs
Bioinformatics tools and helper ecosystem mentioned in skills/toolref:
- BLAST / BLAST+
- minimap2
- BWA-MEM2
- samtools
- bcftools
- MAFFT
- IQ-TREE
- FastTree
- ESMFold /
fair-esm
- BioPython
- NCBI Datasets CLI
- py3Dmol
- pycirclize
- toytree
Scientific workflow helper tools mentioned in skills:
- ACPYPE
- AmberTools
- gmx_MMPBSA
- PyMOL
- VMD
- OVITO
- ParaView
- VESTA
- XCrySDen
- FermiSurfer
- ifermi
- NIST Interatomic Potentials Repository
- PseudoDojo
- Materials Cloud SSSP
Backup / System Tools
- rsync
- ssh / remote backup targets
Known Seed Issue
qt-web-extractor currently produces poor Markdown for at least one real Wikipedia page:
- URL:
https://zh.wikipedia.org/wiki/周培源
- Symptom: malformed fenced-code blocks inside infobox table cells
- Observed result: image localization succeeds, but extracted Markdown quality is not acceptable for direct ingest/readback
- Follow-up direction: decide whether to fix upstream conversion, add ScholarAIO post-cleanup, or mark limitations per site family
Audit Questions
For every supported integration above:
- Is it actually reachable through the documented CLI/skill path?
- Does setup check report missing dependencies and credentials clearly?
- Does a real smoke test exist, preferably with small deterministic fixtures?
- Does output quality meet the user-facing promise, not just return non-empty data?
- Are fallback paths explicit and tested?
- Are rate limits, auth failures, network failures, and service-unavailable states handled cleanly?
- Does the corresponding skill/doc overpromise compared with the actual implementation?
- Is the third-party version or API surface pinned, discoverable, or at least reported in diagnostics?
Proposed Deliverables
- A confirmed inventory of third-party integration surfaces.
- A compact quality matrix with status:
good, usable-with-caveats, needs-cleanup, blocked, or unsupported-despite-docs.
- One real smoke fixture or command per major integration class.
- Separate follow-up issues for high-risk failures, starting with
qt-web-extractor Markdown quality on complex pages.
- Updates to docs/skills when a supported surface is only partial or requires fallback behavior.
Context
While validating #95 on a real page (
https://zh.wikipedia.org/wiki/周培源), the image localization path worked, but the upstreamqt-web-extractorMarkdown contained malformed code fences inside Wikipedia infobox table cells, for example around fields like性别,出生,逝世, and国籍.That specific case is not caused by ScholarAIO image localization. It is a signal that our third-party integration quality needs a centralized audit instead of one-off fixes.
Goal
Create a single quality-audit track for third-party projects, services, CLIs, APIs, and optional toolchains that ScholarAIO supports or documents. The audit should verify not only whether each integration is reachable, but whether its outputs are good enough for real user workflows.
Initial Inventory
This inventory is intentionally broad. It should be refined during the audit, but the first pass should not silently exclude a supported surface just because it is optional or skill-driven.
Web / Agent Services
qt-web-extractor/fetch_url/scholaraio webextract/scholaraio ingest-linkGUILessBingSearch/search_bing/scholaraio websearch.mcp.jsonregistrations:web-search,web-extractor,paper2anyPaper And Document Parsing
mineru-open-apiPaper2Any
Literature / Metadata / Discovery APIs
Import / Export Ecosystem
Patent Sources
LLM / Model / Embedding Backends
Office / Output / Diagram Tooling
cli-anything-inkscapeScientific Toolref / Runtime Support
Core toolref surfaces:
Bioinformatics tools and helper ecosystem mentioned in skills/toolref:
fair-esmScientific workflow helper tools mentioned in skills:
Backup / System Tools
Known Seed Issue
qt-web-extractorcurrently produces poor Markdown for at least one real Wikipedia page:https://zh.wikipedia.org/wiki/周培源Audit Questions
For every supported integration above:
Proposed Deliverables
good,usable-with-caveats,needs-cleanup,blocked, orunsupported-despite-docs.qt-web-extractorMarkdown quality on complex pages.