Skip to content

Audit quality of ScholarAIO third-party integrations #96

@ZimoLiao

Description

@ZimoLiao

Context

While validating #95 on a real page (https://zh.wikipedia.org/wiki/周培源), the image localization path worked, but the upstream qt-web-extractor Markdown contained malformed code fences inside Wikipedia infobox table cells, for example around fields like 性别, 出生, 逝世, and 国籍.

That specific case is not caused by ScholarAIO image localization. It is a signal that our third-party integration quality needs a centralized audit instead of one-off fixes.

Goal

Create a single quality-audit track for third-party projects, services, CLIs, APIs, and optional toolchains that ScholarAIO supports or documents. The audit should verify not only whether each integration is reachable, but whether its outputs are good enough for real user workflows.

Initial Inventory

This inventory is intentionally broad. It should be refined during the audit, but the first pass should not silently exclude a supported surface just because it is optional or skill-driven.

Web / Agent Services

  • qt-web-extractor / fetch_url / scholaraio webextract / scholaraio ingest-link
  • GUILessBingSearch / search_bing / scholaraio websearch
  • MCP Streamable HTTP transport used by external services
  • Project .mcp.json registrations: web-search, web-extractor, paper2any

Paper And Document Parsing

  • MinerU local service
  • MinerU cloud / mineru-open-api
  • Docling fallback parser
  • PyMuPDF fallback parser
  • PDF parser matrix behavior: preferred parser, fallback order, chunking, image asset handling, table/formula quality

Paper2Any

  • OpenDCAI/Paper2Any external checkout
  • Paper2Any FastAPI backend
  • ScholarAIO Paper2Any MCP sidecar
  • Paper2Any workflows documented in ScholarAIO: paper-to-figure, PPT, poster, video, citation, rebuttal, DrawIO, mindmap, PDF-to-PPT, image-to-PPT, KB workflows

Literature / Metadata / Discovery APIs

  • arXiv search and PDF fetch
  • OpenAlex explore and metadata enrichment
  • Crossref metadata/reference enrichment
  • Semantic Scholar metadata/reference/citation enrichment
  • DOI landing-page / publisher abstract fallback paths, where applicable

Import / Export Ecosystem

  • Zotero Web API
  • Zotero local SQLite import
  • EndNote XML import
  • RIS import/export compatibility
  • Downstream RIS consumers mentioned in docs, such as Zotero, EndNote, and Mendeley

Patent Sources

  • USPTO Open Data Portal API
  • USPTO Patent Public Search / PPUBS PDF discovery and fetch

LLM / Model / Embedding Backends

  • OpenAI-compatible chat APIs
  • DeepSeek default configuration
  • OpenAI API-compatible deployments
  • Anthropic Messages API
  • Google Gemini API
  • Zhipu OpenAI-compatible handling
  • vLLM / Ollama-style OpenAI-compatible local deployments
  • OpenAI-compatible embedding APIs
  • Local sentence-transformers embeddings
  • Qwen/Qwen3 embedding model distribution via ModelScope and Hugging Face
  • FAISS vector indexes
  • BERTopic topic modeling stack, including UMAP/HDBSCAN behavior on small corpora

Office / Output / Diagram Tooling

  • MarkItDown Office document ingest/inspection path
  • python-docx
  • python-pptx
  • openpyxl
  • Mermaid / mermaid-py
  • Graphviz / DOT
  • Inkscape / cli-anything-inkscape
  • draw.io / diagrams.net XML outputs

Scientific Toolref / Runtime Support

Core toolref surfaces:

  • Quantum ESPRESSO
  • LAMMPS
  • GROMACS
  • OpenFOAM
  • Bioinformatics toolchain docs

Bioinformatics tools and helper ecosystem mentioned in skills/toolref:

  • BLAST / BLAST+
  • minimap2
  • BWA-MEM2
  • samtools
  • bcftools
  • MAFFT
  • IQ-TREE
  • FastTree
  • ESMFold / fair-esm
  • BioPython
  • NCBI Datasets CLI
  • py3Dmol
  • pycirclize
  • toytree

Scientific workflow helper tools mentioned in skills:

  • ACPYPE
  • AmberTools
  • gmx_MMPBSA
  • PyMOL
  • VMD
  • OVITO
  • ParaView
  • VESTA
  • XCrySDen
  • FermiSurfer
  • ifermi
  • NIST Interatomic Potentials Repository
  • PseudoDojo
  • Materials Cloud SSSP

Backup / System Tools

  • rsync
  • ssh / remote backup targets

Known Seed Issue

qt-web-extractor currently produces poor Markdown for at least one real Wikipedia page:

  • URL: https://zh.wikipedia.org/wiki/周培源
  • Symptom: malformed fenced-code blocks inside infobox table cells
  • Observed result: image localization succeeds, but extracted Markdown quality is not acceptable for direct ingest/readback
  • Follow-up direction: decide whether to fix upstream conversion, add ScholarAIO post-cleanup, or mark limitations per site family

Audit Questions

For every supported integration above:

  • Is it actually reachable through the documented CLI/skill path?
  • Does setup check report missing dependencies and credentials clearly?
  • Does a real smoke test exist, preferably with small deterministic fixtures?
  • Does output quality meet the user-facing promise, not just return non-empty data?
  • Are fallback paths explicit and tested?
  • Are rate limits, auth failures, network failures, and service-unavailable states handled cleanly?
  • Does the corresponding skill/doc overpromise compared with the actual implementation?
  • Is the third-party version or API surface pinned, discoverable, or at least reported in diagnostics?

Proposed Deliverables

  • A confirmed inventory of third-party integration surfaces.
  • A compact quality matrix with status: good, usable-with-caveats, needs-cleanup, blocked, or unsupported-despite-docs.
  • One real smoke fixture or command per major integration class.
  • Separate follow-up issues for high-risk failures, starting with qt-web-extractor Markdown quality on complex pages.
  • Updates to docs/skills when a supported surface is only partial or requires fallback behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions