Audit quality of ScholarAIO third-party integrations

## Context

While validating #95 on a real page (`https://zh.wikipedia.org/wiki/周培源`), the image localization path worked, but the upstream `qt-web-extractor` Markdown contained malformed code fences inside Wikipedia infobox table cells, for example around fields like `性别`, `出生`, `逝世`, and `国籍`.

That specific case is not caused by ScholarAIO image localization. It is a signal that our third-party integration quality needs a centralized audit instead of one-off fixes.

## Goal

Create a single quality-audit track for third-party projects, services, CLIs, APIs, and optional toolchains that ScholarAIO supports or documents. The audit should verify not only whether each integration is reachable, but whether its outputs are good enough for real user workflows.

## Initial Inventory

This inventory is intentionally broad. It should be refined during the audit, but the first pass should not silently exclude a supported surface just because it is optional or skill-driven.

### Web / Agent Services

- `qt-web-extractor` / `fetch_url` / `scholaraio webextract` / `scholaraio ingest-link`
- `GUILessBingSearch` / `search_bing` / `scholaraio websearch`
- MCP Streamable HTTP transport used by external services
- Project `.mcp.json` registrations: `web-search`, `web-extractor`, `paper2any`

### Paper And Document Parsing

- MinerU local service
- MinerU cloud / `mineru-open-api`
- Docling fallback parser
- PyMuPDF fallback parser
- PDF parser matrix behavior: preferred parser, fallback order, chunking, image asset handling, table/formula quality

### Paper2Any

- OpenDCAI/Paper2Any external checkout
- Paper2Any FastAPI backend
- ScholarAIO Paper2Any MCP sidecar
- Paper2Any workflows documented in ScholarAIO: paper-to-figure, PPT, poster, video, citation, rebuttal, DrawIO, mindmap, PDF-to-PPT, image-to-PPT, KB workflows

### Literature / Metadata / Discovery APIs

- arXiv search and PDF fetch
- OpenAlex explore and metadata enrichment
- Crossref metadata/reference enrichment
- Semantic Scholar metadata/reference/citation enrichment
- DOI landing-page / publisher abstract fallback paths, where applicable

### Import / Export Ecosystem

- Zotero Web API
- Zotero local SQLite import
- EndNote XML import
- RIS import/export compatibility
- Downstream RIS consumers mentioned in docs, such as Zotero, EndNote, and Mendeley

### Patent Sources

- USPTO Open Data Portal API
- USPTO Patent Public Search / PPUBS PDF discovery and fetch

### LLM / Model / Embedding Backends

- OpenAI-compatible chat APIs
- DeepSeek default configuration
- OpenAI API-compatible deployments
- Anthropic Messages API
- Google Gemini API
- Zhipu OpenAI-compatible handling
- vLLM / Ollama-style OpenAI-compatible local deployments
- OpenAI-compatible embedding APIs
- Local sentence-transformers embeddings
- Qwen/Qwen3 embedding model distribution via ModelScope and Hugging Face
- FAISS vector indexes
- BERTopic topic modeling stack, including UMAP/HDBSCAN behavior on small corpora

### Office / Output / Diagram Tooling

- MarkItDown Office document ingest/inspection path
- python-docx
- python-pptx
- openpyxl
- Mermaid / mermaid-py
- Graphviz / DOT
- Inkscape / `cli-anything-inkscape`
- draw.io / diagrams.net XML outputs

### Scientific Toolref / Runtime Support

Core toolref surfaces:

- Quantum ESPRESSO
- LAMMPS
- GROMACS
- OpenFOAM
- Bioinformatics toolchain docs

Bioinformatics tools and helper ecosystem mentioned in skills/toolref:

- BLAST / BLAST+
- minimap2
- BWA-MEM2
- samtools
- bcftools
- MAFFT
- IQ-TREE
- FastTree
- ESMFold / `fair-esm`
- BioPython
- NCBI Datasets CLI
- py3Dmol
- pycirclize
- toytree

Scientific workflow helper tools mentioned in skills:

- ACPYPE
- AmberTools
- gmx_MMPBSA
- PyMOL
- VMD
- OVITO
- ParaView
- VESTA
- XCrySDen
- FermiSurfer
- ifermi
- NIST Interatomic Potentials Repository
- PseudoDojo
- Materials Cloud SSSP

### Backup / System Tools

- rsync
- ssh / remote backup targets

## Known Seed Issue

`qt-web-extractor` currently produces poor Markdown for at least one real Wikipedia page:

- URL: `https://zh.wikipedia.org/wiki/周培源`
- Symptom: malformed fenced-code blocks inside infobox table cells
- Observed result: image localization succeeds, but extracted Markdown quality is not acceptable for direct ingest/readback
- Follow-up direction: decide whether to fix upstream conversion, add ScholarAIO post-cleanup, or mark limitations per site family

## Audit Questions

For every supported integration above:

- Is it actually reachable through the documented CLI/skill path?
- Does setup check report missing dependencies and credentials clearly?
- Does a real smoke test exist, preferably with small deterministic fixtures?
- Does output quality meet the user-facing promise, not just return non-empty data?
- Are fallback paths explicit and tested?
- Are rate limits, auth failures, network failures, and service-unavailable states handled cleanly?
- Does the corresponding skill/doc overpromise compared with the actual implementation?
- Is the third-party version or API surface pinned, discoverable, or at least reported in diagnostics?

## Proposed Deliverables

- A confirmed inventory of third-party integration surfaces.
- A compact quality matrix with status: `good`, `usable-with-caveats`, `needs-cleanup`, `blocked`, or `unsupported-despite-docs`.
- One real smoke fixture or command per major integration class.
- Separate follow-up issues for high-risk failures, starting with `qt-web-extractor` Markdown quality on complex pages.
- Updates to docs/skills when a supported surface is only partial or requires fallback behavior.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit quality of ScholarAIO third-party integrations #96

Context

Goal

Initial Inventory

Web / Agent Services

Paper And Document Parsing

Paper2Any

Literature / Metadata / Discovery APIs

Import / Export Ecosystem

Patent Sources

LLM / Model / Embedding Backends

Office / Output / Diagram Tooling

Scientific Toolref / Runtime Support

Backup / System Tools

Known Seed Issue

Audit Questions

Proposed Deliverables

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Audit quality of ScholarAIO third-party integrations #96

Description

Context

Goal

Initial Inventory

Web / Agent Services

Paper And Document Parsing

Paper2Any

Literature / Metadata / Discovery APIs

Import / Export Ecosystem

Patent Sources

LLM / Model / Embedding Backends

Office / Output / Diagram Tooling

Scientific Toolref / Runtime Support

Backup / System Tools

Known Seed Issue

Audit Questions

Proposed Deliverables

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions