Naming philosophy:
claw(sharp grasp) โcrawl(creep).hfpclawer= HuggingFace Papers + claw + er = "A sharp tool that claws HF papers with precision" ๐ฆNot a crawler โ faster, sharper, more precise. Same series: OpenClaw, Hermes Agent ecosystem.
A multi-source academic paper clawler for PDE / neural operator / physics-informed ML. Built with SQLite paper_store, Crossref cross-validation, anti-crawl Scrapy pipelines, and MCP server.
pip install hfpclawer- Core (auto-installed): pyyaml, requests, beautifulsoup4, typer, etc.
- LLM features (optional):
pip install hfpclawer[llm]โ forsniff/analyzecommands - PDF conversion (optional):
pip install hfpclawer[pdf] - Scrapy spiders (optional):
pip install hfpclawer[scrapy] - Dev (testing):
pip install hfpclawer[dev] - arXiv local search (optional):
pip install hfpclawer[arxiv]documents the metadata dependency only (PyPI doesn't supportgit+https). See docs/kaggle-metadata.md for manualgit clone+ OAI-PMH or Kaggle setup. - Citation audit (optional):
pip install hfpclawer[audit]declares namespace only. See hfpclawer/citation_audit.py for manual setup.
git clone <your-repo>
cd hfpapers-clawler
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install in editable mode with dev dependencies
pip install -e ".[dev]"
# Verify
hfpclawer --helpFirst run hfpclawer init to generate config and env template:
hfpclawer init --quick # Quick mode (defaults)
# or
hfpclawer init # Interactive wizard
cp .env.template .env # Fill in API keys
# Edit config.yaml to customize search queriesOr manually create files (see docs/USAGE.md for full reference):
# Search for new papers
hfpclawer search # Default 3 pages, threshold 30
hfpclawer search --max-pages 5 # More pages
hfpclawer search --dry-run # Show only, don't save
# Full pipeline: search โ download โ convert
hfpclawer full
# SQLite Paper Store operations
hfpclawer store stats # Storage statistics
hfpclawer store search # List all papers
hfpclawer store search --keyword "FNO"
hfpclawer store verify --aid 2301.11167
# Download & convert
hfpclawer download # Download top-20 PDFs
hfpclawer convert # PDF โ Markdown
# MCP Server (for Hermes Agent / OpenCode)
hfpclawer mcp # Default port :8765from hfpapers.paper_store import PaperStore, PaperRecord, ensure_paper
# Create a store
store = PaperStore(db_path="/tmp/papers.db")
# Add a paper
rec = PaperRecord(
title="Fourier Neural Operator",
abstract="Learning PDE solution operators with Fourier transforms",
year=2023,
source="my_app",
relevance=90,
)
sf_id = store.upsert_paper(rec)
store.add_identifier(sf_id, "arxiv", "2010.08895")
# Search
papers = store.search_papers("neural operator")
for p in papers:
print(f"[{p.relevance}] {p.title}")
# Hardware probe
from hfpapers.hardware import HardwareProbe
hw = HardwareProbe()
print(f"Hardware: {hw.summary()}")hfpapers-clawler ships with a built-in MCP server for AI agent integration:
hfpclawer mcpRegister in Hermes Agent ~/.hermes/config.yaml:
mcp:
servers:
hfpapers:
command: "hfpclawer"
args: ["mcp", "--port", "8765"]Available MCP tools: hfpclawer_search, hfpclawer_download, hfpclawer_convert, hfpclawer_info, hfpclawer_list, hfpclawer_stats, hfpclawer_full.
โโ CLI (Typer) โโ โโ MCP Server โโ
โโโโโโโโฌโโโโโโโโโ โโโโโโโโฌโโโโโโโโ
โโโโโโโโโโฌโโโโโโโโโโโ
โผ
โโ Scrapy Layer (Multi-source) โโโโโโโโโโโโ
โ ArxivSearchSpider | OpenReviewSpider โ
โ HFPapersSpider | MultiSourceSpider โ
โ Middleware: UA random, delay, proxy... โ
โ Pipeline: StoreโClassifyโExportโDL โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โผ
โโ Paper Store (SQLite) โโโโโโโโโโโโโโโโโโโ
โ papers (Snowflake ID) | identifiers โ
โ crossref_cache | CrossrefClient โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
pip install -e ".[dev]"
pytest tests/ -v # Run all tests
pytest tests/ --cov=hfpapers # With coverageMIT
These skills automate common hfpclawer workflows inside Hermes Agent (or any AI coding assistant that supports the Hermes skill format):
| Skill | Purpose | Install |
|---|---|---|
hfpclawer-paper-search |
Daily paper discovery โ download โ wiki | hermes skills install https://raw.githubusercontent.com/diamond2nv/hfpapers-crawler/main/skills/hfpclawer-paper-search/SKILL.md |
hfpclawer-citation-audit |
Verify citations via S2 + OpenAlex | hermes skills install https://raw.githubusercontent.com/diamond2nv/hfpapers-crawler/main/skills/hfpclawer-citation-audit/SKILL.md |
hfpclawer-academic-integrity |
Paper draft integrity: extract โ verify โ flag FABRICATED | hermes skills install https://raw.githubusercontent.com/diamond2nv/hfpapers-crawler/main/skills/hfpclawer-academic-integrity/SKILL.md |
After installing, load with skill_view(name='hfpclawer-paper-search') in any
Hermes conversation.
This project incorporates code adapted from:
- academic-research-skills by Cheng-I Wu
(https://github.com/Imbad0202/academic-research-skills)
hfpclawer/_text_similarity.pyโ title normalization and similarity scoringhfpclawer/citation_audit_s2.pyโ Semantic Scholar API client (architecture reference)hfpclawer/citation_audit_oa.pyโ OpenAlex API client (architecture reference) Licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/)