Skip to content

zhangleiniu/clawdpaper

Repository files navigation

Clawdpaper

A pipeline for processing academic PDFs into structured literature cards, ready for survey writing with an LLM.

topics/<name>/
    pdfs/              your PDF collection for this topic
    papers.csv         manually maintained metadata (title, year, venue, bib_key)
    prompts/
        extract.txt    topic-specific extraction prompt
        survey.md      topic-specific Claude survey prompts
        ↓  01_convert.sh --topic <name>   Marker:  PDF → Markdown
        ↓  02_extract.py --topic <name>   Gemini via Vertex AI:  Markdown → card
        ↓  03_merge.py   --topic <name>   cards + papers.csv → all_cards.md
    markdowns/         (generated)
    cards/             (generated)
    all_cards.md       (generated)

LLM (e.g. Claude)      all_cards.md → survey draft
                         (uses bib_key for LaTeX \cite{} references)

papers.csv is a required input. It maps each PDF to its title, year, venue, and LaTeX bib key. Fill it before running 03_merge.py.

Topic-agnostic. The extraction prompt in topics/<name>/prompts/extract.txt defines what to pull from each paper. See prompts/examples/ for worked examples.


Requirements

  • Python 3.10+
  • A GPU machine (optional but recommended for Marker)
  • A Google Cloud project with Vertex AI enabled
  • gcloud CLI installed and authenticated

Setup (one-time)

1. Clone the repo

git clone https://github.com/your_username/clawdpaper.git
cd clawdpaper

2. Create and activate a virtual environment

python3 -m venv .venv
source .venv/bin/activate

Ubuntu: if you get an ensurepip error, first run: sudo apt install python3.12-venv

3. Install dependencies

pip install -r requirements.txt

4. Authenticate with Google Cloud

gcloud auth application-default login

5. Configure environment variables

cp .env.example .env

Edit .env and fill in your GCP project details.


Starting a New Topic

mkdir -p topics/<name>/pdfs topics/<name>/prompts

Put your PDFs into topics/<name>/pdfs/ (or use --pdfs-dir to point to an existing folder elsewhere). Then fill in topics/<name>/papers.csv and write the extraction prompt at topics/<name>/prompts/extract.txt.

See prompts/examples/ for a worked extraction prompt and survey prompts.


Full Workflow

Step 1 — Convert PDFs to Markdown

source .venv/bin/activate
bash 01_convert.sh --topic <name>
# or, if PDFs live outside the topic directory:
bash 01_convert.sh --topic <name> --pdfs-dir /path/to/pdfs

Output: one subdirectory per paper in topics/<name>/markdowns/.

Step 2 — Extract structured cards

python 02_extract.py --topic <name>
  • One .md card per paper saved to topics/<name>/cards/
  • Resume-safe: already-processed papers are skipped on re-run
  • Code fences in Gemini output are stripped automatically
  • Failed papers are saved as .error.txt

Step 3 — Merge cards

python 03_merge.py --topic <name>

Output: topics/<name>/all_cards.md

The script injects metadata (year, venue, bib_key) from papers.csv into each card header, and prints a token count estimate. If the file exceeds ~150k tokens, split into batches before uploading to Claude.

Step 4 — Write the survey with Claude

  1. Upload topics/<name>/all_cards.md to claude.ai
  2. Follow the prompts in topics/<name>/prompts/survey.md in order
  3. Claude will write each section using \cite{bib_key} references that map directly to your .bib file

File Structure

clawdpaper/
├── .env                        ← your config and secrets (gitignored)
├── .env.example                ← template
├── .gitignore
├── requirements.txt
├── 01_convert.sh               ← PDF → Markdown (Marker)
├── 02_extract.py               ← Markdown → structured card (Gemini/Vertex AI)
├── 03_merge.py                 ← merge cards + inject metadata → all_cards.md
├── prompts/
│   └── examples/               ← reference prompts for new topics
│       ├── extract_gnn_vuln.txt
│       └── claude_prompts_gnn_vuln.md
└── topics/
    └── <topic_name>/
        ├── pdfs/               ← input PDFs (gitignored)
        ├── papers.csv          ← paper metadata: title, year, venue, bib_key
        ├── prompts/
        │   ├── extract.txt     ← extraction prompt for this topic
        │   └── survey.md       ← Claude survey prompts for this topic
        ├── markdowns/          ← Marker output (gitignored)
        ├── cards/              ← extracted cards (gitignored)
        └── all_cards.md        ← final merged file for Claude (gitignored)

Advanced: Multi-GPU Conversion

# Split work across 2 GPUs in parallel
CUDA_VISIBLE_DEVICES=0 marker ./topics/<name>/pdfs --output_dir ./topics/<name>/markdowns \
  --num_chunks 2 --chunk_idx 0 --workers 8 \
  --skip_existing --disable_image_extraction --disable_ocr --disable_ocr_math &

CUDA_VISIBLE_DEVICES=1 marker ./topics/<name>/pdfs --output_dir ./topics/<name>/markdowns \
  --num_chunks 2 --chunk_idx 1 --workers 8 \
  --skip_existing --disable_image_extraction --disable_ocr --disable_ocr_math &

wait

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors