A pipeline for processing academic PDFs into structured literature cards, ready for survey writing with an LLM.
topics/<name>/
pdfs/ your PDF collection for this topic
papers.csv manually maintained metadata (title, year, venue, bib_key)
prompts/
extract.txt topic-specific extraction prompt
survey.md topic-specific Claude survey prompts
↓ 01_convert.sh --topic <name> Marker: PDF → Markdown
↓ 02_extract.py --topic <name> Gemini via Vertex AI: Markdown → card
↓ 03_merge.py --topic <name> cards + papers.csv → all_cards.md
markdowns/ (generated)
cards/ (generated)
all_cards.md (generated)
LLM (e.g. Claude) all_cards.md → survey draft
(uses bib_key for LaTeX \cite{} references)
papers.csv is a required input. It maps each PDF to its title, year, venue, and LaTeX bib key. Fill it before running
03_merge.py.
Topic-agnostic. The extraction prompt in
topics/<name>/prompts/extract.txtdefines what to pull from each paper. Seeprompts/examples/for worked examples.
- Python 3.10+
- A GPU machine (optional but recommended for Marker)
- A Google Cloud project with Vertex AI enabled
gcloudCLI installed and authenticated
git clone https://github.com/your_username/clawdpaper.git
cd clawdpaperpython3 -m venv .venv
source .venv/bin/activateUbuntu: if you get an
ensurepiperror, first run:sudo apt install python3.12-venv
pip install -r requirements.txtgcloud auth application-default logincp .env.example .envEdit .env and fill in your GCP project details.
mkdir -p topics/<name>/pdfs topics/<name>/promptsPut your PDFs into topics/<name>/pdfs/ (or use --pdfs-dir to point to an
existing folder elsewhere). Then fill in topics/<name>/papers.csv and write
the extraction prompt at topics/<name>/prompts/extract.txt.
See prompts/examples/ for a worked extraction prompt and survey prompts.
source .venv/bin/activate
bash 01_convert.sh --topic <name>
# or, if PDFs live outside the topic directory:
bash 01_convert.sh --topic <name> --pdfs-dir /path/to/pdfsOutput: one subdirectory per paper in topics/<name>/markdowns/.
python 02_extract.py --topic <name>- One
.mdcard per paper saved totopics/<name>/cards/ - Resume-safe: already-processed papers are skipped on re-run
- Code fences in Gemini output are stripped automatically
- Failed papers are saved as
.error.txt
python 03_merge.py --topic <name>Output: topics/<name>/all_cards.md
The script injects metadata (year, venue, bib_key) from papers.csv into
each card header, and prints a token count estimate. If the file exceeds
~150k tokens, split into batches before uploading to Claude.
- Upload
topics/<name>/all_cards.mdto claude.ai - Follow the prompts in
topics/<name>/prompts/survey.mdin order - Claude will write each section using
\cite{bib_key}references that map directly to your.bibfile
clawdpaper/
├── .env ← your config and secrets (gitignored)
├── .env.example ← template
├── .gitignore
├── requirements.txt
├── 01_convert.sh ← PDF → Markdown (Marker)
├── 02_extract.py ← Markdown → structured card (Gemini/Vertex AI)
├── 03_merge.py ← merge cards + inject metadata → all_cards.md
├── prompts/
│ └── examples/ ← reference prompts for new topics
│ ├── extract_gnn_vuln.txt
│ └── claude_prompts_gnn_vuln.md
└── topics/
└── <topic_name>/
├── pdfs/ ← input PDFs (gitignored)
├── papers.csv ← paper metadata: title, year, venue, bib_key
├── prompts/
│ ├── extract.txt ← extraction prompt for this topic
│ └── survey.md ← Claude survey prompts for this topic
├── markdowns/ ← Marker output (gitignored)
├── cards/ ← extracted cards (gitignored)
└── all_cards.md ← final merged file for Claude (gitignored)
# Split work across 2 GPUs in parallel
CUDA_VISIBLE_DEVICES=0 marker ./topics/<name>/pdfs --output_dir ./topics/<name>/markdowns \
--num_chunks 2 --chunk_idx 0 --workers 8 \
--skip_existing --disable_image_extraction --disable_ocr --disable_ocr_math &
CUDA_VISIBLE_DEVICES=1 marker ./topics/<name>/pdfs --output_dir ./topics/<name>/markdowns \
--num_chunks 2 --chunk_idx 1 --workers 8 \
--skip_existing --disable_image_extraction --disable_ocr --disable_ocr_math &
waitMIT