Clawdpaper

A pipeline for processing academic PDFs into structured literature cards, ready for survey writing with an LLM.

topics/<name>/
    pdfs/              your PDF collection for this topic
    papers.csv         manually maintained metadata (title, year, venue, bib_key)
    prompts/
        extract.txt    topic-specific extraction prompt
        survey.md      topic-specific Claude survey prompts
        ↓  01_convert.sh --topic <name>   Marker:  PDF → Markdown
        ↓  02_extract.py --topic <name>   Gemini via Vertex AI:  Markdown → card
        ↓  03_merge.py   --topic <name>   cards + papers.csv → all_cards.md
    markdowns/         (generated)
    cards/             (generated)
    all_cards.md       (generated)

LLM (e.g. Claude)      all_cards.md → survey draft
                         (uses bib_key for LaTeX \cite{} references)

papers.csv is a required input. It maps each PDF to its title, year, venue, and LaTeX bib key. Fill it before running 03_merge.py.

Topic-agnostic. The extraction prompt in topics/<name>/prompts/extract.txt defines what to pull from each paper. See prompts/examples/ for worked examples.

Requirements

Python 3.10+
A GPU machine (optional but recommended for Marker)
A Google Cloud project with Vertex AI enabled
gcloud CLI installed and authenticated

Setup (one-time)

1. Clone the repo

git clone https://github.com/your_username/clawdpaper.git
cd clawdpaper

2. Create and activate a virtual environment

python3 -m venv .venv
source .venv/bin/activate

Ubuntu: if you get an ensurepip error, first run: sudo apt install python3.12-venv

3. Install dependencies

pip install -r requirements.txt

4. Authenticate with Google Cloud

gcloud auth application-default login

5. Configure environment variables

cp .env.example .env

Edit .env and fill in your GCP project details.

Starting a New Topic

mkdir -p topics/<name>/pdfs topics/<name>/prompts

Put your PDFs into topics/<name>/pdfs/ (or use --pdfs-dir to point to an existing folder elsewhere). Then fill in topics/<name>/papers.csv and write the extraction prompt at topics/<name>/prompts/extract.txt.

See prompts/examples/ for a worked extraction prompt and survey prompts.

Full Workflow

Step 1 — Convert PDFs to Markdown

source .venv/bin/activate
bash 01_convert.sh --topic <name>
# or, if PDFs live outside the topic directory:
bash 01_convert.sh --topic <name> --pdfs-dir /path/to/pdfs

Output: one subdirectory per paper in topics/<name>/markdowns/.

Step 2 — Extract structured cards

python 02_extract.py --topic <name>

One .md card per paper saved to topics/<name>/cards/
Resume-safe: already-processed papers are skipped on re-run
Code fences in Gemini output are stripped automatically
Failed papers are saved as .error.txt

Step 3 — Merge cards

python 03_merge.py --topic <name>

Output: topics/<name>/all_cards.md

The script injects metadata (year, venue, bib_key) from papers.csv into each card header, and prints a token count estimate. If the file exceeds ~150k tokens, split into batches before uploading to Claude.

Step 4 — Write the survey with Claude

Upload topics/<name>/all_cards.md to claude.ai
Follow the prompts in topics/<name>/prompts/survey.md in order
Claude will write each section using \cite{bib_key} references that map directly to your .bib file

File Structure

clawdpaper/
├── .env                        ← your config and secrets (gitignored)
├── .env.example                ← template
├── .gitignore
├── requirements.txt
├── 01_convert.sh               ← PDF → Markdown (Marker)
├── 02_extract.py               ← Markdown → structured card (Gemini/Vertex AI)
├── 03_merge.py                 ← merge cards + inject metadata → all_cards.md
├── prompts/
│   └── examples/               ← reference prompts for new topics
│       ├── extract_gnn_vuln.txt
│       └── claude_prompts_gnn_vuln.md
└── topics/
    └── <topic_name>/
        ├── pdfs/               ← input PDFs (gitignored)
        ├── papers.csv          ← paper metadata: title, year, venue, bib_key
        ├── prompts/
        │   ├── extract.txt     ← extraction prompt for this topic
        │   └── survey.md       ← Claude survey prompts for this topic
        ├── markdowns/          ← Marker output (gitignored)
        ├── cards/              ← extracted cards (gitignored)
        └── all_cards.md        ← final merged file for Claude (gitignored)

Advanced: Multi-GPU Conversion

# Split work across 2 GPUs in parallel
CUDA_VISIBLE_DEVICES=0 marker ./topics/<name>/pdfs --output_dir ./topics/<name>/markdowns \
  --num_chunks 2 --chunk_idx 0 --workers 8 \
  --skip_existing --disable_image_extraction --disable_ocr --disable_ocr_math &

CUDA_VISIBLE_DEVICES=1 marker ./topics/<name>/pdfs --output_dir ./topics/<name>/markdowns \
  --num_chunks 2 --chunk_idx 1 --workers 8 \
  --skip_existing --disable_image_extraction --disable_ocr --disable_ocr_math &

wait

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clawdpaper

Requirements

Setup (one-time)

1. Clone the repo

2. Create and activate a virtual environment

3. Install dependencies

4. Authenticate with Google Cloud

5. Configure environment variables

Starting a New Topic

Full Workflow

Step 1 — Convert PDFs to Markdown

Step 2 — Extract structured cards

Step 3 — Merge cards

Step 4 — Write the survey with Claude

File Structure

Advanced: Multi-GPU Conversion

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
prompts/examples		prompts/examples
topics		topics
.env.example		.env.example
.gitignore		.gitignore
01_convert.sh		01_convert.sh
02_extract.py		02_extract.py
03_merge.py		03_merge.py
04_survey.py		04_survey.py
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Clawdpaper

Requirements

Setup (one-time)

1. Clone the repo

2. Create and activate a virtual environment

3. Install dependencies

4. Authenticate with Google Cloud

5. Configure environment variables

Starting a New Topic

Full Workflow

Step 1 — Convert PDFs to Markdown

Step 2 — Extract structured cards

Step 3 — Merge cards

Step 4 — Write the survey with Claude

File Structure

Advanced: Multi-GPU Conversion

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages