arXiv Scraper | Research Paper Data Extraction API | Apify Actor

arXiv scraper and research paper data extraction API. Extract titles, authors, abstracts, and PDFs from arXiv with this Apify actor. Free tier included.

Built for AI researchers, data scientists, and developers who need clean, structured academic papers without manual LaTeX parsing or GROBID.

Quick Start · API Reference · Pricing · Support

Quick Start

import { ApifyClient } from 'apify-client';
import 'dotenv/config';

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });

const run = await client.actor('getascraper/arxiv-rag-extractor').call({
  categoriesFilter: ['cs.LG', 'cs.AI'],
  dateFrom: '2024-01-01',
  dateTo: '2024-01-07',
  maxPapers: 10,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Output:

{
  "arxiv_id": "2401.12345",
  "title": "Attention Is All You Need",
  "abstract": "We propose a new simple network architecture...",
  "authors": ["Ashish Vaswani", "Noam Shazeer"],
  "categories": ["cs.CL", "cs.LG"],
  "published": "2024-01-15",
  "pdf_url": "https://arxiv.org/pdf/2401.12345",
  "chunks": [
    {
      "idx": 0,
      "text": "The dominant sequence transduction models...",
      "tokens": 512
    }
  ]
}

Features

LaTeX-stripped chunks for clean text extraction without GROBID
cl100k_base tokenization with 512 tokens per chunk and 50 overlap
Structured metadata including authors, categories, DOI, and PDF links
Drop-in for vector databases ready for LangChain, LlamaIndex, Qdrant, Pinecone
Category and date filtering for targeted research collection

What this actor does

This Actor extracts arXiv papers by category and date range, returning structured JSON with metadata and token-aware chunks. Instead of dealing with raw LaTeX or manual parsing, you get clean text chunks with consistent metadata.

It supports filtering by arXiv categories, date ranges, and maximum paper limits. Each paper is returned with its abstract, full text chunks, and bibliographic metadata.

Installation

npm install

Copy the environment file and add your Apify API token:

cp .env.example .env

Open .env and replace your_apify_token_here with your actual Apify API token. Get one free at console.apify.com.

Input

Field	Type	Description	Default
`categoriesFilter`	array	arXiv categories (e.g. `cs.LG`, `cs.AI`)	none
`dateFrom`	string	Start date (YYYY-MM-DD)	none
`dateTo`	string	End date (YYYY-MM-DD)	none
`maxPapers`	integer	Max papers to extract	100
`searchQuery`	string	Optional full-text search query	none

Output

Each paper is a structured JSON record with token-aware chunks. Download as JSON, CSV, Excel, or HTML.

Field	Description
`arxiv_id`	arXiv identifier (e.g. `2401.12345`)
`title`	Paper title (whitespace-normalized)
`abstract`	Abstract as returned by arXiv
`authors`	Author display names, order preserved
`categories`	arXiv category tags
`published`	First submission datetime
`updated`	Latest update datetime
`doi`	DOI (when provided)
`pdf_url`	Direct PDF link
`source`	`latex` or `abstract` (text origin)
`chunks`	Fixed-token chunks ready for embedding
`chunks[].idx`	0-indexed position
`chunks[].text`	Chunk text
`chunks[].tokens`	Token count under cl100k_base (≤ 512)

See sample-output.json for a full example.

Pricing

$0.015 per paper.

A run of 100 papers typically completes in 1 to 2 minutes. Pay only for what you extract.

Use Cases

AI training data: Build RAG corpora from arXiv without manual LaTeX parsing
Research tracking: Monitor new papers in specific categories weekly
Literature review: Extract structured abstracts and metadata for systematic reviews
Embedding pipelines: Feed pre-chunked text directly into vector databases

FAQ

What categories can I filter by? Any valid arXiv category: cs.LG (Machine Learning), cs.AI (Artificial Intelligence), cs.CL (Computation and Language), physics, math, etc.

How does chunking work? Each paper is split into 512-token chunks under cl100k_base encoding with 50-token overlap. This ensures semantic continuity across chunks.

Do I need LaTeX parsing on my side? No. The Actor handles LaTeX stripping internally. You receive clean plain text.

Support

Open an issue in the Apify Console.

Related Resources

Ready to start extracting?

Open the arXiv Scraper for RAG on Apify

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
llms.txt		llms.txt
package.json		package.json
robots.txt		robots.txt
sample-output.json		sample-output.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arXiv Scraper | Research Paper Data Extraction API | Apify Actor

Quick Start

Features

What this actor does

Installation

Input

Output

Pricing

Use Cases

FAQ

Support

Related Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

arXiv Scraper | Research Paper Data Extraction API | Apify Actor

Quick Start

Features

What this actor does

Installation

Input

Output

Pricing

Use Cases

FAQ

Support

Related Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages