arXiv scraper and research paper data extraction API. Extract titles, authors, abstracts, and PDFs from arXiv with this Apify actor. Free tier included.
Built for AI researchers, data scientists, and developers who need clean, structured academic papers without manual LaTeX parsing or GROBID.
import { ApifyClient } from 'apify-client';
import 'dotenv/config';
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('getascraper/arxiv-rag-extractor').call({
categoriesFilter: ['cs.LG', 'cs.AI'],
dateFrom: '2024-01-01',
dateTo: '2024-01-07',
maxPapers: 10,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);Output:
{
"arxiv_id": "2401.12345",
"title": "Attention Is All You Need",
"abstract": "We propose a new simple network architecture...",
"authors": ["Ashish Vaswani", "Noam Shazeer"],
"categories": ["cs.CL", "cs.LG"],
"published": "2024-01-15",
"pdf_url": "https://arxiv.org/pdf/2401.12345",
"chunks": [
{
"idx": 0,
"text": "The dominant sequence transduction models...",
"tokens": 512
}
]
}- LaTeX-stripped chunks for clean text extraction without GROBID
- cl100k_base tokenization with 512 tokens per chunk and 50 overlap
- Structured metadata including authors, categories, DOI, and PDF links
- Drop-in for vector databases ready for LangChain, LlamaIndex, Qdrant, Pinecone
- Category and date filtering for targeted research collection
This Actor extracts arXiv papers by category and date range, returning structured JSON with metadata and token-aware chunks. Instead of dealing with raw LaTeX or manual parsing, you get clean text chunks with consistent metadata.
It supports filtering by arXiv categories, date ranges, and maximum paper limits. Each paper is returned with its abstract, full text chunks, and bibliographic metadata.
npm installCopy the environment file and add your Apify API token:
cp .env.example .envOpen .env and replace your_apify_token_here with your actual Apify API token. Get one free at console.apify.com.
| Field | Type | Description | Default |
|---|---|---|---|
categoriesFilter |
array | arXiv categories (e.g. cs.LG, cs.AI) |
none |
dateFrom |
string | Start date (YYYY-MM-DD) | none |
dateTo |
string | End date (YYYY-MM-DD) | none |
maxPapers |
integer | Max papers to extract | 100 |
searchQuery |
string | Optional full-text search query | none |
Each paper is a structured JSON record with token-aware chunks. Download as JSON, CSV, Excel, or HTML.
| Field | Description |
|---|---|
arxiv_id |
arXiv identifier (e.g. 2401.12345) |
title |
Paper title (whitespace-normalized) |
abstract |
Abstract as returned by arXiv |
authors |
Author display names, order preserved |
categories |
arXiv category tags |
published |
First submission datetime |
updated |
Latest update datetime |
doi |
DOI (when provided) |
pdf_url |
Direct PDF link |
source |
latex or abstract (text origin) |
chunks |
Fixed-token chunks ready for embedding |
chunks[].idx |
0-indexed position |
chunks[].text |
Chunk text |
chunks[].tokens |
Token count under cl100k_base (≤ 512) |
See sample-output.json for a full example.
$0.015 per paper.
A run of 100 papers typically completes in 1 to 2 minutes. Pay only for what you extract.
- AI training data: Build RAG corpora from arXiv without manual LaTeX parsing
- Research tracking: Monitor new papers in specific categories weekly
- Literature review: Extract structured abstracts and metadata for systematic reviews
- Embedding pipelines: Feed pre-chunked text directly into vector databases
What categories can I filter by?
Any valid arXiv category: cs.LG (Machine Learning), cs.AI (Artificial Intelligence), cs.CL (Computation and Language), physics, math, etc.
How does chunking work? Each paper is split into 512-token chunks under cl100k_base encoding with 50-token overlap. This ensures semantic continuity across chunks.
Do I need LaTeX parsing on my side? No. The Actor handles LaTeX stripping internally. You receive clean plain text.
Open an issue in the Apify Console.
Ready to start extracting?
