Skip to content

getascraper/how-to-scrape-arxiv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXiv Scraper | Research Paper Data Extraction API | Apify Actor

Apify Actor Node.js Made with Love Open Source

arXiv scraper and research paper data extraction API. Extract titles, authors, abstracts, and PDFs from arXiv with this Apify actor. Free tier included.

Built for AI researchers, data scientists, and developers who need clean, structured academic papers without manual LaTeX parsing or GROBID.

Quick Start · API Reference · Pricing · Support

Apify Actor Hero


Quick Start

import { ApifyClient } from 'apify-client';
import 'dotenv/config';

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });

const run = await client.actor('getascraper/arxiv-rag-extractor').call({
  categoriesFilter: ['cs.LG', 'cs.AI'],
  dateFrom: '2024-01-01',
  dateTo: '2024-01-07',
  maxPapers: 10,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Output:

{
  "arxiv_id": "2401.12345",
  "title": "Attention Is All You Need",
  "abstract": "We propose a new simple network architecture...",
  "authors": ["Ashish Vaswani", "Noam Shazeer"],
  "categories": ["cs.CL", "cs.LG"],
  "published": "2024-01-15",
  "pdf_url": "https://arxiv.org/pdf/2401.12345",
  "chunks": [
    {
      "idx": 0,
      "text": "The dominant sequence transduction models...",
      "tokens": 512
    }
  ]
}

Features

  • LaTeX-stripped chunks for clean text extraction without GROBID
  • cl100k_base tokenization with 512 tokens per chunk and 50 overlap
  • Structured metadata including authors, categories, DOI, and PDF links
  • Drop-in for vector databases ready for LangChain, LlamaIndex, Qdrant, Pinecone
  • Category and date filtering for targeted research collection

What this actor does

This Actor extracts arXiv papers by category and date range, returning structured JSON with metadata and token-aware chunks. Instead of dealing with raw LaTeX or manual parsing, you get clean text chunks with consistent metadata.

It supports filtering by arXiv categories, date ranges, and maximum paper limits. Each paper is returned with its abstract, full text chunks, and bibliographic metadata.


Installation

npm install

Copy the environment file and add your Apify API token:

cp .env.example .env

Open .env and replace your_apify_token_here with your actual Apify API token. Get one free at console.apify.com.


Input

Field Type Description Default
categoriesFilter array arXiv categories (e.g. cs.LG, cs.AI) none
dateFrom string Start date (YYYY-MM-DD) none
dateTo string End date (YYYY-MM-DD) none
maxPapers integer Max papers to extract 100
searchQuery string Optional full-text search query none

Output

Each paper is a structured JSON record with token-aware chunks. Download as JSON, CSV, Excel, or HTML.

Field Description
arxiv_id arXiv identifier (e.g. 2401.12345)
title Paper title (whitespace-normalized)
abstract Abstract as returned by arXiv
authors Author display names, order preserved
categories arXiv category tags
published First submission datetime
updated Latest update datetime
doi DOI (when provided)
pdf_url Direct PDF link
source latex or abstract (text origin)
chunks Fixed-token chunks ready for embedding
chunks[].idx 0-indexed position
chunks[].text Chunk text
chunks[].tokens Token count under cl100k_base (≤ 512)

See sample-output.json for a full example.


Pricing

$0.015 per paper.

A run of 100 papers typically completes in 1 to 2 minutes. Pay only for what you extract.


Use Cases

  • AI training data: Build RAG corpora from arXiv without manual LaTeX parsing
  • Research tracking: Monitor new papers in specific categories weekly
  • Literature review: Extract structured abstracts and metadata for systematic reviews
  • Embedding pipelines: Feed pre-chunked text directly into vector databases

FAQ

What categories can I filter by? Any valid arXiv category: cs.LG (Machine Learning), cs.AI (Artificial Intelligence), cs.CL (Computation and Language), physics, math, etc.

How does chunking work? Each paper is split into 512-token chunks under cl100k_base encoding with 50-token overlap. This ensures semantic continuity across chunks.

Do I need LaTeX parsing on my side? No. The Actor handles LaTeX stripping internally. You receive clean plain text.


Support

Open an issue in the Apify Console.


Related Resources


Ready to start extracting?

Open the arXiv Scraper for RAG on Apify

Releases

No releases published

Packages

 
 
 

Contributors