OpenAlex S3 β DuckDB loader powered by rich progress bars.
Reads OpenAlex NDJSON dumps directly from S3 via DuckDB's httpfs extension β no downloading required.
- π Direct S3 reads via DuckDB
httpfsβ no local downloads - π¦ Zero-setup DuckDB loading via
read_json_auto(...) - π― Filter by date range (
YYYY-MM-DD) and by part numbers - π Resume from a specific date and part after a failure
- π Optional SQL-style
WHEREpredicate - π Optional
richprogress bar showing batch progress
pip install pyalexs3or with uv:
uv add pyalexs3Python 3.10+ is required.
from pyalexs3.core import OpenAlexS3Processor
p = OpenAlexS3Processor(n_workers=4)
for file_batch, rel in p.lazy_load(
obj_type="works",
start_date="2025-01-01",
end_date="2025-03-01",
columns=["id", "title", "publication_year"],
):
df = rel.df()
print(df.head())for file_batch, rel in p.lazy_load(
obj_type="works",
start_date="2025-01-01",
end_date="2025-03-01",
columns=["id", "title", "publication_year"],
where_clause="title IS NOT NULL AND language='en'",
):
df = rel.df()If your pipeline fails midway, resume from a specific date and part number:
for file_batch, rel in p.lazy_load(
obj_type="works",
start_date="2025-01-01",
end_date="2025-03-01",
resume_from="2025-01-15/5", # skip everything before 2025-01-15 part 5
):
df = rel.df()for file_batch, rel in p.lazy_load(
obj_type="works",
start_date="2025-01-01",
end_date="2025-01-01",
parts=[0, 1, 2], # only load part_000.gz, part_001.gz, part_002.gz
):
df = rel.df()p = OpenAlexS3Processor(n_workers=4, show_progress=True)
for file_batch, rel in p.lazy_load(obj_type="works"):
df = rel.df()Each lazy_load iteration yields both the file batch and the relation:
for file_batch, rel in p.lazy_load(obj_type="works"):
print(f"Processing: {file_batch}") # list of S3 keys in this batch
df = rel.df()| Parameter | Type | Default | Description |
|---|---|---|---|
n_workers |
int |
4 |
DuckDB thread count |
show_progress |
bool |
False |
Show rich progress bar |
pragma_show_progress |
bool |
False |
Enable DuckDB internal progress bar |
| Parameter | Type | Default | Description |
|---|---|---|---|
obj_type |
str |
required | OpenAlex object type e.g. works, authors |
columns |
list[str] | None |
None |
Columns to select. None = all |
limit |
int | None |
None |
Max records per batch |
start_date |
str | None |
2016-06-24 |
Start of date range YYYY-mm-dd (inclusive) |
end_date |
str | None |
today | End of date range YYYY-mm-dd (inclusive) |
parts |
list[int] | None |
None |
Specific part numbers to load. None = all |
where_clause |
str | None |
None |
SQL filter. Do not include WHERE keyword |
resume_from |
str | None |
None |
Resume from YYYY-mm-dd/<part> e.g. 2025-01-15/5 |
batch_size |
int |
10 |
Number of S3 files per batch |
Yields tuple[list[str], duckdb.DuckDBPyRelation]:
list[str]β S3 keys in this batch (useful for progress tracking)DuckDBPyRelationβ query the batch with.df(),.arrow(),.fetchall()
works, authors, sources, institutions, topics, keywords, publishers, funders, concepts
- No downloads β data is read directly from S3 via DuckDB
httpfs. No temp files, no cleanup needed. - DuckDB β installs and loads
httpfsautomatically on init. SetsPRAGMA threadston_workers. - Object cache β
PRAGMA enable_object_cache=trueis set by default for repeated queries on the same files. - S3 auth β OpenAlex S3 is public. No credentials needed.
Dev dependencies include pytest.
uv sync --extra dev
uv run pytest -qTests mock the S3 client directly using unittest.mock to test the file listing and filtering logic without hitting real S3.
- Source layout:
src/pyalexs3/ - Typed package marker:
src/pyalexs3/py.typed
MIT Β© EurekAI
If you are using this for research purposes please use this BibTeX for citation:
@misc{pyalexs32025,
author = {Adityam Ghosh},
title = {pyalexs3},
howpublished = {\url{https://github.com/EurekAI-Org/pyalexs3}},
year = {2025},
note = {[Accessed 09-10-2025]},
}