ArchAIve

AI-powered research paper recommender system trained on the ArXiv dataset.
Helping researchers and students discover relevant papers based on their interests.

Overview

ArchAIve is a machine learning–based web application that recommends academic articles from the ArXiv repository.
Given a topic, question, or set of keywords, the system suggests most relevant research papers with links to their DOIs.

ArXiv Dataset

Cleanly structured, large, and continuously updated
Includes rich metadata (title, abstract, categories, DOI, etc.)
Supports content-based and collaborative filtering recommendation systems

Format: JSON
Key fields used: title, abstract, categories

Repo Structure

ArchAIve
|
|-- code/
|   |-- load_data_faiss.py 
|       -> loads the raw ArXiv metadata JSON and selects a subset of rows 
|          (or streams them) to avoid memory explosions.
|
|   |-- preprocess_faiss.py 
|       -> cleans and normalizes text, writes cleaned dataset into multiple parquet chunks
|
|   |-- build_index_faiss.py 
|       -> fits a TF-IDF vectorizer over a large sampled subset
|       -> reduces dimensionality using TruncatedSVD
|       -> trains a FAISS IVF index on the reduced vectors 
|       -> adds all chunked vectors to the index
|       -> saves
|            tfidf_vectorizer.joblib  
|            svd_transformer.joblib
|            faiss_index_ivf.idx
|            meta.parquet
|
|   |-- recommend_faiss.py
|       -> takes user query, vectorizes & reduces it, 
|          runs similarity search using the FAISS index,
|          retrieves top-5 papers, and saves results in results/
|
|   |-- analyze_results.py
|       -> reads per-query CSV files generated by recommend_faiss.py
|          computes metrics like Precision@5, cosine distributions, etc.
|          generates all visuals saved in figures/
|
|
|-- data/
|   |-- arxiv-metadata-oai-snapshot.json 
|       -> raw dataset
|
|   |-- processed/
|       |-- arxiv_chunk_1.parquet
|       |-- arxiv_chunk_2.parquet
|       |-- ...
|       |-- arxiv_chunk_N.parquet
|       |-- tfidf_vectorizer.joblib
|       |-- svd_transformer.joblib
|       |-- faiss_index_ivf.idx
|       |-- meta.parquet
|
|
|-- figures/
|   |-- graphs generated by analyze_results.py
|
|
|-- results/
|   |-- per-query CSV outputs produced by recommend_faiss.py
|
|-- templates/
|   |-- index.html
|
|-- app.py
|-- Project Proposal.pdf
|-- Preliminary Results.pdf
|-- Final Results.pdf
|-- README.md

Setup & How to use the web app

1. Activate your venv

python -m venv .venv
source .venv/bin/activate (Mac/Linux)

2. Install dependencies

pip install -U pandas numpy scikit-learn joblib pyarrow nltk matplotlib flask

FAISS installation

Mac, Windows, Linux:

pip install faiss-cpu

Apple Silicon:

pip install faiss-cpu==1.7.4

3. Download ArxIV dataset and unzip it in ./data/

Download the arxiv-metadata-oai-snapshot.json (Kaggle)

Place it in

./data/arxiv-metadata-oai-snapshot.json

4. Preprocess & build index

run:
i) python code/preprocess_faiss.py 
ii) python code/build_index_faiss.py

4. Run the web app

python app.py

Modeling Framework

TF-IDF

TF-IDF turns text into numbers based on how important each word is. TF: how often a word appears in a document IDF: how rare the word is across the entire corpus

Dimensionality Reduction (Truncated SVD)

TF-IDF vectors can have tens of thousands of features. SVD compresses these huge vectors into something smaller while keeping the important info. Helps for faster search, lower memory.

Cosine Similarity

Measures how close two vectors point in the same direction 1.0 = very similar 0 = no similarity Used to check how relevant the recommended papers actually were. We converted these values to rounded percentages for readability.

FAISS

Library built for fast nearest-neighbor search on large datasets.

IVF Index

Faiss indexing method that clusters vector space into lists & searches only the most relevant clusters for a query

Precision@5

Our main evaluation metric which evaluates what portion of the top-5 recommendations are actually relevant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArchAIve

Overview

ArXiv Dataset

Repo Structure

Setup & How to use the web app

1. Activate your venv

2. Install dependencies

3. Download ArxIV dataset and unzip it in ./data/

4. Preprocess & build index

4. Run the web app

Modeling Framework

TF-IDF

Dimensionality Reduction (Truncated SVD)

Cosine Similarity

FAISS

IVF Index

Precision@5

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
code		code
data/processed		data/processed
figures		figures
results		results
static		static
templates		templates
.gitignore		.gitignore
Final Results.pdf		Final Results.pdf
Preliminary Results.pdf		Preliminary Results.pdf
ProjectProposal.pdf		ProjectProposal.pdf
README.md		README.md
app.py		app.py

Folders and files

Latest commit

History

Repository files navigation

ArchAIve

Overview

ArXiv Dataset

Repo Structure

Setup & How to use the web app

1. Activate your venv

2. Install dependencies

3. Download ArxIV dataset and unzip it in ./data/

4. Preprocess & build index

4. Run the web app

Modeling Framework

TF-IDF

Dimensionality Reduction (Truncated SVD)

Cosine Similarity

FAISS

IVF Index

Precision@5

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages