AI-powered research paper recommender system trained on the ArXiv dataset.
Helping researchers and students discover relevant papers based on their interests.
ArchAIve is a machine learning–based web application that recommends academic articles from the ArXiv repository.
Given a topic, question, or set of keywords, the system suggests most relevant research papers with links to their DOIs.
- Cleanly structured, large, and continuously updated
- Includes rich metadata (title, abstract, categories, DOI, etc.)
- Supports content-based and collaborative filtering recommendation systems
Format: JSON
Key fields used: title, abstract, categories
ArchAIve
|
|-- code/
| |-- load_data_faiss.py
| -> loads the raw ArXiv metadata JSON and selects a subset of rows
| (or streams them) to avoid memory explosions.
|
| |-- preprocess_faiss.py
| -> cleans and normalizes text, writes cleaned dataset into multiple parquet chunks
|
| |-- build_index_faiss.py
| -> fits a TF-IDF vectorizer over a large sampled subset
| -> reduces dimensionality using TruncatedSVD
| -> trains a FAISS IVF index on the reduced vectors
| -> adds all chunked vectors to the index
| -> saves
| tfidf_vectorizer.joblib
| svd_transformer.joblib
| faiss_index_ivf.idx
| meta.parquet
|
| |-- recommend_faiss.py
| -> takes user query, vectorizes & reduces it,
| runs similarity search using the FAISS index,
| retrieves top-5 papers, and saves results in results/
|
| |-- analyze_results.py
| -> reads per-query CSV files generated by recommend_faiss.py
| computes metrics like Precision@5, cosine distributions, etc.
| generates all visuals saved in figures/
|
|
|-- data/
| |-- arxiv-metadata-oai-snapshot.json
| -> raw dataset
|
| |-- processed/
| |-- arxiv_chunk_1.parquet
| |-- arxiv_chunk_2.parquet
| |-- ...
| |-- arxiv_chunk_N.parquet
| |-- tfidf_vectorizer.joblib
| |-- svd_transformer.joblib
| |-- faiss_index_ivf.idx
| |-- meta.parquet
|
|
|-- figures/
| |-- graphs generated by analyze_results.py
|
|
|-- results/
| |-- per-query CSV outputs produced by recommend_faiss.py
|
|-- templates/
| |-- index.html
|
|-- app.py
|-- Project Proposal.pdf
|-- Preliminary Results.pdf
|-- Final Results.pdf
|-- README.md
python -m venv .venv
source .venv/bin/activate (Mac/Linux)
pip install -U pandas numpy scikit-learn joblib pyarrow nltk matplotlib flask
FAISS installation
- Mac, Windows, Linux:
pip install faiss-cpu
- Apple Silicon:
pip install faiss-cpu==1.7.4
- Download the arxiv-metadata-oai-snapshot.json (Kaggle)
- Place it in
./data/arxiv-metadata-oai-snapshot.json
run:
i) python code/preprocess_faiss.py
ii) python code/build_index_faiss.py
python app.py
TF-IDF turns text into numbers based on how important each word is. TF: how often a word appears in a document IDF: how rare the word is across the entire corpus
TF-IDF vectors can have tens of thousands of features. SVD compresses these huge vectors into something smaller while keeping the important info. Helps for faster search, lower memory.
Measures how close two vectors point in the same direction 1.0 = very similar 0 = no similarity Used to check how relevant the recommended papers actually were. We converted these values to rounded percentages for readability.
Library built for fast nearest-neighbor search on large datasets.
Faiss indexing method that clusters vector space into lists & searches only the most relevant clusters for a query
Our main evaluation metric which evaluates what portion of the top-5 recommendations are actually relevant.