Skip to content

salma-ysr/archAIve

Repository files navigation

ArchAIve

AI-powered research paper recommender system trained on the ArXiv dataset.
Helping researchers and students discover relevant papers based on their interests.


Overview

ArchAIve is a machine learning–based web application that recommends academic articles from the ArXiv repository.
Given a topic, question, or set of keywords, the system suggests most relevant research papers with links to their DOIs.


ArXiv Dataset

  • Cleanly structured, large, and continuously updated
  • Includes rich metadata (title, abstract, categories, DOI, etc.)
  • Supports content-based and collaborative filtering recommendation systems

Format: JSON
Key fields used: title, abstract, categories

Repo Structure

ArchAIve
|
|-- code/
|   |-- load_data_faiss.py 
|       -> loads the raw ArXiv metadata JSON and selects a subset of rows 
|          (or streams them) to avoid memory explosions.
|
|   |-- preprocess_faiss.py 
|       -> cleans and normalizes text, writes cleaned dataset into multiple parquet chunks
|
|   |-- build_index_faiss.py 
|       -> fits a TF-IDF vectorizer over a large sampled subset
|       -> reduces dimensionality using TruncatedSVD
|       -> trains a FAISS IVF index on the reduced vectors 
|       -> adds all chunked vectors to the index
|       -> saves
|            tfidf_vectorizer.joblib  
|            svd_transformer.joblib
|            faiss_index_ivf.idx
|            meta.parquet
|
|   |-- recommend_faiss.py
|       -> takes user query, vectorizes & reduces it, 
|          runs similarity search using the FAISS index,
|          retrieves top-5 papers, and saves results in results/
|
|   |-- analyze_results.py
|       -> reads per-query CSV files generated by recommend_faiss.py
|          computes metrics like Precision@5, cosine distributions, etc.
|          generates all visuals saved in figures/
|
|
|-- data/
|   |-- arxiv-metadata-oai-snapshot.json 
|       -> raw dataset
|
|   |-- processed/
|       |-- arxiv_chunk_1.parquet
|       |-- arxiv_chunk_2.parquet
|       |-- ...
|       |-- arxiv_chunk_N.parquet
|       |-- tfidf_vectorizer.joblib
|       |-- svd_transformer.joblib
|       |-- faiss_index_ivf.idx
|       |-- meta.parquet
|
|
|-- figures/
|   |-- graphs generated by analyze_results.py
|
|
|-- results/
|   |-- per-query CSV outputs produced by recommend_faiss.py
|
|-- templates/
|   |-- index.html
|
|-- app.py
|-- Project Proposal.pdf
|-- Preliminary Results.pdf
|-- Final Results.pdf
|-- README.md

Setup & How to use the web app

1. Activate your venv

python -m venv .venv
source .venv/bin/activate (Mac/Linux)

2. Install dependencies

pip install -U pandas numpy scikit-learn joblib pyarrow nltk matplotlib flask

FAISS installation

  • Mac, Windows, Linux:
pip install faiss-cpu
  • Apple Silicon:
pip install faiss-cpu==1.7.4

3. Download ArxIV dataset and unzip it in ./data/

  • Download the arxiv-metadata-oai-snapshot.json (Kaggle)
  • Place it in
    ./data/arxiv-metadata-oai-snapshot.json
    

4. Preprocess & build index

run:
i) python code/preprocess_faiss.py 
ii) python code/build_index_faiss.py

4. Run the web app

python app.py

Modeling Framework

TF-IDF

TF-IDF turns text into numbers based on how important each word is. TF: how often a word appears in a document IDF: how rare the word is across the entire corpus

Dimensionality Reduction (Truncated SVD)

TF-IDF vectors can have tens of thousands of features. SVD compresses these huge vectors into something smaller while keeping the important info. Helps for faster search, lower memory.

Cosine Similarity

Measures how close two vectors point in the same direction 1.0 = very similar 0 = no similarity Used to check how relevant the recommended papers actually were. We converted these values to rounded percentages for readability.

FAISS

Library built for fast nearest-neighbor search on large datasets.

IVF Index

Faiss indexing method that clusters vector space into lists & searches only the most relevant clusters for a query

Precision@5

Our main evaluation metric which evaluates what portion of the top-5 recommendations are actually relevant.

About

an AI-powered recommender that helps you discover the most relevant research papers from ArXiv

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors