Skip to content

daandouwe/svd-doc2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Doc2vec with PPMI-SVD

Factor a document-word cooccurence-matrix that is scaled with positive pointwise mutual information (PPMI) using singular value decomposition (SVD).

Setup

We use the WikiText dataset.

To extract documents from WikiText and save as json file, run:

mkdir data
./parse-wikitext.py wikitext-2-raw/wiki.train.raw data/wikitext-2-raw.docs.json

Usage

In the project terminal, run

mkdir vec
./main.py --data data/wikitext-2-raw.docs.json --outpath vec/wikitext-2-raw.vec.txt \
    --lower --num-words 1000 --dim 10

for a quick demo. Plots are saved in the folder plots.

To rank the documents based on the vectors, use:

./rank.py vec/wikitext-2-raw.vec.txt > wikitext-2-raw.ranking.txt

Requirements

numpy
scipy
tqdm
matplotlib
sklearn
bokeh

About

Turn documents into vectors by decomposing a PPMI cooccurence matrix.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors