You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extract a database (article text, section/tags) from The Guardian API and get vocab to load data
Train an EmbeddingBag classifier with linear output layer (Pytorch)
TODO: output test data results to visualise in Tableau for evaluation
TODO: make a validation data set to use during training for hyperparameter tuning
Design choices made so far
If keeping stop tokens: only kept defined list of punctuation, deleted others
> Tokenised punctuation so that contraction words were separated into separate tokens (e.g. "weren ' t")
If deleting stop tokens: also delete all punctuation tokens
Delete any words that have vocab count of < 1 when processing train and test data
Replace words in test data unseen in training data with UNK token
Changed from FFNN to EmbeddingBag model to enhance the feature space, boosting accuracy
About
Classification of The Guardian article titles by tags using machine learning.