This repository contains code for an university project regarding "Data Intensive Computing" and Large Scale Machine Learning models. A report can be found under https://mkleinegger.github.io/spark-svm-amazon-reviews/report.pdf. Furthermore, it consists of the following files.
- notebook.ipynb: The Jupyter notebook containing the solution to the tasks of the exercise.
- output_rdd.txt: The output of the RDD based solution.
- output_ds.txt: The output of the Dataset/DataFrame based solution.
- report.pdf: The report of the exercise.
- evaluation.ipynb: The Jupyter notebook containing the evaluation of the solutions and the code to generate the plots.
Additionally, there are other files, like output.txt and grid_search_evaluation.csv, which are outputs of the notebook and/or needed for the evaluation and the stopwords.txt file containing the stopwords used for filtering in the exercise.