In-Memory Analytics - Spark MovieLens Recommendation System

This workload tests Spark's in-memory analytics capabilities using the MovieLens recommendation dataset. It implements Alternating Least Squares (ALS) collaborative filtering algorithm for movie recommendations.

Note: The large MovieLens dataset is stored in GitHub Releases and needs to be downloaded manually. The workload uses standalone mode where the same container contains both Spark and the recommendation application.

Manual File Setup (Required)

Download the MovieLens dataset from the GitHub release:

Download from release: https://github.com/Anjali05/platform-workloads/releases/download/inMemoryAnalytics-v1.0.0/ml-latest.zip

Extract to correct location:

cd /mydata/platform-workloads/cloud-workloads/inMemoryAnalytics/data

# Download and extract (or place downloaded file here)
wget https://github.com/Anjali05/platform-workloads/releases/download/inMemoryAnalytics-v1.0.0/ml-latest.zip
unzip ml-latest.zip

# Verify the directory structure
ls -la ml-latest/  # Should contain: ratings.csv, movies.csv, etc.

For Docker (runc/runsc)

Use the provided Dockerfile to build the standalone in-memory analytics image.

# Build the image
docker build -t myinmemoryanalytics .

# Start container
docker run -id --rm --name worker \
  --cpuset-cpus="0-3" \
  --memory="24g" \
  myinmemoryanalytics

Run Workload

Main workload (requires manual ml-latest setup above):

docker exec worker /root/run_benchmark.sh /data/ml-latest /data/myratings.csv --driver-memory "8g" --executor-memory "8g"

Development/testing with small dataset (no download needed):

docker exec worker /root/run_benchmark.sh /data/ml-latest-small /data/myratings.csv --driver-memory "8g" --executor-memory "8g"

For Firecracker (fc)

Building the Firecracker Image

Resize the base image to 16GB, then boot the VM and install Spark 3.3.2 following commons/spark/3.3.2/Dockerfile. Set SPARK_HOME=/opt/spark-3.3.2.

Copy the workload files into the VM:

mkdir -p /root/inMemoryAnalytics/data
cp fc/movielens-als-2.0.jar /root/inMemoryAnalytics/
cp fc/run_benchmark.sh /root/run_benchmark.sh
chmod +x /root/run_benchmark.sh

# ml-latest is available from the GitHub release (see Manual File Setup above)
cp -r <path-to>/ml-latest /root/inMemoryAnalytics/data/
cp data/myratings.csv /root/inMemoryAnalytics/data/

Set SPARK_LOCAL_IP=169.254.0.1 in /etc/environment or /etc/profile.d/spark.sh so it persists across SSH sessions.

1. Environment Setup

Set required Spark environment variable:

export SPARK_LOCAL_IP=169.254.0.1

2. Run Workload

Main workload (requires manual ml-latest setup above):

export SPARK_LOCAL_IP=169.254.0.1

/opt/spark-3.3.2/bin/spark-submit \
  --class MovieLensALS \
  --driver-memory "8g" \
  --executor-memory "8g" \
  /root/inMemoryAnalytics/movielens-als-2.0.jar \
  /root/inMemoryAnalytics/data/ml-latest \
  /root/inMemoryAnalytics/data/myratings.csv

Development/testing with small dataset:

export SPARK_LOCAL_IP=169.254.0.1

/opt/spark-3.3.2/bin/spark-submit \
  --class MovieLensALS \
  --driver-memory "8g" \
  --executor-memory "8g" \
  /root/inMemoryAnalytics/movielens-als-2.0.jar \
  /root/inMemoryAnalytics/data/ml-latest-small \
  /root/inMemoryAnalytics/data/myratings.csv

Workload Parameters

Parameter	Description	Example
Dataset Path	Path to MovieLens dataset directory	`/data/ml-latest` or `/data/ml-latest-small`
Ratings File	User ratings file for recommendations	`/data/myratings.csv`
`--driver-memory`	Spark driver memory allocation	`8g`, `16g`
`--executor-memory`	Spark executor memory allocation	`8g`, `16g`

Dataset Information

ml-latest (Large Dataset - GitHub Release)

Size: ~1.1GB uncompressed
Ratings: ~27M ratings
Movies: ~58K movies
Users: ~280K users
Files: ratings.csv (725MB), genome-scores.csv (396MB), tags.csv (38MB), movies.csv, links.csv, genome-tags.csv

ml-latest-small (Small Dataset - Included in Git)

Size: ~1MB
Ratings: ~100K ratings
Movies: ~9K movies
Users: ~600 users
Files: ratings.csv (2.4MB), movies.csv, links.csv, tags.csv

Algorithm Details

Method: Alternating Least Squares (ALS) Collaborative Filtering
Use Case: Movie recommendation system
Input: User-movie ratings matrix
Output: Latent factor matrices for users and movies
Computation: Matrix factorization with iterative optimization

Performance Notes

Memory requirements scale with dataset size and number of factors
ALS algorithm is iterative and benefits from Spark's in-memory caching
Large dataset requires sufficient memory allocation (recommend 16GB+ for driver/executor)
Small dataset useful for development and testing

Files

Dockerfile - Standalone container with Spark and MovieLens ALS
data/ml-latest/ - Large MovieLens dataset (download from releases)
data/ml-latest-small/ - Small MovieLens dataset (included)
data/myratings.csv - Sample user ratings for recommendations
movielens-als/ - Scala source code for ALS implementation
fc/ - Firecracker-specific files and configurations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-Memory Analytics - Spark MovieLens Recommendation System

Manual File Setup (Required)

For Docker (runc/runsc)

Run Workload

For Firecracker (fc)

Building the Firecracker Image

1. Environment Setup

2. Run Workload

Workload Parameters

Dataset Information

ml-latest (Large Dataset - GitHub Release)

ml-latest-small (Small Dataset - Included in Git)

Algorithm Details

Performance Notes

Files

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

In-Memory Analytics - Spark MovieLens Recommendation System

Manual File Setup (Required)

For Docker (runc/runsc)

Run Workload

For Firecracker (fc)

Building the Firecracker Image

1. Environment Setup

2. Run Workload

Workload Parameters

Dataset Information

ml-latest (Large Dataset - GitHub Release)

ml-latest-small (Small Dataset - Included in Git)

Algorithm Details

Performance Notes

Files