Skip to content

Latest commit

 

History

History
155 lines (115 loc) · 5.53 KB

File metadata and controls

155 lines (115 loc) · 5.53 KB

In-Memory Analytics - Spark MovieLens Recommendation System

This workload tests Spark's in-memory analytics capabilities using the MovieLens recommendation dataset. It implements Alternating Least Squares (ALS) collaborative filtering algorithm for movie recommendations.

Note: The large MovieLens dataset is stored in GitHub Releases and needs to be downloaded manually. The workload uses standalone mode where the same container contains both Spark and the recommendation application.

Manual File Setup (Required)

Download the MovieLens dataset from the GitHub release:

  1. Download from release: https://github.com/Anjali05/platform-workloads/releases/download/inMemoryAnalytics-v1.0.0/ml-latest.zip

  2. Extract to correct location:

    cd /mydata/platform-workloads/cloud-workloads/inMemoryAnalytics/data
    
    # Download and extract (or place downloaded file here)
    wget https://github.com/Anjali05/platform-workloads/releases/download/inMemoryAnalytics-v1.0.0/ml-latest.zip
    unzip ml-latest.zip
    
    # Verify the directory structure
    ls -la ml-latest/  # Should contain: ratings.csv, movies.csv, etc.

For Docker (runc/runsc)

Use the provided Dockerfile to build the standalone in-memory analytics image.

# Build the image
docker build -t myinmemoryanalytics .

# Start container
docker run -id --rm --name worker \
  --cpuset-cpus="0-3" \
  --memory="24g" \
  myinmemoryanalytics

Run Workload

Main workload (requires manual ml-latest setup above):

docker exec worker /root/run_benchmark.sh /data/ml-latest /data/myratings.csv --driver-memory "8g" --executor-memory "8g"

Development/testing with small dataset (no download needed):

docker exec worker /root/run_benchmark.sh /data/ml-latest-small /data/myratings.csv --driver-memory "8g" --executor-memory "8g"

For Firecracker (fc)

Building the Firecracker Image

Resize the base image to 16GB, then boot the VM and install Spark 3.3.2 following commons/spark/3.3.2/Dockerfile. Set SPARK_HOME=/opt/spark-3.3.2.

Copy the workload files into the VM:

mkdir -p /root/inMemoryAnalytics/data
cp fc/movielens-als-2.0.jar /root/inMemoryAnalytics/
cp fc/run_benchmark.sh /root/run_benchmark.sh
chmod +x /root/run_benchmark.sh

# ml-latest is available from the GitHub release (see Manual File Setup above)
cp -r <path-to>/ml-latest /root/inMemoryAnalytics/data/
cp data/myratings.csv /root/inMemoryAnalytics/data/

Set SPARK_LOCAL_IP=169.254.0.1 in /etc/environment or /etc/profile.d/spark.sh so it persists across SSH sessions.

1. Environment Setup

Set required Spark environment variable:

export SPARK_LOCAL_IP=169.254.0.1

2. Run Workload

Main workload (requires manual ml-latest setup above):

export SPARK_LOCAL_IP=169.254.0.1

/opt/spark-3.3.2/bin/spark-submit \
  --class MovieLensALS \
  --driver-memory "8g" \
  --executor-memory "8g" \
  /root/inMemoryAnalytics/movielens-als-2.0.jar \
  /root/inMemoryAnalytics/data/ml-latest \
  /root/inMemoryAnalytics/data/myratings.csv

Development/testing with small dataset:

export SPARK_LOCAL_IP=169.254.0.1

/opt/spark-3.3.2/bin/spark-submit \
  --class MovieLensALS \
  --driver-memory "8g" \
  --executor-memory "8g" \
  /root/inMemoryAnalytics/movielens-als-2.0.jar \
  /root/inMemoryAnalytics/data/ml-latest-small \
  /root/inMemoryAnalytics/data/myratings.csv

Workload Parameters

Parameter Description Example
Dataset Path Path to MovieLens dataset directory /data/ml-latest or /data/ml-latest-small
Ratings File User ratings file for recommendations /data/myratings.csv
--driver-memory Spark driver memory allocation 8g, 16g
--executor-memory Spark executor memory allocation 8g, 16g

Dataset Information

ml-latest (Large Dataset - GitHub Release)

  • Size: ~1.1GB uncompressed
  • Ratings: ~27M ratings
  • Movies: ~58K movies
  • Users: ~280K users
  • Files: ratings.csv (725MB), genome-scores.csv (396MB), tags.csv (38MB), movies.csv, links.csv, genome-tags.csv

ml-latest-small (Small Dataset - Included in Git)

  • Size: ~1MB
  • Ratings: ~100K ratings
  • Movies: ~9K movies
  • Users: ~600 users
  • Files: ratings.csv (2.4MB), movies.csv, links.csv, tags.csv

Algorithm Details

  • Method: Alternating Least Squares (ALS) Collaborative Filtering
  • Use Case: Movie recommendation system
  • Input: User-movie ratings matrix
  • Output: Latent factor matrices for users and movies
  • Computation: Matrix factorization with iterative optimization

Performance Notes

  • Memory requirements scale with dataset size and number of factors
  • ALS algorithm is iterative and benefits from Spark's in-memory caching
  • Large dataset requires sufficient memory allocation (recommend 16GB+ for driver/executor)
  • Small dataset useful for development and testing

Files