This workload tests Spark's in-memory analytics capabilities using the MovieLens recommendation dataset. It implements Alternating Least Squares (ALS) collaborative filtering algorithm for movie recommendations.
Note: The large MovieLens dataset is stored in GitHub Releases and needs to be downloaded manually. The workload uses standalone mode where the same container contains both Spark and the recommendation application.
Download the MovieLens dataset from the GitHub release:
-
Download from release: https://github.com/Anjali05/platform-workloads/releases/download/inMemoryAnalytics-v1.0.0/ml-latest.zip
-
Extract to correct location:
cd /mydata/platform-workloads/cloud-workloads/inMemoryAnalytics/data # Download and extract (or place downloaded file here) wget https://github.com/Anjali05/platform-workloads/releases/download/inMemoryAnalytics-v1.0.0/ml-latest.zip unzip ml-latest.zip # Verify the directory structure ls -la ml-latest/ # Should contain: ratings.csv, movies.csv, etc.
Use the provided Dockerfile to build the standalone in-memory analytics image.
# Build the image
docker build -t myinmemoryanalytics .
# Start container
docker run -id --rm --name worker \
--cpuset-cpus="0-3" \
--memory="24g" \
myinmemoryanalyticsMain workload (requires manual ml-latest setup above):
docker exec worker /root/run_benchmark.sh /data/ml-latest /data/myratings.csv --driver-memory "8g" --executor-memory "8g"Development/testing with small dataset (no download needed):
docker exec worker /root/run_benchmark.sh /data/ml-latest-small /data/myratings.csv --driver-memory "8g" --executor-memory "8g"Resize the base image to 16GB, then boot the VM and install Spark 3.3.2 following commons/spark/3.3.2/Dockerfile. Set SPARK_HOME=/opt/spark-3.3.2.
Copy the workload files into the VM:
mkdir -p /root/inMemoryAnalytics/data
cp fc/movielens-als-2.0.jar /root/inMemoryAnalytics/
cp fc/run_benchmark.sh /root/run_benchmark.sh
chmod +x /root/run_benchmark.sh
# ml-latest is available from the GitHub release (see Manual File Setup above)
cp -r <path-to>/ml-latest /root/inMemoryAnalytics/data/
cp data/myratings.csv /root/inMemoryAnalytics/data/Set SPARK_LOCAL_IP=169.254.0.1 in /etc/environment or /etc/profile.d/spark.sh so it persists across SSH sessions.
Set required Spark environment variable:
export SPARK_LOCAL_IP=169.254.0.1Main workload (requires manual ml-latest setup above):
export SPARK_LOCAL_IP=169.254.0.1
/opt/spark-3.3.2/bin/spark-submit \
--class MovieLensALS \
--driver-memory "8g" \
--executor-memory "8g" \
/root/inMemoryAnalytics/movielens-als-2.0.jar \
/root/inMemoryAnalytics/data/ml-latest \
/root/inMemoryAnalytics/data/myratings.csvDevelopment/testing with small dataset:
export SPARK_LOCAL_IP=169.254.0.1
/opt/spark-3.3.2/bin/spark-submit \
--class MovieLensALS \
--driver-memory "8g" \
--executor-memory "8g" \
/root/inMemoryAnalytics/movielens-als-2.0.jar \
/root/inMemoryAnalytics/data/ml-latest-small \
/root/inMemoryAnalytics/data/myratings.csv| Parameter | Description | Example |
|---|---|---|
| Dataset Path | Path to MovieLens dataset directory | /data/ml-latest or /data/ml-latest-small |
| Ratings File | User ratings file for recommendations | /data/myratings.csv |
--driver-memory |
Spark driver memory allocation | 8g, 16g |
--executor-memory |
Spark executor memory allocation | 8g, 16g |
- Size: ~1.1GB uncompressed
- Ratings: ~27M ratings
- Movies: ~58K movies
- Users: ~280K users
- Files: ratings.csv (725MB), genome-scores.csv (396MB), tags.csv (38MB), movies.csv, links.csv, genome-tags.csv
- Size: ~1MB
- Ratings: ~100K ratings
- Movies: ~9K movies
- Users: ~600 users
- Files: ratings.csv (2.4MB), movies.csv, links.csv, tags.csv
- Method: Alternating Least Squares (ALS) Collaborative Filtering
- Use Case: Movie recommendation system
- Input: User-movie ratings matrix
- Output: Latent factor matrices for users and movies
- Computation: Matrix factorization with iterative optimization
- Memory requirements scale with dataset size and number of factors
- ALS algorithm is iterative and benefits from Spark's in-memory caching
- Large dataset requires sufficient memory allocation (recommend 16GB+ for driver/executor)
- Small dataset useful for development and testing
- Dockerfile - Standalone container with Spark and MovieLens ALS
- data/ml-latest/ - Large MovieLens dataset (download from releases)
- data/ml-latest-small/ - Small MovieLens dataset (included)
- data/myratings.csv - Sample user ratings for recommendations
- movielens-als/ - Scala source code for ALS implementation
- fc/ - Firecracker-specific files and configurations