Skip to content

discovery-unicamp/faas-dask

Repository files navigation

FaaS-Dask

FaaS-Dask adapts the Dask parallel computing framework to run on serverless (FaaS) platforms. It is designed to reduce costs for interactive scientific workflows by eliminating charges for idle time between computations, a common issue with traditional VM-based clusters. This provides a scalable, cost-efficient alternative for sporadic workloads.

Quick Demo

Using Python 3.10 or supperior, setup your environment:

python -m venv venv
source venv/bin/activate # Supposing a linux enviorment, may vary
pip install -r requirements.txt
pip install . # Install local version of FaaS-Dask

Then, let's define some input data and our test function:

from typing import cast
import dask.array as da

state = da.random.RandomState(42)
x = cast(da.Array, state.random(size=(10000, 10000), chunks=(1000, 1000)))

There is a development version of FaaS-Dask that uses processes in the local machine with constraints simillar to the FaaS environment. You can use it to quickly test this tool, but it is only meant for simple tests, and should not be used for anything seriou:

from faascheduler.local import MultiprocessingCluster
from faascheduler.scheduler.dictionary import DictionaryScheduler

with MultiprocessingCluster(DictionaryScheduler()):
    print(x.mean().compute())

You should see: 0.4999391524251242 on your screen!

The next example uses AWS and requires AWS credentials to be setup. Check AWS Documentation to learn how to do that.

First, we need to generate a zip file. The following script will package a main.py for our Lambda and all dependencies into ./lambdabin/lambda_function_payload.zip:

./lambda/setup_package.sh

Then, we can run the following (it will take a few minutes):

from faascheduler.aws import AwsLambdaCluster

zip_path = "./lambdabin/lambda_function_payload.zip"

with AwsLambdaCluster(zip_path):
    print(x.mean().compute())

You'll see the same value: 0.4999391524251242

Using the same AwsLambdaCluster for multiple computations will achieve the best results, as it takes a while for all cloud resources to become available. The code instantiating the Cluster should be also close to the region where it will run to minimize network latency (in our experiments we used a VM for that).

What does AwsLambdaCluster do, and will I be billed?

When you create and start the context manager controlled by AwsLambdaCluster the following happens:

  • An AWS Lambda is defined using the provided zip
  • All needed resources are created. For the parameters above, the following will be created on us-east-1:
    • An S3 bucket to store intermediate state
    • Two SQS queues, one for sending tasks other for receiving results
    • Roles to allow all the resources to communicate
  • The default Dask scheduler is replaced by FaaS-Dask

When the context manager exits all AWS resources are deleted and the Dask scheduler is reset.

In our experience simple computations as the one presented above fall into the free quota, but depending on the AWS account you're using this may generate costs.

Reproducing the experiments

🚧 Under Construction 🚧 Our publications share experiments with a myriad of different configurations, and compares those results with FaaS-Dask. This section will present scripts that can be used to reproduce these experiments.

Under benchmarks there is a CLI that can be used to run on different configurations. First of all, you need to install pandas on your venv. Then, you can simply run python benchmarks/run.py -h and you should see some instructions. Before executing, make sure to generate input data and check you environemtn, as instructed below.

Generate Input Data Running the command below will generated input data and upload it to S3

python ./benchmarks/run.py gen_data

Check environment This command will create a Lambda cluster and execute simple computations on it

## the zip referenced below can be generated by running ./lambda/setup_package.sh
python ./benchmarks/run.py -l lambdabin/lambda_function_payload.zip sanity

Lambda concurrency

New AWS acounts may have concurrncy limited to 10. Experiments ran with 1000 limit, if you're interested in reproducing you should request the ceiling to be updated.

Running on EFS

To run on EFS a few additional steps are needed. To access EFS the Lambda needs to be in a VPC, and the first subnet in the first VPC is chosen by default. Chances are that your VPC does not have internet access, so access to SQS and S3 will be blocked. This will make your lambda instances timeout.

To solve that, make sure your VPC can access S3 (by creating an Endpoint, for example). FaaS-Dask already makes sure ACLs to all resources and networking to EFS is configured, so only the VPC needs to be changed.

Fargate and EC2

TODO - Hardcoded AMI and Subnets

Publications

DOI: 10.1109/UCC63386.2024.00057

Carlos Eduardo Millani, Carlos A. Astudillo, and Edson Borin. Towards a scalable and cost efficient serverless scheduler for dask. In 2024 IEEE/ACM 17th International Conference on Utility and Cloud Computing (UCC), page 366–371. IEEE, December 2024.

@inproceedings{Millani2024,
  title = {Towards a Scalable and Cost Efficient Serverless Scheduler for Dask},
  url = {http://dx.doi.org/10.1109/UCC63386.2024.00057},
  DOI = {10.1109/ucc63386.2024.00057},
  booktitle = {2024 IEEE/ACM 17th International Conference on Utility and Cloud Computing (UCC)},
  publisher = {IEEE},
  author = {Millani,  Carlos Eduardo and Astudillo,  Carlos A. and Borin,  Edson},
  year = {2024},
  month = dec,
  pages = {366–371}
}

About

FaaS-Dask adapts the Dask parallel computing framework to run on serverless (FaaS) platforms. It is designed to reduce costs for interactive scientific workflows by eliminating charges for idle time between computations, a common issue with traditional VM-based clusters. This provides a scalable, cost-efficient alternative for sporadic workloads.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors