In this project we used AWS SageMaker, Amazon Bin Image Dataset, and good machine learning engineering practices to fetch data from a database, preprocess it, and then train a machine learning model. A custom resnet-50 model is built by overlaying the well-known CNN model architecture ResNet50 (not pre-trained) on top of our own fully-connected (fc) network. The choice for the selection of resnet-50 is based on its general performance on imagenet dataset. We again profile the model's performance with respect to cpu, gpu, io and memory utilization during training. We also run the training with a higher hyperparameter ranges and then select the best hyperparameters to retrain our model.
- AWS Sagemaker is an integrated service for machine learning model training, hyperparameter tuning, debugging, and deployment purposes and this will be used to train the deep learning model
- S3 will be used for data storage and training model storage
- Endpoint services available in AWS Sagemaker is used for model inference. Batch tranform can also be used on inference on batch test-set.
- Data can be downloaded from Amazon Open Data website https://registry.opendata.aws/amazon-bin-imagery/
- Data is captured by Amazon in their Fulfilment centre and has around 50000 images
- Images are located in the bin-images directory, and metadata for each image is located in the metadata directory. Images and their associated metadata share simple numerical unique identifiers.
There are two set of inputs for the model training
- Images for the model, which is available in the source as JPEG file Example https://aft-vbi-pds.s3.amazonaws.com/bin-images/1005.jpg
- JSON format with meta data for the image Example https://aft-vbi-pds.s3.amazonaws.com/metadata/1005.json.
From the JSON file, we can filter the target label which the quantity of the objects in the image
The data is uploaded to the S3 bucket through the AWS Gateway so that SageMaker has access to the data, using aws shell commands.
- !aws s3 cp train s3://bucket/train/ --recursive
- !aws s3 cp test s3://bucket/test/ --recursive
hpo.pyfor hyperparameter tuning jobs where we train the model for multiple time with different hyperparameters and search for the best one based on loss metrics.train_model.pyfor really training the model with the best parameters getting from the previous tuning jobs, and put debug and profiler hooks for debugging purpose.inference.py: It includes the required methods (model_fnto load the model andinput_fnto transform the input into something which can be understood by the model) for the model to be deployed.
Below are hyperparameter types and their respective ranges used in the training
- learning rate
- batch size
- epochs
hyperparameter_ranges = {
"batch-size": sagemaker.tuner.CategoricalParameter([32, 64, 128, 256, 512]),
"lr": sagemaker.tuner.ContinuousParameter(0.01, 0.1),
"epochs": sagemaker.tuner.IntegerParameter(2, 4)
}The objective type is to maximize accuracy.
objective_metric_name = "average test accuracy"
objective_type = "Maximize"
metric_definitions = [{"Name": "average test accuracy", "Regex": "Test set: Average accuracy: ([0-9\\.]+)"}]Best hyperparameter values
hyperparameters = {'batch-size': '512', 'lr': '0.026305482032806977', 'epochs': '4'}Training Jobs:
4 max jobs with 2 concurrent jobs is used.
It took 14 minutes to complete all 4 jobs. 4 concurrent jobs will be used next time to save time.
First configured a debugger rule object that accepts a list of rules against output tensors that you want to evaluate. SageMaker Debugger automatically runs the ProfilerReport rule by default. This rules autogenerates a profiling report Secondly, configure a debugger hook parameter to adjust save intervals of the output tensors in the different training phases. Next, construct a PyTorch estimator object with the debugger rule object and hook parameters. Finally, start the training job by fitting the training data to the estimator object.
The training job was very long (approximately 8 hours). Observing the peaks in utilization of cpu, gpu, memory and IO helped to better select the right instance type for training for improved resource efficiency. Training is also performed on EC2 instance to compare performance, time and cost.
The deployed model runs on 1 instance type of a standard compute resource ("ml.t2.medium"). The configuration of these parameters are set using the PyTorch deploy function. Upon performing the model deploy, an Endpoint is created. To query the endpoint with the test sample input, first perform a resize, crop, toTensor, and normalization transformation on the image, and then pass the transformed image to the predict function of the endpoint.
Execute the following lines of code replacing IMAGE_PATH by the path where your image is stored and ENDPOINT by the name of your endpoint:
import io
import sagemaker
from PIL import Image
from sagemaker.serializers import IdentitySerializer
from sagemaker.pytorch.model import PyTorchPredictor
serializer = IdentitySerializer("image/jpeg")
predictor = PyTorchPredictor(ENDPOINT, serializer=serializer, sagemaker_session=sagemaker.Session())
buffer = io.BytesIO()
Image.open(IMAGE_PATH).save(buffer, format="JPEG")
response = predictor.predict(buffer.getvalue())ACTIVE ENDPOINT