Skip to content
This repository was archived by the owner on May 3, 2024. It is now read-only.
This repository was archived by the owner on May 3, 2024. It is now read-only.

Batch Size Choosing for single GPU Traing and Multiple GPU Train #21

@dhzhd1

Description

@dhzhd1

Issue summary

I am using Caffe and ImageNet dataset training on GoogleNet(v1). When I do the single GPU (MI25) training, the training batch size I used is '128'. Then I change the training applied to Multiple MI25 training on hipCaffe, since the total GPU memory capacity has 4 times ( 16GB x4), the batch size should able to fit 512 image/batch(128 image/batch/card). From my testing result, the batch size cannot be changed, even just '192' (multiple of 64), it shows "error: 'hipErrorMemory Allocation'(1002)" .

Since the batch size only has '128', I just do a roughly math, the four cards training time will 3 ~ 3.5x longer as training time on 4xP100 system (batch_size=512).

May I ask is there some environment parameters should I set before the training which can help on enlarge the batch size on multiple GPU training?

I crossed check with one of my NVIDIA P100x4 Server, the batch size could be increased as long as I use more cards. The batch number mentioned above was based on my experience when I did on the same dataset, same network, with NVIDIA P100(16GB GDDR), and V100(16GB GDDR) Training job.

Steps to reproduce

Use the bvlc_googlenet training network under the hipCaffe installation path. The ImageNet dataset from ImageNet official website.

Your system configuration

Operating system: Ubuntu 16.04.3
Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)
CUDA version (if applicable):
CUDNN version (if applicable):
BLAS: USE_ROCBLAS := 1
Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7.12
Other:
miopen-hip 1.1.4
miopengemm 1.1.5
rocm-libs 1.6.180
Server: Inventec P47
GPU: AMD MI25 x4
CPU: AMD EPYC 7601 x2
Memory: 512GB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions