Skip to content

CPU SP performance imbalance (IM=0 path) linked to allocations and SP→DP copies in LEDIR #392

Description

@CarlosPenaDePedro

Context

  • Organisation: Barcelona Supercomputing Center (BSC)
  • Project: DestinE ClimateDT

Summary

We observe a significant OpenMP thread imbalance in the single-precision CPU version of ectrans when running with IM=0.

The issue was first detected on MN5 GPP and also appears on LUMI-C, but the reproducer below is for MN5.

The imbalance becomes visible when GEMM is fast enough, e.g. with:

export MKL_CBWR=AUTO

On MN5 this selects AVX512 behavior and makes the SGEMM calls sufficiently fast that the overhead from temporary allocation and single-to-double copies in ledir_mod.F90 becomes a relevant fraction of runtime.

Versions

ecbuild: 3.8.5
fiat:    1.4.1
ectrans: 1.8.0

Platform

System: MN5 GPP partition
Nodes: 100
Cores per node: 112
Compiler: Intel ifort
MPI: Intel MPI 2021.10.0
MKL: 2023.2.0
FFTW: 3.3.10

Modules:

module load intel/2023.2.0 impi/2021.10.0 mkl/2023.2.0 ucx/1.16.0 fftw/3.3.10 cmake
export FC=ifort

Reproducer

Submission script:

#!/bin/bash
#SBATCH -J ectrans-tco2559
#SBATCH --qos=gp_bsces
#SBATCH --account=bsc32
#SBATCH -N 100
#SBATCH --ntasks-per-node=14
#SBATCH --cpus-per-task=8
#SBATCH --hint=nomultithread
#SBATCH --time=00:30:00

export OMP_NUM_THREADS=8
export OMP_PLACES=cores
export OMP_PROC_BIND=close

module load intel/2023.2.0 impi/2021.10.0 mkl/2023.2.0 ucx/1.16.0 fftw/3.3.10

export DR_HOOK=1
export DR_HOOK_OPT=prof

# Important: needed for GEMM to be fast enough for the imbalance to show clearly.
# On MN5 this selects AVX512 behavior.
export MKL_CBWR=AUTO

srun /gpfs/scratch/ehpc01/bsc032799/ectrans-1.8/ectrans/bin/ectrans-benchmark-cpu-sp \
  --truncation 1279 \
  --vordiv \
  --scders \
  --nlev 137 \
  --nfld 16 \
  --nproma 16 \
  --niter 50

Observed behavior

With DR_HOOK profiling, rank 0 and similar affected ranks show strong imbalance in LEDIR_SGEMM_1 and LEDIR_SGEMM_2 across OpenMP threads, being thread 0 much slower as it takes the DP path.

Example excerpt:

#  % Time         Cumul         Self        Total     # of calls        Self       Total    Routine@<thread-id>
                                                                         ms/call     ms/call
63     0.99        5.747        0.057        0.057             53        1.08        1.08   *LEDIR_SGEMM_1@1
66     0.33        5.787        0.019        0.019             53        0.37        0.37   *LEDIR_SGEMM_2@1
72     0.20        5.787        0.012        0.012            106        0.11        0.11    LEDIR_SGEMM_1@8
73     0.19        5.787        0.011        0.011             53        0.21        0.21    LEDIR_SGEMM_1@2
74     0.17        5.787        0.010        0.010            106        0.09        0.09    LEDIR_SGEMM_2@8
75     0.17        5.787        0.010        0.010            106        0.09        0.09    LEDIR_SGEMM_1@7
76     0.16        5.787        0.009        0.009             53        0.17        0.17    LEDIR_SGEMM_2@2
77     0.14        5.787        0.008        0.008            106        0.07        0.07    LEDIR_SGEMM_2@7
79     0.12        5.787        0.007        0.007             53        0.14        0.14    LEDIR_SGEMM_1@3
80     0.12        5.787        0.007        0.007             85        0.08        0.08    LEDIR_SGEMM_1@6
81     0.11        5.787        0.006        0.006             53        0.12        0.12    LEDIR_SGEMM_1@4
82     0.11        5.787        0.006        0.006             53        0.12        0.12    LEDIR_SGEMM_2@3
83     0.11        5.787        0.006        0.006             85        0.07        0.07    LEDIR_SGEMM_2@6
84     0.11        5.787        0.006        0.006             74        0.08        0.08    LEDIR_SGEMM_1@5
86     0.10        5.793        0.006        0.006             53        0.11        0.11    LEDIR_SGEMM_2@4
87     0.09        5.793        0.005        0.005             74        0.07        0.07    LEDIR_SGEMM_2@5

Additional profiling / suspected source

After adding extra DR_HOOK regions in:

src/trans/cpu/internal/ledir_mod.F90

The imbalance is associated with the IM=0 path in the single-precision version. As we understand it, this path is required to preserve conservation properties and maintain scientific accuracy.

However, in large-scale runs we observe that this imbalance becomes more pronounced. In particular, there are cases where a single DGEMM-based path becomes slower than executing multiple SGEMM paths.

Additional timers added around this region suggest that part of this extra cost may come from:

  • Allocation/deallocation of temporary double-precision arrays
  • Copy and conversion of the matrix slice from single to double precision before the GEMM call

The intention of this issue is not to question the IM=0 logic itself, but to highlight that the allocation and copy overheads could potentially be reduced or avoided (e.g. via caching or reuse), which may help mitigate the observed imbalance.

Relevant code pattern:

ELSE
  BLOCK
     REAL(KIND=JPRD), allocatable :: ZB_D(:,:), ZCS_D(:,:), ZRPNMS(:,:)
     INTEGER(KIND=JPIM) :: I1, I2, I3, I4

     I1 = size(S%FA(KMLOC)%RPNMS(:,1))
     I2 = size(S%FA(KMLOC)%RPNMS(1,:))
     ALLOCATE(ZRPNMS(I1,I2))
     ALLOCATE(ZB_D(KDGLU,KIFC))
     ALLOCATE(ZCS_D((R%NTMAX-KM+3)/2,KIFC))

     IFLD=0
     DO JK=1,KFC,ISKIP
        IFLD=IFLD+1
        DO J=1,KDGLU
           ZB_D(J,IFLD)=PSIA(JK,ISL+J-1)*REAL(PW(ISL+J-1),JPRB)
        ENDDO
     ENDDO

     DO I3=1,I1
        DO I4=1,I2
           ZRPNMS(I3,I4) = S%FA(KMLOC)%RPNMS(I3,I4)
        END DO
     END DO

     CALL GEMM('T','N',ILS,KIFC,KDGLU,1.0_JPRD,ZRPNMS,KDGLU,&
          &ZB_D,KDGLU,0._JPRD,ZCS_D,ILS)

     IFLD=0
     DO JK=1,KFC,ISKIP
        IFLD=IFLD+1
        DO J=1,ILS
           ZCS(J,IFLD) = ZCS_D(J,IFLD)
        ENDDO
     ENDDO

     DEALLOCATE(ZRPNMS)
     DEALLOCATE(ZB_D)
     DEALLOCATE(ZCS_D)
  END BLOCK
END IF

When the GEMM duration is very small, the allocation/deallocation and the repeated copy/conversion of S%FA(KMLOC)%RPNMS into ZRPNMS become a noticeable part of the cost and may contribute to the observed imbalance.

Expected behavior

More balanced OpenMP thread execution in the single-precision CPU path when IM=0.

Actual behavior

Clear thread imbalance in LEDIR_SGEMM_* regions, especially at scale and when GEMM is fast.

Possible improvement

A potential optimization would be to trade memory for performance by caching the double-precision version of the IM=0 matrix slice in the single-precision path.

For example:

  • Build the double-precision slice once (e.g. in SULEG)
  • Reuse it in LEDIR
  • Avoid repeated allocation, copy, and conversion

This would remove:

  • Repeated allocation/deallocation of ZRPNMS
  • Repeated single → double conversion of the matrix slice

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions