Context
- Organisation: Barcelona Supercomputing Center (BSC)
- Project: DestinE ClimateDT
Summary
We observe a significant OpenMP thread imbalance in the single-precision CPU version of ectrans when running with IM=0.
The issue was first detected on MN5 GPP and also appears on LUMI-C, but the reproducer below is for MN5.
The imbalance becomes visible when GEMM is fast enough, e.g. with:
On MN5 this selects AVX512 behavior and makes the SGEMM calls sufficiently fast that the overhead from temporary allocation and single-to-double copies in ledir_mod.F90 becomes a relevant fraction of runtime.
Versions
ecbuild: 3.8.5
fiat: 1.4.1
ectrans: 1.8.0
Platform
System: MN5 GPP partition
Nodes: 100
Cores per node: 112
Compiler: Intel ifort
MPI: Intel MPI 2021.10.0
MKL: 2023.2.0
FFTW: 3.3.10
Modules:
module load intel/2023.2.0 impi/2021.10.0 mkl/2023.2.0 ucx/1.16.0 fftw/3.3.10 cmake
export FC=ifort
Reproducer
Submission script:
#!/bin/bash
#SBATCH -J ectrans-tco2559
#SBATCH --qos=gp_bsces
#SBATCH --account=bsc32
#SBATCH -N 100
#SBATCH --ntasks-per-node=14
#SBATCH --cpus-per-task=8
#SBATCH --hint=nomultithread
#SBATCH --time=00:30:00
export OMP_NUM_THREADS=8
export OMP_PLACES=cores
export OMP_PROC_BIND=close
module load intel/2023.2.0 impi/2021.10.0 mkl/2023.2.0 ucx/1.16.0 fftw/3.3.10
export DR_HOOK=1
export DR_HOOK_OPT=prof
# Important: needed for GEMM to be fast enough for the imbalance to show clearly.
# On MN5 this selects AVX512 behavior.
export MKL_CBWR=AUTO
srun /gpfs/scratch/ehpc01/bsc032799/ectrans-1.8/ectrans/bin/ectrans-benchmark-cpu-sp \
--truncation 1279 \
--vordiv \
--scders \
--nlev 137 \
--nfld 16 \
--nproma 16 \
--niter 50
Observed behavior
With DR_HOOK profiling, rank 0 and similar affected ranks show strong imbalance in LEDIR_SGEMM_1 and LEDIR_SGEMM_2 across OpenMP threads, being thread 0 much slower as it takes the DP path.
Example excerpt:
# % Time Cumul Self Total # of calls Self Total Routine@<thread-id>
ms/call ms/call
63 0.99 5.747 0.057 0.057 53 1.08 1.08 *LEDIR_SGEMM_1@1
66 0.33 5.787 0.019 0.019 53 0.37 0.37 *LEDIR_SGEMM_2@1
72 0.20 5.787 0.012 0.012 106 0.11 0.11 LEDIR_SGEMM_1@8
73 0.19 5.787 0.011 0.011 53 0.21 0.21 LEDIR_SGEMM_1@2
74 0.17 5.787 0.010 0.010 106 0.09 0.09 LEDIR_SGEMM_2@8
75 0.17 5.787 0.010 0.010 106 0.09 0.09 LEDIR_SGEMM_1@7
76 0.16 5.787 0.009 0.009 53 0.17 0.17 LEDIR_SGEMM_2@2
77 0.14 5.787 0.008 0.008 106 0.07 0.07 LEDIR_SGEMM_2@7
79 0.12 5.787 0.007 0.007 53 0.14 0.14 LEDIR_SGEMM_1@3
80 0.12 5.787 0.007 0.007 85 0.08 0.08 LEDIR_SGEMM_1@6
81 0.11 5.787 0.006 0.006 53 0.12 0.12 LEDIR_SGEMM_1@4
82 0.11 5.787 0.006 0.006 53 0.12 0.12 LEDIR_SGEMM_2@3
83 0.11 5.787 0.006 0.006 85 0.07 0.07 LEDIR_SGEMM_2@6
84 0.11 5.787 0.006 0.006 74 0.08 0.08 LEDIR_SGEMM_1@5
86 0.10 5.793 0.006 0.006 53 0.11 0.11 LEDIR_SGEMM_2@4
87 0.09 5.793 0.005 0.005 74 0.07 0.07 LEDIR_SGEMM_2@5
Additional profiling / suspected source
After adding extra DR_HOOK regions in:
src/trans/cpu/internal/ledir_mod.F90
The imbalance is associated with the IM=0 path in the single-precision version. As we understand it, this path is required to preserve conservation properties and maintain scientific accuracy.
However, in large-scale runs we observe that this imbalance becomes more pronounced. In particular, there are cases where a single DGEMM-based path becomes slower than executing multiple SGEMM paths.
Additional timers added around this region suggest that part of this extra cost may come from:
- Allocation/deallocation of temporary double-precision arrays
- Copy and conversion of the matrix slice from single to double precision before the GEMM call
The intention of this issue is not to question the IM=0 logic itself, but to highlight that the allocation and copy overheads could potentially be reduced or avoided (e.g. via caching or reuse), which may help mitigate the observed imbalance.
Relevant code pattern:
ELSE
BLOCK
REAL(KIND=JPRD), allocatable :: ZB_D(:,:), ZCS_D(:,:), ZRPNMS(:,:)
INTEGER(KIND=JPIM) :: I1, I2, I3, I4
I1 = size(S%FA(KMLOC)%RPNMS(:,1))
I2 = size(S%FA(KMLOC)%RPNMS(1,:))
ALLOCATE(ZRPNMS(I1,I2))
ALLOCATE(ZB_D(KDGLU,KIFC))
ALLOCATE(ZCS_D((R%NTMAX-KM+3)/2,KIFC))
IFLD=0
DO JK=1,KFC,ISKIP
IFLD=IFLD+1
DO J=1,KDGLU
ZB_D(J,IFLD)=PSIA(JK,ISL+J-1)*REAL(PW(ISL+J-1),JPRB)
ENDDO
ENDDO
DO I3=1,I1
DO I4=1,I2
ZRPNMS(I3,I4) = S%FA(KMLOC)%RPNMS(I3,I4)
END DO
END DO
CALL GEMM('T','N',ILS,KIFC,KDGLU,1.0_JPRD,ZRPNMS,KDGLU,&
&ZB_D,KDGLU,0._JPRD,ZCS_D,ILS)
IFLD=0
DO JK=1,KFC,ISKIP
IFLD=IFLD+1
DO J=1,ILS
ZCS(J,IFLD) = ZCS_D(J,IFLD)
ENDDO
ENDDO
DEALLOCATE(ZRPNMS)
DEALLOCATE(ZB_D)
DEALLOCATE(ZCS_D)
END BLOCK
END IF
When the GEMM duration is very small, the allocation/deallocation and the repeated copy/conversion of S%FA(KMLOC)%RPNMS into ZRPNMS become a noticeable part of the cost and may contribute to the observed imbalance.
Expected behavior
More balanced OpenMP thread execution in the single-precision CPU path when IM=0.
Actual behavior
Clear thread imbalance in LEDIR_SGEMM_* regions, especially at scale and when GEMM is fast.
Possible improvement
A potential optimization would be to trade memory for performance by caching the double-precision version of the IM=0 matrix slice in the single-precision path.
For example:
- Build the double-precision slice once (e.g. in
SULEG)
- Reuse it in
LEDIR
- Avoid repeated allocation, copy, and conversion
This would remove:
- Repeated allocation/deallocation of
ZRPNMS
- Repeated single → double conversion of the matrix slice
Context
Summary
We observe a significant OpenMP thread imbalance in the single-precision CPU version of
ectranswhen running withIM=0.The issue was first detected on MN5 GPP and also appears on LUMI-C, but the reproducer below is for MN5.
The imbalance becomes visible when GEMM is fast enough, e.g. with:
export MKL_CBWR=AUTOOn MN5 this selects AVX512 behavior and makes the SGEMM calls sufficiently fast that the overhead from temporary allocation and single-to-double copies in
ledir_mod.F90becomes a relevant fraction of runtime.Versions
Platform
Modules:
module load intel/2023.2.0 impi/2021.10.0 mkl/2023.2.0 ucx/1.16.0 fftw/3.3.10 cmake export FC=ifortReproducer
Submission script:
Observed behavior
With DR_HOOK profiling, rank 0 and similar affected ranks show strong imbalance in
LEDIR_SGEMM_1andLEDIR_SGEMM_2across OpenMP threads, being thread 0 much slower as it takes the DP path.Example excerpt:
Additional profiling / suspected source
After adding extra DR_HOOK regions in:
The imbalance is associated with the
IM=0path in the single-precision version. As we understand it, this path is required to preserve conservation properties and maintain scientific accuracy.However, in large-scale runs we observe that this imbalance becomes more pronounced. In particular, there are cases where a single DGEMM-based path becomes slower than executing multiple SGEMM paths.
Additional timers added around this region suggest that part of this extra cost may come from:
The intention of this issue is not to question the
IM=0logic itself, but to highlight that the allocation and copy overheads could potentially be reduced or avoided (e.g. via caching or reuse), which may help mitigate the observed imbalance.Relevant code pattern:
When the GEMM duration is very small, the allocation/deallocation and the repeated copy/conversion of
S%FA(KMLOC)%RPNMSintoZRPNMSbecome a noticeable part of the cost and may contribute to the observed imbalance.Expected behavior
More balanced OpenMP thread execution in the single-precision CPU path when
IM=0.Actual behavior
Clear thread imbalance in
LEDIR_SGEMM_*regions, especially at scale and when GEMM is fast.Possible improvement
A potential optimization would be to trade memory for performance by caching the double-precision version of the
IM=0matrix slice in the single-precision path.For example:
SULEG)LEDIRThis would remove:
ZRPNMS