CPU SP performance imbalance (IM=0 path) linked to allocations and SP→DP copies in LEDIR

## Context

- Organisation: Barcelona Supercomputing Center (BSC)
- Project: DestinE ClimateDT

## Summary

We observe a significant OpenMP thread imbalance in the single-precision CPU version of `ectrans` when running with `IM=0`.

The issue was first detected on MN5 GPP and also appears on LUMI-C, but the reproducer below is for MN5.

The imbalance becomes visible when GEMM is fast enough, e.g. with:

```bash
export MKL_CBWR=AUTO
```

On MN5 this selects AVX512 behavior and makes the SGEMM calls sufficiently fast that the overhead from temporary allocation and single-to-double copies in `ledir_mod.F90` becomes a relevant fraction of runtime.

## Versions

```text
ecbuild: 3.8.5
fiat:    1.4.1
ectrans: 1.8.0
```

## Platform

```text
System: MN5 GPP partition
Nodes: 100
Cores per node: 112
Compiler: Intel ifort
MPI: Intel MPI 2021.10.0
MKL: 2023.2.0
FFTW: 3.3.10
```

Modules:

```bash
module load intel/2023.2.0 impi/2021.10.0 mkl/2023.2.0 ucx/1.16.0 fftw/3.3.10 cmake
export FC=ifort
```

## Reproducer

Submission script:

```bash
#!/bin/bash
#SBATCH -J ectrans-tco2559
#SBATCH --qos=gp_bsces
#SBATCH --account=bsc32
#SBATCH -N 100
#SBATCH --ntasks-per-node=14
#SBATCH --cpus-per-task=8
#SBATCH --hint=nomultithread
#SBATCH --time=00:30:00

export OMP_NUM_THREADS=8
export OMP_PLACES=cores
export OMP_PROC_BIND=close

module load intel/2023.2.0 impi/2021.10.0 mkl/2023.2.0 ucx/1.16.0 fftw/3.3.10

export DR_HOOK=1
export DR_HOOK_OPT=prof

# Important: needed for GEMM to be fast enough for the imbalance to show clearly.
# On MN5 this selects AVX512 behavior.
export MKL_CBWR=AUTO

srun /gpfs/scratch/ehpc01/bsc032799/ectrans-1.8/ectrans/bin/ectrans-benchmark-cpu-sp \
  --truncation 1279 \
  --vordiv \
  --scders \
  --nlev 137 \
  --nfld 16 \
  --nproma 16 \
  --niter 50
```

## Observed behavior

With DR_HOOK profiling, rank 0 and similar affected ranks show strong imbalance in `LEDIR_SGEMM_1` and `LEDIR_SGEMM_2` across OpenMP threads, being thread 0 much slower as it takes the DP path.

Example excerpt:

```text
#  % Time         Cumul         Self        Total     # of calls        Self       Total    Routine@<thread-id>
                                                                         ms/call     ms/call
63     0.99        5.747        0.057        0.057             53        1.08        1.08   *LEDIR_SGEMM_1@1
66     0.33        5.787        0.019        0.019             53        0.37        0.37   *LEDIR_SGEMM_2@1
72     0.20        5.787        0.012        0.012            106        0.11        0.11    LEDIR_SGEMM_1@8
73     0.19        5.787        0.011        0.011             53        0.21        0.21    LEDIR_SGEMM_1@2
74     0.17        5.787        0.010        0.010            106        0.09        0.09    LEDIR_SGEMM_2@8
75     0.17        5.787        0.010        0.010            106        0.09        0.09    LEDIR_SGEMM_1@7
76     0.16        5.787        0.009        0.009             53        0.17        0.17    LEDIR_SGEMM_2@2
77     0.14        5.787        0.008        0.008            106        0.07        0.07    LEDIR_SGEMM_2@7
79     0.12        5.787        0.007        0.007             53        0.14        0.14    LEDIR_SGEMM_1@3
80     0.12        5.787        0.007        0.007             85        0.08        0.08    LEDIR_SGEMM_1@6
81     0.11        5.787        0.006        0.006             53        0.12        0.12    LEDIR_SGEMM_1@4
82     0.11        5.787        0.006        0.006             53        0.12        0.12    LEDIR_SGEMM_2@3
83     0.11        5.787        0.006        0.006             85        0.07        0.07    LEDIR_SGEMM_2@6
84     0.11        5.787        0.006        0.006             74        0.08        0.08    LEDIR_SGEMM_1@5
86     0.10        5.793        0.006        0.006             53        0.11        0.11    LEDIR_SGEMM_2@4
87     0.09        5.793        0.005        0.005             74        0.07        0.07    LEDIR_SGEMM_2@5
```

## Additional profiling / suspected source

After adding extra DR_HOOK regions in:

```
src/trans/cpu/internal/ledir_mod.F90
```

The imbalance is associated with the `IM=0` path in the single-precision version. As we understand it, this path is required to preserve conservation properties and maintain scientific accuracy.

However, in large-scale runs we observe that this imbalance becomes more pronounced. In particular, there are cases where a single DGEMM-based path becomes slower than executing multiple SGEMM paths.

Additional timers added around this region suggest that part of this extra cost may come from:

- Allocation/deallocation of temporary double-precision arrays
- Copy and conversion of the matrix slice from single to double precision before the GEMM call

The intention of this issue is not to question the `IM=0` logic itself, but to highlight that the allocation and copy overheads could potentially be reduced or avoided (e.g. via caching or reuse), which may help mitigate the observed imbalance.

Relevant code pattern:

```fortran
ELSE
  BLOCK
     REAL(KIND=JPRD), allocatable :: ZB_D(:,:), ZCS_D(:,:), ZRPNMS(:,:)
     INTEGER(KIND=JPIM) :: I1, I2, I3, I4

     I1 = size(S%FA(KMLOC)%RPNMS(:,1))
     I2 = size(S%FA(KMLOC)%RPNMS(1,:))
     ALLOCATE(ZRPNMS(I1,I2))
     ALLOCATE(ZB_D(KDGLU,KIFC))
     ALLOCATE(ZCS_D((R%NTMAX-KM+3)/2,KIFC))

     IFLD=0
     DO JK=1,KFC,ISKIP
        IFLD=IFLD+1
        DO J=1,KDGLU
           ZB_D(J,IFLD)=PSIA(JK,ISL+J-1)*REAL(PW(ISL+J-1),JPRB)
        ENDDO
     ENDDO

     DO I3=1,I1
        DO I4=1,I2
           ZRPNMS(I3,I4) = S%FA(KMLOC)%RPNMS(I3,I4)
        END DO
     END DO

     CALL GEMM('T','N',ILS,KIFC,KDGLU,1.0_JPRD,ZRPNMS,KDGLU,&
          &ZB_D,KDGLU,0._JPRD,ZCS_D,ILS)

     IFLD=0
     DO JK=1,KFC,ISKIP
        IFLD=IFLD+1
        DO J=1,ILS
           ZCS(J,IFLD) = ZCS_D(J,IFLD)
        ENDDO
     ENDDO

     DEALLOCATE(ZRPNMS)
     DEALLOCATE(ZB_D)
     DEALLOCATE(ZCS_D)
  END BLOCK
END IF
```

When the GEMM duration is very small, the allocation/deallocation and the repeated copy/conversion of `S%FA(KMLOC)%RPNMS` into `ZRPNMS` become a noticeable part of the cost and may contribute to the observed imbalance.

## Expected behavior

More balanced OpenMP thread execution in the single-precision CPU path when `IM=0`.

## Actual behavior

Clear thread imbalance in `LEDIR_SGEMM_*` regions, especially at scale and when GEMM is fast.

## Possible improvement

A potential optimization would be to trade memory for performance by caching the double-precision version of the `IM=0` matrix slice in the single-precision path.

For example:

- Build the double-precision slice once (e.g. in `SULEG`)
- Reuse it in `LEDIR`
- Avoid repeated allocation, copy, and conversion

This would remove:

- Repeated allocation/deallocation of `ZRPNMS`
- Repeated single → double conversion of the matrix slice

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU SP performance imbalance (IM=0 path) linked to allocations and SP→DP copies in LEDIR #392

Context

Summary

Versions

Platform

Reproducer

Observed behavior

Additional profiling / suspected source

Expected behavior

Actual behavior

Possible improvement

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

CPU SP performance imbalance (IM=0 path) linked to allocations and SP→DP copies in LEDIR #392

Description

Context

Summary

Versions

Platform

Reproducer

Observed behavior

Additional profiling / suspected source

Expected behavior

Actual behavior

Possible improvement

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions