Skip to content

Update gadi config to reduce wasted compute#1

Open
Mitchob wants to merge 7 commits into
mainfrom
SBP-357
Open

Update gadi config to reduce wasted compute#1
Mitchob wants to merge 7 commits into
mainfrom
SBP-357

Conversation

@Mitchob

@Mitchob Mitchob commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Resource optimisation for SBP_gadi.config based on benchmarking of nf-core/proteinfold
processes (AlphaFold2, ColabFold, Boltz) on NCI Gadi, informed by 10 test runs using T1024 as a usecase. Results available on googledrive. MSA was the process with most space to edit and reduce costs

Following TL suggestion I aimed to prioritise SU based on NCIs costing. Its possible that some steps might get a speed up by reducing CPU requests but we would still be charged at the same rate. Useful calc from sih.

Results

MMSEQS_COLABFOLDSEARCH (cpus 28 memory 256GB)

  • Previous settings with CPU/mem <20CPU caused OOM failure; MMseqs2 prefilter allocates memory relative to
    available RAM and kills child workers when constrained
  • Benchmarking showed the full normalbw node (28 CPU / 252 GB, 1.25 SU/CPU-hr) is both cheaper and faster than the next smaller configuration (24 CPU / 96 GB): 77 SU / 2:10h vs 85 SU / 2:49h
  • scratch=false retained: MMseqs2 reads large databases in-place; scratch mode causes
    staging failures
  • MSA tasks are being run on normalbw with a full node. Where the higher cpu count is required for these samples only for mem. so cheaper to use 28 normalbw's than 24 normals with 48 CPU worth of mem

RUN_ALPHAFOLD2_MSA (cpus 18 memory 72GB)

  • Previous 12 CPU / 48 GB allocation ran at 100% memory utilisation (borderline); 16 CPU / 64 GB also hit ceiling (98.7% utilisation)
  • 18 CPU / 72 GB is the first configuration with headroom (70% memory utilisation, 50 GB used), at equivalent SU cost (~8.4 SU) and marginally faster walltime (0:22h vs 0:27h at 12 CPU)

Notes

  • All GPU processes set to gpuhopper as per TL suggestion based on prior benchmarking
  • Hugemem queue benchmarking for MMSEQS not yet performed; normalbw full-node is currently the optimal configuration but hugemem may warrant testing as you can request higher mem with lower CPU. Its double the SU rate so would need to make up for it through walltime savings. This would be useful particularly if a higher cpu count actually decreases the efficiency/increases walltime.

@Mitchob Mitchob requested a review from amandazhuyilan May 28, 2026 23:19
Comment thread proteinfold.config
Comment thread proteinfold.config Outdated

@vtnphan vtnphan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making these so beautiful. I've had some questions for using params in params { } and assign them in process later

@vtnphan vtnphan self-requested a review June 11, 2026 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants