Custom CUDA implementation of a Gaussian Process for the sweep by vyeoms · Pull Request #587 · PufferAI/PufferLib

vyeoms · 2026-06-14T21:47:09Z

Summary

A custom CUDA C implementation of a Gaussian Process to begin porting Protein to CUDA C:

src/gp_cuda.cu implements the core logic for GP regression. Operations are implemented in single precision, still agrees to ~1e-3 precision with gpytorch. Can change to double precision if desired, but that's roughly 2x-4x the computational cost.
src/gp_cuda_kernel.cu is for the GP covariance kernel. For the time being it only implements the current kernel used by Protein (Matern 3/2+linear), but keeping this separate could be useful if in the future we want to change the kernel.
Modified src/bindings.cu and pufferlib/sweep.py to adopt the custom GP, respectively. As it is right now, running sweeps on CPU would be broken since the GP is only implemented for CUDA, since I would think that sweeps make most sense with a GPU environment.
- If desired, I have a separate implementation in C in this repo, but it introduces more dependencies on either CBLAS or LAPACKE (can be chosen). Alternatively, could potentially just adjust the import calls and use gpytorch for CPU.
Added tests:
- tests/test_custom_gp_numerics.py to verify numerical evaluation of this implementation vs gpytorch. Single precision agrees to ~1e-3, double precision agrees to ~1e-9
- tests/test_custom_gp_optimpy to verify the GP implementations get updated more or less the same. Parameters agree to ~1e-3.
Updated tests/test_sweep.py to use Puffer 3.0 format

Numerical and qualitative results

Output of `tests/test_custom_gp_numerics.py`

Running in my laptop with an A1000 GPU

python3 tests/test_custom_gp_numerics.py
<GaussianProcess dim=1 n=0 cap=100 ell=1 sf=1 noise=0.01>

====================================================================================
  n=100  dim=1
====================================================================================
                            train  predict    total          MLL |err_mean|  |err_var|
  CUDA                    0.0138s  0.0538s  0.0676s       0.4207  1.21e-03  4.35e-03
  GPyTorch                0.1798s  0.0086s  0.1884s       0.4222  0.00e+00  0.00e+00
                         (GPyTorch: exact Cholesky forced for n>800, errors vs GPyTorch)
<GaussianProcess dim=5 n=0 cap=100 ell=[1..1] sf=1 noise=0.01>

====================================================================================
  n=100  dim=5
====================================================================================
                            train  predict    total          MLL |err_mean|  |err_var|
  CUDA                    0.0007s  0.0012s  0.0019s      -1.3530  7.89e-05  7.27e-05
  GPyTorch                0.0014s  0.0044s  0.0058s      -1.3512  0.00e+00  0.00e+00
                         (GPyTorch: exact Cholesky forced for n>800, errors vs GPyTorch)
<GaussianProcess dim=1 n=0 cap=200 ell=1 sf=1 noise=0.01>

====================================================================================
  n=200  dim=1
====================================================================================
                            train  predict    total          MLL |err_mean|  |err_var|
  CUDA                    0.0008s  0.0035s  0.0042s       0.5845  1.22e-03  4.30e-03
  GPyTorch                0.0013s  0.0114s  0.0126s       0.5852  0.00e+00  0.00e+00
                         (GPyTorch: exact Cholesky forced for n>800, errors vs GPyTorch)
<GaussianProcess dim=5 n=0 cap=200 ell=[1..1] sf=1 noise=0.01>

====================================================================================
  n=200  dim=5
====================================================================================
                            train  predict    total          MLL |err_mean|  |err_var|
  CUDA                    0.0009s  0.0013s  0.0021s      -1.2414  7.64e-05  4.24e-05
  GPyTorch                0.0018s  0.0061s  0.0078s      -1.2405  0.00e+00  0.00e+00
                         (GPyTorch: exact Cholesky forced for n>800, errors vs GPyTorch)
<GaussianProcess dim=1 n=0 cap=1000 ell=1 sf=1 noise=0.01>

====================================================================================
  n=1000  dim=1
====================================================================================
                            train  predict    total          MLL |err_mean|  |err_var|
  CUDA                    0.0013s  0.0017s  0.0030s       0.7854  1.57e-03  3.56e-03
  GPyTorch                0.0012s  0.0501s  0.0513s       0.7856  0.00e+00  0.00e+00
                         (GPyTorch: exact Cholesky forced for n>800, errors vs GPyTorch)
<GaussianProcess dim=5 n=0 cap=1000 ell=[1..1] sf=1 noise=0.01>

====================================================================================
  n=1000  dim=5
====================================================================================
                            train  predict    total          MLL |err_mean|  |err_var|
  CUDA                    0.0014s  0.0017s  0.0031s      -1.1051  5.33e-05  1.69e-04
  GPyTorch                0.0014s  0.0444s  0.0458s      -1.1049  0.00e+00  0.00e+00
                         (GPyTorch: exact Cholesky forced for n>800, errors vs GPyTorch)
<GaussianProcess dim=1 n=0 cap=2000 ell=1 sf=1 noise=0.01>

====================================================================================
  n=2000  dim=1
====================================================================================
                            train  predict    total          MLL |err_mean|  |err_var|
  CUDA                    0.0035s  0.0025s  0.0060s       0.8255  2.78e-03  3.45e-03
  GPyTorch                0.0017s  0.2396s  0.2413s       0.8256  0.00e+00  0.00e+00
                         (GPyTorch: exact Cholesky forced for n>800, errors vs GPyTorch)
<GaussianProcess dim=5 n=0 cap=2000 ell=[1..1] sf=1 noise=0.01>

====================================================================================
  n=2000  dim=5
====================================================================================
                            train  predict    total          MLL |err_mean|  |err_var|
  CUDA                    0.0035s  0.0024s  0.0060s      -1.0348  5.27e-04  4.14e-04
  GPyTorch                0.0014s  0.1602s  0.1616s      -1.0347  0.00e+00  0.00e+00
                         (GPyTorch: exact Cholesky forced for n>800, errors vs GPyTorch)

Output of `tests/test_custom_gp_optimpy`

Running in my laptop with an A1000 GPU

python3 tests/test_custom_gp_optim.py 

step   MLL (ours)   MLL (gpyt)    |dMLL|   ell_ours   ell_gpyt   noise_ours   noise_gpyt
------------------------------------------------------------------------------------------------------
   0     -0.01350     -0.01350  2.77e-06     1.0000     1.0319     0.010000     0.009519
  10      0.08714      0.08716  2.49e-05     1.3407     1.3766     0.008839     0.008702
  20      0.09407      0.09409  1.80e-05     1.6936     1.7266     0.007376     0.007327
  30      0.28398      0.28396  1.99e-05     1.9821     2.0041     0.007577     0.007637
  40      0.26798      0.26799  4.01e-06     2.1305     2.1381     0.007893     0.007919
  50      0.20838      0.20839  1.16e-05     2.1709     2.1711     0.007845     0.007886
  59      0.13616      0.13615  1.48e-05     2.1970     2.2010     0.007777     0.007847

Final hyperparameters:
  lengthscale   ours=2.196962  gpyt=2.201036  |d|=4.07e-03
  outputscale   ours=0.506837  gpyt=0.512420  |d|=5.58e-03
  noise         ours=0.007777  gpyt=0.007847  |d|=6.99e-05
  offset        ours=0.113545  gpyt=0.109945  |d|=3.60e-03

Gradient agreement check (fresh models, same batch, same params):
  raw_ell_0         ours=-0.225116  gpyt=-0.225130  |d|=1.36e-05
  raw_outputscale   ours=+0.119949  gpyt=+0.119963  |d|=1.38e-05
  raw_noise         ours=+0.091917  gpyt=+0.091916  |d|=1.13e-06
  raw_offset        ours=+0.006617  gpyt=+0.006617  |d|=1.82e-07

Running `tests/test_sweep.py`

For more experiments see this thread on discord. Running 200 iters on my laptop with an A1000 GPU.

gpytorch

Last 4 iterations:

Predicted --  Score: 1998.153 Cost: 1998.238 Rating: 0.441
196th sweep.suggest() took 6.0027s, score_loss: -2.6922, cost_loss: -3.0726
score_lengthscale: (0.29905853460017584, 10.167150127777367), cost_lengthscale: (7.47679755462468, 9.74540144925938)
Score: 399.5986720385235 Cost: 400.0
Score: 799.197344077047 Cost: 800.0
Score: 1198.7960161155704 Cost: 1200.0
Score: 1598.394688154094 Cost: 1600.0
Score: 1997.9933601926175 Cost: 2000.0
Predicted --  Score: 1996.813 Cost: 1998.563 Rating: 0.456
197th sweep.suggest() took 5.9045s, score_loss: -2.7016, cost_loss: -3.0819
score_lengthscale: (0.2995976852471699, 10.218646419804578), cost_lengthscale: (7.526265068733572, 9.796569932002358)
Score: 399.47347948362307 Cost: 400.0
Score: 798.9469589672461 Cost: 800.0
Score: 1198.4204384508691 Cost: 1200.0
Score: 1597.8939179344923 Cost: 1600.0
Score: 1997.3673974181152 Cost: 2000.0
Predicted --  Score: 1989.399 Cost: 1994.990 Rating: 0.666
198th sweep.suggest() took 6.1254s, score_loss: -2.7136, cost_loss: -3.0921
score_lengthscale: (0.29935269245337226, 10.270866115980953), cost_lengthscale: (7.574612105626908, 9.847318390411866)
Score: 399.90292790988974 Cost: 400.0
Score: 799.8058558197795 Cost: 800.0
Score: 1199.7087837296692 Cost: 1200.0
Score: 1599.611711639559 Cost: 1600.0
Score: 1999.5146395494487 Cost: 2000.0
Resetting GP optimizers at suggestion 200
Predicted --  Score: 1995.713 Cost: 1995.764 Rating: 0.755
199th sweep.suggest() took 5.8800s, score_loss: -2.7234, cost_loss: -3.1010
score_lengthscale: (0.2938725521362531, 10.323430282364289), cost_lengthscale: (7.623147214651303, 9.897310807614435)
Score: 399.04383487610795 Cost: 400.0
Score: 798.0876697522159 Cost: 800.0
Score: 1197.131504628324 Cost: 1200.0
Score: 1596.1753395044318 Cost: 1600.0
Score: 1995.21917438054 Cost: 2000.0

This implementation

Last 4 iterations:

Predicted --  Score: 1966.847 Cost: 1941.924 Rating: 0.512
196th sweep.suggest() took 1.3605s, score_loss: -1.9132, cost_loss: -2.0066
score_lengthscale: (0.260856956243515, 9.255864143371582), cost_lengthscale: (4.239974498748779, 8.621585845947266)
Score: 399.3732635476037 Cost: 400.0
Score: 798.7465270952074 Cost: 800.0
Score: 1198.1197906428113 Cost: 1200.0
Score: 1597.4930541904148 Cost: 1600.0
Score: 1996.8663177380186 Cost: 2000.0
Predicted --  Score: 1971.421 Cost: 1973.673 Rating: 0.446
197th sweep.suggest() took 1.3495s, score_loss: -1.9137, cost_loss: -2.0063
score_lengthscale: (0.2555449604988098, 9.292190551757812), cost_lengthscale: (4.275691986083984, 8.656906127929688)
Score: 399.9670546182794 Cost: 400.0
Score: 799.9341092365588 Cost: 800.0
Score: 1199.9011638548382 Cost: 1200.0
Score: 1599.8682184731176 Cost: 1600.0
Score: 1999.835273091397 Cost: 2000.0
Predicted --  Score: 1946.673 Cost: 1854.647 Rating: 0.413
198th sweep.suggest() took 1.3514s, score_loss: -1.9151, cost_loss: -2.0067
score_lengthscale: (0.25784075260162354, 9.327220916748047), cost_lengthscale: (4.261244773864746, 8.691877365112305)
Score: 399.9138531460248 Cost: 400.0
Score: 799.8277062920496 Cost: 800.0
Score: 1199.7415594380743 Cost: 1200.0
Score: 1599.6554125840992 Cost: 1600.0
Score: 1999.5692657301238 Cost: 2000.0
Resetting GP optimizers at suggestion 200
Predicted --  Score: -114.115 Cost: 0.565 Rating: 0.057
199th sweep.suggest() took 1.3616s, score_loss: -1.9141, cost_loss: -2.0066
score_lengthscale: (0.2558309733867645, 9.36167049407959), cost_lengthscale: (4.284342288970947, 8.727143287658691)
Score: 0.0031129867384367457 Cost: 0.11999999999999955
Score: 0.006225973476873491 Cost: 0.2399999999999991
Score: 0.009338960215310237 Cost: 0.35999999999999865
Score: 0.012451946953746983 Cost: 0.4799999999999982
Score: 0.015564933692183728 Cost: 0.5999999999999978

Sweeping breakout

Constellation output with 60 iters:

Seems to play ok with the top score hyperparams:

This change defines __setstate__ and __getstate__ in Protein to handle CUDA graph capture in child processes spawned by multiprocessing. The child processes don't handle GP training or updates, so they don't need to be calling CudaMalloc. Stripping the CUDA-heavy parameters for the child processes reduces the graph capture load

vyeoms · 2026-06-14T21:48:41Z

Can test more envs if desired, let me know here. Also can tell me if other tests would be good to have

vyeoms added 5 commits June 11, 2026 12:00

(bugfix) Update test_sweep.py to use current version of pufferl

cfa9ff1

Add custom CUDA Gaussian Process implementation

87bdd03

Use custom CUDA GP implementation in sweep

ae21fd0

Test custom CUDA GP to verify numeric accuracy with gpytorch

92a2361

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom CUDA implementation of a Gaussian Process for the sweep#587

Custom CUDA implementation of a Gaussian Process for the sweep#587
vyeoms wants to merge 5 commits into
PufferAI:4.0from
vyeoms:port/c_gp

vyeoms commented Jun 14, 2026

Uh oh!

vyeoms commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vyeoms commented Jun 14, 2026

Summary

Numerical and qualitative results

Output of tests/test_custom_gp_numerics.py

Output of tests/test_custom_gp_optimpy

Running tests/test_sweep.py

gpytorch

This implementation

Sweeping breakout

Uh oh!

vyeoms commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Output of `tests/test_custom_gp_numerics.py`

Output of `tests/test_custom_gp_optimpy`

Running `tests/test_sweep.py`