Skip to content

fix(CSGOptiX): compile Release PTX with production definitions#380

Merged
plexoos merged 2 commits into
mainfrom
perf-fix-ptx-release
Jun 25, 2026
Merged

fix(CSGOptiX): compile Release PTX with production definitions#380
plexoos merged 2 commits into
mainfrom
perf-fix-ptx-release

Conversation

@plexoos

@plexoos plexoos commented Jun 24, 2026

Copy link
Copy Markdown
Member

Release PTX was compiled with Debug macros and without PRODUCTION. This updates the CSGOptiX7_ptx
compile definitions to be build-config aware.

Previously, the OptiX PTX target always received Debug-oriented definitions, even in Release builds:

  • CONFIG_Debug
  • DEBUG_TAG
  • DEBUG_PIDX
  • DEBUG_PIDXYZ

That meant Release PTX still compiled debug paths and did not define PRODUCTION. This PR switches
those definitions to CMake generator expressions so Release PTX gets:

  • CONFIG_Release
  • PRODUCTION

while Debug builds keep the debug markers.

CSGOptiX7.cu gates expensive simulation instrumentation behind #ifndef PRODUCTION and
debug-specific blocks. Because Release PTX was compiled as if it were Debug, the generated raygen
program carried substantially more code and local state than intended.

Measured on this PR branch.

Baseline:

  • build type: Release
  • benchmark config: config/dev.json
  • geometry: tests/geom/raindrop.gdml
  • macro: tests/run.mac with /run/beamOn controlled by EVENTS (default 1)
  • photons: controlled by PHOTONS in a temporary config derived from config/dev.json (default 100000)
  • metric: QSim::simulate tot_dt, the summed CSGOptiX::simulate_launch() time around optixLaunch + cudaDeviceSynchronize

Five-run results with EVENTS=1:

photons/event build mean tot_dt stddev relative
100,000 before 2.241 ms 0.067 ms 1.00x
100,000 after 0.256 ms 0.001 ms 8.77x faster
1,000,000 before 3.600 ms 0.107 ms 1.00x
1,000,000 after 1.226 ms 0.005 ms 2.94x faster

The 100,000 photon workload shows a 90% reduction in measured OptiX launch/sync time. The
1,000,000 photon scaling check shows a smaller 65% reduction because the removed debug overhead
is not purely proportional to photon count; as photon count increases, per-photon simulation work
becomes a larger share of tot_dt.

A multi-event run in one process also amortizes the large first-event difference. With EVENTS=5
and PHOTONS=100000, summing all five QSim::simulate timings gave 3.445 ms -> 1.193 ms
(2.89x faster), while the last event alone was 0.330 ms -> 0.231 ms (1.43x faster).

Wall time is not the primary metric here because it includes Geant4 startup and event plumbing. The
internal launch timing isolates the path affected by the PTX compile definitions.

Before, Release PTX still used Debug definitions:

-DCONFIG_Debug -DDEBUG_PIDX -DDEBUG_PIDXYZ -DDEBUG_TAG

After, Release PTX uses Release/production definitions:

-DCONFIG_Release -DPRODUCTION

Generated PTX changed accordingly:

artifact before after
PTX size 2,092,409 bytes 1,946,286 bytes
PTX line count 36,199 31,920
raygen local depot __local_depot1[928] __local_depot1[64]

The generated raygen PTX local-memory frame shrank from 928 bytes per launched thread to 64 bytes,
showing that Release PTX no longer carries the debug-only per-thread tracing/recording state.

BENCH_ROOT=/tmp/simphony-perf
BEFORE_SRC=$BENCH_ROOT/src-before
BEFORE_BUILD=$BENCH_ROOT/build-before-release
AFTER_SRC=/workspaces/simphony
AFTER_BUILD=$BENCH_ROOT/build-after-release

git worktree add --detach "$BEFORE_SRC" HEAD^

cmake -S "$BEFORE_SRC" -B "$BEFORE_BUILD" -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
cmake -S "$AFTER_SRC"  -B "$AFTER_BUILD"  -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON

cmake --build "$BEFORE_BUILD" --target simg4ox -j"$(nproc)"
cmake --build "$AFTER_BUILD"  --target simg4ox -j"$(nproc)"
ulimit -c 0

EVENTS=${EVENTS:-1}
PHOTONS=${PHOTONS:-100000}

run_one() {
  label=$1
  iter=$2
  src=$3
  build=$4
  work=$BENCH_ROOT/run-${label}-${iter}
  log=$BENCH_ROOT/run-${label}-${iter}.log
  macro=$work/run.mac
  config_dir=$work/config
  config_name=dev
  config=$config_dir/${config_name}.json

  mkdir -p "$work" "$config_dir"
  sed -E "s#^/run/beamOn .*#/run/beamOn ${EVENTS}#" "$src/tests/run.mac" > "$macro"
  sed -E "s#(\"numphoton\"[[:space:]]*:[[:space:]]*)[0-9]+#\1${PHOTONS}#" "$src/config/dev.json" > "$config"

  cd "$work" || return 99

  t0=$(date +%s.%N)
  env \
    CUDA_VISIBLE_DEVICES=0 \
    OPTICKS_HOME="$src" \
    GPHOX_CONFIG_DIR="$config_dir" \
    PYTHONPATH="$src" \
    SEvt__MINTIME=1 \
    SEvt__MINIMAL=1 \
    "$build/src/simg4ox" \
      -g "$src/tests/geom/raindrop.gdml" \
      -m "$macro" \
      -c "$config_name" \
      >"$log" 2>&1
  rc=$?
  t1=$(date +%s.%N)

  wall=$(awk -v a="$t0" -v b="$t1" 'BEGIN{printf "%.6f", b-a}')
  read -r dt idt < <(
    awk '
      /QSim::simulate@603/ {
        events++
        for (i = 1; i <= NF; i++) if ($i == "tot_dt") dt += $(i + 1)
      }
      /tot_idt\/M/ {
        for (i = 1; i <= NF; i++) if ($i == "tot_idt/M") idt += $(i + 1)
      }
      END {
        if (events == 0) print "NA NA"
        else printf "%.6f %.6f\n", dt, idt
      }
    ' "$log"
  )

  printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\n' \
    "$label" "$iter" "$rc" "${dt:-NA}" "${idt:-NA}" "$wall" "$log"
}

for i in 1 2 3 4 5; do
  run_one before "$i" "$BEFORE_SRC" "$BEFORE_BUILD"
  run_one after  "$i" "$AFTER_SRC"  "$AFTER_BUILD"
done

Measurement invocation:

EVENTS=1 \
PHOTONS=100000 \
bash perf.sh | tee /tmp/simphony-perf-dev-7a976b4e6/results.tsv

EVENTS=1 \
PHOTONS=1000000 \
bash perf.sh | tee /tmp/simphony-perf/results.tsv

PTX verification commands:

rg -n "CUDA_DEFINES" \
  "$BEFORE_BUILD/CSGOptiX/CMakeFiles/CSGOptiX7_ptx.dir/flags.make" \
  "$AFTER_BUILD/CSGOptiX/CMakeFiles/CSGOptiX7_ptx.dir/flags.make"

ls -l "$BEFORE_BUILD/ptx/CSGOptiX7.ptx" "$AFTER_BUILD/ptx/CSGOptiX7.ptx"

wc -l "$BEFORE_BUILD/ptx/CSGOptiX7.ptx" "$AFTER_BUILD/ptx/CSGOptiX7.ptx"

rg -n "__local_depot1" "$BEFORE_BUILD/ptx/CSGOptiX7.ptx" "$AFTER_BUILD/ptx/CSGOptiX7.ptx"

Copilot AI review requested due to automatic review settings June 24, 2026 02:12
Release PTX was compiled with Debug macros and without PRODUCTION. This updates the `CSGOptiX7_ptx`
compile definitions to be build-config aware.

Previously, the OptiX PTX target always received Debug-oriented definitions, even in `Release` builds:

- `CONFIG_Debug`
- `DEBUG_TAG`
- `DEBUG_PIDX`
- `DEBUG_PIDXYZ`

That meant Release PTX still compiled debug paths and did not define `PRODUCTION`. This PR switches
those definitions to CMake generator expressions so Release PTX gets:

- `CONFIG_Release`
- `PRODUCTION`

while Debug builds keep the debug markers.

`CSGOptiX7.cu` gates expensive simulation instrumentation behind `#ifndef PRODUCTION` and
debug-specific blocks. Because Release PTX was compiled as if it were Debug, the generated raygen
program carried substantially more code and local state than intended.

Measured on this PR branch.

Baseline:

- build type: `Release`
- benchmark config: `config/dev.json`
- geometry: `tests/geom/raindrop.gdml`
- macro: `tests/run.mac` with `/run/beamOn` controlled by `EVENTS` (default `1`)
- photons: controlled by `PHOTONS` in a temporary config derived from `config/dev.json` (default `100000`)
- metric: `QSim::simulate tot_dt`, the summed `CSGOptiX::simulate_launch()` time around `optixLaunch + cudaDeviceSynchronize`

Five-run results with `EVENTS=1`:

| photons/event | build | mean `tot_dt` | stddev | relative |
| ---: | --- | ---: | ---: | ---: |
| 100,000 | before | 2.241 ms | 0.067 ms | 1.00x |
| 100,000 | after | 0.256 ms | 0.001 ms | 8.77x faster |
| 1,000,000 | before | 3.600 ms | 0.107 ms | 1.00x |
| 1,000,000 | after | 1.226 ms | 0.005 ms | 2.94x faster |

The `100,000` photon workload shows a 90% reduction in measured OptiX launch/sync time. The
`1,000,000` photon scaling check shows a smaller 65% reduction because the removed debug overhead
is not purely proportional to photon count; as photon count increases, per-photon simulation work
becomes a larger share of `tot_dt`.

A multi-event run in one process also amortizes the large first-event difference. With `EVENTS=5`
and `PHOTONS=100000`, summing all five `QSim::simulate` timings gave `3.445 ms -> 1.193 ms`
(`2.89x` faster), while the last event alone was `0.330 ms -> 0.231 ms` (`1.43x` faster).

Wall time is not the primary metric here because it includes Geant4 startup and event plumbing. The
internal launch timing isolates the path affected by the PTX compile definitions.

Before, Release PTX still used Debug definitions:

```text
-DCONFIG_Debug -DDEBUG_PIDX -DDEBUG_PIDXYZ -DDEBUG_TAG
```

After, Release PTX uses Release/production definitions:

```text
-DCONFIG_Release -DPRODUCTION
```

Generated PTX changed accordingly:

| artifact           | before                | after                |
| ---                | ---:                  | ---:                 |
| PTX size           | 2,092,409 bytes       | 1,946,286 bytes      |
| PTX line count     | 36,199                | 31,920               |
| raygen local depot | `__local_depot1[928]` | `__local_depot1[64]` |

The generated raygen PTX local-memory frame shrank from 928 bytes per launched thread to 64 bytes,
showing that Release PTX no longer carries the debug-only per-thread tracing/recording state.

```bash
BENCH_ROOT=/tmp/simphony-perf
BEFORE_SRC=$BENCH_ROOT/src-before
BEFORE_BUILD=$BENCH_ROOT/build-before-release
AFTER_SRC=/workspaces/simphony
AFTER_BUILD=$BENCH_ROOT/build-after-release

git worktree add --detach "$BEFORE_SRC" HEAD^

cmake -S "$BEFORE_SRC" -B "$BEFORE_BUILD" -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
cmake -S "$AFTER_SRC"  -B "$AFTER_BUILD"  -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON

cmake --build "$BEFORE_BUILD" --target simg4ox -j"$(nproc)"
cmake --build "$AFTER_BUILD"  --target simg4ox -j"$(nproc)"
```

```bash
ulimit -c 0

EVENTS=${EVENTS:-1}
PHOTONS=${PHOTONS:-100000}

run_one() {
  label=$1
  iter=$2
  src=$3
  build=$4
  work=$BENCH_ROOT/run-${label}-${iter}
  log=$BENCH_ROOT/run-${label}-${iter}.log
  macro=$work/run.mac
  config_dir=$work/config
  config_name=dev
  config=$config_dir/${config_name}.json

  mkdir -p "$work" "$config_dir"
  sed -E "s#^/run/beamOn .*#/run/beamOn ${EVENTS}#" "$src/tests/run.mac" > "$macro"
  sed -E "s#(\"numphoton\"[[:space:]]*:[[:space:]]*)[0-9]+#\1${PHOTONS}#" "$src/config/dev.json" > "$config"

  cd "$work" || return 99

  t0=$(date +%s.%N)
  env \
    CUDA_VISIBLE_DEVICES=0 \
    OPTICKS_HOME="$src" \
    GPHOX_CONFIG_DIR="$config_dir" \
    PYTHONPATH="$src" \
    SEvt__MINTIME=1 \
    SEvt__MINIMAL=1 \
    "$build/src/simg4ox" \
      -g "$src/tests/geom/raindrop.gdml" \
      -m "$macro" \
      -c "$config_name" \
      >"$log" 2>&1
  rc=$?
  t1=$(date +%s.%N)

  wall=$(awk -v a="$t0" -v b="$t1" 'BEGIN{printf "%.6f", b-a}')
  read -r dt idt < <(
    awk '
      /QSim::simulate@603/ {
        events++
        for (i = 1; i <= NF; i++) if ($i == "tot_dt") dt += $(i + 1)
      }
      /tot_idt\/M/ {
        for (i = 1; i <= NF; i++) if ($i == "tot_idt/M") idt += $(i + 1)
      }
      END {
        if (events == 0) print "NA NA"
        else printf "%.6f %.6f\n", dt, idt
      }
    ' "$log"
  )

  printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\n' \
    "$label" "$iter" "$rc" "${dt:-NA}" "${idt:-NA}" "$wall" "$log"
}

for i in 1 2 3 4 5; do
  run_one before "$i" "$BEFORE_SRC" "$BEFORE_BUILD"
  run_one after  "$i" "$AFTER_SRC"  "$AFTER_BUILD"
done
```

Measurement invocation:

```bash
EVENTS=1 \
PHOTONS=100000 \
bash perf.sh | tee /tmp/simphony-perf-dev-7a976b4e6/results.tsv

EVENTS=1 \
PHOTONS=1000000 \
bash perf.sh | tee /tmp/simphony-perf/results.tsv
```

PTX verification commands:

```bash
rg -n "CUDA_DEFINES" \
  "$BEFORE_BUILD/CSGOptiX/CMakeFiles/CSGOptiX7_ptx.dir/flags.make" \
  "$AFTER_BUILD/CSGOptiX/CMakeFiles/CSGOptiX7_ptx.dir/flags.make"

ls -l "$BEFORE_BUILD/ptx/CSGOptiX7.ptx" "$AFTER_BUILD/ptx/CSGOptiX7.ptx"

wc -l "$BEFORE_BUILD/ptx/CSGOptiX7.ptx" "$AFTER_BUILD/ptx/CSGOptiX7.ptx"

rg -n "__local_depot1" "$BEFORE_BUILD/ptx/CSGOptiX7.ptx" "$AFTER_BUILD/ptx/CSGOptiX7.ptx"
```
@plexoos plexoos force-pushed the perf-fix-ptx-release branch from 62b166b to 1019c22 Compare June 24, 2026 02:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the OptiX PTX build configuration so CSGOptiX7_ptx is compiled with build-config-appropriate preprocessor definitions (notably defining PRODUCTION in Release and keeping debug-only markers in Debug). This aligns device PTX compilation with the host build configuration and avoids unintentionally compiling expensive debug/instrumentation paths into Release PTX.

Changes:

  • Replace hard-coded Debug compile definitions for CSGOptiX7_ptx with CMake generator expressions keyed on the active configuration.
  • Define PRODUCTION for Release PTX builds while keeping DEBUG_* markers gated to Debug.
  • Add CONFIG_* defines for all standard CMake build types to the PTX target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread CSGOptiX/CMakeLists.txt
Comment on lines +46 to +56
$<$<CONFIG:Debug>:CONFIG_Debug>
$<$<CONFIG:RelWithDebInfo>:CONFIG_RelWithDebInfo>
$<$<CONFIG:Release>:CONFIG_Release>
$<$<CONFIG:MinSizeRel>:CONFIG_MinSizeRel>
OPTICKS_SYSRAP
PLOG_LOCAL
RNG_PHILOX
$<$<CONFIG:Debug>:DEBUG_TAG>
$<$<CONFIG:Debug>:DEBUG_PIDX>
$<$<CONFIG:Debug>:DEBUG_PIDXYZ>
$<$<CONFIG:Release>:PRODUCTION>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62b166b526

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread CSGOptiX/CMakeLists.txt
$<$<CONFIG:Debug>:DEBUG_TAG>
$<$<CONFIG:Debug>:DEBUG_PIDX>
$<$<CONFIG:Debug>:DEBUG_PIDXYZ>
$<$<CONFIG:Release>:PRODUCTION>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use config-specific PTX output paths

When using a multi-config generator, these config-specific definitions make the Debug and Release PTX semantically different, but csgoptix_ptx still copies every config to the single ${PROJECT_BINARY_DIR}/ptx/CSGOptiX7.ptx and sysrap/config_path.h.in:4 makes runtime lookup use that shared directory. After building one config and then running a binary from another, the binary loads whichever PTX was copied last, so Release can still run non-PRODUCTION PTX or Debug can silently lose the DEBUG_* instrumentation. Please include the config in the PTX output/search path or otherwise keep the artifacts separate.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue to be addressed later

@plexoos plexoos self-assigned this Jun 24, 2026
@plexoos plexoos added bug Something isn't working build Build-system or toolchain configuration problems labels Jun 24, 2026
@plexoos plexoos force-pushed the perf-fix-ptx-release branch from 533f722 to 8610cab Compare June 25, 2026 21:28
@plexoos plexoos merged commit 6868547 into main Jun 25, 2026
13 checks passed
@plexoos plexoos deleted the perf-fix-ptx-release branch June 25, 2026 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working build Build-system or toolchain configuration problems

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants