fix(CSGOptiX): compile Release PTX with production definitions#380
Conversation
Release PTX was compiled with Debug macros and without PRODUCTION. This updates the `CSGOptiX7_ptx`
compile definitions to be build-config aware.
Previously, the OptiX PTX target always received Debug-oriented definitions, even in `Release` builds:
- `CONFIG_Debug`
- `DEBUG_TAG`
- `DEBUG_PIDX`
- `DEBUG_PIDXYZ`
That meant Release PTX still compiled debug paths and did not define `PRODUCTION`. This PR switches
those definitions to CMake generator expressions so Release PTX gets:
- `CONFIG_Release`
- `PRODUCTION`
while Debug builds keep the debug markers.
`CSGOptiX7.cu` gates expensive simulation instrumentation behind `#ifndef PRODUCTION` and
debug-specific blocks. Because Release PTX was compiled as if it were Debug, the generated raygen
program carried substantially more code and local state than intended.
Measured on this PR branch.
Baseline:
- build type: `Release`
- benchmark config: `config/dev.json`
- geometry: `tests/geom/raindrop.gdml`
- macro: `tests/run.mac` with `/run/beamOn` controlled by `EVENTS` (default `1`)
- photons: controlled by `PHOTONS` in a temporary config derived from `config/dev.json` (default `100000`)
- metric: `QSim::simulate tot_dt`, the summed `CSGOptiX::simulate_launch()` time around `optixLaunch + cudaDeviceSynchronize`
Five-run results with `EVENTS=1`:
| photons/event | build | mean `tot_dt` | stddev | relative |
| ---: | --- | ---: | ---: | ---: |
| 100,000 | before | 2.241 ms | 0.067 ms | 1.00x |
| 100,000 | after | 0.256 ms | 0.001 ms | 8.77x faster |
| 1,000,000 | before | 3.600 ms | 0.107 ms | 1.00x |
| 1,000,000 | after | 1.226 ms | 0.005 ms | 2.94x faster |
The `100,000` photon workload shows a 90% reduction in measured OptiX launch/sync time. The
`1,000,000` photon scaling check shows a smaller 65% reduction because the removed debug overhead
is not purely proportional to photon count; as photon count increases, per-photon simulation work
becomes a larger share of `tot_dt`.
A multi-event run in one process also amortizes the large first-event difference. With `EVENTS=5`
and `PHOTONS=100000`, summing all five `QSim::simulate` timings gave `3.445 ms -> 1.193 ms`
(`2.89x` faster), while the last event alone was `0.330 ms -> 0.231 ms` (`1.43x` faster).
Wall time is not the primary metric here because it includes Geant4 startup and event plumbing. The
internal launch timing isolates the path affected by the PTX compile definitions.
Before, Release PTX still used Debug definitions:
```text
-DCONFIG_Debug -DDEBUG_PIDX -DDEBUG_PIDXYZ -DDEBUG_TAG
```
After, Release PTX uses Release/production definitions:
```text
-DCONFIG_Release -DPRODUCTION
```
Generated PTX changed accordingly:
| artifact | before | after |
| --- | ---: | ---: |
| PTX size | 2,092,409 bytes | 1,946,286 bytes |
| PTX line count | 36,199 | 31,920 |
| raygen local depot | `__local_depot1[928]` | `__local_depot1[64]` |
The generated raygen PTX local-memory frame shrank from 928 bytes per launched thread to 64 bytes,
showing that Release PTX no longer carries the debug-only per-thread tracing/recording state.
```bash
BENCH_ROOT=/tmp/simphony-perf
BEFORE_SRC=$BENCH_ROOT/src-before
BEFORE_BUILD=$BENCH_ROOT/build-before-release
AFTER_SRC=/workspaces/simphony
AFTER_BUILD=$BENCH_ROOT/build-after-release
git worktree add --detach "$BEFORE_SRC" HEAD^
cmake -S "$BEFORE_SRC" -B "$BEFORE_BUILD" -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
cmake -S "$AFTER_SRC" -B "$AFTER_BUILD" -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
cmake --build "$BEFORE_BUILD" --target simg4ox -j"$(nproc)"
cmake --build "$AFTER_BUILD" --target simg4ox -j"$(nproc)"
```
```bash
ulimit -c 0
EVENTS=${EVENTS:-1}
PHOTONS=${PHOTONS:-100000}
run_one() {
label=$1
iter=$2
src=$3
build=$4
work=$BENCH_ROOT/run-${label}-${iter}
log=$BENCH_ROOT/run-${label}-${iter}.log
macro=$work/run.mac
config_dir=$work/config
config_name=dev
config=$config_dir/${config_name}.json
mkdir -p "$work" "$config_dir"
sed -E "s#^/run/beamOn .*#/run/beamOn ${EVENTS}#" "$src/tests/run.mac" > "$macro"
sed -E "s#(\"numphoton\"[[:space:]]*:[[:space:]]*)[0-9]+#\1${PHOTONS}#" "$src/config/dev.json" > "$config"
cd "$work" || return 99
t0=$(date +%s.%N)
env \
CUDA_VISIBLE_DEVICES=0 \
OPTICKS_HOME="$src" \
GPHOX_CONFIG_DIR="$config_dir" \
PYTHONPATH="$src" \
SEvt__MINTIME=1 \
SEvt__MINIMAL=1 \
"$build/src/simg4ox" \
-g "$src/tests/geom/raindrop.gdml" \
-m "$macro" \
-c "$config_name" \
>"$log" 2>&1
rc=$?
t1=$(date +%s.%N)
wall=$(awk -v a="$t0" -v b="$t1" 'BEGIN{printf "%.6f", b-a}')
read -r dt idt < <(
awk '
/QSim::simulate@603/ {
events++
for (i = 1; i <= NF; i++) if ($i == "tot_dt") dt += $(i + 1)
}
/tot_idt\/M/ {
for (i = 1; i <= NF; i++) if ($i == "tot_idt/M") idt += $(i + 1)
}
END {
if (events == 0) print "NA NA"
else printf "%.6f %.6f\n", dt, idt
}
' "$log"
)
printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\n' \
"$label" "$iter" "$rc" "${dt:-NA}" "${idt:-NA}" "$wall" "$log"
}
for i in 1 2 3 4 5; do
run_one before "$i" "$BEFORE_SRC" "$BEFORE_BUILD"
run_one after "$i" "$AFTER_SRC" "$AFTER_BUILD"
done
```
Measurement invocation:
```bash
EVENTS=1 \
PHOTONS=100000 \
bash perf.sh | tee /tmp/simphony-perf-dev-7a976b4e6/results.tsv
EVENTS=1 \
PHOTONS=1000000 \
bash perf.sh | tee /tmp/simphony-perf/results.tsv
```
PTX verification commands:
```bash
rg -n "CUDA_DEFINES" \
"$BEFORE_BUILD/CSGOptiX/CMakeFiles/CSGOptiX7_ptx.dir/flags.make" \
"$AFTER_BUILD/CSGOptiX/CMakeFiles/CSGOptiX7_ptx.dir/flags.make"
ls -l "$BEFORE_BUILD/ptx/CSGOptiX7.ptx" "$AFTER_BUILD/ptx/CSGOptiX7.ptx"
wc -l "$BEFORE_BUILD/ptx/CSGOptiX7.ptx" "$AFTER_BUILD/ptx/CSGOptiX7.ptx"
rg -n "__local_depot1" "$BEFORE_BUILD/ptx/CSGOptiX7.ptx" "$AFTER_BUILD/ptx/CSGOptiX7.ptx"
```
62b166b to
1019c22
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes the OptiX PTX build configuration so CSGOptiX7_ptx is compiled with build-config-appropriate preprocessor definitions (notably defining PRODUCTION in Release and keeping debug-only markers in Debug). This aligns device PTX compilation with the host build configuration and avoids unintentionally compiling expensive debug/instrumentation paths into Release PTX.
Changes:
- Replace hard-coded Debug compile definitions for
CSGOptiX7_ptxwith CMake generator expressions keyed on the active configuration. - Define
PRODUCTIONforReleasePTX builds while keepingDEBUG_*markers gated toDebug. - Add
CONFIG_*defines for all standard CMake build types to the PTX target.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| $<$<CONFIG:Debug>:CONFIG_Debug> | ||
| $<$<CONFIG:RelWithDebInfo>:CONFIG_RelWithDebInfo> | ||
| $<$<CONFIG:Release>:CONFIG_Release> | ||
| $<$<CONFIG:MinSizeRel>:CONFIG_MinSizeRel> | ||
| OPTICKS_SYSRAP | ||
| PLOG_LOCAL | ||
| RNG_PHILOX | ||
| $<$<CONFIG:Debug>:DEBUG_TAG> | ||
| $<$<CONFIG:Debug>:DEBUG_PIDX> | ||
| $<$<CONFIG:Debug>:DEBUG_PIDXYZ> | ||
| $<$<CONFIG:Release>:PRODUCTION> |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62b166b526
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| $<$<CONFIG:Debug>:DEBUG_TAG> | ||
| $<$<CONFIG:Debug>:DEBUG_PIDX> | ||
| $<$<CONFIG:Debug>:DEBUG_PIDXYZ> | ||
| $<$<CONFIG:Release>:PRODUCTION> |
There was a problem hiding this comment.
Use config-specific PTX output paths
When using a multi-config generator, these config-specific definitions make the Debug and Release PTX semantically different, but csgoptix_ptx still copies every config to the single ${PROJECT_BINARY_DIR}/ptx/CSGOptiX7.ptx and sysrap/config_path.h.in:4 makes runtime lookup use that shared directory. After building one config and then running a binary from another, the binary loads whichever PTX was copied last, so Release can still run non-PRODUCTION PTX or Debug can silently lose the DEBUG_* instrumentation. Please include the config in the PTX output/search path or otherwise keep the artifacts separate.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
created an issue to be addressed later
533f722 to
8610cab
Compare
Release PTX was compiled with Debug macros and without PRODUCTION. This updates the
CSGOptiX7_ptxcompile definitions to be build-config aware.
Previously, the OptiX PTX target always received Debug-oriented definitions, even in
Releasebuilds:CONFIG_DebugDEBUG_TAGDEBUG_PIDXDEBUG_PIDXYZThat meant Release PTX still compiled debug paths and did not define
PRODUCTION. This PR switchesthose definitions to CMake generator expressions so Release PTX gets:
CONFIG_ReleasePRODUCTIONwhile Debug builds keep the debug markers.
CSGOptiX7.cugates expensive simulation instrumentation behind#ifndef PRODUCTIONanddebug-specific blocks. Because Release PTX was compiled as if it were Debug, the generated raygen
program carried substantially more code and local state than intended.
Measured on this PR branch.
Baseline:
Releaseconfig/dev.jsontests/geom/raindrop.gdmltests/run.macwith/run/beamOncontrolled byEVENTS(default1)PHOTONSin a temporary config derived fromconfig/dev.json(default100000)QSim::simulate tot_dt, the summedCSGOptiX::simulate_launch()time aroundoptixLaunch + cudaDeviceSynchronizeFive-run results with
EVENTS=1:tot_dtThe
100,000photon workload shows a 90% reduction in measured OptiX launch/sync time. The1,000,000photon scaling check shows a smaller 65% reduction because the removed debug overheadis not purely proportional to photon count; as photon count increases, per-photon simulation work
becomes a larger share of
tot_dt.A multi-event run in one process also amortizes the large first-event difference. With
EVENTS=5and
PHOTONS=100000, summing all fiveQSim::simulatetimings gave3.445 ms -> 1.193 ms(
2.89xfaster), while the last event alone was0.330 ms -> 0.231 ms(1.43xfaster).Wall time is not the primary metric here because it includes Geant4 startup and event plumbing. The
internal launch timing isolates the path affected by the PTX compile definitions.
Before, Release PTX still used Debug definitions:
After, Release PTX uses Release/production definitions:
Generated PTX changed accordingly:
__local_depot1[928]__local_depot1[64]The generated raygen PTX local-memory frame shrank from 928 bytes per launched thread to 64 bytes,
showing that Release PTX no longer carries the debug-only per-thread tracing/recording state.
Measurement invocation:
PTX verification commands: