Skip to content

Craig clean#145

Open
craigwarner-ufastro wants to merge 7 commits into
craig_factored_mergefrom
craig_clean
Open

Craig clean#145
craigwarner-ufastro wants to merge 7 commits into
craig_factored_mergefrom
craig_clean

Conversation

@craigwarner-ufastro

Copy link
Copy Markdown
Collaborator

Merge "craig clean" in to craig_factored_merge

craig_clean has two steps of differences from craig_factored_merge: 1) It includes new GPU logic or point source fitting, tryUpdates, and GPULsqrOptimizer. 2) It cleans up unused branches and options and print and timing statements.

Logs from craig_clean updates:
Commit with full debug info for refactor to GPU-ize point source
fitting, tryUpdates, and improve GPULsqrOptimizer for background
fitting. Commit message generated by AI below:

Subject: Optimize GPU fitting pipelines and improve memory/edge-case robustness
Description:

This commit implements a series of significant performance optimizations and stability fixes for the GPU-accelerated fitting pipelines, focusing on tryUpdates, GPULsqrOptimizer, and handling of complex image stacks.

Key Changes:

    GPU Point Source Fitter & Batching:

        Implemented a hybrid strategy for getSingleImageUpdateDirection where CPU-based model images (including Lanczos shifting) are batched into 3D arrays before GPU processing.

Achieved a 2x speedup in update direction computation while maintaining full numerical correctness.

Optimized getBatchModelImages to handle cases where allderivs and tractor.images counts differ.

GPU tryUpdates for Point Sources & Galaxies:

    Migrated galaxy tryUpdates to the GPU, achieving a 3x speedup compared to the CPU version.

Transitioned to float64 precision for point source tryUpdates to eliminate occasional divergence from CPU results.

Eliminated redundant CPU getLogProb() calls by performing batch calculations entirely on the GPU, yielding an additional 33% speedup.

Memory-Efficient ("Less Mem") GPU Mode:

    Introduced a new memory-prediction helper to check available VRAM before execution.

Implemented a sequential image-looping fallback within the GPU kernels for large blobs that exceed memory limits, preventing crashes while retaining GPU acceleration.

GPULsqrOptimizer Enhancements:

    Refactored background and sky fitting to use the new batching logic developed for tryUpdates.

Optimized sparse matrix accumulation, making the process 4x faster.

Overall algorithm runtime reduced from 292s to 99s (excluding solver time).

Handling None modelMasks & Edge Cases:

    Added logic to filter out images where modelMask is None (e.g., objects off the edge of a segment).

The tractor context is now dynamically updated to only include valid images before proceeding with GPU fitting.

Ensured full CPU fallback if all modelMasks in a stack are None.

Performance Impact:

    Total fitblobs runtime for large test blobs (e.g., Blob 1 of 0001p000) reduced from 5842s (CPU) to 1975s (GPU).

Bricks with long runtimes see improvements of up to 2x faster compared to previous GPU versions.


 Optimize GPU memory usage and improve robustness in engine and optimizer

 This commit introduces a memory-efficient GPU processing mode and improves
 the reliability of the GPU-accelerated fitting paths.

 tractor/engine.py:
 - Add 'use_less_mem' and 'ie_stack' support to log-likelihood batch methods.
 - Implement sequential image processing in getLogLikelihoodBatch to minimize
   VRAM footprint for large FFT workspaces.
 - Add manual memory management (CuPy block clearing) during batch operations.

 tractor/factored_optimizer.py:
 - Filter out non-overlapping images (None masks) before GPU tryUpdates.
 - Implement 'use_less_mem' logic in GPUFriendlyOptimizer to handle large
   image stacks by iterating through valid images when VRAM is low.
 - Add robust state restoration (images/masks) using try...finally blocks.
 - Improve error reporting with tracebacks and fallback to CPU on GPU failure.
 - Add debug logging for linear algebra internals and improve NaN handling.

fitting, tryUpdates, and improve GPULsqrOptimizer for background
fitting.  Commit message generated by AI below:

Subject: Optimize GPU fitting pipelines and improve memory/edge-case robustness
Description:

This commit implements a series of significant performance optimizations and stability fixes for the GPU-accelerated fitting pipelines, focusing on tryUpdates, GPULsqrOptimizer, and handling of complex image stacks.

Key Changes:

    GPU Point Source Fitter & Batching:

        Implemented a hybrid strategy for getSingleImageUpdateDirection where CPU-based model images (including Lanczos shifting) are batched into 3D arrays before GPU processing.

Achieved a 2x speedup in update direction computation while maintaining full numerical correctness.

Optimized getBatchModelImages to handle cases where allderivs and tractor.images counts differ.

GPU tryUpdates for Point Sources & Galaxies:

    Migrated galaxy tryUpdates to the GPU, achieving a 3x speedup compared to the CPU version.

Transitioned to float64 precision for point source tryUpdates to eliminate occasional divergence from CPU results.

Eliminated redundant CPU getLogProb() calls by performing batch calculations entirely on the GPU, yielding an additional 33% speedup.

Memory-Efficient ("Less Mem") GPU Mode:

    Introduced a new memory-prediction helper to check available VRAM before execution.

Implemented a sequential image-looping fallback within the GPU kernels for large blobs that exceed memory limits, preventing crashes while retaining GPU acceleration.

GPULsqrOptimizer Enhancements:

    Refactored background and sky fitting to use the new batching logic developed for tryUpdates.

Optimized sparse matrix accumulation, making the process 4x faster.

Overall algorithm runtime reduced from 292s to 99s (excluding solver time).

Handling None modelMasks & Edge Cases:

    Added logic to filter out images where modelMask is None (e.g., objects off the edge of a segment).

The tractor context is now dynamically updated to only include valid images before proceeding with GPU fitting.

Ensured full CPU fallback if all modelMasks in a stack are None.

Performance Impact:

    Total fitblobs runtime for large test blobs (e.g., Blob 1 of 0001p000) reduced from 5842s (CPU) to 1975s (GPU).

Bricks with long runtimes see improvements of up to 2x faster compared to previous GPU versions.
…izer

 This commit introduces a memory-efficient GPU processing mode and improves
 the reliability of the GPU-accelerated fitting paths.

 tractor/engine.py:
 - Add 'use_less_mem' and 'ie_stack' support to log-likelihood batch methods.
 - Implement sequential image processing in getLogLikelihoodBatch to minimize
   VRAM footprint for large FFT workspaces.
 - Add manual memory management (CuPy block clearing) during batch operations.

 tractor/factored_optimizer.py:
 - Filter out non-overlapping images (None masks) before GPU tryUpdates.
 - Implement 'use_less_mem' logic in GPUFriendlyOptimizer to handle large
   image stacks by iterating through valid images when VRAM is low.
 - Add robust state restoration (images/masks) using try...finally blocks.
 - Improve error reporting with tracebacks and fallback to CPU on GPU failure.
 - Add debug logging for linear algebra internals and improve NaN handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant