Skip to content

Examples: Support Arm builds, interactive debugging, and proxy configuration#127

Draft
cjac wants to merge 3 commits into
GoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:example-script-update-20260612
Draft

Examples: Support Arm builds, interactive debugging, and proxy configuration#127
cjac wants to merge 3 commits into
GoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:example-script-update-20260612

Conversation

@cjac

@cjac cjac commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Update the Secure Boot custom image builder to support Arm architectures, add an interactive debugging workflow, configure Secure Web Proxy (SWP), update path resolution, and sanitize the public .gitignore.

1. Arm Architecture Support

  • Targets: Added 2.2-ubuntu22-arm and 2.3-ubuntu22-arm image chains.
  • Machine Type: Added create_arm_instance to use t2a-standard-2 GCE instances for Arm builds.
  • Build Paths: Bypass GPU/ML stages (TensorFlow, PyTorch, RAPIDS, Spark) for Arm architectures.
  • Base Image: Fall back to secure-proxy instead of proxy-tf for optional components (Docker, Jupyter, Zeppelin, Pig) on Arm.

2. Interactive Debugging Workflow

  • Documentation: Documented the customize-in-screen.sh workflow for running customization on persistent debug VMs.
  • Disk Monitoring: Install screen on the VM before running the disk usage monitor in no-customization.sh.
  • State Resolution: Resolve the build directory on failure using sort | tail -n1 to handle multiple directories.

3. Proxy & Network Integration

  • SWP Variables: Pass SWP_IP, SWP_PORT, and PROXY_CERT_GCS_PATH through the Podman runner to the builder VM.
  • Environment: Parse and export SWP variables from env.json.
  • Integration: Configure the builder and debug VM to use the local startup_script/gce-proxy-setup.sh as the authoritative proxy configuration script.

4. Code Robustness & Sanitization

  • Path Resolution: Traverse parent directories in env.sh to resolve DATAPROC_EVOLUTION_DIR across nested git boundaries.
  • Error Handling: Update run_gcloud to capture and return exit codes under set -e.
  • Globbing: Prevent errors in build-current-images.sh when no logs match the timestamp.
  • Repository Sanitization: Sanitize .gitignore to remove development scratchpads, local draft scripts, and specific file patterns, replacing them with generic rules for keys, logs, and temporary directories.

TAG=agy
CONV=b274b565-1bd6-43f1-b4db-31f3d89d087b

Update the Secure Boot custom image builder to support Arm architectures,
add an interactive debugging workflow, configure Secure Web Proxy (SWP),
update path resolution, and sanitize the public .gitignore.

### 1. Arm Architecture Support
*   **Targets:** Added `2.2-ubuntu22-arm` and `2.3-ubuntu22-arm` image chains.
*   **Machine Type:** Added `create_arm_instance` to use `t2a-standard-2` GCE instances for Arm builds.
*   **Build Paths:** Bypass GPU/ML stages (TensorFlow, PyTorch, RAPIDS, Spark) for Arm architectures.
*   **Base Image:** Fall back to `secure-proxy` instead of `proxy-tf` for optional components (Docker, Jupyter, Zeppelin, Pig) on Arm.

### 2. Interactive Debugging Workflow
*   **Documentation:** Documented the `customize-in-screen.sh` workflow for running customization on persistent debug VMs.
*   **Disk Monitoring:** Install `screen` on the VM before running the disk usage monitor in `no-customization.sh`.
*   **State Resolution:** Resolve the build directory on failure using `sort | tail -n1` to handle multiple directories.

### 3. Proxy & Network Integration
*   **SWP Variables:** Pass `SWP_IP`, `SWP_PORT`, and `PROXY_CERT_GCS_PATH` through the Podman runner to the builder VM.
*   **Environment:** Parse and export SWP variables from `env.json`.
*   **Integration:** Configure the builder and debug VM to use the local `startup_script/gce-proxy-setup.sh` as the authoritative proxy configuration script.

### 4. Code Robustness & Sanitization
*   **Path Resolution:** Traverse parent directories in `env.sh` to resolve `DATAPROC_EVOLUTION_DIR` across nested git boundaries.
*   **Error Handling:** Update `run_gcloud` to capture and return exit codes under `set -e`.
*   **Globbing:** Prevent errors in `build-current-images.sh` when no logs match the timestamp.
*   **Repository Sanitization:** Sanitize `.gitignore` to remove development scratchpads, local draft scripts, and specific file patterns, replacing them with generic rules for keys, logs, and temporary directories.

TAG=agy
CONV=b274b565-1bd6-43f1-b4db-31f3d89d087b

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of scripts and documentation for manual customization, interactive debugging, and diagnostics of Dataproc secure boot images, including support for ARM-based instances and updated CUDA versions. The review feedback highlights several critical areas for improvement: resolving a variable expansion bug in cleanup-builders.sh, removing hardcoded debug overrides in ssh-builder, avoiding redundant user prompts, and addressing security concerns with predictable temporary filenames. Additionally, portability issues on macOS (such as stat and readlink usage) should be resolved, and script robustness should be enhanced by adding error handling to remote SSH blocks, improving screen PID retrieval, and using proper argument propagation ("$@").

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +67 to +82
SCREEN_CMD="
attempt=0
while true; do
gcloud beta compute instances delete ${instance} --zone=${ZONE} --project=${PROJECT_ID} --quiet --no-graceful-shutdown && break
attempt=$((attempt + 1))
if [[ ${attempt} -ge 3 ]]; then
echo 'Max retries reached for ${instance}'
break
fi
echo 'Failed, retrying in 5 seconds...'
read -t 5
done
echo \"Delete command for ${instance} finished with exit code $?\"
echo \"Window will close in 5 seconds...\"
read -t 5
"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Because SCREEN_CMD is defined using double quotes, variables like attempt and $? are eagerly expanded on the host workstation rather than inside the screen session. Since attempt is unset on the host, attempt=$((attempt + 1)) expands to attempt=$(( + 1)), which is a syntax error. Additionally, $? will report the exit code of the host's last command instead of the remote gcloud command. Escaping the $ character for these variables defers their expansion to execution time.

Suggested change
SCREEN_CMD="
attempt=0
while true; do
gcloud beta compute instances delete ${instance} --zone=${ZONE} --project=${PROJECT_ID} --quiet --no-graceful-shutdown && break
attempt=$((attempt + 1))
if [[ ${attempt} -ge 3 ]]; then
echo 'Max retries reached for ${instance}'
break
fi
echo 'Failed, retrying in 5 seconds...'
read -t 5
done
echo \"Delete command for ${instance} finished with exit code $?\"
echo \"Window will close in 5 seconds...\"
read -t 5
"
SCREEN_CMD="
attempt=0
while true; do
gcloud beta compute instances delete ${instance} --zone=${ZONE} --project=${PROJECT_ID} --quiet --no-graceful-shutdown && break
attempt=\$((attempt + 1))
if [[ \${attempt} -ge 3 ]]; then
echo 'Max retries reached for ${instance}'
break
fi
echo 'Failed, retrying in 5 seconds...'
read -t 5
done
echo \"Delete command for ${instance} finished with exit code \$?\"
echo \"Window will close in 5 seconds...\"
read -t 5
"

Comment on lines +22 to +34
if [[ -z "${INSTANCE_NAME}" ]]; then
report_result "Not Found"
echo "No RUNNING builder instance found matching the prefix." >&2
# exit 1
fi

report_result "Found: ${INSTANCE_NAME}"

print_status "SSHing into ${INSTANCE_NAME}"...

INSTANCE_NAME=debug-deb12-build

gcloud compute ssh "${INSTANCE_NAME}" --zone "${ZONE}" --project "${PROJECT_ID}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This script contains leftover debug code that overrides the dynamically discovered INSTANCE_NAME with a hardcoded value (debug-deb12-build). Additionally, the exit 1 statement is commented out, which allows the script to proceed even if no running builder instance is found. Uncommenting the exit statement and removing the hardcoded override restores the script's intended dynamic behavior.

Suggested change
if [[ -z "${INSTANCE_NAME}" ]]; then
report_result "Not Found"
echo "No RUNNING builder instance found matching the prefix." >&2
# exit 1
fi
report_result "Found: ${INSTANCE_NAME}"
print_status "SSHing into ${INSTANCE_NAME}"...
INSTANCE_NAME=debug-deb12-build
gcloud compute ssh "${INSTANCE_NAME}" --zone "${ZONE}" --project "${PROJECT_ID}"
if [[ -z "${INSTANCE_NAME}" ]]; then
report_result "Not Found"
echo "No RUNNING builder instance found matching the prefix." >&2
exit 1
fi
report_result "Found: ${INSTANCE_NAME}"
print_status "SSHing into ${INSTANCE_NAME}"...
gcloud compute ssh "${INSTANCE_NAME}" --zone "${ZONE}" --project "${PROJECT_ID}"

Comment on lines +51 to +62
TEMP_SCREENRC="${tmpdir}/temp_cleanup.screenrc"
rm -f "${TEMP_SCREENRC}"
touch "${TEMP_SCREENRC}"

echo "The following builder VMs will be deleted in a new screen session:" >&2
echo "${INSTANCE_LIST}" >&2

read -p "Continue with deletion? (y/N): " confirm
if [[ "${confirm}" != [yY] ]]; then
echo "Deletion cancelled." >&2
exit 0
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script prompts the user for confirmation twice with the exact same message. Removing the second redundant confirmation prompt improves usability and avoids confusion.

Suggested change
TEMP_SCREENRC="${tmpdir}/temp_cleanup.screenrc"
rm -f "${TEMP_SCREENRC}"
touch "${TEMP_SCREENRC}"
echo "The following builder VMs will be deleted in a new screen session:" >&2
echo "${INSTANCE_LIST}" >&2
read -p "Continue with deletion? (y/N): " confirm
if [[ "${confirm}" != [yY] ]]; then
echo "Deletion cancelled." >&2
exit 0
fi
TEMP_SCREENRC="${tmpdir}/temp_cleanup.screenrc"
rm -f "${TEMP_SCREENRC}"
touch "${TEMP_SCREENRC}"

GSA="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

GCS_SOURCES_PATH="gs://${BUCKET}/${INSTANCE_NAME}/sources"
CACHE_FILE="/tmp/latest_dataproc_$(echo "${IMAGE_VERSION}" | tr './-' '_').txt"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a predictable filename directly in /tmp can lead to permission conflicts or symlink attacks on shared multi-user systems. Incorporating the current user's name into the filename prevents conflicts when multiple developers run the script on the same workstation or bastion host.

Suggested change
CACHE_FILE="/tmp/latest_dataproc_$(echo "${IMAGE_VERSION}" | tr './-' '_').txt"
CACHE_FILE="/tmp/latest_dataproc_${USER}_$(echo "${IMAGE_VERSION}" | tr './-' '_').txt"

# Determine Dataproc Image
if [[ -n "${DATAPROC_IMAGE:-}" ]]; then
echo "Using image from ENV: ${DATAPROC_IMAGE}"
elif [[ -f "${CACHE_FILE}" ]] && [[ $(($(date +%s) - $(stat -c %Y "${CACHE_FILE}"))) -lt "${CACHE_TTL_SECONDS}" ]]; then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The stat -c %Y command is specific to GNU stat and is not portable to macOS workstations, where it will fail with a syntax error. Using date -r is highly portable and works seamlessly on both macOS and Linux.

Suggested change
elif [[ -f "${CACHE_FILE}" ]] && [[ $(($(date +%s) - $(stat -c %Y "${CACHE_FILE}"))) -lt "${CACHE_TTL_SECONDS}" ]]; then
elif [[ -f "${CACHE_FILE}" ]] && [[ $(($(date +%s) - $(date -r "${CACHE_FILE}" +%s))) -lt "${CACHE_TTL_SECONDS}" ]]; then

Comment on lines +207 to +216
ssh -o ControlMaster=no -o BatchMode=yes -o ConnectTimeout=10 "${INSTANCE_NAME}" "
GCS_PATH=\$(curl -s -H 'Metadata-Flavor: Google' http://metadata.google.internal/computeMetadata/v1/instance/attributes/custom-sources-path)
rm -rf /tmp/sources
mkdir -p /tmp/sources
gsutil -m cp -r \"\${GCS_PATH}/*\" /tmp/sources/ >/dev/null
chmod +x /tmp/sources/*.sh
# Spawn the guest screen bootstrap wrapper in the background, passing FORCE_APPLY status
nohup env FORCE_APPLY=${FORCE_APPLY} bash /tmp/sources/install-in-screen.sh /tmp/sources/init_actions.sh >/tmp/launcher.log 2>&1 &
sleep 1
"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The remote command block executed via SSH does not run with set -e enabled. If gsutil or curl fails, the script will silently continue and attempt to run install-in-screen.sh with missing or corrupted files. Adding set -euo pipefail ensures the remote execution fails fast on any error.

Suggested change
ssh -o ControlMaster=no -o BatchMode=yes -o ConnectTimeout=10 "${INSTANCE_NAME}" "
GCS_PATH=\$(curl -s -H 'Metadata-Flavor: Google' http://metadata.google.internal/computeMetadata/v1/instance/attributes/custom-sources-path)
rm -rf /tmp/sources
mkdir -p /tmp/sources
gsutil -m cp -r \"\${GCS_PATH}/*\" /tmp/sources/ >/dev/null
chmod +x /tmp/sources/*.sh
# Spawn the guest screen bootstrap wrapper in the background, passing FORCE_APPLY status
nohup env FORCE_APPLY=${FORCE_APPLY} bash /tmp/sources/install-in-screen.sh /tmp/sources/init_actions.sh >/tmp/launcher.log 2>&1 &
sleep 1
"
ssh -o ControlMaster=no -o BatchMode=yes -o ConnectTimeout=10 "${INSTANCE_NAME}" "
set -euo pipefail
GCS_PATH=\$(curl -s -H 'Metadata-Flavor: Google' http://metadata.google.internal/computeMetadata/v1/instance/attributes/custom-sources-path)
rm -rf /tmp/sources
mkdir -p /tmp/sources
gsutil -m cp -r \"\${GCS_PATH}/*\" /tmp/sources/ >/dev/null
chmod +x /tmp/sources/*.sh
# Spawn the guest screen bootstrap wrapper in the background, passing FORCE_APPLY status
nohup env FORCE_APPLY=${FORCE_APPLY} bash /tmp/sources/install-in-screen.sh /tmp/sources/init_actions.sh >/tmp/launcher.log 2>&1 &
sleep 1
"

# --- Wait and Stream Logs ---
# We find the PID of the newly spawned screen session
sleep 2
SCREEN_PID=$(screen -ls | grep customization | awk '{print $1}' | cut -d. -f1 || echo "")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If multiple screen sessions contain the word 'customization' (e.g., from other users or dead sessions), grep customization will return multiple lines, causing SCREEN_PID to contain multiple PIDs separated by newlines. This will cause the subsequent kill -0 loop to fail with a syntax error. Grepping for the exact .customization suffix and limiting the output to a single line using head -n 1 makes the PID retrieval robust.

Suggested change
SCREEN_PID=$(screen -ls | grep customization | awk '{print $1}' | cut -d. -f1 || echo "")
SCREEN_PID=$(screen -ls | grep "\.customization" | head -n 1 | awk '{print $1}' | cut -d. -f1 || echo "")

export DATAPROC_EVOLUTION_DIR="$(realpath "${SCRIPT_DIR}/../../../..")"
# Traverse upwards from env.sh's directory to resolve the repository root
# containing both custom-images and cloud-dataproc, bypassing nested git boundaries.
current_dir=$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The readlink -f command is not portable to macOS workstations, where it will fail with an error. Using a portable subshell with cd and pwd safely resolves the absolute path of the script's directory on both macOS and Linux.

Suggested change
current_dir=$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")
current_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)

Comment on lines +100 to +104
function create_arm_instance() {
python3 generate_custom_image.py \
--machine-type "t2a-standard-2" \
$*
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using $* inside the function splits arguments on spaces, which can break arguments containing spaces or special characters. Using "$@" is the standard best practice to preserve argument boundaries correctly.

Suggested change
function create_arm_instance() {
python3 generate_custom_image.py \
--machine-type "t2a-standard-2" \
$*
}
function create_arm_instance() {
python3 generate_custom_image.py \\
--machine-type "t2a-standard-2" \\
"$@"
}

Update the Secure Boot custom image builder to support Arm architectures,
add an interactive debugging workflow, configure Secure Web Proxy (SWP),
update path resolution, and sanitize the public .gitignore.

### 1. Arm Architecture Support
*   **Targets:** Added `2.2-ubuntu22-arm` and `2.3-ubuntu22-arm` image chains.
*   **Machine Type:** Added `create_arm_instance` to use `t2a-standard-2` GCE instances for Arm builds.
*   **Build Paths:** Bypass GPU/ML stages (TensorFlow, PyTorch, RAPIDS, Spark) for Arm architectures.
*   **Base Image:** Fall back to `secure-proxy` instead of `proxy-tf` for optional components (Docker, Jupyter, Zeppelin, Pig) on Arm.
*   **Prerelease Support:** Added temporary prerelease base image URI for `2.3-ubuntu22-arm` to enable testing.

### 2. Interactive Debugging Workflow
*   **Documentation:** Documented the `customize-in-screen.sh` workflow for running customization on persistent debug VMs.
*   **Orchestration:** Added `customize-in-screen.sh` to manage the idempotent VM lifecycle and screen attachment.
*   **Guest Bootstrap:** Added `install-in-screen.sh` to run customization inside a detached screen session on the VM.
*   **Diagnostics:** Added `audit-image-customizer.sh` to perform remote network and proxy connectivity audits.
*   **Helpers:** Added `create-debug-vm.sh`, `destroy-debug-vm.sh`, `ssh-debug-vm.sh`, and `scp-debug-vm.sh` to manage the debug VM.
*   **Cleanup:** Added `cleanup-builders.sh` to clean up lingering builder VMs in parallel using screen.

### 3. Proxy & Network Integration
*   **SWP Variables:** Pass `SWP_IP`, `SWP_PORT`, and `PROXY_CERT_GCS_PATH` through the Podman runner to the builder VM.
*   **Environment:** Parse and export SWP variables from `env.json`.
*   **Integration:** Configure the builder and debug VM to use the local `gce-proxy-setup.sh` as the authoritative proxy configuration script.

### 4. Code Robustness & Sanitization
*   **Path Resolution:** Traverse parent directories in `env.sh` to resolve `DATAPROC_EVOLUTION_DIR` across nested git boundaries.
*   **Error Handling:** Update `run_gcloud` to capture and return exit codes under `set -e`.
*   **Globbing:** Prevent errors in `build-current-images.sh` when no logs match the timestamp.
*   **Repository Sanitization:** Sanitize `.gitignore` to remove development scratchpads, local draft scripts, and specific file patterns, replacing them with generic rules for keys, logs, and temporary directories.

TAG=agy
CONV=b274b565-1bd6-43f1-b4db-31f3d89d087b
@cjac

cjac commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an interactive debugging and manual customization workflow for GCE image customization, including scripts to create, SSH into, audit, and destroy debug VMs, alongside updated documentation and ARM support. The review feedback highlights several critical issues: leftover debugging code in ssh-builder that hardcodes the instance name, incorrect variable evaluation in cleanup-builders.sh due to unescaped variables in double-quoted strings, a bootstrap issue in install-in-screen.sh where screen is used before installation, redundant confirmation prompts in cleanup-builders.sh, and a portability issue with stat on macOS in create-debug-vm.sh.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +19 to +34
--sort-by=\"~creationTimestamp\" \
| head -n 1)

if [[ -z "${INSTANCE_NAME}" ]]; then
report_result "Not Found"
echo "No RUNNING builder instance found matching the prefix." >&2
# exit 1
fi

report_result "Found: ${INSTANCE_NAME}"

print_status "SSHing into ${INSTANCE_NAME}"...

INSTANCE_NAME=debug-deb12-build

gcloud compute ssh "${INSTANCE_NAME}" --zone "${ZONE}" --project "${PROJECT_ID}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script contains leftover debugging code that hardcodes INSTANCE_NAME=debug-deb12-build on line 32, completely overriding the dynamically discovered builder instance. Additionally, the exit 1 on line 25 is commented out, which would cause the script to proceed even if no running builder instance is found. Finally, the --sort-by argument on line 19 has unnecessarily escaped double quotes (\"~creationTimestamp\"), which can cause parsing issues in gcloud.

Suggested change
--sort-by=\"~creationTimestamp\" \
| head -n 1)
if [[ -z "${INSTANCE_NAME}" ]]; then
report_result "Not Found"
echo "No RUNNING builder instance found matching the prefix." >&2
# exit 1
fi
report_result "Found: ${INSTANCE_NAME}"
print_status "SSHing into ${INSTANCE_NAME}"...
INSTANCE_NAME=debug-deb12-build
gcloud compute ssh "${INSTANCE_NAME}" --zone "${ZONE}" --project "${PROJECT_ID}"
--sort-by="~creationTimestamp" \
| head -n 1)
if [[ -z "${INSTANCE_NAME}" ]]; then
report_result "Not Found"
echo "No RUNNING builder instance found matching the prefix." >&2
exit 1
fi
report_result "Found: ${INSTANCE_NAME}"
print_status "SSHing into ${INSTANCE_NAME}"...
gcloud compute ssh "${INSTANCE_NAME}" --zone "${ZONE}" --project "${PROJECT_ID}"

Comment on lines +67 to +82
SCREEN_CMD="
attempt=0
while true; do
gcloud beta compute instances delete ${instance} --zone=${ZONE} --project=${PROJECT_ID} --quiet --no-graceful-shutdown && break
attempt=$((attempt + 1))
if [[ ${attempt} -ge 3 ]]; then
echo 'Max retries reached for ${instance}'
break
fi
echo 'Failed, retrying in 5 seconds...'
read -t 5
done
echo \"Delete command for ${instance} finished with exit code $?\"
echo \"Window will close in 5 seconds...\"
read -t 5
"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The SCREEN_CMD string is defined using double quotes, which causes the host shell to evaluate several variables and expressions before writing them to TEMP_SCREENRC. Specifically:

  • $((attempt + 1)) on line 71 evaluates to 1 statically on the host, meaning attempt will never increment dynamically inside the screen session.
  • ${attempt} on line 72 expands to an empty string on the host, resulting in a syntax error ([[ -ge 3 ]]) inside the screen session.
  • $? on line 79 expands to the exit code of the last command run on the host before the loop, rather than the exit code of the command inside the screen session.

To fix this, escape these variables/expressions with a backslash so they are evaluated dynamically inside the screen session.

Suggested change
SCREEN_CMD="
attempt=0
while true; do
gcloud beta compute instances delete ${instance} --zone=${ZONE} --project=${PROJECT_ID} --quiet --no-graceful-shutdown && break
attempt=$((attempt + 1))
if [[ ${attempt} -ge 3 ]]; then
echo 'Max retries reached for ${instance}'
break
fi
echo 'Failed, retrying in 5 seconds...'
read -t 5
done
echo \"Delete command for ${instance} finished with exit code $?\"
echo \"Window will close in 5 seconds...\"
read -t 5
"
SCREEN_CMD="
attempt=0
while true; do
gcloud beta compute instances delete ${instance} --zone=${ZONE} --project=${PROJECT_ID} --quiet --no-graceful-shutdown && break
attempt=\$((attempt + 1))
if [[ \${attempt} -ge 3 ]]; then
echo 'Max retries reached for ${instance}'
break
fi
echo 'Failed, retrying in 5 seconds...'
read -t 5
done
echo \"Delete command for ${instance} finished with exit code \$?\"
echo \"Window will close in 5 seconds...\"
read -t 5
"

Comment on lines +17 to +19

echo "INFO: Preparing to run customization: ${TARGET_SCRIPT}" >&2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a chicken-and-egg bootstrap issue with screen. The script install-in-screen.sh attempts to launch the customization script inside a detached screen session, but screen is only installed inside the customization script (no-customization.sh). If screen is not pre-installed on the base image, install-in-screen.sh will fail immediately, and the customization script will never run to install it. We should ensure screen is installed on the host VM before attempting to spawn the screen session.

Suggested change
echo "INFO: Preparing to run customization: ${TARGET_SCRIPT}" >&2
if ! command -v screen >/dev/null 2>&1; then
echo "INFO: Installing screen..." >&2
if command -v dnf >/dev/null 2>&1; then
sudo dnf -y -q install epel-release && sudo dnf -y -q install screen
elif command -v apt-get >/dev/null 2>&1; then
sudo apt-get update -y -qq >/dev/null 2>&1
sudo apt-get install -y -qq screen >/dev/null 2>&1
fi
fi
echo "INFO: Preparing to run customization: ${TARGET_SCRIPT}" >&2

Comment on lines +41 to +62
report_result "Found"
echo "The following builder VMs will be deleted in detached screen sessions:" >&2
echo "${INSTANCE_LIST}" >&2

read -p "Continue with deletion? (y/N): " confirm
if [[ "${confirm}" != [yY] ]]; then
echo "Deletion cancelled." >&2
exit 0
fi

TEMP_SCREENRC="${tmpdir}/temp_cleanup.screenrc"
rm -f "${TEMP_SCREENRC}"
touch "${TEMP_SCREENRC}"

echo "The following builder VMs will be deleted in a new screen session:" >&2
echo "${INSTANCE_LIST}" >&2

read -p "Continue with deletion? (y/N): " confirm
if [[ "${confirm}" != [yY] ]]; then
echo "Deletion cancelled." >&2
exit 0
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are two redundant, identical confirmation prompts back-to-back (lines 45-49 and lines 58-62). The user is prompted twice to confirm the deletion of the same builder VMs. We should remove the first prompt and proceed directly to setting up the screen session and prompting once.

Suggested change
report_result "Found"
echo "The following builder VMs will be deleted in detached screen sessions:" >&2
echo "${INSTANCE_LIST}" >&2
read -p "Continue with deletion? (y/N): " confirm
if [[ "${confirm}" != [yY] ]]; then
echo "Deletion cancelled." >&2
exit 0
fi
TEMP_SCREENRC="${tmpdir}/temp_cleanup.screenrc"
rm -f "${TEMP_SCREENRC}"
touch "${TEMP_SCREENRC}"
echo "The following builder VMs will be deleted in a new screen session:" >&2
echo "${INSTANCE_LIST}" >&2
read -p "Continue with deletion? (y/N): " confirm
if [[ "${confirm}" != [yY] ]]; then
echo "Deletion cancelled." >&2
exit 0
fi
report_result "Found"
TEMP_SCREENRC="${tmpdir}/temp_cleanup.screenrc"
rm -f "${TEMP_SCREENRC}"
touch "${TEMP_SCREENRC}"
echo "The following builder VMs will be deleted in a new screen session:" >&2
echo "${INSTANCE_LIST}" >&2
read -p "Continue with deletion? (y/N): " confirm
if [[ "${confirm}" != [yY] ]]; then
echo "Deletion cancelled." >&2
exit 0
fi

# Determine Dataproc Image
if [[ -n "${DATAPROC_IMAGE:-}" ]]; then
echo "Using image from ENV: ${DATAPROC_IMAGE}"
elif [[ -f "${CACHE_FILE}" ]] && [[ $(($(date +%s) - $(stat -c %Y "${CACHE_FILE}"))) -lt "${CACHE_TTL_SECONDS}" ]]; then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of stat -c %Y is not portable to macOS (which uses BSD stat where -c is not supported). Since this script is executed on the developer's workstation, and many developers use macOS, this will cause the script to fail. Using date -r <file> +%s is fully portable and works on both GNU (Linux) and BSD (macOS) systems.

Suggested change
elif [[ -f "${CACHE_FILE}" ]] && [[ $(($(date +%s) - $(stat -c %Y "${CACHE_FILE}"))) -lt "${CACHE_TTL_SECONDS}" ]]; then
elif [[ -f "${CACHE_FILE}" ]] && [[ $(($(date +%s) - $(date -r "${CACHE_FILE}" +%s))) -lt "${CACHE_TTL_SECONDS}" ]]; then

…deferred script execution

This commit introduces several key improvements to the custom image builder
pipeline to enhance debugging, reliability, and extensibility:

1. **VM Retention on Failure for Debugging**:
   - Modified `startup_script/run.sh` to use an `EXIT` trap (`shutdown_vm`) instead of an inline shutdown at the end of `main`.
   - The trap checks the exit status and GCE metadata `retain-on-failure`. If the build fails and `retain-on-failure=true` is set, the VM remains running (skips shutdown) to allow live debugging via SSH.
   - Added `--retain-on-failure` command-line argument to `generate_custom_image.py` (via `args_parser.py` and `shell_script_generator.py`) and enabled it by default in `examples/secure-boot/pre-init.sh`.

2. **Prevent Storage Exhaustion in Monolithic Builds**:
   - Increased the build disk size for `2.3-debian12` from 50GB to 100GB in `examples/secure-boot/pre-init.sh` to accommodate the cumulative size of GPU/ML libraries and monolithic optional components (Zeppelin, Jupyter, Docker, Pig, Delta).

3. **Dynamic Deferred Config Script Execution**:
   - Added support in `startup_script/gce-proxy-setup.sh` to dynamically download and execute a custom user script from a GCS URI (specified via `deferred-config-script-uri` metadata) during the deferred boot phase.
   - Supports optional integrity verification using a SHA256 checksum (specified via `deferred-config-script-sha256` metadata).

4. **QoL and Portability Fixes**:
   - Updated ARM instance machine type to `c4a-standard-2` in `examples/secure-boot/pre-init.sh`.
   - Replaced Linux-specific `stat` command with portable `date -r` in `examples/secure-boot/bin/create-debug-vm.sh` for cache TTL checks.

TAG=agy
CONV=b274b565-1bd6-43f1-b4db-31f3d89d087b
@cjac cjac force-pushed the example-script-update-20260612 branch from 472b57f to d7a7d3a Compare June 19, 2026 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant