Skip to content

# Inject proxy environment variables into systemd manager at boot#126

Merged
cjac merged 2 commits into
GoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:proxy-script-20260605
Jun 12, 2026
Merged

# Inject proxy environment variables into systemd manager at boot#126
cjac merged 2 commits into
GoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:proxy-script-20260605

Conversation

@cjac

@cjac cjac commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Refactor custom image builder, guest proxy setup, and add validation guide

This change set addresses file resolution in the custom image builder,
refactors the guest proxy setup script, and adds a verification guide:

  1. Path Resolution: Resolves a failure in the Python custom image builder
    (generate_custom_image.py) where it could not locate the secure-boot
    key generation script when executed as a .par file in a remote Borg
    runner environment.

  2. Guest Proxy Setup: Refactors the gce-proxy-setup.sh script into a
    decoupled three-phase architecture (Introspection, Planning/Reconciliation,
    and Mutation) backed by a guest-state JSON cache.

  3. Testing & Verification: Adds TESTING.md to document the post-build
    verification workflow for secure-boot custom images.

  4. Configuration Sample: Updates env.json.sample to include all required
    variables, defaulting the customization script to a baseline test.


Detailed Changes

1. Custom Image Builder (Python)

  • File: custom_image_utils/shell_script_generator.py
  • Path Resolution: Replaced the hardcoded relative path to
    create-key-pair.sh with a multi-candidate search algorithm.
    It now searches:
    1. Relative to the current working directory (CWD).
    2. Relative to the Python module file location.
    3. Relative to sys.argv[0] (the executing .par or script directory).
  • Bash Template Integration: Sourced and resolved the script path in
    Python and injected it into the generated Bash template via the
    {create_key_pair_script} placeholder.

2. Guest Proxy Setup (Bash)

  • File: startup_script/gce-proxy-setup.sh
  • Three-Phase Architecture:
    • Phase 1: Introspection: Audits the VM and GCE metadata.
      • Introduced initialize_guest_state which stores GCE metadata
        and system state in a JSON cache in /dev/shm/metadata_cache.
      • Restricted GCE metadata queries to prevent downloading project-wide
        SSH keys.
      • Enforces the cache with get_cached_state via jq queries.
    • Phase 2: Planning & Reconciliation:
      • Refactored set_proxy to compare the desired state (asserted by
        GCE metadata) against the actual state (audited in Phase 1).
      • Logs a planning report showing skipping or mutation actions for
        /etc/environment and systemd.
      • Implements idempotency: exits early if actual state matches desired
        state, unless forced via FORCE_APPLY=1.
      • Added a fallback to derive the Dataproc cluster name from the
        hostname if the dataproc-cluster-name metadata is missing.
    • Phase 3: Mutation: Applies the planned system modifications.
      • Added configure_systemd_proxy to inject proxy overrides into the
        systemd manager configuration and trigger systemctl daemon-reexec.
      • Refactored repair_boto to use cached state and updated sed
        regex for proxy key stripping.
      • Added setup_deferred_service to copy itself to /usr/local/sbin/
        and provision a oneshot systemd service (dataproc-proxy-init.service)
        that runs BEFORE google-dataproc-agent.service on boot.

3. Documentation & Samples (Markdown / JSON)

  • File: examples/secure-boot/TESTING.md
    • Added a verification guide detailing cluster provisioning, driver load
      checks (nvidia-smi), UEFI module signature auditing (modinfo),
      and kernel event logging (dmesg). Includes timing references for
      build phases.
  • File: examples/secure-boot/env.json.sample
    • Added missing required variables IMAGE_VERSION and CUSTOMIZATION_SCRIPT.
    • Defaulted CUSTOMIZATION_SCRIPT to examples/secure-boot/no-customization.sh.

TAG=agy

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new function configure_systemd_proxy in startup_script/gce-proxy-setup.sh to inject proxy environment variables into the systemd manager configuration. The review feedback points out an asymmetric variable check bug that could inject empty values, suggests checking if systemd is running to prevent error noise in non-systemd environments, and recommends trimming a trailing space from the formatted environment string. A refactored code block is provided to address these issues.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread startup_script/gce-proxy-setup.sh Outdated
Comment on lines +321 to +341
if [[ -n "${HTTP_PROXY:-}" || -n "${HTTPS_PROXY:-}" ]]; then
echo "DEBUG: configure_systemd_proxy: Injecting proxy overrides into systemd manager..." >&2
mkdir -p "${systemd_conf_dir}"

local env_strings=()
if [[ -n "${http_proxy:-}" ]]; then env_strings+=( "http_proxy=${http_proxy}" "HTTP_PROXY=${HTTP_PROXY}" ); fi
if [[ -n "${https_proxy:-}" ]]; then env_strings+=( "https_proxy=${https_proxy}" "HTTPS_PROXY=${HTTPS_PROXY}" ); fi
if [[ -n "${no_proxy:-}" ]]; then env_strings+=( "no_proxy=${no_proxy}" "NO_PROXY=${NO_PROXY}" ); fi

# Format as systemd DefaultEnvironment space-separated list
local default_env
default_env=$(printf '"%s" ' "${env_strings[@]}")

cat <<EOF > "${systemd_conf_file}"
[Manager]
DefaultEnvironment=${default_env}
EOF

echo "DEBUG: configure_systemd_proxy: Executing systemd daemon-reexec..." >&2
systemctl daemon-reexec || true
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a few areas of improvement in this block:

  1. Asymmetric Variable Check Bug: Checking http_proxy and then unconditionally appending both http_proxy and HTTP_PROXY can lead to issues. If http_proxy is unset but HTTP_PROXY is set, neither is added. Conversely, if http_proxy is set but HTTP_PROXY is unset, it appends HTTP_PROXY=, injecting an empty value. Checking each variable independently is much more robust.
  2. Unnecessary systemctl Error Noise: In non-systemd environments (e.g., Docker container builds), running systemctl daemon-reexec will fail and print error messages to stderr. Checking if systemd is running ([[ -d /run/systemd/system ]]) before calling systemctl keeps the logs clean.
  3. Trailing Space: The printf command adds a trailing space to default_env. We can trim it using parameter expansion ${default_env% }.
  if [[ -n "${HTTP_PROXY:-}" || -n "${HTTPS_PROXY:-}" || -n "${http_proxy:-}" || -n "${https_proxy:-}" ]]; then
    echo "DEBUG: configure_systemd_proxy: Injecting proxy overrides into systemd manager..." >&2
    mkdir -p "${systemd_conf_dir}"

    local env_strings=()
    if [[ -n "${http_proxy:-}" ]];  then env_strings+=( "http_proxy=${http_proxy}" ); fi
    if [[ -n "${HTTP_PROXY:-}" ]];  then env_strings+=( "HTTP_PROXY=${HTTP_PROXY}" ); fi
    if [[ -n "${https_proxy:-}" ]]; then env_strings+=( "https_proxy=${https_proxy}" ); fi
    if [[ -n "${HTTPS_PROXY:-}" ]]; then env_strings+=( "HTTPS_PROXY=${HTTPS_PROXY}" ); fi
    if [[ -n "${no_proxy:-}" ]];    then env_strings+=( "no_proxy=${no_proxy}" ); fi
    if [[ -n "${NO_PROXY:-}" ]];    then env_strings+=( "NO_PROXY=${NO_PROXY}" ); fi

    # Format as systemd DefaultEnvironment space-separated list
    local default_env
    default_env=$(printf '"%s" ' "${env_strings[@]}")
    default_env=${default_env% }

    cat <<EOF > "${systemd_conf_file}"
[Manager]
DefaultEnvironment=${default_env}
EOF

    if [[ -d /run/systemd/system ]]; then
      echo "DEBUG: configure_systemd_proxy: Executing systemd daemon-reexec..." >&2
      systemctl daemon-reexec || true
    fi
  fi

@cjac

cjac commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@Aditya-Shah - could use a review

@Aditya-Shah

Copy link
Copy Markdown

We can handle the case issue suggested by AI review above, rest looks good

@cjac

cjac commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates gce-proxy-setup.sh to derive the cluster name from the hostname as a fallback, improves whitespace stripping for the no-proxy list, configures systemd proxy overrides, and registers a deferred systemd service (dataproc-proxy-init) during custom image builds. The review comments provide valuable feedback to ensure robustness: waiting for the network to be online before querying metadata, adding a fallback check for custom image builds since IS_CUSTOM_IMAGE_BUILD may not be set, using uname -n instead of hostname -s for better compatibility in minimal environments, and guarding systemctl calls to prevent failures when the command is unavailable.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +425 to +427
DefaultDependencies=no
After=systemd-udevd.service local-fs.target
Before=google-dataproc-agent.service

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since DefaultDependencies=no is set, this service runs extremely early in the boot process. At this stage, the network interface may not be fully initialized or online, causing the curl requests to the metadata server to fail. Because get_metadata_value does not retry on network failures, the script will fail to detect the proxy and skip configuration entirely.

Adding network-online.target to both After and Wants guarantees that the network is fully up before the script attempts to query the metadata server.

Suggested change
DefaultDependencies=no
After=systemd-udevd.service local-fs.target
Before=google-dataproc-agent.service
DefaultDependencies=no
After=systemd-udevd.service local-fs.target network-online.target
Wants=network-online.target
Before=google-dataproc-agent.service

Comment thread startup_script/gce-proxy-setup.sh Outdated
Comment on lines +454 to +456
if [[ "${IS_CUSTOM_IMAGE_BUILD:-}" == "true" ]]; then
setup_deferred_service
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The environment variable IS_CUSTOM_IMAGE_BUILD is not set or exported by run.sh (the entry point for the custom image build workflow). As a result, setup_deferred_service will never be executed during image builds, and the deferred service will not be registered.

Falling back to checking the custom-sources-path metadata attribute provides a reliable way to detect if the script is running within the Dataproc custom image build workflow.

Suggested change
if [[ "${IS_CUSTOM_IMAGE_BUILD:-}" == "true" ]]; then
setup_deferred_service
fi
if [[ "${IS_CUSTOM_IMAGE_BUILD:-}" == "true" || -n "$(get_metadata_attribute 'custom-sources-path' '')" ]]; then
setup_deferred_service
fi

Comment thread startup_script/gce-proxy-setup.sh Outdated
Comment on lines +137 to +138
local hostname
hostname=$(hostname -s)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using hostname -s spawns a subshell and relies on the external hostname command being installed. In minimal or containerized environments (such as minimal Docker images used during testing or custom builds), the hostname command may not be present, which will cause the script to fail under set -e.

Using uname -n (part of coreutils and guaranteed to be present) combined with bash parameter expansion is a more robust and efficient alternative.

Suggested change
local hostname
hostname=$(hostname -s)
local hostname
hostname=$(uname -n)
hostname="${hostname%%.*}"


# 3. Enable the service so it runs on every boot
echo "INFO: setup_deferred_service: Enabling systemd service ${service_name}" >&2
systemctl enable "${service_name}.service"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the script is executed in an environment where systemctl is not installed (such as a minimal Docker container used for testing or building), this command will fail and terminate the script due to set -e.

Guarding the call with a check for the existence of the systemctl command prevents unexpected script failures.

Suggested change
systemctl enable "${service_name}.service"
if command -v systemctl >/dev/null 2>&1; then
systemctl enable "${service_name}.service"
else
echo "WARNING: systemctl command not found. Skipping enabling of ${service_name}.service" >&2
fi

@cjac cjac force-pushed the proxy-script-20260605 branch from a80879c to 21a6f87 Compare June 5, 2026 23:13
@cjac

cjac commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates gce-proxy-setup.sh to derive the cluster name from the hostname as a fallback, improves parsing of the no-proxy metadata, and introduces systemd proxy configuration overrides along with a deferred systemd service for custom image builds. The review feedback highlights critical issues where the script exits with exit 1 if systemd is not running as PID 1, which would break container-based or chroot-based custom image builds. The reviewer provided actionable suggestions to gracefully handle non-systemd environments by skipping daemon-reexec and manually enabling the systemd service if systemctl is unavailable.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread startup_script/gce-proxy-setup.sh Outdated
Comment on lines +332 to +335
if [[ ! -d /run/systemd/system ]]; then
echo "ERROR: configure_systemd_proxy: systemd is not running as PID 1. This environment is unsupported." >&2
exit 1
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Exiting with exit 1 when systemd is not running as PID 1 will fail the script during container-based or chroot-based custom image builds (which the PR description explicitly notes should be supported).

Instead of failing, we should allow the configuration file to be written (so it is present when the image boots on a live VM) and only skip the systemctl daemon-reexec call.

Suggested change
if [[ ! -d /run/systemd/system ]]; then
echo "ERROR: configure_systemd_proxy: systemd is not running as PID 1. This environment is unsupported." >&2
exit 1
fi
if [[ ! -d /run/systemd/system ]]; then
echo "INFO: configure_systemd_proxy: systemd is not running as PID 1. Overrides will be written but daemon-reexec will be skipped." >&2
fi

Comment thread startup_script/gce-proxy-setup.sh Outdated
Comment on lines +358 to +359
echo "DEBUG: configure_systemd_proxy: Executing systemd daemon-reexec..." >&2
systemctl daemon-reexec || true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Only execute systemctl daemon-reexec if systemd is actually running as PID 1 to avoid unnecessary errors or warnings in non-systemd environments.

Suggested change
echo "DEBUG: configure_systemd_proxy: Executing systemd daemon-reexec..." >&2
systemctl daemon-reexec || true
if [[ -d /run/systemd/system ]]; then
echo "DEBUG: configure_systemd_proxy: Executing systemd daemon-reexec..." >&2
systemctl daemon-reexec || true
fi

Comment thread startup_script/gce-proxy-setup.sh Outdated
Comment on lines +449 to +455
if ! command -v systemctl >/dev/null 2>&1 || [[ ! -d /run/systemd/system ]]; then
echo "ERROR: setup_deferred_service: systemd is not running as PID 1. Cannot enable deferred service." >&2
exit 1
fi

echo "INFO: setup_deferred_service: Enabling systemd service ${service_name}" >&2
systemctl enable "${service_name}.service"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Exiting with exit 1 here will fail custom image builds in environments where systemd is not running as PID 1 (such as Docker containers or chroots).

systemctl enable can actually run successfully in chroots/containers even if systemd is not running as PID 1. If systemctl is not available at all, we can fallback to manually creating the symlink or logging a warning instead of failing the build.

Suggested change
if ! command -v systemctl >/dev/null 2>&1 || [[ ! -d /run/systemd/system ]]; then
echo "ERROR: setup_deferred_service: systemd is not running as PID 1. Cannot enable deferred service." >&2
exit 1
fi
echo "INFO: setup_deferred_service: Enabling systemd service ${service_name}" >&2
systemctl enable "${service_name}.service"
if command -v systemctl >/dev/null 2>&1; then
echo "INFO: setup_deferred_service: Enabling systemd service ${service_name}" >&2
systemctl enable "${service_name}.service" || true
else
echo "WARNING: setup_deferred_service: systemctl not found. Manually enabling service..." >&2
mkdir -p /etc/systemd/system/multi-user.target.wants
ln -sf "${service_file}" "/etc/systemd/system/multi-user.target.wants/${service_name}.service"
fi

@cjac cjac self-assigned this Jun 5, 2026
@cjac

cjac commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

Here are drafted replies you can copy-paste directly into your code review tool (such as Critique or GitHub).

These replies politely accept the robustness improvements while correcting the bot's assumption about the build environment (clarifying that the primary Dataproc custom image
builder runs on a live GCE VM where systemd is indeed PID 1).
──────

1. Reply to the configure_systemd_proxy / systemctl daemon-reexec thread:

│ Acknowledged. While the Dataproc custom image builder ( generate_custom_image.py ) runs on a live GCE VM where systemd is guaranteed to be running as PID 1, guarding the
daemon-
│ reexec call with [[ -d /run/systemd/system ]] is a clean way to prevent log noise in chroot-based or containerized build environments. I have applied this guard.
──────

2. Reply to the IS_CUSTOM_IMAGE_BUILD context detection thread:

│ Correct. The run.sh entry point does not export IS_CUSTOM_IMAGE_BUILD to the environment during the customization phase. I have updated the check to fall back to the custom-
│ sources-path metadata attribute, which is uniquely populated on the builder VM by the orchestrator.
──────

3. Reply to the hostname -s vs uname -n portability thread:

│ Good catch. I have replaced the hostname -s call with uname -n and Bash parameter expansion ( ${hostname%%.*} ). This removes the dependency on the external hostname
│ binary
│ and avoids spawning an unnecessary subshell.
──────

4. Reply to the systemctl enable / chroot-safe fallback thread:

│ Excellent point. In the primary Dataproc custom image workflow, we customize the image on a live GCE VM where systemd is active. However, to support offline chroot-based
builders
│ or minimal environments lacking systemctl entirely, we have removed the hard exit and implemented a robust fallback:

│ We now attempt systemctl enable || true if the command is present (which works in chroots), and fall back to manually creating the multi-user.target.wants symlink on the
│ filesystem if systemctl is missing. This ensures the deferred service is successfully enabled in all build contexts.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the gce-proxy-setup.sh script to fallback to deriving the cluster name from the hostname, configure systemd proxy overrides, and register a deferred systemd service (dataproc-proxy-init) for custom image builds. The review comments identify critical issues with the current implementation: parsing space-separated values in no-proxy incorrectly concatenates them into a single invalid hostname, and hard-failing when systemd is not running as PID 1 breaks offline image customization workflows (such as Packer or Docker builds). The reviewer provides actionable suggestions to robustly parse the proxy list, write systemd configurations regardless of the active PID 1 status, and manually symlink the deferred service if systemctl is unavailable.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread startup_script/gce-proxy-setup.sh Outdated
local user_no_proxy_list=()
if [[ -n "${user_no_proxy}" ]]; then
IFS=',' read -r -a user_no_proxy_list <<< "${user_no_proxy// /,}"
IFS=',' read -r -a user_no_proxy_list <<< "${user_no_proxy//[[:space:]]/}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the no-proxy metadata attribute contains space-separated values (e.g., a b c), using ${user_no_proxy//[[:space:]]/} will strip all whitespace and concatenate them into a single invalid hostname (e.g., abc).

To robustly handle both comma-separated and space-separated lists (including mixed formats with arbitrary whitespace) without creating empty array elements, we can replace all commas with spaces and use read -r -a with the default whitespace IFS.

Suggested change
IFS=',' read -r -a user_no_proxy_list <<< "${user_no_proxy//[[:space:]]/}"
IFS=$' \t\n' read -r -a user_no_proxy_list <<< "${user_no_proxy//,/ }"

Comment thread startup_script/gce-proxy-setup.sh Outdated
Comment on lines +332 to +335
if [[ ! -d /run/systemd/system ]]; then
echo "ERROR: configure_systemd_proxy: systemd is not running as PID 1. This environment is unsupported." >&2
exit 1
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Exiting with exit 1 when systemd is not running as PID 1 prevents the script from running successfully during offline image builds (such as in chroots, Docker containers, or Packer environments).

Instead of failing fast, we should write the systemd configuration file anyway (so it is baked into the final image and applied on the next boot) and only skip the runtime systemctl daemon-reexec command.

Suggested change
if [[ ! -d /run/systemd/system ]]; then
echo "ERROR: configure_systemd_proxy: systemd is not running as PID 1. This environment is unsupported." >&2
exit 1
fi
# Note: We write the configuration even if systemd is not currently running as PID 1 (e.g. during offline image builds) so it applies on boot.

Comment thread startup_script/gce-proxy-setup.sh Outdated
Comment on lines +449 to +455
if ! command -v systemctl >/dev/null 2>&1 || [[ ! -d /run/systemd/system ]]; then
echo "ERROR: setup_deferred_service: systemd is not running as PID 1. Cannot enable deferred service." >&2
exit 1
fi

echo "INFO: setup_deferred_service: Enabling systemd service ${service_name}" >&2
systemctl enable "${service_name}.service"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Requiring systemd to be running as PID 1 to enable the deferred service breaks offline image customization workflows. systemctl enable does not require systemd to be running as PID 1 to manage symlinks, and we can also provide a manual symlink fallback if systemctl is entirely missing from the build environment.

Suggested change
if ! command -v systemctl >/dev/null 2>&1 || [[ ! -d /run/systemd/system ]]; then
echo "ERROR: setup_deferred_service: systemd is not running as PID 1. Cannot enable deferred service." >&2
exit 1
fi
echo "INFO: setup_deferred_service: Enabling systemd service ${service_name}" >&2
systemctl enable "${service_name}.service"
if command -v systemctl >/dev/null 2>&1; then
echo "INFO: setup_deferred_service: Enabling systemd service ${service_name}" >&2
systemctl enable "${service_name}.service"
else
echo "INFO: setup_deferred_service: systemctl not found. Manually enabling service..." >&2
mkdir -p /etc/systemd/system/multi-user.target.wants
ln -sf "${service_file}" "/etc/systemd/system/multi-user.target.wants/${service_name}.service"
fi

Comment thread startup_script/gce-proxy-setup.sh Outdated
…guide

This change set addresses file resolution in the custom image builder,
refactors the guest proxy setup script, and adds a verification guide:

1. Path Resolution: Resolves a failure in the Python custom image builder
   (`generate_custom_image.py`) where it could not locate the secure-boot
   key generation script when executed as a `.par` file in a remote Borg
   runner environment.

2. Guest Proxy Setup: Refactors the `gce-proxy-setup.sh` script into a
   decoupled three-phase architecture (Introspection, Planning/Reconciliation,
   and Mutation) backed by a guest-state JSON cache.

3. Testing & Verification: Adds `TESTING.md` to document the post-build
   verification workflow for secure-boot custom images.

4. Configuration Sample: Updates `env.json.sample` to include all required
   variables, defaulting the customization script to a baseline test.

---

### Detailed Changes

#### 1. Custom Image Builder (Python)
*   **File**: `custom_image_utils/shell_script_generator.py`
*   **Path Resolution**: Replaced the hardcoded relative path to
    `create-key-pair.sh` with a multi-candidate search algorithm.
    It now searches:
    1. Relative to the current working directory (CWD).
    2. Relative to the Python module file location.
    3. Relative to `sys.argv[0]` (the executing `.par` or script directory).
*   **Bash Template Integration**: Sourced and resolved the script path in
    Python and injected it into the generated Bash template via the
    `{create_key_pair_script}` placeholder.

#### 2. Guest Proxy Setup (Bash)
*   **File**: `startup_script/gce-proxy-setup.sh`
*   **Three-Phase Architecture**:
    *   **Phase 1: Introspection**: Audits the VM and GCE metadata.
        *   Introduced `initialize_guest_state` which stores GCE metadata
            and system state in a JSON cache in `/dev/shm/metadata_cache`.
        *   Restricted GCE metadata queries to prevent downloading project-wide
            SSH keys.
        *   Enforces the cache with `get_cached_state` via `jq` queries.
    *   **Phase 2: Planning & Reconciliation**:
        *   Refactored `set_proxy` to compare the desired state (asserted by
            GCE metadata) against the actual state (audited in Phase 1).
        *   Logs a planning report showing skipping or mutation actions for
            `/etc/environment` and systemd.
        *   Implements idempotency: exits early if actual state matches desired
            state, unless forced via `FORCE_APPLY=1`.
        *   Added a fallback to derive the Dataproc cluster name from the
            hostname if the `dataproc-cluster-name` metadata is missing.
    *   **Phase 3: Mutation**: Applies the planned system modifications.
        *   Added `configure_systemd_proxy` to inject proxy overrides into the
            systemd manager configuration and trigger `systemctl daemon-reexec`.
        *   Refactored `repair_boto` to use cached state and updated `sed`
            regex for proxy key stripping.
        *   Added `setup_deferred_service` to copy itself to `/usr/local/sbin/`
            and provision a oneshot systemd service (`dataproc-proxy-init.service`)
            that runs BEFORE `google-dataproc-agent.service` on boot.

#### 3. Documentation & Samples (Markdown / JSON)
*   **File**: `examples/secure-boot/TESTING.md`
    *   Added a verification guide detailing cluster provisioning, driver load
        checks (`nvidia-smi`), UEFI module signature auditing (`modinfo`),
        and kernel event logging (`dmesg`). Includes timing references for
        build phases.
*   **File**: `examples/secure-boot/env.json.sample`
    *   Added missing required variables `IMAGE_VERSION` and `CUSTOMIZATION_SCRIPT`.
    *   Defaulted `CUSTOMIZATION_SCRIPT` to `examples/secure-boot/no-customization.sh`.

TAG=agy
@cjac cjac force-pushed the proxy-script-20260605 branch from f822c9d to 1e44115 Compare June 10, 2026 22:16
Comment thread custom_image_utils/shell_script_generator.py
Comment thread startup_script/gce-proxy-setup.sh Outdated

@dilipgodhia dilipgodhia left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass done.

Address review feedback on PR GoogleCloudDataproc#126 by improving the robustness and modularity
of the systemd proxy configuration helper in `gce-proxy-setup.sh`:

1.  **Exit to Return Refactor:** Replace `exit 1` with `return 1` inside
    `configure_systemd_proxy` when systemd is not running as PID 1. This
    decouples the function from the global script execution, improving
    modularity and enabling easier unit testing. The script's active
    `set -euo pipefail` still guarantees a fail-fast exit at the call site.

2.  **Strict daemon-reexec Verification:** Remove the permissive `|| true`
    from `systemctl daemon-reexec` and replace it with an explicit error
    handling block. If the re-exec fails, the function now logs a distinct
    error and returns `1` to trigger a fail-fast abort. This replaces a
    technically infeasible suggestion to check `UserspaceTimestamp` (which
    is a boot-time constant and does not change on re-exec).

3.  **Explicit Success Return:** Add `return 0` at the end of the function
    to ensure it always returns a clean status on successful execution or
    noop bypass.

TAG=agy
CONV=b274b565-1bd6-43f1-b4db-31f3d89d087b
@cjac cjac marked this pull request as ready for review June 12, 2026 21:46
@cjac cjac merged commit e0279ca into GoogleCloudDataproc:main Jun 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants