Skip to content

feat(gcp-cvm): add GCP confidential image build kit#5

Open
Wilbert957 wants to merge 15 commits into
mainfrom
feat/gcp-cvm-build
Open

feat(gcp-cvm): add GCP confidential image build kit#5
Wilbert957 wants to merge 15 commits into
mainfrom
feat/gcp-cvm-build

Conversation

@Wilbert957

@Wilbert957 Wilbert957 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Build a hardened, measured, remotely-attestable GCP (Intel TDX) confidential image for 0g-tapp from a stock Ubuntu 24.04 cloud image, using cryptpilot for measured FDE.

What's here

  • gcp-cvm/build-gcp-tapp.sh — one-command pipeline: base provisioning (tapp-server + Docker + Intel SGX/libtdx-attest + config + DNS) → kernel swap (generic → gcp, for RTMR extend) → cryptpilot-convert → ESP grub sync → optional security hardening. Toggle with HARDEN=1|0.
  • gcp-cvm/prepare-gcp-tapp.sh / gcp-cvm/fix-esp-grub.sh — stage-B-only (kernel + convert + ESP) and ESP-only helpers, reused by the main pipeline.
  • gcp-cvm/cryptpilot-gcp-boot-fix.md — root-cause analysis + full reproducible SOP + integrity notes, covering the four problems we hit (boot crash from GCP's dual grub.cfg, read-only rootfs / RTMR-not-extended, runtime RTMR extend failure, broken DNS) and the convert-side issues to confirm with the cryptpilot maintainers.
  • gcp-cvm/README.md, gcp-cvm/config_dir/fde.toml, root README.md pointer, .gitignore for *.deb/*.qcow2.

Two image variants (same base, same pipeline)

hardened (HARDEN=1) dev (HARDEN=0)
Purpose production debugging
ssh / cloud-init / google agents / snapd / … removed kept
google-guest-agent (GCP SSH key injection) omitted by design → not SSH-reachable installed → SSH works
kernel / gve / cryptpilot / tapp-server v0.1.0 identical identical

google-guest-agent is a back-door-class component (it can push changes into the instance from outside the measured app), so the hardened image is intentionally not SSH-reachable; the dev image reinstalls it (+ a 169.254.169.254 metadata.google.internal /etc/hosts entry, needed because the build pins resolv.conf to public DNS) to restore SSH.

§0 reproducibility note

The regression was run from the truly-official Ubuntu cloud image, which revealed that the previously-used temp-fixed.qcow2 base was not a pristine "official + gVNIC" image — it also carried (1) the root partition resized to 20 GiB and (2) google-guest-agent. Both are now documented in §0 and handled by the build, so the image is reproducible from the official image alone.

Companion fix

The cryptpilot show-reference-value fix for never-booted images (empty grubenv / missing saved_entry) is openanolis/cryptpilot#128.

Binaries are gitignored; tapp-server is pulled from release v0.1.0.

🤖 Generated with Claude Code

Build a hardened, measured, attestable GCP (Intel TDX) confidential
image from a stock Ubuntu 24.04 cloud image.

- build-gcp-tapp.sh: one-command pipeline (base provisioning + kernel
  swap + cryptpilot-convert + ESP sync + security hardening)
- prepare-gcp-tapp.sh / fix-esp-grub.sh: stage-B-only and ESP-only helpers
- cryptpilot-gcp-boot-fix.md: root-cause analysis, full SOP, integrity
  notes, and the convert-side issues to confirm with the cryptpilot maintainers
- security hardening: purge ssh/cloud-init/google-guest-agent/osconfig/
  startup-scripts/snapd/etc., mask console getty, MAC-agnostic netplan
- README: point the confidential-instance section to gcp-cvm/ for the GCP variant

Binaries (*.deb, *.qcow2) are gitignored; tapp-server is pulled from release.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Wilbert957

Copy link
Copy Markdown
Collaborator Author

✅ Reviewed. Pure shell scripts + docs, no Rust code changes — no compilation impact.

  • build-gcp-tapp.sh / prepare-gcp-tapp.sh / fix-esp-grub.sh: well-structured, proper set -euo pipefail, env-var overrides for all paths.
  • Security hardening (§11): thorough purge of ssh/cloud-init/google-guest-agent/snapd/etc. Masking getty + MAC-agnostic netplan is correct.
  • .gitignore correctly excludes *.qcow2 / *.deb / disk.raw.
  • cryptpilot-gcp-boot-fix.md: detailed root-cause analysis + SOP.

One suggestion: cryptpilot-fde_0.7.0_amd64.deb is referenced but gitignored — consider adding a sha256sum check in the build script to verify the deb when present. Not blocking.

LGTM.

Wei and others added 9 commits June 24, 2026 11:20
…e grubenv saved_entry

The proper fix belongs in cryptpilot-fde load_kernel_artifacts
(src/cmd/fde/disk.rs:395-397), which currently errors when grubenv has no
saved_entry. Freshly built / never-booted images have an empty grubenv and
boot the grub.cfg default (set default="0" → first menuentry), so the
reference extractor should resolve that fallback instead of erroring.
Documents the convert-side saved_entry injection as a local workaround only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nce-value

Document how to compute remote-attestation reference values from the built
image with `cryptpilot-fde show-reference-value --disk`, including:
- the prerequisite cryptpilot-fde fix (openanolis/cryptpilot#126) so it works
  on never-booted images (empty grubenv) without the convert workaround
- how to build the fixed cryptpilot-fde from the 0gfoundation fork branch
- flags (--disk, --hash-algo, --stage) and the AAEL reference-value outputs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- tapp-server: default release v0.0.5 -> v0.1.0 (build-gcp-tapp.sh URL, README, doc)
- cryptpilot saved_entry fix: reference #128 (against current master 0.8.0,
  cryptpilot-fde/src/disk/grub.rs) instead of the superseded #126 (stale 0.2.7)
- §12 build steps updated for 0.8.0: cargo build -p cryptpilot-fde produces
  cryptpilot-fde-host/-guest; add cryptsetup-devel build dep; branch fix/srv-default-entry
- build-gcp-tapp.sh: add HARDEN toggle (HARDEN=1 hardened / HARDEN=0 dev image)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reparation

- cryptpilot-gcp-boot-fix.md and README.md fully translated to English
- add "§0 Preparation": how temp-fixed.qcow2 is produced from the official
  Ubuntu noble cloud image + GCP gVNIC driver (gve-dkms), plus the prebuilt
  image download link

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t-fde deb

The cryptpilot-fde deb is a build-time prerequisite for the conversion step
(§9), unrelated to producing the base image (temp-fixed.qcow2 = Ubuntu + gVNIC).
Drop it from the Preparation materials.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
§0 only lists the Ubuntu cloud image as the base-image material.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The stock Ubuntu noble cloud image ships a ~3.5 GiB disk, too small for
the kernel + app + Docker layers. Document resizing the root partition to
20 GiB via qemu-img resize + growpart + resize2fs, and warn against
virt-resize --expand which renumbers partitions (sda1 -> sda4) and breaks
the sda1=rootfs / sda16=/boot assumptions the build flow depends on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A freshly built image (from the truly-official Ubuntu cloud image) could
not be reached over SSH, while images built from the legacy temp-fixed.qcow2
base could. Root cause: temp-fixed shipped google-guest-agent, which injects
the instance SSH public key from the metadata server (via 169.254.169.254,
no DNS) into ~ubuntu/.ssh/authorized_keys; the official image does not.

The dev variant (HARDEN=0) now reinstalls google-guest-agent and adds the
169.254.169.254 metadata.google.internal mapping to /etc/hosts (needed
because the build pins resolv.conf to public DNS, so the agent cannot
resolve the metadata hostname otherwise). The hardened variant intentionally
omits it — google-guest-agent is a back-door-class component and the
hardened image is not SSH-reachable by design.

Also document in §0 that temp-fixed.qcow2 deviates from the official image
(20 GiB resize + google-guest-agent).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The three build scripts (build-gcp-tapp.sh, prepare-gcp-tapp.sh,
fix-esp-grub.sh) still had Chinese comments and echo strings. Translate
them all to English; no logic changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wilbert957 added a commit that referenced this pull request Jun 25, 2026
…ocs)

- .gitmodules: SSH → HTTPS url so clone/submodule update works without an SSH key
  (review blocker #1).
- register-shared-as.sh: guard against placeholder (_TODO) reference files; trap-clean
  the temp file on any exit; and detect a no-op injection by checking leftover qrv() on
  ref_* lines only (the qrv(key) helper definition legitimately keeps its own — checking
  the whole policy would false-positive). (review #2, #5, #6)
- policy.rego: document the AR4SI/EAR claim tiers and the non-affirming defaults
  (executables 33 / hardware 97 / configuration 36). (review #3)
- docs: fix verify.rs path to tapp-common/src/verify.rs. (review #4)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wei and others added 4 commits June 25, 2026 21:14
0g-tapp had no quick local way to check that a converted GCP confidential
image actually boots before uploading it to GCP. Add boot-smoke-test.sh,
which boots the image under QEMU/OVMF (UEFI) in the qemux/qemu container
(KVM if available, else TCG) and scans the serial console for the full boot
chain: grub -> gcp kernel -> cryptpilot-fde (dm-verity + zram + dm-snapshot)
-> /sysroot mount -> switch-root -> multi-user / tapp-server.service.

Validates everything except the TDX-only bits (RTMR extend, remote
attestation), which require real hardware; it is a pre-flight check, not a
replacement for on-hardware testing. Documented in the gcp-cvm README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…m build scripts

The local boot smoke test is a verification tool, not part of the build
pipeline; keep it under gcp-cvm/test/ rather than alongside the build
scripts. Update the README references accordingly. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ixed base

Remove deployment-specific values that were baked in as defaults:
- OWNER_ADDRESS (a concrete wallet address) and KBS_URLS (concrete KBS
  node IPs) are now REQUIRED; the build aborts if unset, so no specific
  deployment value is ever committed or silently baked into config.toml.
- §0 of the doc no longer points at the opaque, non-reproducible
  temp-fixed.qcow2 download; the base is built reproducibly from the
  official Ubuntu cloud image (resize + gVNIC). The SSH/google-guest-agent
  note is reframed around the dev vs hardened build instead of temp-fixed.

README documents the now-required variables.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cible by others

A third party could not follow the docs end-to-end: the conversion host
requirement was implicit and the package sources were missing. Add an
explicit Prerequisites section (README + doc §0.1):
- conversion host must be Anolis / Alibaba Cloud Linux 3 (al8); cryptpilot-convert
  is not packaged for Ubuntu hosts
- install cryptpilot-convert via the al8 RPM from openanolis/cryptpilot v0.7.0;
  the target-image runtime is the matching .deb from the same release
- host tooling: libguestfs-tools, qemu-img, nbd module, LIBGUESTFS_BACKEND=direct,
  root; docker only for the smoke test
Also make the §0.2 resize step a copy-pasteable virt-customize command.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Wilbert957

Copy link
Copy Markdown
Collaborator Author

Read through all three scripts + fde.toml + the smoke test. Agree with the LGTM — well-structured defensive shell (consistent set -euo pipefail, env overrides, required-arg checks, trap cleanup, guestfish layout asserts), clean two-variant design, and the QEMU/OVMF smoke test is a genuinely useful pre-flight. A few notes:

Worth confirming before treating any output as a production image

  • config_dir/fde.toml hardcodes a literal passphrase:
    [data.encrypt.exec]
    command = "echo"
    args = ["-n", "test-passphrase"]
    Even if this data volume is inert here ([data] integrity = false + build uses --rootfs-no-encryption, so rootfs is verity-only), baking test-passphrase into the image-build config is a smell. Can we confirm it's never exercised, or swap it for a real key provider for prod?

Non-blocking / follow-up

  • No checksum on downloaded inputs: tapp-server (wget) and FDE_PACKAGE (deb). The measured image catches post-build tampering, but a sha256sum on both inputs is cheap defense-in-depth (extends the deb suggestion already raised).
  • build-gcp-tapp.sh mutates $IN in place (stage A operates on the input directly), whereas prepare-gcp-tapp.sh defaults to a protective copy. The trailing echo warns about it, but given the (in, out) signature this is a footgun — consider copying first or announcing the in-place destruction up front.
  • Partition layout is hardcoded (base sda1; post-convert sda15=ESP / sda16=boot). Fine for GCP Ubuntu images, and the is-file/is-dir asserts fail loudly on mismatch rather than writing the wrong partition. OK as-is.

Naming nit (in favor): gcp-cvm/ is correct here — this genuinely builds a GCP-specific image, distinct from the verification side which is platform-agnostic. Don't let anyone "unify" it away.

…#21, Phase 1)

Add ENABLE_SYSBOX=1 to build-gcp-tapp.sh: installs sysbox-ce and registers
sysbox-runc as a dockerd runtime so in-container root is user-namespace
remapped (a sandbox kernel CVE is no longer host-equivalent).

Because the cryptpilot rootfs writable overlay is RAM-backed (zram) and
ephemeral, docker/sysbox data must not live on it: the build pins docker
data-root to /data/docker and adds a docker.service RequiresMountsFor=/data
drop-in + an fstab LABEL=tapp-data entry, so docker fails loud if the
persistent /data disk is not mounted (never silently writes to the RAM root).

Phase 1 is isolation only; /data confidentiality (KBS-bound dm-crypt) is
deferred to Phase 2. Default build is unchanged (ENABLE_SYSBOX defaults to 0).
boot-smoke-test.sh gains an optional CHECK_SYSBOX=1 static image check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant