Skip to content

test: enforce restricted PSS for CI user namespace#3487

Open
danish9039 wants to merge 21 commits into
kubeflow:masterfrom
danish9039:pss-restricted-user-namespace-ci
Open

test: enforce restricted PSS for CI user namespace#3487
danish9039 wants to merge 21 commits into
kubeflow:masterfrom
danish9039:pss-restricted-user-namespace-ci

Conversation

@danish9039

@danish9039 danish9039 commented May 26, 2026

Copy link
Copy Markdown
Member

Summary of Changes

This PR makes CI validate Kubeflow user workloads under Kubernetes Pod Security Standards restricted.

  • Relabel the CI Profile namespace kubeflow-user-example-com to pod-security.kubernetes.io/enforce=restricted after the Profile Controller creates it.
  • Keep the production/default Profile Controller behavior unchanged.
  • Update workflow path filters so shared PSS test-script and local overlay changes rerun affected component workflows.
  • Make CI workload examples compatible with restricted PSS across Katib, KServe, Notebooks, Workspaces, Trainer, Training Operator, Ray, and Istio validation.
  • Move Katib, Trainer, and Training Operator restricted-PSS test mutations into local overlays instead of live kubectl patch or generated temp-overlay logic.
  • Move PR-added script debug dumps into GitHub Actions failure-log artifacts while keeping test waits fail-fast.

Notes

This is intentionally CI/test focused. It does not change the default customer Profile namespace policy.

The new application changes are local Kubeflow manifests overlays under applications/*/overlays. They do not edit synchronized applications/*/upstream folders or common/* manifests.

Follow-up notes are tracked separately for keeping the copied Katib CI config in sync during future Katib syncs and for removing the temporary Training Operator overlay after the final 26.03.1 path is available.

Related

Follow-up to #3444.

Validation

Latest head 7adb2d55 is green on all technical checks.

  • DCO passes.
  • Formatting checks pass for YAML, Bash, and Python.
  • Full Kubeflow integration passes.
  • Component workflows pass, including Katib, Trainer, Training Operator, notebooks, pipelines, workspaces, KServe, Ray, Istio validation, dashboard, and Dex/OAuth2.
  • Local validation included bash -n, shellcheck, YAML parsing, kustomize build for the new overlays, rendered field assertions, git diff --check, and a guard confirming no applications/*/upstream/** changes.

tide is still waiting for review labels.

Contributor Checklist

  • I have tested these changes with kustomize. See Installation Prerequisites.
  • All commits are signed-off to satisfy the DCO check.
  • I have considered adding my company to the adopters page to support Kubeflow and help the community, since I expect help from the community for my issue.

abdullahpathan22 and others added 3 commits May 27, 2026 03:26
Overwrites the default baseline namespace label created by the profile
controller exclusively during CI tests. This guarantees that test
workloads simulate strict pod security standard (PSS) restricted
enforcements without modifying production default baselines.

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
- Broaden katib triggers: tests/katib_install.sh -> tests/katib*
- Broaden pipeline triggers: individual files -> tests/pipeline*
- Add tests/pipeline* trigger to pipeline_run_from_notebook workflow
- Replace dead experimental/security/PSS/* path (directory no longer
  exists) with actual test files: tests/kubeflow_profile_install.sh
  and tests/PSS_enable.sh across all affected workflows

Note: Dashboard/profiles directory paths are intentionally kept as-is
since applications/profiles/ does not exist yet on master. Those paths
will be updated once the directory restructure lands.

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Copilot AI review requested due to automatic review settings May 26, 2026 21:57
@google-oss-prow google-oss-prow Bot requested a review from juliusvonkohout May 26, 2026 21:57
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kimwnasptd for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot requested a review from tarekabouzeid May 26, 2026 21:57
@github-actions

Copy link
Copy Markdown

Welcome to the Kubeflow Manifests Repository

Thanks for opening your first PR. Your contribution means a lot to the Kubeflow community.

Before making more PRs:
Please ensure your PR follows our Contributing Guide.
Please also be aware that many components are synchronizes from upstream via the scripts in /scripts.
So in some cases you have to fix the problem in the upstream repositories first, but you can use a PR against kubeflow/manifests to test the platform integration.

Community Resources:

Thanks again for helping to improve Kubeflow.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates CI to validate/enforce Kubernetes Pod Security Standards (PSS) settings for Kubeflow profile namespaces, and wires the related test scripts into multiple GitHub Actions path filters so relevant workflows run when these scripts change.

Changes:

  • Update kubeflow_profile_install.sh to label the Kubeflow user namespace with pod-security.kubernetes.io/enforce=restricted and enforce-version=latest, and assert the labels are applied.
  • Add tests/kubeflow_profile_install.sh and tests/PSS_enable.sh to workflow on.pull_request.paths filters across several test workflows.
  • Broaden some workflow path filters (e.g., tests/pipeline*, tests/katib*) to cover more related changes.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/kubeflow_profile_install.sh Labels the profile namespace for PSS restricted + adds assertions for applied labels.
.github/workflows/workspaces_pipeline_run_test.yaml Triggers workflow on profile/PSS test script changes.
.github/workflows/volumes_web_application_test.yaml Triggers workflow on profile install script changes.
.github/workflows/training_operator_test.yaml Triggers workflow on profile/PSS test script changes.
.github/workflows/trainer_test.yaml Triggers workflow on profile/PSS test script changes.
.github/workflows/pipeline_test.yaml Broadens pipeline path trigger + triggers on profile/PSS test script changes.
.github/workflows/pipeline_run_from_notebook.yaml Adds pipeline wildcard + profile/PSS script triggers.
.github/workflows/kserve_test.yaml Triggers workflow on profile/PSS test script changes.
.github/workflows/kserve_models_web_application_test.yaml Triggers workflow on profile install script changes.
.github/workflows/katib_test.yaml Broadens katib path trigger + triggers on profile/PSS test script changes.
.github/workflows/istio_validation.yaml Triggers workflow on profile/PSS test script changes.
.github/workflows/dex_oauth2-proxy_test.yaml Triggers workflow on profile/PSS test script changes.

Comment thread tests/kubeflow_profile_install.sh
Comment thread tests/kubeflow_profile_install.sh
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@google-oss-prow google-oss-prow Bot added size/L and removed size/M labels May 27, 2026
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@google-oss-prow google-oss-prow Bot added size/XL and removed size/L labels May 27, 2026
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Comment thread tests/dashboard_install.sh Outdated
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@danish9039 danish9039 marked this pull request as ready for review May 30, 2026 14:15
@danish9039

Copy link
Copy Markdown
Member Author

@juliusvonkohout

@juliusvonkohout juliusvonkohout requested review from Copilot and removed request for Copilot June 1, 2026 14:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.

Comment thread tests/katib_install.sh Outdated
Comment thread tests/katib_install.sh
Comment thread tests/katib_test.sh Outdated
Comment thread tests/notebooks_install.sh Outdated
Comment thread tests/trainer_test.sh Outdated
kubectl get pods -n kubeflow-system -l app.kubernetes.io/name=trainer
kubectl get clustertrainingruntimes torch-distributed

kubectl patch clustertrainingruntime torch-distributed --type=json -p='[

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this must also be a proper overlay and maybe upstreamed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might already bea recent Pr for this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the torch-distributed runtime changes into applications/trainer/overlays/runtimes-restricted; upstream follow-up is noted separately.

@danish9039 danish9039 Jun 5, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked kubeflow/trainer#3066 ; it helps with override support, but this PR still needs the local runtime default overlay.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please a proper overlay. this has to survive the next release and then we can already remove it. Please raise a follow up PR to remove it such that i can merge it afte rthe 26.03.1 release.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this into applications/training-operator/overlays/kubeflow-restricted-pss and noted the follow-up removal after 26.03.1.

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@danish9039 danish9039 requested a review from Copilot June 5, 2026 08:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

tests/workspaces_pipeline_run_test.sh:1

  • This JSONPatch adds nested fields under /spec/podTemplate/securityContext/... and /spec/podTemplate/containerSecurityContext/.... JSONPatch add will fail if the parent object (e.g., /spec/podTemplate/securityContext) does not already exist. To make this robust across different base WorkspaceKind manifests, patch the parent objects in one operation (e.g., add/merge the full securityContext / containerSecurityContext objects) or use a strategic/merge patch (--type=merge) targeting the parent keys.
    tests/kserve_test.sh:1
  • The polling logic treats any response other than 403 as success (twice). This can still false-positive on 404/503 while routes/backends are not ready—exactly the failure mode mentioned in the comment above the loop. Prefer checking for the expected success code (typically 200) or at least explicitly excluding 404/503 (and possibly 000) from counting toward STABLE_POLL_COUNT, to prevent passing the policy wait while the service is still unreachable.
#!/bin/bash

Comment thread .github/workflows/workspaces_pipeline_run_test.yaml

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you upstream this or make sure to only patch the securitycontext, but not the image for example ?

serviceAccountName: default-editor
containers:
- name: ray-worker
image: rayproject/ray:2.44.1-py311-cpu

@juliusvonkohout juliusvonkohout Jun 5, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may you update ray and kuberay to the latest available versions please

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so not just this file, also the kuberay operator

Comment thread tests/katib_install.sh

cd applications/katib/upstream && kustomize build installs/katib-with-kubeflow | kubectl apply -f - && cd ../../../

kustomize build applications/katib/overlays/security | kubectl apply -f -

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to have 2 steps for this? since these scripts are also used for the single step install and would reach customers.

@juliusvonkohout juliusvonkohout Jun 5, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you have to use kustomize components instead of overlays to have a cleaner separation. If componnts do not work you can also stay with the overlay

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you caould have an if around it as in our kind installation that you install the extra security stuff on top only if you are in GHA. That goes for all installation scripts.

Comment thread tests/kserve_test.sh
export KSERVE_M2M_TOKEN="$(kubectl -n ${NAMESPACE} create token default-editor)"
export KSERVE_TEST_NAMESPACE=${NAMESPACE}

kubectl patch clusterstoragecontainer default --type=merge --patch '{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this directly to the kserve installation as patch ? Maybe even upstream it ?

Comment on lines +9 to +12
kubectl label namespace "$KF_PROFILE" \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/enforce-version=latest \
--overwrite

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only when you are in the GHA environments as in our kind script.

kubectl -n kubeflow rollout status deployment/tensorboard-controller-deployment --timeout=60s
kubectl -n kubeflow rollout status deployment/tensorboards-web-app-deployment --timeout=60s
kubectl -n kubeflow rollout status deployment/volumes-web-app-deployment --timeout=60s
kubectl -n kubeflow rollout status deployment/jupyter-web-app-deployment --timeout=180s

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you increase the timeouts so much ?

Comment thread tests/trainer_install.sh
kubectl wait --for=condition=Available deployment/jobset-controller-manager -n kubeflow-system --timeout=120s

kustomize build upstream/overlays/runtimes | kubectl apply --server-side --force-conflicts -f -
kustomize build overlays/runtimes-restricted | kubectl apply --server-side --force-conflicts -f -

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same comments as above. keep it like before for regular users, but if you are in gha add the extra security


cd applications/training-operator/upstream
kustomize build overlays/kubeflow | kubectl apply --server-side --force-conflicts -f -
kustomize build applications/training-operator/overlays/kubeflow-restricted-pss | kubectl apply --server-side --force-conflicts -f -

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, only when you are in the GHA environment

kubectl apply -f /tmp/pytorch-job.yaml

kubectl wait --for=jsonpath='{.status.conditions[0].type}=Created' pytorchjob.kubeflow.org/pytorch-simple -n $KF_PROFILE --timeout=60s
kubectl wait --for=condition=Created pytorchjob.kubeflow.org/pytorch-simple -n "$KF_PROFILE" --timeout=180s

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you increase the timeouts ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if really needed this has to be upstreamed. CC @christian-heusel

Comment thread tests/trainer_test.sh
Comment on lines 19 to +20

kubectl get clustertrainingruntime torch-distributed -o yaml

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this line ?

Comment thread tests/katib_test.sh

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you change the timeouts ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants