Skip to content

google_kubernetes_engine: add management plane operations#6747

Open
ashishsuneja wants to merge 9 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:mgmt_plane_gke
Open

google_kubernetes_engine: add management plane operations#6747
ashishsuneja wants to merge 9 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:mgmt_plane_gke

Conversation

@ashishsuneja

Copy link
Copy Markdown
Collaborator

Summary
Adds GKE-specific implementations of management plane async methods.

Main changes

  • _IssueAsync, CreateNodePoolAsync, DeleteNodePoolAsync,
    UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation,
    ResolveNodePoolVersions, _GetLatestOperationName, GetNodePoolNames

How tested

  • Existing GKE test suite passing
  • pyink + lint-diffs clean

Ashish Suneja added 3 commits June 9, 2026 12:47
- _IssueAsync: issues gcloud --async commands, returns op name
- CreateNodePoolAsync: creates nodepool, returns op handle
- DeleteNodePoolAsync: deletes nodepool, returns op handle
- UpgradeNodePoolAsync: upgrades nodepool, returns op handle
- UpdateClusterAsync: toggles label for non-destructive update
- WaitForOperation: polls until DONE/ABORTING
- ResolveNodePoolVersions: auto-detects initial/target versions
- _GetLatestOperationName: fallback op lookup by type+target
- GetNodePoolNames: lists current node pools

Tested:
- Existing GKE test suite passing
- pyink + lint-diffs clean
Operation name string, or empty string if none found.
"""
link_target = target_name or self.name
if op_start_time:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good logic; can we not just do it every time / require op_start_time? Additional conditions when not needed are just additional complexity.

@ashishsuneja ashishsuneja Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — collapsed it. _GetLatestOperationName now always uses the broadened filter (RUNNING/PENDING/DONE with a startTime>= guard) and always takes op_start_time. The guard already prevents matching stale completed ops, so the two-branch conditional was redundant.


def _IssueAsync(self, cmd: util.GcloudCommand) -> str:
"""Issues a gcloud command with --async, returns the operation name."""
cmd.args.append('--async')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was wondering what made these operations async. got it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged

f'targetLink ~ {link_target}'
f'{time_filter}'
)
for attempt in range(1, max_attempts + 1):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are some examples where you try to call the operation but it doesn't get seen immediately? does it just take a minute for results to show up?

A unit test with semi-real output examples would be helpful.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit tests (GoogleKubernetesEngineAsyncOpsTestCase) with captured-style gcloud output: create returns the op name directly (no fallback), upgrade and update fall back and recover from the operations list, and the no-fallback-configured case raises. On the timing: the operation can take a moment to leave PENDING, and fast ops (the label-update) may already be DONE by the time we query — both handled by the broadened filter. In practice on the 100-pool run it resolved on the first query every time; the 5×3s retry is a safety margin for the PENDING-transition window.

raise errors.Resource.CreationError(
f'GKE async command returned no operation name; stderr={stderr}'
)
return op_name

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below you have _GetLatestOperationName command.. does this command here not output the operation name? Shouldn't it just be:

  • start operation, name is returned
  • wait for operation to finish. now done.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For create and delete, that's exactly the flow — _IssueAsync issues --async, gcloud prints the operation name, we wait on it. The exception is clusters upgrade --node-pool and clusters update: those reliably return success with empty stdout (gcloud just doesn't print the name for those two subcommands). Verified on a 100-pool GKE run — 99/99 upgrades and the cluster-update all came back with no operation name, so _GetLatestOperationName recovers it from the operations list. I didn't find a gcloud flag that makes upgrade/update print it; if one exists I'd happily drop the fallback.

# Fallback: gcloud succeeded but printed nothing. Query the operations
# list scoped to this specific nodepool to find the operation name.
logging.info(
'UpgradeNodePoolAsync: ops list fallback for %s: %s',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why/when/how frequently does this happen?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time, not intermittently — clusters upgrade --node-pool --async returned no operation name on 99/99 upgrades in the 100-pool run. So the fallback runs on every upgrade/update.

try:
return self._IssueAsync(cmd)
except errors.Resource.CreationError as e:
if 'returned no operation name' not in str(e):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is shared with UpgradeNodePoolAsync above. Refactor to only use in one location. Perhaps an optional parameter in _IssueAsync which can handle this "no operation name" case. If ofc this is necessary at all.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, exactly as suggested — moved the fallback into _IssueAsync behind optional fallback_op_type / fallback_target params, so the "no op name → query operations list" path lives in one place. UpgradeNodePoolAsync and UpdateClusterAsync are now one-liners

@ashishsuneja ashishsuneja force-pushed the mgmt_plane_gke branch 2 times, most recently from 2cd4d5b to d3eec15 Compare June 17, 2026 19:21
@hubatish

Copy link
Copy Markdown
Collaborator

/gcbrun

f'targetLink ~ {link_target} AND '
f'startTime>="{from_time}"'
)
for attempt in range(1, max_attempts + 1):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you looked at vm_util.Retry? Used with @vm_util.Retry(poll_interval,max_retries,timeout). I suspect you're rewriting code rather than utilizing PKB standards/logic.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — refactored to use vm_util.Retry instead of the manual for-loop/sleep. Since max_attempts/retry_delay are call-time parameters here (not fixed), I apply the decorator to a locally-defined inner function inside the method rather than statically on it — poll_interval/max_retries still fully respect the caller's values via closure. Scoped retryable_exceptions to errors.Resource.GetError so only the "not found yet" case retries, not unrelated failures.

fallback_op_type,
fallback_target,
)
return self._GetLatestOperationName(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is only for operations that don't return an op name? How common is that? Is it in the output but just harder to parse, or completely absent?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only UPGRADE_NODES and UPDATE_CLUSTER hit this path — create/delete always print the operation name directly (the fallback is disabled for those via fallback_op_type=None). For upgrade/update it's not intermittent — verified 99/99 upgrades plus the cluster update returned empty stdout in the 100-pool run. It's completely absent from gcloud's output for those two subcommands, not a parsing issue

Returns:
The operation name.
"""
cmd.args.append('--async')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overall looks useful enough & broad enough we should consider putting it in providers/gcp/utils.py's GcloudCommand. The --async feature is supported across many gcloud commands & should be broadly applicable right? Even if the fallback "get lastest" behavior would vary - it could take like an optional "get_latest" lambda as an argument. (and given that said fallback is necessary).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added GcloudCommand.IssueAsync(get_latest_op_fn=...) to util.py. GKE's _GetLatestOperationName is now passed in as the callable, so the GKE-specific filter logic stays in google_kubernetes_engine.py while the async-issue/fallback shape is reusable by any GCP resource. Purely additive — GcloudCommand.Issue() itself is unchanged, so none of the existing 194 call sites across GCP providers are affected. Verified with the full tests/providers/gcp/ suite.

def UpgradeNodePoolAsync(self, name: str, target_version: str) -> str:
"""Initiates node pool upgrade; returns op handle. Does NOT wait.

`clusters upgrade --node-pool --async` returns success with empty stdout

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it ok I see where we need this now. Are we running multiple of these in parallel? That's where "get latest" would break.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — concurrent_node_pool_ops runs many of these in parallel (concurrent upgrades across node pools). It's safe: the filter includes targetLink ~ {link_target}, and each concurrent call passes its own specific fallback_target (the individual node-pool name) — so concurrent calls each resolve only operations matching their own target, not just "whatever's most recent." Combined with the op_start_time guard, this disambiguates correctly even when many ops start within the same few seconds. I'll add an explicit concurrent-calls test case to make this guarantee visible in the test suite.

cmd.flags['update-labels'] = f'k8s-mgmt-ts={int(time.time())}'
# GcloudCommand sets --quiet by default; the label update is
# non-interactive so it's safe to drop (matches how this was validated).
cmd.flags.pop('quiet', None)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this pop logic is very GcloudCommand specific. Seconding the "add async as option in utils.py" suggestion as a refactor. Might move some unit tests over as well.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added GcloudCommandIssueAsyncTestCase to util_test.py, covering direct op-name return, fallback invocation, raise-when-no-fallback, raise-on-command-failure, and the --async/format flag setup.

Ashish Suneja added 3 commits June 29, 2026 12:20
…sync

Moves the --async issue + 'no op name printed, fall back to listing
operations' pattern out of GKE-specific _IssueAsync into a reusable
GcloudCommand.IssueAsync method, with the resource-specific lookup
(_GetLatestOperationName for GKE) passed in as a callable. Per
hubatish's review on PR GoogleCloudPlatform#6747.
@hubatish hubatish self-assigned this Jun 29, 2026
@hubatish

Copy link
Copy Markdown
Collaborator

/gcbrun

@hubatish hubatish left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty reasonable! Left a few more comments on the move to util.py but generally lgtm. Will start trying to approve/submit after last round of comments & the base is submitted.

"""
self.args.append('--async')
self.flags['format'] = 'value(name)'
stdout, stderr, retcode = self.Issue(raise_on_failure=False, **kwargs)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if I asked this already, but why raise_on_failure=False ? That's a big problem for general use - maybe it should be an argument passed from google_kubernetes_engine. Especially given the "if retcode" below it just looks like you're changing the exception type returned.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - removed the hardcoded override. IssueAsync now just calls self.Issue(**kwargs), so it inherits Issue()'s own default (raise_on_failure=True) unless a caller explicitly opts out via **kwargs. The manual if retcode: raise is gone too — it was just reimplementing what the default already does.

self.flags['format'] = 'value(name)'
stdout, stderr, retcode = self.Issue(raise_on_failure=False, **kwargs)
if retcode:
raise errors.Resource.CreationError(stderr)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Said exception type also doesn't really make sense in a more general location - this IssueAsync command doesn't have to be called in a _Create context.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — removed errors.Resource.CreationError from this path entirely. Since raise_on_failure now defaults to True, a failing command raises via Issue()/IssueCommand's own exception (errors.VmUtil.IssueCommandError) before we'd ever reach a retcode check here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — removed errors.Resource.CreationError from this path entirely. Since raise_on_failure now defaults to True, a failing command raises via Issue()/IssueCommand's own exception (errors.VmUtil.IssueCommandError) before we'd ever reach a retcode check here.

if op_name:
return op_name
if get_latest_op_fn is None:
raise errors.Resource.CreationError(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise probably errors.VmUtil.IssueCommandError instead.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — switched the "no op name and no fallback" raise to errors.VmUtil.IssueCommandError, matching what the command-failure path now raises by default too.

…ueCommandError

Per hubatish's review:
- Removed hardcoded raise_on_failure=False -- IssueAsync now inherits
  Issue()'s own default (True), so callers control this via **kwargs
  if needed, and the manual retcode check (which just reimplemented
  the default) is gone.
- Switched the remaining 'no op name, no fallback' exception from
  errors.Resource.CreationError to errors.VmUtil.IssueCommandError,
  since IssueAsync isn't always called in a _Create context.

Updated 3 tests to match: 2 needed the new exception type, and
testIssueAsyncRaisesOnCommandFailure now mocks Issue() raising
(side_effect) rather than returning a failure tuple, matching real
behavior now that retcode isn't checked manually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants