Skip to content

fix(CubeMaster/templatecenter): serialize concurrent rootfs builds for the same artifact_id#194

Open
dushulin wants to merge 1 commit into
TencentCloud:masterfrom
dushulin:fix/template-artifact-build-race
Open

fix(CubeMaster/templatecenter): serialize concurrent rootfs builds for the same artifact_id#194
dushulin wants to merge 1 commit into
TencentCloud:masterfrom
dushulin:fix/template-artifact-build-race

Conversation

@dushulin
Copy link
Copy Markdown
Contributor

Motivation

Quick-succession cubemastercli tpl create-from-image calls against the same OCI image spec race inside ensureRootfsArtifact: both goroutines resolve to the same artifact_id, share the same workDir/ext4Path, and one's defer cleanupIntermediateArtifacts wipes the ext4 file out from under the other. The surviving caller reaches distributeRootfsArtifact with a partial record (ext4_size_bytes=0, empty download_token, empty master_node_ip); buildDownloadURL falls back
to os.Hostname(), and cubelet rejects the pull with invalid size:0 — the template is marked FAILED with no clear signal of the underlying race.

Changes

  • artifactBuildLocks sync.Map (keyed by artifact_id) serializes the full find-or-build flow in ensureRootfsArtifact. Granularity is artifact_id only — concurrent builds for different image specs still run in parallel.
  • Defense-in-depth guard at the top of distributeRootfsArtifact: refuse to push a CreateImage when the artifact record is obviously incomplete, returning a diagnostic error instead of relying on cubelet's opaque invalid size:0.

Verification

  • go build ./CubeMaster/pkg/templatecenter/... clean
  • go vet ./CubeMaster/pkg/templatecenter/... clean
  • Reproduced on an internal deployment before the fix; after the fix, concurrent submits serialize as expected and no partial artifact reaches cubelet.

…r the same artifact_id

Symptom:
  Submitting `cubemastercli tpl create-from-image` twice in quick
  succession against the same OCI image (or the same writable_layer_size
  / instance_type combination) causes cubelets to receive CreateImage
  requests with `ext4_size_bytes=0`, `token=""`, and a download URL that
  falls back to `os.Hostname()` of the cubemaster host. Cubelet rejects
  the pull with "invalid size:0" and the template is marked FAILED; a
  manual retry has to wait ~90s for the prior build to finish.

Root cause:
  - `ensureRootfsArtifact` has no lock at the artifact_id level.
    artifact_id is derived deterministically from the image + spec
    fingerprint, so two concurrent submits for the same image spec both
    resolve to the same artifact_id.
  - Both goroutines fall through to `buildRootfsArtifact`, which shares
    the same workDir / storeDir / ext4Path. One goroutine's
    `defer cleanupIntermediateArtifacts` then wipes the ext4 file out
    from under the other goroutine.
  - The "winning" caller reaches `distributeRootfsArtifact` with a
    partial record (size=0, token="", master_node_ip=""). Because
    `buildDownloadURL` falls back to `os.Hostname()` when master_node_ip
    is empty, the cubelet receives a URL that isn't reachable from
    outside the cubemaster host.

Fix:
  - Add `var artifactBuildLocks sync.Map` keyed by artifact_id.
    `ensureRootfsArtifact` does LoadOrStore + Lock + defer Unlock so the
    full find-or-build flow is serialized per artifact. The lock
    granularity is artifact_id only; builds for different images still
    run in parallel.
  - Add a defense-in-depth guard at the top of
    `distributeRootfsArtifact`: if the artifact record is incomplete
    (nil / Status != Ready / Ext4SizeBytes == 0 / DownloadToken empty /
    MasterNodeIP empty), fail fast with a diagnostic error instead of
    pushing a bad CreateImage to cubelet. If a future code path lets a
    partial artifact reach distribute, operators will see a clear
    message instead of cubelet's opaque "invalid size:0".

Verification:
  - go build ./CubeMaster/pkg/templatecenter/... clean
  - go vet ./CubeMaster/pkg/templatecenter/... clean

Signed-off-by: dushulin <shulindoo@gmail.com>
@fslongjin
Copy link
Copy Markdown
Member

/cubebot review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

Claude finished @fslongjin's task in 2m 23s —— View job


PR Review: fix(CubeMaster/templatecenter): serialize concurrent rootfs builds for the same artifact_id

  • Gather context: read the changed file and diff
  • Analyze the race condition fix and provide review feedback

Overall assessment: This is a well-designed fix for a real race condition. The root cause analysis is thorough, the locking strategy is sound, and the defense-in-depth guard is a good safety net. Below are a few observations.

What works well

  • Per-artifactID locking via sync.Map is idiomatic Go and correctly serializes only competing builds for the same image spec. Different images still build in parallel.
  • The LoadOrStore pattern at template_image.go:1244 correctly handles concurrent initialization of the mutex.
  • The defense-in-depth validation in distributeRootfsArtifact (lines 1702-1711) provides a clear diagnostic error that would be much easier to debug than cubelet's opaque invalid size:0.
  • The lock scope is correct — it covers the entire find-or-build flow in ensureRootfsArtifact, so the second caller will find the ready artifact after the first releases the lock.

Issues and suggestions

1. artifactBuildLocks entries are never cleaned up (minor)

artifactBuildLocks grows monotonically — once an artifactID key is stored, it's never deleted. For a long-running cubemaster, this accumulates orphaned *sync.Mutex entries. The number is bounded by distinct image specs, which is likely manageable, but it's worth noting. A simple fix would be to Delete the entry after the lock is released (the second caller would re-create it if needed):

muV, _ := artifactBuildLocks.LoadOrStore(artifactID, &sync.Mutex{})
mu := muV.(*sync.Mutex)
mu.Lock()
defer func() {
    mu.Unlock()
    artifactBuildLocks.Delete(artifactID)
}()

This is safe because: if another goroutine is already waiting on mu.Lock(), it holds a reference to the same mutex and will proceed. If a new goroutine arrives after the Delete, LoadOrStore creates a fresh mutex — which is fine since no one else holds it.

Fix this →

2. Missing Ext4SHA256 in the defense-in-depth check (minor)

The validation at line 1705 checks Ext4SizeBytes, DownloadToken, and MasterNodeIP, but not Ext4SHA256. If size > 0 but SHA256 is empty, cubelet would receive a corrupt image spec. Consider adding:

artifact.Status != ArtifactStatusReady ||
    artifact.Ext4SizeBytes == 0 ||
    artifact.Ext4SHA256 == "" ||
    strings.TrimSpace(artifact.DownloadToken) == "" ||
    strings.TrimSpace(artifact.MasterNodeIP) == ""

Fix this →

3. No test for concurrent build serialization (suggestion)

The PR description mentions go build and go vet pass, but there's no unit test verifying that concurrent calls to ensureRootfsArtifact for the same artifactID are properly serialized. Given the subtlety of the race, a targeted test would prevent regressions:

func TestEnsureRootfsArtifact_ConcurrentSerialization(t *testing.T) {
    // Launch N goroutines calling ensureRootfsArtifact for the same artifactID
    // Verify only one build actually occurs and all callers succeed
}

This is a suggestion rather than a blocking issue — I understand the build infrastructure and test setup may make this non-trivial.

4. Pre-existing concern: redo cleanup not under lock (observation, not a PR issue)

In runRedoTemplateImageJob (lines 1057-1066), the cleanup of the previous artifact happens before calling ensureRootfsArtifact, so it's not protected by the per-artifactID lock. If two redo jobs for the same artifactID run concurrently, both could attempt cleanup. This appears safe today since the cleanup operations are idempotent, but it's worth being aware of.

Summary

The fix correctly addresses the race condition. The two minor suggestions (cleaning up artifactBuildLocks entries and adding Ext4SHA256 to the validation) would make it more robust. Consider adding a concurrency test if feasible.


// full failure mode.
muV, _ := artifactBuildLocks.LoadOrStore(artifactID, &sync.Mutex{})
mu := muV.(*sync.Mutex)
mu.Lock()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second request will be blocked here. Can we return early here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests that do not currently obtain a lock will be stuck in the first state of the state machine (BUILDING_EXT4). After the job with the lock is completed, other tasks will instantly enter the next state. I understand what you mean is that when you find this situation, do the next job directly throw an exception at the api level to the cli and end the job?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the advantage of this is that it is not intrusive to the state machine or the cli, but can cause too many coroutines in extreme cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants