fix(CubeMaster/templatecenter): serialize concurrent rootfs builds for the same artifact_id by dushulin · Pull Request #194 · TencentCloud/CubeSandbox

dushulin · 2026-05-12T03:21:23Z

Motivation

Quick-succession cubemastercli tpl create-from-image calls against the same OCI image spec race inside ensureRootfsArtifact: both goroutines resolve to the same artifact_id, share the same workDir/ext4Path, and one's defer cleanupIntermediateArtifacts wipes the ext4 file out from under the other. The surviving caller reaches distributeRootfsArtifact with a partial record (ext4_size_bytes=0, empty download_token, empty master_node_ip); buildDownloadURL falls back
to os.Hostname(), and cubelet rejects the pull with invalid size:0 — the template is marked FAILED with no clear signal of the underlying race.

Changes

artifactBuildLocks sync.Map (keyed by artifact_id) serializes the full find-or-build flow in ensureRootfsArtifact. Granularity is artifact_id only — concurrent builds for different image specs still run in parallel.
Defense-in-depth guard at the top of distributeRootfsArtifact: refuse to push a CreateImage when the artifact record is obviously incomplete, returning a diagnostic error instead of relying on cubelet's opaque invalid size:0.

Verification

go build ./CubeMaster/pkg/templatecenter/... clean
go vet ./CubeMaster/pkg/templatecenter/... clean
Reproduced on an internal deployment before the fix; after the fix, concurrent submits serialize as expected and no partial artifact reaches cubelet.

…r the same artifact_id Symptom: Submitting `cubemastercli tpl create-from-image` twice in quick succession against the same OCI image (or the same writable_layer_size / instance_type combination) causes cubelets to receive CreateImage requests with `ext4_size_bytes=0`, `token=""`, and a download URL that falls back to `os.Hostname()` of the cubemaster host. Cubelet rejects the pull with "invalid size:0" and the template is marked FAILED; a manual retry has to wait ~90s for the prior build to finish. Root cause: - `ensureRootfsArtifact` has no lock at the artifact_id level. artifact_id is derived deterministically from the image + spec fingerprint, so two concurrent submits for the same image spec both resolve to the same artifact_id. - Both goroutines fall through to `buildRootfsArtifact`, which shares the same workDir / storeDir / ext4Path. One goroutine's `defer cleanupIntermediateArtifacts` then wipes the ext4 file out from under the other goroutine. - The "winning" caller reaches `distributeRootfsArtifact` with a partial record (size=0, token="", master_node_ip=""). Because `buildDownloadURL` falls back to `os.Hostname()` when master_node_ip is empty, the cubelet receives a URL that isn't reachable from outside the cubemaster host. Fix: - Add `var artifactBuildLocks sync.Map` keyed by artifact_id. `ensureRootfsArtifact` does LoadOrStore + Lock + defer Unlock so the full find-or-build flow is serialized per artifact. The lock granularity is artifact_id only; builds for different images still run in parallel. - Add a defense-in-depth guard at the top of `distributeRootfsArtifact`: if the artifact record is incomplete (nil / Status != Ready / Ext4SizeBytes == 0 / DownloadToken empty / MasterNodeIP empty), fail fast with a diagnostic error instead of pushing a bad CreateImage to cubelet. If a future code path lets a partial artifact reach distribute, operators will see a clear message instead of cubelet's opaque "invalid size:0". Verification: - go build ./CubeMaster/pkg/templatecenter/... clean - go vet ./CubeMaster/pkg/templatecenter/... clean Signed-off-by: dushulin <shulindoo@gmail.com>

fslongjin · 2026-05-12T03:34:00Z

/cubebot review

github-actions · 2026-05-12T03:34:16Z

Claude finished @fslongjin's task in 2m 23s —— View job

PR Review: fix(CubeMaster/templatecenter): serialize concurrent rootfs builds for the same artifact_id

Gather context: read the changed file and diff
Analyze the race condition fix and provide review feedback

Overall assessment: This is a well-designed fix for a real race condition. The root cause analysis is thorough, the locking strategy is sound, and the defense-in-depth guard is a good safety net. Below are a few observations.

What works well

Per-artifactID locking via sync.Map is idiomatic Go and correctly serializes only competing builds for the same image spec. Different images still build in parallel.
The LoadOrStore pattern at template_image.go:1244 correctly handles concurrent initialization of the mutex.
The defense-in-depth validation in distributeRootfsArtifact (lines 1702-1711) provides a clear diagnostic error that would be much easier to debug than cubelet's opaque invalid size:0.
The lock scope is correct — it covers the entire find-or-build flow in ensureRootfsArtifact, so the second caller will find the ready artifact after the first releases the lock.

Issues and suggestions

1. artifactBuildLocks entries are never cleaned up (minor)

artifactBuildLocks grows monotonically — once an artifactID key is stored, it's never deleted. For a long-running cubemaster, this accumulates orphaned *sync.Mutex entries. The number is bounded by distinct image specs, which is likely manageable, but it's worth noting. A simple fix would be to Delete the entry after the lock is released (the second caller would re-create it if needed):

muV, _ := artifactBuildLocks.LoadOrStore(artifactID, &sync.Mutex{})
mu := muV.(*sync.Mutex)
mu.Lock()
defer func() {
    mu.Unlock()
    artifactBuildLocks.Delete(artifactID)
}()

This is safe because: if another goroutine is already waiting on mu.Lock(), it holds a reference to the same mutex and will proceed. If a new goroutine arrives after the Delete, LoadOrStore creates a fresh mutex — which is fine since no one else holds it.

Fix this →

2. Missing Ext4SHA256 in the defense-in-depth check (minor)

The validation at line 1705 checks Ext4SizeBytes, DownloadToken, and MasterNodeIP, but not Ext4SHA256. If size > 0 but SHA256 is empty, cubelet would receive a corrupt image spec. Consider adding:

artifact.Status != ArtifactStatusReady ||
    artifact.Ext4SizeBytes == 0 ||
    artifact.Ext4SHA256 == "" ||
    strings.TrimSpace(artifact.DownloadToken) == "" ||
    strings.TrimSpace(artifact.MasterNodeIP) == ""

Fix this →

3. No test for concurrent build serialization (suggestion)

The PR description mentions go build and go vet pass, but there's no unit test verifying that concurrent calls to ensureRootfsArtifact for the same artifactID are properly serialized. Given the subtlety of the race, a targeted test would prevent regressions:

func TestEnsureRootfsArtifact_ConcurrentSerialization(t *testing.T) {
    // Launch N goroutines calling ensureRootfsArtifact for the same artifactID
    // Verify only one build actually occurs and all callers succeed
}

This is a suggestion rather than a blocking issue — I understand the build infrastructure and test setup may make this non-trivial.

4. Pre-existing concern: redo cleanup not under lock (observation, not a PR issue)

In runRedoTemplateImageJob (lines 1057-1066), the cleanup of the previous artifact happens before calling ensureRootfsArtifact, so it's not protected by the per-artifactID lock. If two redo jobs for the same artifactID run concurrently, both could attempt cleanup. This appears safe today since the cleanup operations are idempotent, but it's worth being aware of.

Summary

The fix correctly addresses the race condition. The two minor suggestions (cleaning up artifactBuildLocks entries and adding Ext4SHA256 to the validation) would make it more robust. Consider adding a concurrency test if feasible.

chenhengqi · 2026-05-12T07:02:34Z

+	// full failure mode.
+	muV, _ := artifactBuildLocks.LoadOrStore(artifactID, &sync.Mutex{})
+	mu := muV.(*sync.Mutex)
+	mu.Lock()


The second request will be blocked here. Can we return early here?

Requests that do not currently obtain a lock will be stuck in the first state of the state machine (BUILDING_EXT4). After the job with the lock is completed, other tasks will instantly enter the next state. I understand what you mean is that when you find this situation, do the next job directly throw an exception at the api level to the cli and end the job?

I think the advantage of this is that it is not intrusive to the state machine or the cli, but can cause too many coroutines in extreme cases

dushulin requested review from chenhengqi, fslongjin, ls-ggg, tinklone and up2wing as code owners May 12, 2026 03:21

chenhengqi reviewed May 12, 2026

View reviewed changes

chenhengqi assigned fslongjin May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(CubeMaster/templatecenter): serialize concurrent rootfs builds for the same artifact_id#194

fix(CubeMaster/templatecenter): serialize concurrent rootfs builds for the same artifact_id#194
dushulin wants to merge 1 commit into
TencentCloud:masterfrom
dushulin:fix/template-artifact-build-race

dushulin commented May 12, 2026

Uh oh!

fslongjin commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

chenhengqi May 12, 2026

Uh oh!

dushulin May 13, 2026

Uh oh!

dushulin May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dushulin commented May 12, 2026

Motivation

Changes

Verification

Uh oh!

fslongjin commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(CubeMaster/templatecenter): serialize concurrent rootfs builds for the same artifact_id

What works well

Issues and suggestions

Summary

Uh oh!

chenhengqi May 12, 2026

Choose a reason for hiding this comment

Uh oh!

dushulin May 13, 2026

Choose a reason for hiding this comment

Uh oh!

dushulin May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 12, 2026 •

edited

Loading