fix(CubeMaster/templatecenter): serialize concurrent rootfs builds for the same artifact_id#194
Conversation
…r the same artifact_id
Symptom:
Submitting `cubemastercli tpl create-from-image` twice in quick
succession against the same OCI image (or the same writable_layer_size
/ instance_type combination) causes cubelets to receive CreateImage
requests with `ext4_size_bytes=0`, `token=""`, and a download URL that
falls back to `os.Hostname()` of the cubemaster host. Cubelet rejects
the pull with "invalid size:0" and the template is marked FAILED; a
manual retry has to wait ~90s for the prior build to finish.
Root cause:
- `ensureRootfsArtifact` has no lock at the artifact_id level.
artifact_id is derived deterministically from the image + spec
fingerprint, so two concurrent submits for the same image spec both
resolve to the same artifact_id.
- Both goroutines fall through to `buildRootfsArtifact`, which shares
the same workDir / storeDir / ext4Path. One goroutine's
`defer cleanupIntermediateArtifacts` then wipes the ext4 file out
from under the other goroutine.
- The "winning" caller reaches `distributeRootfsArtifact` with a
partial record (size=0, token="", master_node_ip=""). Because
`buildDownloadURL` falls back to `os.Hostname()` when master_node_ip
is empty, the cubelet receives a URL that isn't reachable from
outside the cubemaster host.
Fix:
- Add `var artifactBuildLocks sync.Map` keyed by artifact_id.
`ensureRootfsArtifact` does LoadOrStore + Lock + defer Unlock so the
full find-or-build flow is serialized per artifact. The lock
granularity is artifact_id only; builds for different images still
run in parallel.
- Add a defense-in-depth guard at the top of
`distributeRootfsArtifact`: if the artifact record is incomplete
(nil / Status != Ready / Ext4SizeBytes == 0 / DownloadToken empty /
MasterNodeIP empty), fail fast with a diagnostic error instead of
pushing a bad CreateImage to cubelet. If a future code path lets a
partial artifact reach distribute, operators will see a clear
message instead of cubelet's opaque "invalid size:0".
Verification:
- go build ./CubeMaster/pkg/templatecenter/... clean
- go vet ./CubeMaster/pkg/templatecenter/... clean
Signed-off-by: dushulin <shulindoo@gmail.com>
|
/cubebot review |
|
Claude finished @fslongjin's task in 2m 23s —— View job PR Review: fix(CubeMaster/templatecenter): serialize concurrent rootfs builds for the same artifact_id
Overall assessment: This is a well-designed fix for a real race condition. The root cause analysis is thorough, the locking strategy is sound, and the defense-in-depth guard is a good safety net. Below are a few observations. What works well
Issues and suggestions1.
muV, _ := artifactBuildLocks.LoadOrStore(artifactID, &sync.Mutex{})
mu := muV.(*sync.Mutex)
mu.Lock()
defer func() {
mu.Unlock()
artifactBuildLocks.Delete(artifactID)
}()This is safe because: if another goroutine is already waiting on 2. Missing The validation at line 1705 checks artifact.Status != ArtifactStatusReady ||
artifact.Ext4SizeBytes == 0 ||
artifact.Ext4SHA256 == "" ||
strings.TrimSpace(artifact.DownloadToken) == "" ||
strings.TrimSpace(artifact.MasterNodeIP) == ""3. No test for concurrent build serialization (suggestion) The PR description mentions func TestEnsureRootfsArtifact_ConcurrentSerialization(t *testing.T) {
// Launch N goroutines calling ensureRootfsArtifact for the same artifactID
// Verify only one build actually occurs and all callers succeed
}This is a suggestion rather than a blocking issue — I understand the build infrastructure and test setup may make this non-trivial. 4. Pre-existing concern: redo cleanup not under lock (observation, not a PR issue) In SummaryThe fix correctly addresses the race condition. The two minor suggestions (cleaning up |
| // full failure mode. | ||
| muV, _ := artifactBuildLocks.LoadOrStore(artifactID, &sync.Mutex{}) | ||
| mu := muV.(*sync.Mutex) | ||
| mu.Lock() |
There was a problem hiding this comment.
The second request will be blocked here. Can we return early here?
There was a problem hiding this comment.
Requests that do not currently obtain a lock will be stuck in the first state of the state machine (BUILDING_EXT4). After the job with the lock is completed, other tasks will instantly enter the next state. I understand what you mean is that when you find this situation, do the next job directly throw an exception at the api level to the cli and end the job?
There was a problem hiding this comment.
I think the advantage of this is that it is not intrusive to the state machine or the cli, but can cause too many coroutines in extreme cases
Motivation
Quick-succession
cubemastercli tpl create-from-imagecalls against the same OCI image spec race insideensureRootfsArtifact: both goroutines resolve to the sameartifact_id, share the same workDir/ext4Path, and one'sdefer cleanupIntermediateArtifactswipes the ext4 file out from under the other. The surviving caller reachesdistributeRootfsArtifactwith a partial record (ext4_size_bytes=0, emptydownload_token, emptymaster_node_ip);buildDownloadURLfalls backto
os.Hostname(), and cubelet rejects the pull withinvalid size:0— the template is marked FAILED with no clear signal of the underlying race.Changes
artifactBuildLocks sync.Map(keyed byartifact_id) serializes the full find-or-build flow inensureRootfsArtifact. Granularity isartifact_idonly — concurrent builds for different image specs still run in parallel.distributeRootfsArtifact: refuse to push aCreateImagewhen the artifact record is obviously incomplete, returning a diagnostic error instead of relying on cubelet's opaqueinvalid size:0.Verification
go build ./CubeMaster/pkg/templatecenter/...cleango vet ./CubeMaster/pkg/templatecenter/...clean