fix: apply migration compaction and deep job pickups fixes by deepshekhardas · Pull Request #1 · deepshekhardas/rudder-server

deepshekhardas · 2026-05-04T11:19:10Z

This PR applies fixes from two key PRs:

PR rudderlabs#6844 - Migration Compaction V2:

Fix router proxy lock handling to ensure all requests are processed before switching
Increase partition migration test timeouts for stability
Add guard to prevent getMigrationList from producing datasets larger than maxDSSize
Optimize dataset writer to keep only last dataset instead of last two

PR rudderlabs#6850 - Deep Job Pickups:

Implement deep job pickups feature with retry loop when DS limits are reached

All changes have been applied with conflict resolution.

Summary by cubic

Stabilizes migration compaction and router partition switchovers, and adds deep job pickups to prevent job starvation when dataset limits block reads. This improves reliability during heavy migrations.

New Features
- Deep job pickups: processor, router/batchrouter, and the job iterator now retry when reads return no jobs due to DS limits, until work is available.
Bug Fixes
- Migration compaction hardens jobsdb.getMigrationList: enforces maxDSSize, skips oversize pairs, stops piggybacking before the limit; writers keep only the last dataset; default jobMinRowsLeftMigrateThreshold raised to 0.6.
- Router proxy mapping updates now wait for in-flight requests to finish before switching, preventing misrouted requests.

^{Written for commit 46f6a87. Summary will update on new commits.}

…er than maxDSSize

cubic-dev-ai

4 issues found across 10 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="router/internal/jobiterator/jobiterator.go">

<violation number="1" location="router/internal/jobiterator/jobiterator.go:127">
P1: Use `DSLimitsReached` here; `LimitsReached` is a different jobs/events/payload limit, so the deep-pickup retry won't run when dataset limits are hit.</violation>
</file>

<file name="jobsdb/integration_test.go">

<violation number="1" location="jobsdb/integration_test.go:785">
P3: Assertion message is inconsistent with the asserted length (`3` vs message saying `2`).</violation>
</file>

<file name="jobsdb/migration_test.go">

<violation number="1" location="jobsdb/migration_test.go:816">
P2: Avoid strict duration comparisons in tests; they are nondeterministic and can cause flaky CI failures.</violation>
</file>

<file name="router/batchrouter/handle.go">

<violation number="1" location="router/batchrouter/handle.go:205">
P1: The new deep-pickup loop can run forever when DS limits remain reached, because it is unbounded and uses a non-cancelable context.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-05-04T11:24:27Z

+		if err != nil {
+			panic(err)
+		}
+		query = ji.state.stats.QueryCount == 1 && len(jobs.Jobs) == 0 && jobs.LimitsReached


P1: Use DSLimitsReached here; LimitsReached is a different jobs/events/payload limit, so the deep-pickup retry won't run when dataset limits are hit.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At router/internal/jobiterator/jobiterator.go, line 127: <comment>Use `DSLimitsReached` here; `LimitsReached` is a different jobs/events/payload limit, so the deep-pickup retry won't run when dataset limits are hit.</comment> <file context> @@ -112,21 +112,27 @@ func (ji *Iterator) HasNext() bool { + if err != nil { + panic(err) + } + query = ji.state.stats.QueryCount == 1 && len(jobs.Jobs) == 0 && jobs.LimitsReached } - ji.state.jobs = allJobsResult.Jobs </file context>

Suggested change

query = ji.state.stats.QueryCount == 1 && len(jobs.Jobs) == 0 && jobs.LimitsReached

query = ji.state.stats.QueryCount == 1 && len(jobs.Jobs) == 0 && jobs.DSLimitsReached

cubic-dev-ai · 2026-05-04T11:24:27Z

-	if err != nil {
-		brt.logger.Errorn("BRT: Error while reading from DB", obskit.DestinationType(brt.destType), obskit.Error(err))
-		panic(err)
+	for query := true; query; { // keep trying to get jobs while no jobs are returned because ds limits are being reached


P1: The new deep-pickup loop can run forever when DS limits remain reached, because it is unbounded and uses a non-cancelable context.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At router/batchrouter/handle.go, line 205: <comment>The new deep-pickup loop can run forever when DS limits remain reached, because it is unbounded and uses a non-cancelable context.</comment> <file context> @@ -202,14 +202,17 @@ func (brt *Handle) getWorkerJobs(partition string) (workerJobs []*DestinationJob - if err != nil { - brt.logger.Errorn("BRT: Error while reading from DB", obskit.DestinationType(brt.destType), obskit.Error(err)) - panic(err) + for query := true; query; { // keep trying to get jobs while no jobs are returned because ds limits are being reached + toProcess, err := misc.QueryWithRetriesAndNotify(context.Background(), brt.jobdDBQueryRequestTimeout.Load(), brt.jobdDBMaxRetries.Load(), func(ctx context.Context) (jobsdb.JobsResult, error) { + return brt.jobsDB.GetJobs(ctx, []string{jobsdb.Failed.State, jobsdb.Unprocessed.State}, queryParams) </file context>

cubic-dev-ai · 2026-05-04T11:24:27Z

+		t.Logf("lock check duration (skip):  %v", lockCheckDuration)
+
+		// The second call with skipBefore should be significantly faster
+		require.Greater(t, checkDuration, lockCheckDuration,


P2: Avoid strict duration comparisons in tests; they are nondeterministic and can cause flaky CI failures.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At jobsdb/migration_test.go, line 816: <comment>Avoid strict duration comparisons in tests; they are nondeterministic and can cause flaky CI failures.</comment> <file context> @@ -651,3 +651,212 @@ func Test_GetColumnConversion(t *testing.T) { + t.Logf("lock check duration (skip): %v", lockCheckDuration) + + // The second call with skipBefore should be significantly faster + require.Greater(t, checkDuration, lockCheckDuration, + "getMigrationList with skipBefore should be faster than without", + ) </file context>

cubic-dev-ai · 2026-05-04T11:24:27Z

-		require.Lenf(t, dsList, 2, "dsList length is not 1, got %+v", dsList)
-		require.Equal(t, prefix+"_jobs_5_1", dsList[0].JobTable) // 12 jobs
-		require.Equal(t, prefix+"_jobs_6", dsList[1].JobTable)   // 0 jobs
+		require.Lenf(t, dsList, 3, "dsList length is not 2, got %+v", dsList)


P3: Assertion message is inconsistent with the asserted length (3 vs message saying 2).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At jobsdb/integration_test.go, line 785: <comment>Assertion message is inconsistent with the asserted length (`3` vs message saying `2`).</comment> <file context> @@ -780,11 +780,12 @@ func TestJobsDB(t *testing.T) { - require.Lenf(t, dsList, 2, "dsList length is not 1, got %+v", dsList) - require.Equal(t, prefix+"_jobs_5_1", dsList[0].JobTable) // 12 jobs - require.Equal(t, prefix+"_jobs_6", dsList[1].JobTable) // 0 jobs + require.Lenf(t, dsList, 3, "dsList length is not 2, got %+v", dsList) + require.Equal(t, prefix+"_jobs_4_1", dsList[0].JobTable) // 8 jobs + require.Equal(t, prefix+"_jobs_5", dsList[1].JobTable) // 4 jobs </file context>

Suggested change

require.Lenf(t, dsList, 3, "dsList length is not 2, got %+v", dsList)

require.Lenf(t, dsList, 3, "dsList length is not 3, got %+v", dsList)

deepshekhardas added 5 commits May 4, 2026 12:21

chore: router proxy should switch all requests before acking

a2f3d47

chore: increase partition migration test timeouts

6937e4e

chore(jobsdb): guard getMigrationList against producing datasets larg…

d3430cf

…er than maxDSSize

chore(jobsdb): keep only last dataset for writers instead of last two

b2a8db1

feat: deep job pickups

46f6a87

cubic-dev-ai Bot reviewed May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: apply migration compaction and deep job pickups fixes#1

fix: apply migration compaction and deep job pickups fixes#1
deepshekhardas wants to merge 5 commits into
masterfrom
fix/migration-compaction-deep-job-pickups

deepshekhardas commented May 4, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 4, 2026

Uh oh!

cubic-dev-ai Bot May 4, 2026

Uh oh!

cubic-dev-ai Bot May 4, 2026

Uh oh!

cubic-dev-ai Bot May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	query = ji.state.stats.QueryCount == 1 && len(jobs.Jobs) == 0 && jobs.LimitsReached
	query = ji.state.stats.QueryCount == 1 && len(jobs.Jobs) == 0 && jobs.DSLimitsReached

	require.Lenf(t, dsList, 3, "dsList length is not 2, got %+v", dsList)
	require.Lenf(t, dsList, 3, "dsList length is not 3, got %+v", dsList)

Conversation

deepshekhardas commented May 4, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

deepshekhardas commented May 4, 2026 •

edited by cubic-dev-ai Bot

Loading