Skip to content

[test] Fix flaky tests: test-only changes (split from #2624)#2852

Open
sushantmane wants to merge 2 commits into
linkedin:mainfrom
sushantmane:sumane/flaky-fixes-test-only
Open

[test] Fix flaky tests: test-only changes (split from #2624)#2852
sushantmane wants to merge 2 commits into
linkedin:mainfrom
sushantmane:sumane/flaky-fixes-test-only

Conversation

@sushantmane

Copy link
Copy Markdown
Contributor

Summary

Test-only half of #2624, split out for reviewability. 86 files, no production code touched.

Fix categories (details and verification in #2624):

  • AsyncGauge shared executor contention: dedicated AsyncGaugeExecutor per stats test instead of the shared static 3-thread executor that other tests can destroy via close()
  • SIT assertion races: replace racy verify(timeout().times(N)) with waitForNonDeterministicAssertion; fix concurrent Mockito stubbing with AtomicReference
  • JMM visibility: LeaderErrorNotifier.doOne made volatile (plain boolean written by drainer thread, read by test thread — timeouts could never fix this)
  • Insufficient CI timeouts: participant store setup 30s→60s, client test timeouts raised
  • Test-logic concurrency bugs: latch ordering, advance+verify retries, strict timestamp ordering in BackupVersionOptimizationServiceTest

All fixes were verified with 10/10 consecutive local passes and two rounds of targeted CI validation (90 job executions, zero failures) on the original PR.

Companion PRs:

Test plan

Test-side fixes for 20+ intermittently failing tests: dedicated
AsyncGaugeExecutor per stats test, replace racy verify(timeout())
with waitForNonDeterministicAssertion, latch-ordering fixes,
volatile LeaderErrorNotifier flag, and CI-appropriate timeouts.

Production-code fixes from linkedin#2624 are split into a follow-up PR.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR contains test-only changes aimed at reducing CI flakiness across unit/integration/E2E suites by hardening asynchronous assertions, fixing test-side concurrency races, and increasing time budgets where CI contention makes prior limits unrealistic.

Changes:

  • Replaced racy sleeps / single-shot assertions with TestUtils.waitForNonDeterministicAssertion(...) and added targeted synchronization to eliminate ordering/visibility races in tests.
  • Introduced/used test utilities to make common operations more robust (e.g., idempotent createNewStore with retries; SleepStallingMockTime.awaitSleeper() to avoid “advance-before-sleep” races).
  • Adjusted a number of per-test and fixture timeouts/tolerances to better match worst-case CI behavior.

Reviewed changes

Copilot reviewed 86 out of 86 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
services/venice-server/src/test/java/com/linkedin/venice/stats/DiskHealthStatsTest.java Retry Tehuti AsyncGauge reads to avoid cached/stale first samples under load.
services/venice-router/src/test/java/com/linkedin/venice/router/throttle/RouterRequestThrottlingTest.java Loosen tolerance threshold to absorb token-bucket refill jitter in CI.
services/venice-router/src/test/java/com/linkedin/venice/router/api/TestDictionaryRetrievalService.java Reorder Mockito stubbing to avoid background-thread stubbing race.
services/venice-controller/src/test/java/com/linkedin/venice/controller/TestProtocolVersionAutoDetectionService.java Prevent ms-granularity timing assertions from flaking on fast runners.
internal/venice-test-common/src/main/java/com/linkedin/venice/utils/TestUtils.java Add createNewStoreWithRetry helper to treat 409 “already exists” as success after retries.
internal/venice-test-common/src/main/java/com/linkedin/venice/utils/SleepStallingMockTime.java Add awaitSleeper() to synchronize tests with threads parked in sleep().
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/utils/StoreMigrationTestUtil.java Increase propagation waits around store migration completion.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/utils/IntegrationTestPushUtils.java Retry store creation and treat “already exists” as idempotent success.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/storagenode/ReadComputeValidationTest.java Increase per-test timeout to account for restart cost under CI load.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/server/VeniceServerTest.java Add polling for transient server-start state; increase time budgets for teardown/drop flows.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/router/TestRouter.java Accept alternate expected exception types (router vs discovery) in mock setup.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/router/TestRead.java Warm up both GET and compute paths after updateStore to avoid transient 502s.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/router/TestConnectionWarmingForApacheAsyncClient.java Wait for async connection warming before asserting pool metrics.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/router/TestBlobDiscovery.java Increase ingestion wait budget in multi-region fixture setup.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/restart/TestRestartServerDuringIngestion.java Increase wait for routing-table catch-up after server restart.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/restart/TestRestartServerAfterDeletingSstFiles.java Retry on throwable to tolerate transient “no version” router errors after swap.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/restart/TestRestartController.java Replace single retry with polling for stable leader acceptance of request_topic.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/pubsub/api/admin/PubSubAdminAdapterTest.java Increase op timeout for multi-threaded delete in contention-heavy scenario.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/integration/utils/VeniceServerWrapper.java Increase state-transition metadata wait; make metric verification best-effort when metrics not registered.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/integration/utils/IntegrationTestUtils.java Increase participant store setup wait to tolerate commonPool starvation in CI.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/ingestionHeartbeat/IngestionHeartBeatTest.java Increase overall test timeout and inner push completion budget.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/helix/TestHelixCustomizedViewOfflinePushRepository.java Replace fixed sleep with polling until CV/cache observes resource drop.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/fastclient/ServerOverloadTest.java Poll Tehuti metrics to avoid async sampling/windowing races.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestTargetedRegionPushWithNativeReplication.java Retry on-disk size checks during async backup retirement; tolerate transient listFiles() issues.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestStoreMigrationMultiRegion.java Increase @BeforeClass timeout for multi-region fixture bring-up.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestStoreMigration.java Retry store reads with throwable retries to handle router “no version” window.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestServerIngestionPauseResume.java Increase push waits; strengthen paused-gauge convergence to all hosting servers.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestRepushDiagnostics.java Add explicit A/A propagation wait before pushes.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestPushJobWithNativeReplication.java Increase system store push readiness wait.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestPartialUpdateWithActiveActiveReplication.java Increase time budgets; wait for schema propagation to child controllers before WC writes.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestIncrementalPush.java Wait for A/A flag propagation to child controllers before first push.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestGlobalRtDiv.java Poll for async metadata-partition DIV chunk availability per server.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestDeferredVersionSwapWithSequentialRolloutWithDvc.java Increase outer/inner timeouts for sequential deferred swap convergence.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestDeferredVersionSwapWithSequentialRollout.java Set swap wait time at create time; increase convergence wait.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestDeferredVersionSwapWithFailingRegions.java Increase wait for push-status system store readiness.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestBatch.java Increase dictionary readiness wait; increase async version cleanup wait.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestAdminOperationWithPreviousVersion.java Retry initial reads; use createNewStoreWithRetry for transient create failures.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestActiveActiveVersionSwapMessage.java Increase overall test cap and decouple push timeout from test timeout.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/PushStatusStoreTest.java Poll async API until it matches sync path, avoiding cache-lag flake.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/PartialUpdateTest.java Harden router readiness wait and store creation; fix test premise by removing list-field update from ignored record.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/DaVinciP2PBlobTransferRecoveryTest.java Wait for blob peer discovery on all partitions before starting second client.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/DaVinciLiveUpdateSuppressionTest.java Retry on throwable for transient router “no version” window.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/DaVinciClientP2PBlobTransferTest.java Wait for blob peer discovery on all partitions (and apply to non-lagging variant).
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/DataRecoveryTest.java Adjust topic cleanup behavior/timeouts; replace fast-fail push completion with retrying status polling.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/BlobP2PTransferAmongServersTest.java Add FULLY_REPLICATED gating before restarts; increase long-running test timeouts/waits.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/AbstractTestRepush.java Add reusable helper to wait for A/A flag propagation to all child regions.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/VeniceParentHelixAdminSchemaTest.java Wait for child controller to receive both flag and schema update before asserting.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/TestVeniceHelixAdminWithIsolatedEnvironment.java Ensure locally constructed admins are closed to avoid leaking init-routine retry chains.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/TestTopicRequestOnHybridDelete.java Use blocking empty push + retry-on-throwable read to avoid router propagation race.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/TestParentControllerWithMultiDataCenter.java Increase push wait; add propagation retry for child getStore; increase empty-push verification wait.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/TestHybridStoreRepartitioningWithMultiDataCenter.java Increase timeouts and empty-push wait for hybrid completion under CI load.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/TestHolisticSeverHealthCheck.java Increase wait for mocked CV visibility to controller handler thread.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/server/TestAdminSparkServer.java Retry for child visibility after async pushes; gate deletable topics on Helix EV purge.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/AbstractTestVeniceHelixAdmin.java Increase long-test timeout to avoid Kafka admin createTopic blocking beyond cap.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/consumer/VersionSpecificCDCShutdownTest.java Increase test timeout; disable persistently flaky test; stabilize checkpoint handling (per-partition latest).
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/consumer/TestVersionSpecificChangelogConsumerReplicationMetadata.java Increase heavy test timeout to avoid outer-cap flakes.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/consumer/TestStatelessChangelogConsumerWithCompression.java Increase offline job start timeout to avoid resource-assignment timeouts.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/consumer/StatefulVeniceChangelogConsumerTest.java Replace fixed sleep with per-partition peer discovery + eventual snapshot existence assertion.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/consumer/ChangelogConsumerTestUtils.java Increase meta system store readiness wait budget in fixtures.
internal/venice-test-common/src/integrationTest/java/com/linkedin/davinci/notifier/LeaderErrorNotifier.java Fix visibility and leader predicate race; defer ERROR write to avoid EV timing window.
internal/venice-common/src/test/java/com/linkedin/venice/servicediscovery/TestServiceDiscoveryAnnouncerRetryTask.java Replace fixed sleeps with converging assertion polling on retry-queue state.
internal/venice-common/src/test/java/com/linkedin/venice/memory/HeapSizeEstimatorTest.java Increase attempts and widen JDK8 empirical tolerance to reduce GC-noise flakes.
internal/venice-client-common/src/test/java/com/linkedin/venice/utils/RetryUtilsTest.java Increase retry maxDuration in tests to avoid CI scheduling prematurely exhausting retry budget.
internal/venice-avro-compatibility-test/src/test/java/com/linkedin/venice/VeniceClientCompatibilityTest.java Increase cluster bootstrap/readiness waits for forked initializer under CI load.
internal/alpini/common/alpini-common-base/src/test/java/com/linkedin/alpini/base/registry/TestResourceRegistry.java Add explicit latch barrier to remove shutdown-phase race in assertions.
clients/venice-client/src/test/java/com/linkedin/venice/fastclient/RetriableAvroGenericStoreClientTest.java Increase outer cap; extend metrics convergence wait to account for OccurrenceRate window decay.
clients/venice-client/src/test/java/com/linkedin/venice/fastclient/meta/RequestBasedMetadataTest.java Replace Mockito spy verification with override-based counting for self-invocation limitation.
clients/venice-client/src/test/java/com/linkedin/venice/fastclient/meta/InstanceHealthMonitorTest.java Increase HB timeout and assertion windows to reduce scheduling jitter flakes.
clients/venice-client/src/test/java/com/linkedin/venice/fastclient/DispatchingAvroGenericStoreClientTest.java Poll until per-route metrics are registered before validating.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/StoreBackendTest.java Increase eventual assertion budget for async metrics/AsyncGauge sampling.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/storage/DiskHealthCheckServiceTest.java Make health-check assertions wait on actual task/file creation rather than fixed sleep.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ParticipantStateTransitionStatsTest.java Ensure stats executor is shut down to prevent thread leakage across tests.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/DIVStatsReporterTest.java Add eventual assertions for AsyncGauge sampling/cached-value races.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/BlobTransferStatsTest.java Add eventual assertions helper for AsyncGauge metrics.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/AggHostLevelIngestionStatsTest.java Use dedicated metrics repo/executor and ensure close to avoid static executor contamination.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTaskTest.java Fix Mockito mock reuse; adjust reset/throttle/resubscribe assertions and time budgets; skip structurally flaky reset tests in subclasses.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/SITWithSAwarePWiseWithoutBufferAfterLeaderTest.java Skip structurally flaky testResetPartition for this subclass/config.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/SITWithSAwarePWiseAndBufferAfterLeaderTest.java Skip structurally flaky testResetPartition for this subclass/config.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/SITWithPWiseWithoutBufferAfterLeaderTest.java Skip structurally flaky testResetPartition for this subclass/config.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/SITWithPWiseAndBufferAfterLeaderTest.java Skip structurally flaky testResetPartition for this subclass/config.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/ParticipantStoreConsumptionTaskTest.java Synchronize mock time advancement with sleeper threads via awaitSleeper().
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/KafkaConsumerServiceDelegatorTest.java Make stress loop robust to documented transient exceptions; best-effort cleanup on interrupt.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/ActiveKeyCountScenarioTest.java Configure AsyncGauge executor to avoid slow-metric blacklisting/cached readings under CI.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/helix/LeaderFollowerPartitionStateModelTest.java Use doReturn().when() to avoid concurrent Mockito stubbing failures.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/consumer/VeniceChangelogConsumerImplTest.java Stub getAssignment() to prevent ClassCastException from Mockito smart defaults.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/blobtransfer/TestNettyP2PBlobTransferManager.java Poll for async decrement on channelInactive before asserting concurrent snapshot users.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +434 to +443
try (RequestBasedMetadata requestBasedMetadata = new RequestBasedMetadata(clientConfig, d2TransportClient) {
@Override
synchronized void updateCache(boolean onDemandRefresh) throws InterruptedException {
if (onDemandRefresh) {
onDemandRefreshCalls.incrementAndGet();
} else {
initialRefreshCalls.incrementAndGet();
}
super.updateCache(onDemandRefresh);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c58e0ad: the override now returns List<Runnable> and propagates the callbacks via return super.updateCache(onDemandRefresh). The signature drifted when main's RequestBasedMetadata changed after the original #2624 branch was cut.

RequestBasedMetadata.updateCache now returns List<Runnable>; adapt the
counting override accordingly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants