[test] Fix flaky tests: test-only changes (split from #2624)#2852
Open
sushantmane wants to merge 2 commits into
Open
[test] Fix flaky tests: test-only changes (split from #2624)#2852sushantmane wants to merge 2 commits into
sushantmane wants to merge 2 commits into
Conversation
Test-side fixes for 20+ intermittently failing tests: dedicated AsyncGaugeExecutor per stats test, replace racy verify(timeout()) with waitForNonDeterministicAssertion, latch-ordering fixes, volatile LeaderErrorNotifier flag, and CI-appropriate timeouts. Production-code fixes from linkedin#2624 are split into a follow-up PR. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
1 task
There was a problem hiding this comment.
Pull request overview
This PR contains test-only changes aimed at reducing CI flakiness across unit/integration/E2E suites by hardening asynchronous assertions, fixing test-side concurrency races, and increasing time budgets where CI contention makes prior limits unrealistic.
Changes:
- Replaced racy sleeps / single-shot assertions with
TestUtils.waitForNonDeterministicAssertion(...)and added targeted synchronization to eliminate ordering/visibility races in tests. - Introduced/used test utilities to make common operations more robust (e.g., idempotent
createNewStorewith retries;SleepStallingMockTime.awaitSleeper()to avoid “advance-before-sleep” races). - Adjusted a number of per-test and fixture timeouts/tolerances to better match worst-case CI behavior.
Reviewed changes
Copilot reviewed 86 out of 86 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| services/venice-server/src/test/java/com/linkedin/venice/stats/DiskHealthStatsTest.java | Retry Tehuti AsyncGauge reads to avoid cached/stale first samples under load. |
| services/venice-router/src/test/java/com/linkedin/venice/router/throttle/RouterRequestThrottlingTest.java | Loosen tolerance threshold to absorb token-bucket refill jitter in CI. |
| services/venice-router/src/test/java/com/linkedin/venice/router/api/TestDictionaryRetrievalService.java | Reorder Mockito stubbing to avoid background-thread stubbing race. |
| services/venice-controller/src/test/java/com/linkedin/venice/controller/TestProtocolVersionAutoDetectionService.java | Prevent ms-granularity timing assertions from flaking on fast runners. |
| internal/venice-test-common/src/main/java/com/linkedin/venice/utils/TestUtils.java | Add createNewStoreWithRetry helper to treat 409 “already exists” as success after retries. |
| internal/venice-test-common/src/main/java/com/linkedin/venice/utils/SleepStallingMockTime.java | Add awaitSleeper() to synchronize tests with threads parked in sleep(). |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/utils/StoreMigrationTestUtil.java | Increase propagation waits around store migration completion. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/utils/IntegrationTestPushUtils.java | Retry store creation and treat “already exists” as idempotent success. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/storagenode/ReadComputeValidationTest.java | Increase per-test timeout to account for restart cost under CI load. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/server/VeniceServerTest.java | Add polling for transient server-start state; increase time budgets for teardown/drop flows. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/router/TestRouter.java | Accept alternate expected exception types (router vs discovery) in mock setup. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/router/TestRead.java | Warm up both GET and compute paths after updateStore to avoid transient 502s. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/router/TestConnectionWarmingForApacheAsyncClient.java | Wait for async connection warming before asserting pool metrics. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/router/TestBlobDiscovery.java | Increase ingestion wait budget in multi-region fixture setup. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/restart/TestRestartServerDuringIngestion.java | Increase wait for routing-table catch-up after server restart. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/restart/TestRestartServerAfterDeletingSstFiles.java | Retry on throwable to tolerate transient “no version” router errors after swap. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/restart/TestRestartController.java | Replace single retry with polling for stable leader acceptance of request_topic. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/pubsub/api/admin/PubSubAdminAdapterTest.java | Increase op timeout for multi-threaded delete in contention-heavy scenario. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/integration/utils/VeniceServerWrapper.java | Increase state-transition metadata wait; make metric verification best-effort when metrics not registered. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/integration/utils/IntegrationTestUtils.java | Increase participant store setup wait to tolerate commonPool starvation in CI. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/ingestionHeartbeat/IngestionHeartBeatTest.java | Increase overall test timeout and inner push completion budget. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/helix/TestHelixCustomizedViewOfflinePushRepository.java | Replace fixed sleep with polling until CV/cache observes resource drop. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/fastclient/ServerOverloadTest.java | Poll Tehuti metrics to avoid async sampling/windowing races. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestTargetedRegionPushWithNativeReplication.java | Retry on-disk size checks during async backup retirement; tolerate transient listFiles() issues. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestStoreMigrationMultiRegion.java | Increase @BeforeClass timeout for multi-region fixture bring-up. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestStoreMigration.java | Retry store reads with throwable retries to handle router “no version” window. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestServerIngestionPauseResume.java | Increase push waits; strengthen paused-gauge convergence to all hosting servers. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestRepushDiagnostics.java | Add explicit A/A propagation wait before pushes. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestPushJobWithNativeReplication.java | Increase system store push readiness wait. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestPartialUpdateWithActiveActiveReplication.java | Increase time budgets; wait for schema propagation to child controllers before WC writes. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestIncrementalPush.java | Wait for A/A flag propagation to child controllers before first push. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestGlobalRtDiv.java | Poll for async metadata-partition DIV chunk availability per server. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestDeferredVersionSwapWithSequentialRolloutWithDvc.java | Increase outer/inner timeouts for sequential deferred swap convergence. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestDeferredVersionSwapWithSequentialRollout.java | Set swap wait time at create time; increase convergence wait. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestDeferredVersionSwapWithFailingRegions.java | Increase wait for push-status system store readiness. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestBatch.java | Increase dictionary readiness wait; increase async version cleanup wait. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestAdminOperationWithPreviousVersion.java | Retry initial reads; use createNewStoreWithRetry for transient create failures. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestActiveActiveVersionSwapMessage.java | Increase overall test cap and decouple push timeout from test timeout. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/PushStatusStoreTest.java | Poll async API until it matches sync path, avoiding cache-lag flake. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/PartialUpdateTest.java | Harden router readiness wait and store creation; fix test premise by removing list-field update from ignored record. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/DaVinciP2PBlobTransferRecoveryTest.java | Wait for blob peer discovery on all partitions before starting second client. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/DaVinciLiveUpdateSuppressionTest.java | Retry on throwable for transient router “no version” window. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/DaVinciClientP2PBlobTransferTest.java | Wait for blob peer discovery on all partitions (and apply to non-lagging variant). |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/DataRecoveryTest.java | Adjust topic cleanup behavior/timeouts; replace fast-fail push completion with retrying status polling. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/BlobP2PTransferAmongServersTest.java | Add FULLY_REPLICATED gating before restarts; increase long-running test timeouts/waits. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/AbstractTestRepush.java | Add reusable helper to wait for A/A flag propagation to all child regions. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/VeniceParentHelixAdminSchemaTest.java | Wait for child controller to receive both flag and schema update before asserting. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/TestVeniceHelixAdminWithIsolatedEnvironment.java | Ensure locally constructed admins are closed to avoid leaking init-routine retry chains. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/TestTopicRequestOnHybridDelete.java | Use blocking empty push + retry-on-throwable read to avoid router propagation race. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/TestParentControllerWithMultiDataCenter.java | Increase push wait; add propagation retry for child getStore; increase empty-push verification wait. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/TestHybridStoreRepartitioningWithMultiDataCenter.java | Increase timeouts and empty-push wait for hybrid completion under CI load. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/TestHolisticSeverHealthCheck.java | Increase wait for mocked CV visibility to controller handler thread. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/server/TestAdminSparkServer.java | Retry for child visibility after async pushes; gate deletable topics on Helix EV purge. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/controller/AbstractTestVeniceHelixAdmin.java | Increase long-test timeout to avoid Kafka admin createTopic blocking beyond cap. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/consumer/VersionSpecificCDCShutdownTest.java | Increase test timeout; disable persistently flaky test; stabilize checkpoint handling (per-partition latest). |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/consumer/TestVersionSpecificChangelogConsumerReplicationMetadata.java | Increase heavy test timeout to avoid outer-cap flakes. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/consumer/TestStatelessChangelogConsumerWithCompression.java | Increase offline job start timeout to avoid resource-assignment timeouts. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/consumer/StatefulVeniceChangelogConsumerTest.java | Replace fixed sleep with per-partition peer discovery + eventual snapshot existence assertion. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/consumer/ChangelogConsumerTestUtils.java | Increase meta system store readiness wait budget in fixtures. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/davinci/notifier/LeaderErrorNotifier.java | Fix visibility and leader predicate race; defer ERROR write to avoid EV timing window. |
| internal/venice-common/src/test/java/com/linkedin/venice/servicediscovery/TestServiceDiscoveryAnnouncerRetryTask.java | Replace fixed sleeps with converging assertion polling on retry-queue state. |
| internal/venice-common/src/test/java/com/linkedin/venice/memory/HeapSizeEstimatorTest.java | Increase attempts and widen JDK8 empirical tolerance to reduce GC-noise flakes. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/utils/RetryUtilsTest.java | Increase retry maxDuration in tests to avoid CI scheduling prematurely exhausting retry budget. |
| internal/venice-avro-compatibility-test/src/test/java/com/linkedin/venice/VeniceClientCompatibilityTest.java | Increase cluster bootstrap/readiness waits for forked initializer under CI load. |
| internal/alpini/common/alpini-common-base/src/test/java/com/linkedin/alpini/base/registry/TestResourceRegistry.java | Add explicit latch barrier to remove shutdown-phase race in assertions. |
| clients/venice-client/src/test/java/com/linkedin/venice/fastclient/RetriableAvroGenericStoreClientTest.java | Increase outer cap; extend metrics convergence wait to account for OccurrenceRate window decay. |
| clients/venice-client/src/test/java/com/linkedin/venice/fastclient/meta/RequestBasedMetadataTest.java | Replace Mockito spy verification with override-based counting for self-invocation limitation. |
| clients/venice-client/src/test/java/com/linkedin/venice/fastclient/meta/InstanceHealthMonitorTest.java | Increase HB timeout and assertion windows to reduce scheduling jitter flakes. |
| clients/venice-client/src/test/java/com/linkedin/venice/fastclient/DispatchingAvroGenericStoreClientTest.java | Poll until per-route metrics are registered before validating. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/StoreBackendTest.java | Increase eventual assertion budget for async metrics/AsyncGauge sampling. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/storage/DiskHealthCheckServiceTest.java | Make health-check assertions wait on actual task/file creation rather than fixed sleep. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ParticipantStateTransitionStatsTest.java | Ensure stats executor is shut down to prevent thread leakage across tests. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/DIVStatsReporterTest.java | Add eventual assertions for AsyncGauge sampling/cached-value races. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/BlobTransferStatsTest.java | Add eventual assertions helper for AsyncGauge metrics. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/AggHostLevelIngestionStatsTest.java | Use dedicated metrics repo/executor and ensure close to avoid static executor contamination. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTaskTest.java | Fix Mockito mock reuse; adjust reset/throttle/resubscribe assertions and time budgets; skip structurally flaky reset tests in subclasses. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/SITWithSAwarePWiseWithoutBufferAfterLeaderTest.java | Skip structurally flaky testResetPartition for this subclass/config. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/SITWithSAwarePWiseAndBufferAfterLeaderTest.java | Skip structurally flaky testResetPartition for this subclass/config. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/SITWithPWiseWithoutBufferAfterLeaderTest.java | Skip structurally flaky testResetPartition for this subclass/config. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/SITWithPWiseAndBufferAfterLeaderTest.java | Skip structurally flaky testResetPartition for this subclass/config. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/ParticipantStoreConsumptionTaskTest.java | Synchronize mock time advancement with sleeper threads via awaitSleeper(). |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/KafkaConsumerServiceDelegatorTest.java | Make stress loop robust to documented transient exceptions; best-effort cleanup on interrupt. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/ActiveKeyCountScenarioTest.java | Configure AsyncGauge executor to avoid slow-metric blacklisting/cached readings under CI. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/helix/LeaderFollowerPartitionStateModelTest.java | Use doReturn().when() to avoid concurrent Mockito stubbing failures. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/consumer/VeniceChangelogConsumerImplTest.java | Stub getAssignment() to prevent ClassCastException from Mockito smart defaults. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/blobtransfer/TestNettyP2PBlobTransferManager.java | Poll for async decrement on channelInactive before asserting concurrent snapshot users. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+434
to
+443
| try (RequestBasedMetadata requestBasedMetadata = new RequestBasedMetadata(clientConfig, d2TransportClient) { | ||
| @Override | ||
| synchronized void updateCache(boolean onDemandRefresh) throws InterruptedException { | ||
| if (onDemandRefresh) { | ||
| onDemandRefreshCalls.incrementAndGet(); | ||
| } else { | ||
| initialRefreshCalls.incrementAndGet(); | ||
| } | ||
| super.updateCache(onDemandRefresh); | ||
| } |
Contributor
Author
RequestBasedMetadata.updateCache now returns List<Runnable>; adapt the counting override accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test-only half of #2624, split out for reviewability. 86 files, no production code touched.
Fix categories (details and verification in #2624):
AsyncGaugeExecutorper stats test instead of the shared static 3-thread executor that other tests can destroy viaclose()verify(timeout().times(N))withwaitForNonDeterministicAssertion; fix concurrent Mockito stubbing withAtomicReferenceLeaderErrorNotifier.doOnemadevolatile(plain boolean written by drainer thread, read by test thread — timeouts could never fix this)BackupVersionOptimizationServiceTestAll fixes were verified with 10/10 consecutive local passes and two rounds of targeted CI validation (90 job executions, zero failures) on the original PR.
Companion PRs:
build.gradletest-retry removal from [test] Fix 20+ flaky tests across multiple modules #2624 is intentionally excluded — it should land last, after flakies are fixedTest plan
🤖 Generated with Claude Code