Skip to content

perf(storage): speed up connection load and clarify in-memory wrapping#726

Merged
tnaum-ms merged 20 commits into
mainfrom
dev/tnaum/storeage-optimization
Jun 3, 2026
Merged

perf(storage): speed up connection load and clarify in-memory wrapping#726
tnaum-ms merged 20 commits into
mainfrom
dev/tnaum/storeage-optimization

Conversation

@tnaum-ms

@tnaum-ms tnaum-ms commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Why

Connection storage code showed "migration" running in traces on every load, and connection loading was slow on Remote-WSL where each SecretStorage read crosses the WSL2↔Windows boundary. Investigation found the cost came from (a) re-reading every storage zone multiple times during startup cleanup, (b) running that cleanup on every session, (c) reading item secrets one-at-a-time, and (d) independent activation-time consumers each re-reading the same zone back-to-back.

What changed (one concern per commit)

  1. refactor(storage): clarify read-time wrapping is in-memory, not migration — Rename migrateToV2/migrateToV3wrapV1AsV2/wrapV2AsCurrent and fix trace text. Wrapping an old stored item into the current shape is a pure in-memory transform recomputed on every read; it was never persisted, so "migration" in the trace was misleading.
  2. perf(storage): load each zone once for post-migration cleanup — The three cleaners each re-read every zone (~6 zone reads). Now each zone is loaded once and the in-memory array is threaded through all cleaners (2 reads).
  3. perf(storage): gate startup cleanup behind a one-time version marker — A STORAGE_CLEANUP_VERSION marker is persisted after a successful cleanup pass; subsequent loads skip the whole pass. The marker only bumps when a new one-time step is added.
  4. perf(storage): read item secrets concurrently in getItems — Replace the per-item sequential await secretStorage.get() loop with Promise.all, pipelining round trips over the shared RPC channel (~N round trips → ~1).
  5. chore(storage): drop one-time Azure Databases connection import — Remove the cross-extension import from ms-azuretools.vscode-cosmosdb. It's self-limiting, the remaining population is flat, and it added startup overhead on every load. Unlike on-disk format readers (kept), dropping it only affects users who never opened the extension, with a clear workaround (re-add the connection).
  6. chore(l10n): drop strings for removed Azure Databases import — Mechanical follow-up so the localization bundle doesn't carry dead keys.
  7. perf(storage): coalesce and briefly cache getItems during activation — Add a short-lived, per-workspace read-through cache to StorageImpl.getItems(). Concurrent callers share one in-flight read (request coalescing), and callers within a short TTL reuse the resolved snapshot instead of issuing a fresh read. Every caller gets a defensive copy; push/delete invalidate immediately and failed reads are evicted so they retry. Adds storageService.test.ts (6 tests).

Performance impact

In our test environments, these changes reduce redundant secret-store reads during activation by well over half (roughly 60–80% depending on the connection profile). Because each avoided read is a round trip to secure storage, this is a noticeable speed-up in connection load time — with the largest gains on remote setups (e.g. Remote-WSL), where every saved read also avoids a cross-boundary round trip.

Beyond the read count, the previous behavior re-read the same storage zones repeatedly spread across several seconds of startup (cleanup ran on every session and independent consumers each triggered their own reads). These changes collapse that into a single pass per zone, so storage is no longer being re-hammered well after activation begins.

Validation

  • npm run l10n, npm run prettier-fix, npm run lint — clean
  • npx jest --no-coverage — 2001 tests / 102 suites pass (added storageService.test.ts, 6 tests)
  • npm run build — clean

A detailed write-up of the reasoning is in docs/ai-and-plans/PRs/.

tnaum-ms added 6 commits June 2, 2026 22:43
…tion

Rename the on-read transform helpers from migrateToV2/migrateToV3 to
wrapV1AsV2/wrapV2AsCurrent and update trace messages to state explicitly
that wrapping an older stored item into the current StoredItem shape is a
pure in-memory operation that is recomputed on every read and never
persisted. This removes the confusion of seeing 'migration' in the trace
on every load when nothing is actually being migrated to disk.
Previously fixFolderConnectionStrings, cleanupDuplicateConnectionStringParameters
and cleanupInvalidConnectionStrings each re-read every storage zone via
getItems(), so startup cleanup performed ~6 full zone reads (3 cleaners x 2
zones). Each getItems() does a SecretStorage round-trip per item, which on
Remote-WSL crosses the WSL2 <-> Windows boundary and dominates load time.

resolvePostMigrationErrors now loads each zone exactly once and threads the
same in-memory array through all three cleaners, cutting zone reads from ~6 to
2. The cleaners operate on disjoint item sets (folders vs. connections vs.
invalid/unparseable connections) so a single read per zone is safe. Telemetry
counters are now aggregated by the orchestrator.
Add a STORAGE_CLEANUP_VERSION marker ('0.8.1', the patch this ships with)
persisted to globalState once resolvePostMigrationErrors completes. On every
subsequent load, if the stored marker matches the current cleanup-schema
version, the whole cleanup pass (folder fixups, duplicate-param cleanup,
invalid-connection removal, orphan sweep) is skipped instead of re-scanning
both zones each session.

The marker doubles as a guarantee: any install carrying '0.8.1' is known to
have completed every one-time format upgrade and corruption cleanup up to that
release. The constant is only bumped when a new one-time step is added, in
which case existing installs re-run the pass exactly once. The marker is
written only after a successful run, so interrupted runs safely retry.
Adds cleanupSkipped/cleanupVersion telemetry for visibility.
getItems awaited ext.secretStorage.get() once per item inside a for-loop, so
N items cost N sequential round trips to the main process. Under Remote-WSL
each round trip also crosses the WSL2 <-> Windows boundary, making this the
dominant component of connection load time.

Collect item metadata from globalState first (synchronous, in-memory), then
dispatch all secret reads together with Promise.all so the round trips
pipeline over the shared RPC channel. Total wait drops from ~N round trips to
~1. This is safe because VS Code serializes secret access per-key rather than
globally and caches the decrypted store after the first read, so reads for
distinct item keys do not block one another. Behavior and return shape are
unchanged.
Remove migrateFromAzureDatabases, getMongoMigrationApi, the
MongoConnectionMigrationApi interface, the attempts-counter globalState key,
and the gating call in getStorageService (plus now-unused apiUtils, vscode,
l10n and isVCoreAndRURolloutEnabled imports).

This path imported MongoDB cluster connections from the Azure Databases
(ms-azuretools.vscode-cosmosdb) extension on first storage access. It shipped
several releases ago, is self-limiting (only acts when that extension is
installed and still holds un-imported connections), and otherwise added pure
startup overhead on every load (extension lookup + cross-extension API
negotiation) behind an attempts<20 counter. The remaining population is
effectively flat. Unlike the on-disk format readers (wrapV1AsV2 etc., kept),
dropping this only affects a narrow set of users who never opened the
extension, and they have a clear workaround: re-add the connection. The
read-time format wrapping is unaffected.
Copilot AI review requested due to automatic review settings June 2, 2026 21:02
@tnaum-ms tnaum-ms requested a review from a team as a code owner June 2, 2026 21:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces connection-load latency (notably under Remote-WSL) by cutting down SecretStorage round-trips and by avoiding repeated post-migration cleanup work on every activation. It also clarifies that the legacy-record “migration” work is an in-memory wrap performed at read time (not a persisted migration), and removes the legacy Azure Databases cross-extension import path.

Changes:

  • Parallelize SecretStorage reads in Storage.getItems() with Promise.all() to avoid N sequential RPC round trips.
  • Refactor startup cleanup to load each storage zone once, and gate the cleanup pass behind a persisted “cleanup completed” marker.
  • Rename legacy “migration” helpers and traces to reflect in-memory wrapping; remove the Azure Databases import + related l10n keys; add a design write-up doc.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/services/storageService.ts Dispatches SecretStorage reads concurrently in getItems() to reduce latency.
src/services/connectionStorageService.ts Gates and refactors startup cleanup; clarifies in-memory wrapping; removes Azure Databases import path.
l10n/bundle.l10n.json Removes localization entries that were only used by deleted Azure Databases import code.
docs/ai-and-plans/PRs/726-storage-load-optimization.md Adds a rationale/design note documenting the performance work and tradeoffs.

Comment thread src/services/connectionStorageService.ts
Comment thread src/services/connectionStorageService.ts Outdated
Comment thread docs/ai-and-plans/PRs/726-storage-load-optimization.md Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread src/services/storageService.ts
Comment thread src/services/storageService.ts
tnaum-ms added 4 commits June 3, 2026 10:14
Previously the orphan cleanup ran fire-and-forget while the cleanup-completed
marker was written immediately. An interrupted or failed orphan pass would
leave orphans behind and the marker would suppress every future cleanup
attempt, hiding affected items from the tree until the user reinstalled
the extension or manually cleared globalState.

Awaiting the orphan pass ensures the marker is only written on a successful
run. callWithTelemetryAndErrorHandling still swallows exceptions, so a thrown
orphan-cleanup error simply prevents the marker write and the next activation
retries the whole pass.

Addresses Copilot review thread PRRT_kwDOODtcO86GkTae on PR #726.
… dirty reads

Concurrent activation-path callers (tree providers, URI handler, stats hook)
could race the storage-service bootstrap. The previous code assigned
_storageService before awaiting cleanup, so a second caller arriving in the
same tick would see the field truthy, return immediately, and read dirty
data (duplicate-param connections, invalid CS rows, orphans) while cleanup
was still running in parallel. The cache then served that snapshot to
later readers for up to TTL.

Now getStorageService caches a single in-flight bootstrap promise; all
external callers await it. _storageService is published only after
resolvePostMigrationErrors and collectStorageStats complete, so no consumer
ever observes a partially-bootstrapped service.

To avoid re-entrant deadlock, the cleanup helpers (resolvePostMigrationErrors,
cleanupOrphanedItems, collectStorageStats, and the inner cleaners) now
receive the Storage instance as a parameter instead of calling
getStorageService. A new private loadAllItemsFromService helper is shared
between getAllItems and the cleanup helpers.
…nErrors

The three cleaners in resolvePostMigrationErrors share a single pre-cleanup
items snapshot. They are currently safe because they operate on disjoint
categories (folders vs. connections vs. invalid/unparseable), but a future
cleaner that depends on the in-memory state of a previous cleaner would
silently see stale data.

Make the invariant explicit so the next person extending the cleanup pass
either preserves it or rebuilds the items array between cleaners (cheap
now thanks to the per-workspace cache).
… changes

Local push/delete invalidate the per-workspace cache, but a SecretStorage
mutation from a second VS Code window (or any other extension writing
through the same SecretStorage) was not observed. The first window would
keep serving the stale snapshot for up to GET_ITEMS_CACHE_TTL_MS.

Subscribe to SecretStorage.onDidChange and drop the cache entry for the
affected workspace when a secret in this storage namespace changes.
Unrelated secret churn is ignored.
@tnaum-ms

tnaum-ms commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Fix: serialize storage-service bootstrap behind a shared promiseea95be1

Concurrent activation-path callers (tree providers, URI handler, stats hook) could race the lazy getStorageService() init: the field was assigned before awaiting cleanup, so a second caller arriving in the same tick saw it truthy, returned immediately, and read dirty data (duplicate-param connections, invalid CS rows, orphans) while cleanup was still running. The new getItems cache then served that snapshot for up to TTL.

After this commit, getStorageService() caches a single in-flight bootstrap promise; all external callers await it. _storageService is published only after resolvePostMigrationErrors and collectStorageStats complete, so no consumer ever observes a partially-bootstrapped service.

To avoid re-entrant deadlock, the cleanup helpers (resolvePostMigrationErrors, cleanupOrphanedItems, collectStorageStats, and the inner cleaners) now receive the Storage instance as a parameter instead of calling getStorageService themselves. A new private loadAllItemsFromService helper is shared between getAllItems and the cleanup helpers.

@tnaum-ms

tnaum-ms commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Fix: invalidate getItems cache on cross-window SecretStorage changes2acace9

Local push/delete invalidate the per-workspace cache, but a SecretStorage mutation from a second VS Code window (or any other extension writing through the same SecretStorage namespace) was not observed. The first window kept serving the stale snapshot for up to GET_ITEMS_CACHE_TTL_MS (10 s) — realistic for users who keep the DocumentDB tree open in two windows on the same profile.

This commit subscribes StorageImpl to SecretStorage.onDidChange and drops the cache entry for the affected workspace when a changed key belongs to this storage namespace. Unrelated secret churn is ignored. The subscription is registered via ext.context.subscriptions so it's disposed with the extension.

@tnaum-ms

tnaum-ms commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Doc: document cleaner-pass invariant in resolvePostMigrationErrorsd9dd080

The three cleaners in resolvePostMigrationErrors share a single pre-cleanup items snapshot loaded once per zone. They're currently safe because they operate on disjoint categories (folders / connections / invalid CS), but a future cleaner that depends on the in-memory state mutated by a previous cleaner would silently see stale data — saves/deletes are persisted, but the in-memory items are still the pre-cleanup snapshot.

This commit adds an explicit invariant comment so the next person extending the cleanup pass either preserves the disjoint-category contract or rebuilds the items array between cleaners (cheap now thanks to the per-workspace cache).

tnaum-ms added 4 commits June 3, 2026 10:22
…ons doc

- Move 726-storage-load-optimization.md into 726-storage-load-optimization/ subfolder.
- Add review-and-resolutions.md capturing each review finding and what was done
  about it (commit SHAs, decisions, deferrals).
STORAGE_CLEANUP_VERSION was a semver string ('0.8.1') even though it has
nothing to do with the extension version — it's a cleanup-schema contract
marker. Using a semver here invited drift: future contributors could assume
it tracks package.json and either fail to bump it on a new cleanup step
(silent skip) or bump it on every release (unnecessary re-runs).

Switch to a plain integer counter starting at 1, and update the read-side
type to accept either number (current) or the legacy string ('0.8.1') so
existing installs simply fail the equality check and re-run the cleanup
once — harmless because every cleaner is idempotent.

Also surface the previous marker value via telemetry so we can observe how
many installs cross the legacy->integer boundary.

Addresses Copilot review thread PRRT_kwDOODtcO86GkTa4 on PR #726.
…-time

Two reviewer-flagged contract gaps in storageService.ts that don't need code
changes but do need accurate docs/comments so future maintainers understand
the trade-offs:

1. getItems defensive copy is shallow: nested 'properties' is shared with the
   cached snapshot. Audit of every current consumer shows no problematic
   mutator (wrapV2AsCurrent is idempotent, updateParentId immediately awaits
   save which invalidates the cache). Cloning properties via structuredClone
   on every read would cost CPU on the activation critical path this cache
   exists to optimize. Tighten the JSDoc to spell out the contract — item
   and secrets are fresh, properties is shared, do not mutate.

2. getOrLoadItems timestamps the cache entry at creation, not resolution.
   A slow SecretStorage load shrinks (or, pathologically, eliminates) the
   post-resolution reuse window. In practice concurrent in-flight callers
   are coalesced via the shared promise, so the duplicate-load risk only
   matters after first resolution. Add an inline comment documenting the
   trade-off and noting that bumping GET_ITEMS_CACHE_TTL_MS is the cheap
   escape hatch if telemetry ever shows the window collapsing.

Addresses Copilot review threads PRRT_kwDOODtcO86Gk_WK and
PRRT_kwDOODtcO86Gk_Vd on PR #726.
The 'Commits: 7 on top of main' line was already inaccurate and would
continue to drift on every push. The commit list is one click away on the
PR page; keeping a frozen count in a checked-in doc only invites further
drift.

Addresses Copilot review thread PRRT_kwDOODtcO86GkTbL on PR #726.
Add a format census to connectionStorage.stats so we can decide when the
read-time v1/v2 wrapping code in fromStorageItem is safe to remove.

Per zone (clusters, emulators):
  <zone>_v1Items, <zone>_v2Items, <zone>_v3Items, <zone>_unknownVersionItems

Aggregate across zones:
  v1Items, v2Items, v3Items, unknownVersionItems
  hasLegacyItems = 'true' | 'false'

The counts are taken from the raw StorageItem.version field before wrapping
(loadAllItemsFromService strips the version). 'unknown' covers anything
that doesn't match a known version, including future formats that the
service skips today.

Once hasLegacyItems drops to ~0% across an entire release window in
telemetry, wrapV1AsV2 and wrapV2AsCurrent become deletable along with the
v1/v2 cases in fromStorageItem.
@tnaum-ms

tnaum-ms commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Feat: per-version item census in stats telemetrye7f2169

Added a format census to the connectionStorage.stats activation event so we can decide when the read-time v1/v2 wrapping code in fromStorageItem (i.e. wrapV1AsV2 / wrapV2AsCurrent) becomes safe to delete.

Per zone (clusters, emulators):

  • <zone>_v1Items, <zone>_v2Items, <zone>_v3Items, <zone>_unknownVersionItems

Aggregate across zones:

  • v1Items, v2Items, v3Items, unknownVersionItems
  • hasLegacyItems = 'true' | 'false'

The counts are taken from the raw StorageItem.version field before wrapping (loadAllItemsFromService strips the version). unknownVersionItems covers anything that doesn't match a known version, including future formats that the service skips today.

Plus the previously-mentioned previousCleanupVersion property on resolvePostMigrationErrors (added in 12f5163) tracks how many installs cross the legacy '0.8.1' → integer-counter marker boundary.

Removal criteria: once hasLegacyItems = 'false' across an entire release window in telemetry, wrapV1AsV2, wrapV2AsCurrent, and the v1/v2 branches in fromStorageItem become safe to delete.

tnaum-ms added 2 commits June 3, 2026 11:09
…p refactor

- Reset _bootstrap alongside _storageService in beforeEach and mid-test
  resets so cleanup re-runs per test (was reusing the cached promise)
- Pass mockStorage explicitly to cleanupOrphanedItems() calls in orphans
  tests — signature now requires a storageService argument
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

✅ Code Quality Checks

Check Status How to fix
Localization (l10n) ✅ Passed
ESLint ✅ Passed
Prettier formatting ✅ Passed

This comment is updated automatically on each push.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

📦 Build Size Report

Metric Base (main) PR Delta
VSIX (vscode-documentdb-0.8.0.vsix) 7.60 MB 7.60 MB ⬇️ -0 KB (-0.0%)
Webview bundle (views.js) 5.88 MB 5.88 MB ✅ 0 KB (0.0%)

Download artifact · updated automatically on each push.

@tnaum-ms tnaum-ms merged commit 4233592 into main Jun 3, 2026
8 checks passed
@tnaum-ms tnaum-ms deleted the dev/tnaum/storeage-optimization branch June 3, 2026 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants