feat(agent): retry with backoff and resolution cache for GitHub skill resolver by ptone · Pull Request #326 · ptone/scion

ptone · 2026-06-28T09:38:30Z

Summary

When agent templates reference many skills via gh:// URIs, provisioning hits GitHub API rate limits (403/429), causing agents to fail. This PR adds two mitigations:

Retry with exponential backoff: All GitHub API calls in GitHubSkillResolver (resolveCommitSHA, listContents, downloadRawFile) now retry on 429, 403 rate-limit, and 5xx responses. Respects Retry-After and X-RateLimit-Reset headers. Max 4 retries with backoff capped at 30s.
Resolution cache with TTL: A file-backed cache (5 min TTL) stores the mapping from skill URI → resolved skill metadata (commit SHA, file listings, hashes). On cache hit, all GitHub API calls are skipped entirely. Works alongside the existing CachingSkillResolver content cache for two-layer caching: resolution (short TTL, tracks branch tip movement) and content (long-lived, content-addressed).

Files changed

pkg/agent/github_skill_resolver.go — Added doWithRetry(), isRetryableResponse(), retryDelay(), resolution cache integration
pkg/agent/github_resolution_cache.go (new) — TTL-based on-disk resolution cache
pkg/agent/github_skill_resolver_test.go — Tests for retry on 429/5xx, rate limit handling, resolution cache hit
pkg/agent/github_resolution_cache_test.go (new) — Tests for cache put/get, expiry, persistence, reload

Test plan

All existing TestGitHubSkillResolver_* tests pass (including updated rate limit test)
New TestGitHubSkillResolver_RetryOn429 — verifies recovery after transient 429
New TestGitHubSkillResolver_RetryOn5xx — verifies recovery after transient 503
New TestGitHubSkillResolver_ResolutionCacheHit — verifies zero API calls on cache hit
New TestIsRetryableResponse — unit tests for retry classification (9 cases)
New TestRetryDelay — unit tests for backoff calculation
New TestGitHubResolutionCache_* — cache CRUD, expiry, persistence, reload
Full go test ./pkg/agent/ passes
Full go build ./... passes

…kill resolver When agent templates reference many skills via gh:// URIs, provisioning hits GitHub API rate limits causing agents to fail. This adds two mitigations: 1. Retry with exponential backoff: All GitHub API calls in the skill resolver (resolveCommitSHA, listContents, downloadRawFile) now retry on 429, 403 rate-limit, and 5xx responses with exponential backoff. Respects Retry-After and X-RateLimit-Reset headers when present. 2. Resolution cache with TTL: A file-backed TTL cache (5 min default) stores the mapping from skill URI to resolved skill metadata (commit SHA, file listings, hashes). On cache hit, all GitHub API calls are skipped entirely. This works alongside the existing CachingSkillResolver which caches file content by hash for longer-term reuse. Together these form a two-layer cache: resolution cache (short TTL, tracks branch tip movement) and content cache (long-lived, content-addressed).

gemini-code-assist

Code Review

This pull request introduces a disk-backed GitHubResolutionCache to cache resolved skill URIs and avoid redundant GitHub API calls, and implements an exponential backoff retry mechanism (doWithRetry) in GitHubSkillResolver to handle rate limits and transient server errors. Feedback on these changes highlights several key improvements: optimizing the retry mechanism to avoid a redundant final request and correctly capping the rate-limit reset wait time; releasing the cache's write lock before performing disk I/O to prevent blocking concurrent operations; utilizing atomic file writes to prevent cache corruption; and adding a nil check during cache loading to prevent potential nil pointer dereference panics.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-28T09:39:58Z

+func (r *GitHubSkillResolver) doWithRetry(ctx context.Context, req *http.Request) (*http.Response, error) {
+	var lastResp *http.Response
+	var lastErr error
+
+	for attempt := 0; attempt <= githubMaxRetries; attempt++ {
+		if attempt > 0 {
+			delay := retryDelay(lastResp, attempt)
+			util.Debugf("github: retrying request (attempt %d/%d) after %v: %s %s",
+				attempt, githubMaxRetries, delay, req.Method, req.URL.Path)
+
+			select {
+			case <-ctx.Done():
+				return nil, ctx.Err()
+			case <-time.After(delay):
+			}
+		}
+
+		// Clone the request for retries (body is nil for GET requests)
+		cloned := req.Clone(ctx)
+		resp, err := r.httpClient.Do(cloned)
+		if err != nil {
+			lastErr = err
+			// Network errors are retryable
+			continue
+		}
+
+		if !isRetryableResponse(resp) {
+			return resp, nil
+		}
+
+		// Drain and close body so the connection can be reused
+		_, _ = io.Copy(io.Discard, resp.Body)
+		resp.Body.Close()
+		lastResp = resp
+		lastErr = nil
+	}
+
+	if lastErr != nil {
+		return nil, lastErr
+	}
+	// All retries exhausted — re-execute to return the final response for
+	// the caller to handle (e.g. apiError extraction).
+	return r.httpClient.Do(req.Clone(ctx))
+}


Re-executing the HTTP request after all retries are exhausted is highly inefficient and counterproductive, especially when dealing with rate limits (429) or server errors (5xx). Instead, on the last attempt (attempt == githubMaxRetries), we should return the response directly to the caller with its body open, allowing them to read the error details without making an extra redundant request.

func (r *GitHubSkillResolver) doWithRetry(ctx context.Context, req *http.Request) (*http.Response, error) { var lastResp *http.Response var lastErr error for attempt := 0; attempt <= githubMaxRetries; attempt++ { if attempt > 0 { delay := retryDelay(lastResp, attempt) util.Debugf("github: retrying request (attempt %d/%d) after %v: %s %s", attempt, githubMaxRetries, delay, req.Method, req.URL.Path) select { case <-ctx.Done(): return nil, ctx.Err() case <-time.After(delay): } } // Clone the request for retries (body is nil for GET requests) cloned := req.Clone(ctx) resp, err := r.httpClient.Do(cloned) if err != nil { lastErr = err // Network errors are retryable continue } if !isRetryableResponse(resp) { return resp, nil } if attempt == githubMaxRetries { return resp, nil } // Drain and close body so the connection can be reused _, _ = io.Copy(io.Discard, resp.Body) resp.Body.Close() lastResp = resp lastErr = nil } return nil, lastErr }

gemini-code-assist · 2026-06-28T09:39:59Z

+					wait := time.Until(time.Unix(resetUnix, 0))
+					if wait > 0 && wait <= githubMaxBackoff {
+						return wait
+					}


If the X-RateLimit-Reset time is greater than githubMaxBackoff (e.g., 45 seconds), the current logic completely ignores the reset header and falls back to a tiny exponential backoff (like 1s). This will quickly exhaust all retries. Instead, we should cap the wait time at githubMaxBackoff.

wait := time.Until(time.Unix(resetUnix, 0)) if wait > 0 { if wait > githubMaxBackoff { return githubMaxBackoff } return wait }

gemini-code-assist · 2026-06-28T09:39:59Z

+func (c *GitHubResolutionCache) Put(uri string, skill ResolvedSkill) {
+	c.mu.Lock()
+	defer c.mu.Unlock()
+
+	now := time.Now()
+	c.entries[uri] = &resolutionCacheEntry{
+		Skill:     skill,
+		CachedAt:  now,
+		ExpiresAt: now.Add(c.ttl),
+	}
+	c.evictExpired()
+	c.save()
+}


Holding the write lock c.mu.Lock() during disk I/O (os.WriteFile) blocks all concurrent readers (Get) and writers (Put). We should copy the entries under the lock, release the lock, and then perform the marshaling and file write operations outside the lock.

func (c *GitHubResolutionCache) Put(uri string, skill ResolvedSkill) { c.mu.Lock() now := time.Now() c.entries[uri] = &resolutionCacheEntry{ Skill: skill, CachedAt: now, ExpiresAt: now.Add(c.ttl), } c.evictExpired() // Copy entries to save outside the lock to avoid blocking readers with disk I/O. entriesCopy := make(map[string]*resolutionCacheEntry, len(c.entries)) for k, v := range c.entries { entriesCopy[k] = v } c.mu.Unlock() c.save(entriesCopy) }

gemini-code-assist · 2026-06-28T09:39:59Z

+func (c *GitHubResolutionCache) save() {
+	f := resolutionCacheFile{Entries: c.entries}
+	data, err := json.MarshalIndent(f, "", "  ")
+	if err != nil {
+		return
+	}
+	_ = os.WriteFile(c.filePath, data, 0644)
+}


Update save to accept the entries map and write to a temporary file before renaming it to ensure atomic writes and prevent cache file corruption.

func (c *GitHubResolutionCache) save(entries map[string]*resolutionCacheEntry) { f := resolutionCacheFile{Entries: entries} data, err := json.MarshalIndent(f, "", " ") if err != nil { return } tmpPath := c.filePath + ".tmp" if err := os.WriteFile(tmpPath, data, 0644); err != nil { return } _ = os.Rename(tmpPath, c.filePath) }

gemini-code-assist · 2026-06-28T09:39:59Z

+	for uri, entry := range f.Entries {
+		if now.Before(entry.ExpiresAt) {
+			c.entries[uri] = entry
+		}
+	}


If the JSON cache file contains a null entry (which can happen if corrupted or manually edited), entry.ExpiresAt will panic with a nil pointer dereference. We should add a nil check before accessing entry.ExpiresAt.

Suggested change

for uri, entry := range f.Entries {

if now.Before(entry.ExpiresAt) {

c.entries[uri] = entry

}

}

for uri, entry := range f.Entries {

if entry != nil && now.Before(entry.ExpiresAt) {

c.entries[uri] = entry

}

}

- Release lock before disk I/O in Put() by copying entries under lock - Fix extra HTTP request after retries exhausted in doWithRetry() - Deep copy Files slice in Get() to prevent data races - Use atomic write (tmp + rename) in save() - Cap rate limit reset wait at maxBackoff instead of falling back to shorter exponential backoff

Properly handle return values from resp.Body.Close(), w.Write(), and json.NewEncoder().Encode() calls.

gemini-code-assist Bot reviewed Jun 28, 2026

View reviewed changes

Scion Agent (skill-bank-rate-limit-fix) added 2 commits June 28, 2026 11:11

fix: address errcheck linter findings in github_skill_resolver

dd16feb

Properly handle return values from resp.Body.Close(), w.Write(), and json.NewEncoder().Encode() calls.

ptone closed this Jun 28, 2026

ptone deleted the scion/skill-bank-rate-limit branch June 28, 2026 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): retry with backoff and resolution cache for GitHub skill resolver#326

feat(agent): retry with backoff and resolution cache for GitHub skill resolver#326
ptone wants to merge 3 commits into
mainfrom
scion/skill-bank-rate-limit

ptone commented Jun 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ptone commented Jun 28, 2026

Summary

Files changed

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant