Skip to content

feat(agent): retry with backoff and resolution cache for GitHub skill resolver#326

Closed
ptone wants to merge 3 commits into
mainfrom
scion/skill-bank-rate-limit
Closed

feat(agent): retry with backoff and resolution cache for GitHub skill resolver#326
ptone wants to merge 3 commits into
mainfrom
scion/skill-bank-rate-limit

Conversation

@ptone

@ptone ptone commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Summary

When agent templates reference many skills via gh:// URIs, provisioning hits GitHub API rate limits (403/429), causing agents to fail. This PR adds two mitigations:

  • Retry with exponential backoff: All GitHub API calls in GitHubSkillResolver (resolveCommitSHA, listContents, downloadRawFile) now retry on 429, 403 rate-limit, and 5xx responses. Respects Retry-After and X-RateLimit-Reset headers. Max 4 retries with backoff capped at 30s.
  • Resolution cache with TTL: A file-backed cache (5 min TTL) stores the mapping from skill URI → resolved skill metadata (commit SHA, file listings, hashes). On cache hit, all GitHub API calls are skipped entirely. Works alongside the existing CachingSkillResolver content cache for two-layer caching: resolution (short TTL, tracks branch tip movement) and content (long-lived, content-addressed).

Files changed

  • pkg/agent/github_skill_resolver.go — Added doWithRetry(), isRetryableResponse(), retryDelay(), resolution cache integration
  • pkg/agent/github_resolution_cache.go (new) — TTL-based on-disk resolution cache
  • pkg/agent/github_skill_resolver_test.go — Tests for retry on 429/5xx, rate limit handling, resolution cache hit
  • pkg/agent/github_resolution_cache_test.go (new) — Tests for cache put/get, expiry, persistence, reload

Test plan

  • All existing TestGitHubSkillResolver_* tests pass (including updated rate limit test)
  • New TestGitHubSkillResolver_RetryOn429 — verifies recovery after transient 429
  • New TestGitHubSkillResolver_RetryOn5xx — verifies recovery after transient 503
  • New TestGitHubSkillResolver_ResolutionCacheHit — verifies zero API calls on cache hit
  • New TestIsRetryableResponse — unit tests for retry classification (9 cases)
  • New TestRetryDelay — unit tests for backoff calculation
  • New TestGitHubResolutionCache_* — cache CRUD, expiry, persistence, reload
  • Full go test ./pkg/agent/ passes
  • Full go build ./... passes

…kill resolver

When agent templates reference many skills via gh:// URIs, provisioning
hits GitHub API rate limits causing agents to fail. This adds two
mitigations:

1. Retry with exponential backoff: All GitHub API calls in the skill
   resolver (resolveCommitSHA, listContents, downloadRawFile) now retry
   on 429, 403 rate-limit, and 5xx responses with exponential backoff.
   Respects Retry-After and X-RateLimit-Reset headers when present.

2. Resolution cache with TTL: A file-backed TTL cache (5 min default)
   stores the mapping from skill URI to resolved skill metadata (commit
   SHA, file listings, hashes). On cache hit, all GitHub API calls are
   skipped entirely. This works alongside the existing CachingSkillResolver
   which caches file content by hash for longer-term reuse.

Together these form a two-layer cache: resolution cache (short TTL, tracks
branch tip movement) and content cache (long-lived, content-addressed).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a disk-backed GitHubResolutionCache to cache resolved skill URIs and avoid redundant GitHub API calls, and implements an exponential backoff retry mechanism (doWithRetry) in GitHubSkillResolver to handle rate limits and transient server errors. Feedback on these changes highlights several key improvements: optimizing the retry mechanism to avoid a redundant final request and correctly capping the rate-limit reset wait time; releasing the cache's write lock before performing disk I/O to prevent blocking concurrent operations; utilizing atomic file writes to prevent cache corruption; and adding a nil check during cache loading to prevent potential nil pointer dereference panics.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +316 to +359
func (r *GitHubSkillResolver) doWithRetry(ctx context.Context, req *http.Request) (*http.Response, error) {
var lastResp *http.Response
var lastErr error

for attempt := 0; attempt <= githubMaxRetries; attempt++ {
if attempt > 0 {
delay := retryDelay(lastResp, attempt)
util.Debugf("github: retrying request (attempt %d/%d) after %v: %s %s",
attempt, githubMaxRetries, delay, req.Method, req.URL.Path)

select {
case <-ctx.Done():
return nil, ctx.Err()
case <-time.After(delay):
}
}

// Clone the request for retries (body is nil for GET requests)
cloned := req.Clone(ctx)
resp, err := r.httpClient.Do(cloned)
if err != nil {
lastErr = err
// Network errors are retryable
continue
}

if !isRetryableResponse(resp) {
return resp, nil
}

// Drain and close body so the connection can be reused
_, _ = io.Copy(io.Discard, resp.Body)
resp.Body.Close()
lastResp = resp
lastErr = nil
}

if lastErr != nil {
return nil, lastErr
}
// All retries exhausted — re-execute to return the final response for
// the caller to handle (e.g. apiError extraction).
return r.httpClient.Do(req.Clone(ctx))
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Re-executing the HTTP request after all retries are exhausted is highly inefficient and counterproductive, especially when dealing with rate limits (429) or server errors (5xx). Instead, on the last attempt (attempt == githubMaxRetries), we should return the response directly to the caller with its body open, allowing them to read the error details without making an extra redundant request.

func (r *GitHubSkillResolver) doWithRetry(ctx context.Context, req *http.Request) (*http.Response, error) {
	var lastResp *http.Response
	var lastErr error

	for attempt := 0; attempt <= githubMaxRetries; attempt++ {
		if attempt > 0 {
			delay := retryDelay(lastResp, attempt)
			util.Debugf("github: retrying request (attempt %d/%d) after %v: %s %s",
				attempt, githubMaxRetries, delay, req.Method, req.URL.Path)

			select {
			case <-ctx.Done():
				return nil, ctx.Err()
			case <-time.After(delay):
			}
		}

		// Clone the request for retries (body is nil for GET requests)
		cloned := req.Clone(ctx)
		resp, err := r.httpClient.Do(cloned)
		if err != nil {
			lastErr = err
			// Network errors are retryable
			continue
		}

		if !isRetryableResponse(resp) {
			return resp, nil
		}

		if attempt == githubMaxRetries {
			return resp, nil
		}

		// Drain and close body so the connection can be reused
		_, _ = io.Copy(io.Discard, resp.Body)
		resp.Body.Close()
		lastResp = resp
		lastErr = nil
	}

	return nil, lastErr
}

Comment on lines +393 to +396
wait := time.Until(time.Unix(resetUnix, 0))
if wait > 0 && wait <= githubMaxBackoff {
return wait
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the X-RateLimit-Reset time is greater than githubMaxBackoff (e.g., 45 seconds), the current logic completely ignores the reset header and falls back to a tiny exponential backoff (like 1s). This will quickly exhaust all retries. Instead, we should cap the wait time at githubMaxBackoff.

					wait := time.Until(time.Unix(resetUnix, 0))
					if wait > 0 {
						if wait > githubMaxBackoff {
							return githubMaxBackoff
						}
						return wait
					}

Comment on lines +91 to +103
func (c *GitHubResolutionCache) Put(uri string, skill ResolvedSkill) {
c.mu.Lock()
defer c.mu.Unlock()

now := time.Now()
c.entries[uri] = &resolutionCacheEntry{
Skill: skill,
CachedAt: now,
ExpiresAt: now.Add(c.ttl),
}
c.evictExpired()
c.save()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Holding the write lock c.mu.Lock() during disk I/O (os.WriteFile) blocks all concurrent readers (Get) and writers (Put). We should copy the entries under the lock, release the lock, and then perform the marshaling and file write operations outside the lock.

func (c *GitHubResolutionCache) Put(uri string, skill ResolvedSkill) {
	c.mu.Lock()
	now := time.Now()
	c.entries[uri] = &resolutionCacheEntry{
		Skill:     skill,
		CachedAt:  now,
		ExpiresAt: now.Add(c.ttl),
	}
	c.evictExpired()

	// Copy entries to save outside the lock to avoid blocking readers with disk I/O.
	entriesCopy := make(map[string]*resolutionCacheEntry, len(c.entries))
	for k, v := range c.entries {
		entriesCopy[k] = v
	}
	c.mu.Unlock()

	c.save(entriesCopy)
}

Comment thread pkg/agent/github_resolution_cache.go Outdated
Comment on lines +128 to +135
func (c *GitHubResolutionCache) save() {
f := resolutionCacheFile{Entries: c.entries}
data, err := json.MarshalIndent(f, "", " ")
if err != nil {
return
}
_ = os.WriteFile(c.filePath, data, 0644)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update save to accept the entries map and write to a temporary file before renaming it to ensure atomic writes and prevent cache file corruption.

func (c *GitHubResolutionCache) save(entries map[string]*resolutionCacheEntry) {
	f := resolutionCacheFile{Entries: entries}
	data, err := json.MarshalIndent(f, "", "  ")
	if err != nil {
		return
	}
	tmpPath := c.filePath + ".tmp"
	if err := os.WriteFile(tmpPath, data, 0644); err != nil {
		return
	}
	_ = os.Rename(tmpPath, c.filePath)
}

Comment on lines +119 to +123
for uri, entry := range f.Entries {
if now.Before(entry.ExpiresAt) {
c.entries[uri] = entry
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the JSON cache file contains a null entry (which can happen if corrupted or manually edited), entry.ExpiresAt will panic with a nil pointer dereference. We should add a nil check before accessing entry.ExpiresAt.

Suggested change
for uri, entry := range f.Entries {
if now.Before(entry.ExpiresAt) {
c.entries[uri] = entry
}
}
for uri, entry := range f.Entries {
if entry != nil && now.Before(entry.ExpiresAt) {
c.entries[uri] = entry
}
}

Scion Agent (skill-bank-rate-limit-fix) added 2 commits June 28, 2026 11:11
- Release lock before disk I/O in Put() by copying entries under lock
- Fix extra HTTP request after retries exhausted in doWithRetry()
- Deep copy Files slice in Get() to prevent data races
- Use atomic write (tmp + rename) in save()
- Cap rate limit reset wait at maxBackoff instead of falling back to
  shorter exponential backoff
Properly handle return values from resp.Body.Close(), w.Write(),
and json.NewEncoder().Encode() calls.
@ptone ptone closed this Jun 28, 2026
@ptone ptone deleted the scion/skill-bank-rate-limit branch June 28, 2026 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant