feat(agent): retry with backoff and resolution cache for GitHub skill resolver#326
feat(agent): retry with backoff and resolution cache for GitHub skill resolver#326ptone wants to merge 3 commits into
Conversation
…kill resolver When agent templates reference many skills via gh:// URIs, provisioning hits GitHub API rate limits causing agents to fail. This adds two mitigations: 1. Retry with exponential backoff: All GitHub API calls in the skill resolver (resolveCommitSHA, listContents, downloadRawFile) now retry on 429, 403 rate-limit, and 5xx responses with exponential backoff. Respects Retry-After and X-RateLimit-Reset headers when present. 2. Resolution cache with TTL: A file-backed TTL cache (5 min default) stores the mapping from skill URI to resolved skill metadata (commit SHA, file listings, hashes). On cache hit, all GitHub API calls are skipped entirely. This works alongside the existing CachingSkillResolver which caches file content by hash for longer-term reuse. Together these form a two-layer cache: resolution cache (short TTL, tracks branch tip movement) and content cache (long-lived, content-addressed).
There was a problem hiding this comment.
Code Review
This pull request introduces a disk-backed GitHubResolutionCache to cache resolved skill URIs and avoid redundant GitHub API calls, and implements an exponential backoff retry mechanism (doWithRetry) in GitHubSkillResolver to handle rate limits and transient server errors. Feedback on these changes highlights several key improvements: optimizing the retry mechanism to avoid a redundant final request and correctly capping the rate-limit reset wait time; releasing the cache's write lock before performing disk I/O to prevent blocking concurrent operations; utilizing atomic file writes to prevent cache corruption; and adding a nil check during cache loading to prevent potential nil pointer dereference panics.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| func (r *GitHubSkillResolver) doWithRetry(ctx context.Context, req *http.Request) (*http.Response, error) { | ||
| var lastResp *http.Response | ||
| var lastErr error | ||
|
|
||
| for attempt := 0; attempt <= githubMaxRetries; attempt++ { | ||
| if attempt > 0 { | ||
| delay := retryDelay(lastResp, attempt) | ||
| util.Debugf("github: retrying request (attempt %d/%d) after %v: %s %s", | ||
| attempt, githubMaxRetries, delay, req.Method, req.URL.Path) | ||
|
|
||
| select { | ||
| case <-ctx.Done(): | ||
| return nil, ctx.Err() | ||
| case <-time.After(delay): | ||
| } | ||
| } | ||
|
|
||
| // Clone the request for retries (body is nil for GET requests) | ||
| cloned := req.Clone(ctx) | ||
| resp, err := r.httpClient.Do(cloned) | ||
| if err != nil { | ||
| lastErr = err | ||
| // Network errors are retryable | ||
| continue | ||
| } | ||
|
|
||
| if !isRetryableResponse(resp) { | ||
| return resp, nil | ||
| } | ||
|
|
||
| // Drain and close body so the connection can be reused | ||
| _, _ = io.Copy(io.Discard, resp.Body) | ||
| resp.Body.Close() | ||
| lastResp = resp | ||
| lastErr = nil | ||
| } | ||
|
|
||
| if lastErr != nil { | ||
| return nil, lastErr | ||
| } | ||
| // All retries exhausted — re-execute to return the final response for | ||
| // the caller to handle (e.g. apiError extraction). | ||
| return r.httpClient.Do(req.Clone(ctx)) | ||
| } |
There was a problem hiding this comment.
Re-executing the HTTP request after all retries are exhausted is highly inefficient and counterproductive, especially when dealing with rate limits (429) or server errors (5xx). Instead, on the last attempt (attempt == githubMaxRetries), we should return the response directly to the caller with its body open, allowing them to read the error details without making an extra redundant request.
func (r *GitHubSkillResolver) doWithRetry(ctx context.Context, req *http.Request) (*http.Response, error) {
var lastResp *http.Response
var lastErr error
for attempt := 0; attempt <= githubMaxRetries; attempt++ {
if attempt > 0 {
delay := retryDelay(lastResp, attempt)
util.Debugf("github: retrying request (attempt %d/%d) after %v: %s %s",
attempt, githubMaxRetries, delay, req.Method, req.URL.Path)
select {
case <-ctx.Done():
return nil, ctx.Err()
case <-time.After(delay):
}
}
// Clone the request for retries (body is nil for GET requests)
cloned := req.Clone(ctx)
resp, err := r.httpClient.Do(cloned)
if err != nil {
lastErr = err
// Network errors are retryable
continue
}
if !isRetryableResponse(resp) {
return resp, nil
}
if attempt == githubMaxRetries {
return resp, nil
}
// Drain and close body so the connection can be reused
_, _ = io.Copy(io.Discard, resp.Body)
resp.Body.Close()
lastResp = resp
lastErr = nil
}
return nil, lastErr
}| wait := time.Until(time.Unix(resetUnix, 0)) | ||
| if wait > 0 && wait <= githubMaxBackoff { | ||
| return wait | ||
| } |
There was a problem hiding this comment.
If the X-RateLimit-Reset time is greater than githubMaxBackoff (e.g., 45 seconds), the current logic completely ignores the reset header and falls back to a tiny exponential backoff (like 1s). This will quickly exhaust all retries. Instead, we should cap the wait time at githubMaxBackoff.
wait := time.Until(time.Unix(resetUnix, 0))
if wait > 0 {
if wait > githubMaxBackoff {
return githubMaxBackoff
}
return wait
}| func (c *GitHubResolutionCache) Put(uri string, skill ResolvedSkill) { | ||
| c.mu.Lock() | ||
| defer c.mu.Unlock() | ||
|
|
||
| now := time.Now() | ||
| c.entries[uri] = &resolutionCacheEntry{ | ||
| Skill: skill, | ||
| CachedAt: now, | ||
| ExpiresAt: now.Add(c.ttl), | ||
| } | ||
| c.evictExpired() | ||
| c.save() | ||
| } |
There was a problem hiding this comment.
Holding the write lock c.mu.Lock() during disk I/O (os.WriteFile) blocks all concurrent readers (Get) and writers (Put). We should copy the entries under the lock, release the lock, and then perform the marshaling and file write operations outside the lock.
func (c *GitHubResolutionCache) Put(uri string, skill ResolvedSkill) {
c.mu.Lock()
now := time.Now()
c.entries[uri] = &resolutionCacheEntry{
Skill: skill,
CachedAt: now,
ExpiresAt: now.Add(c.ttl),
}
c.evictExpired()
// Copy entries to save outside the lock to avoid blocking readers with disk I/O.
entriesCopy := make(map[string]*resolutionCacheEntry, len(c.entries))
for k, v := range c.entries {
entriesCopy[k] = v
}
c.mu.Unlock()
c.save(entriesCopy)
}| func (c *GitHubResolutionCache) save() { | ||
| f := resolutionCacheFile{Entries: c.entries} | ||
| data, err := json.MarshalIndent(f, "", " ") | ||
| if err != nil { | ||
| return | ||
| } | ||
| _ = os.WriteFile(c.filePath, data, 0644) | ||
| } |
There was a problem hiding this comment.
Update save to accept the entries map and write to a temporary file before renaming it to ensure atomic writes and prevent cache file corruption.
func (c *GitHubResolutionCache) save(entries map[string]*resolutionCacheEntry) {
f := resolutionCacheFile{Entries: entries}
data, err := json.MarshalIndent(f, "", " ")
if err != nil {
return
}
tmpPath := c.filePath + ".tmp"
if err := os.WriteFile(tmpPath, data, 0644); err != nil {
return
}
_ = os.Rename(tmpPath, c.filePath)
}| for uri, entry := range f.Entries { | ||
| if now.Before(entry.ExpiresAt) { | ||
| c.entries[uri] = entry | ||
| } | ||
| } |
There was a problem hiding this comment.
If the JSON cache file contains a null entry (which can happen if corrupted or manually edited), entry.ExpiresAt will panic with a nil pointer dereference. We should add a nil check before accessing entry.ExpiresAt.
| for uri, entry := range f.Entries { | |
| if now.Before(entry.ExpiresAt) { | |
| c.entries[uri] = entry | |
| } | |
| } | |
| for uri, entry := range f.Entries { | |
| if entry != nil && now.Before(entry.ExpiresAt) { | |
| c.entries[uri] = entry | |
| } | |
| } |
- Release lock before disk I/O in Put() by copying entries under lock - Fix extra HTTP request after retries exhausted in doWithRetry() - Deep copy Files slice in Get() to prevent data races - Use atomic write (tmp + rename) in save() - Cap rate limit reset wait at maxBackoff instead of falling back to shorter exponential backoff
Properly handle return values from resp.Body.Close(), w.Write(), and json.NewEncoder().Encode() calls.
Summary
When agent templates reference many skills via
gh://URIs, provisioning hits GitHub API rate limits (403/429), causing agents to fail. This PR adds two mitigations:GitHubSkillResolver(resolveCommitSHA,listContents,downloadRawFile) now retry on 429, 403 rate-limit, and 5xx responses. RespectsRetry-AfterandX-RateLimit-Resetheaders. Max 4 retries with backoff capped at 30s.CachingSkillResolvercontent cache for two-layer caching: resolution (short TTL, tracks branch tip movement) and content (long-lived, content-addressed).Files changed
pkg/agent/github_skill_resolver.go— AddeddoWithRetry(),isRetryableResponse(),retryDelay(), resolution cache integrationpkg/agent/github_resolution_cache.go(new) — TTL-based on-disk resolution cachepkg/agent/github_skill_resolver_test.go— Tests for retry on 429/5xx, rate limit handling, resolution cache hitpkg/agent/github_resolution_cache_test.go(new) — Tests for cache put/get, expiry, persistence, reloadTest plan
TestGitHubSkillResolver_*tests pass (including updated rate limit test)TestGitHubSkillResolver_RetryOn429— verifies recovery after transient 429TestGitHubSkillResolver_RetryOn5xx— verifies recovery after transient 503TestGitHubSkillResolver_ResolutionCacheHit— verifies zero API calls on cache hitTestIsRetryableResponse— unit tests for retry classification (9 cases)TestRetryDelay— unit tests for backoff calculationTestGitHubResolutionCache_*— cache CRUD, expiry, persistence, reloadgo test ./pkg/agent/passesgo build ./...passes