Enable Gemini-3.5-flash cua#2273
Conversation
Gemini 3.x emits predefined function names and argument shapes that differ from the 2.5 computer-use vocabulary. Map the 3.x names onto the canonical 2.5 handlers, tolerate the new argument shapes (coordinate-less type, keys arrays, scroll magnitude_in_pixels, drag start/end pairs), treat take_screenshot as a recognized no-op, and always return a screenshot function response even when a turn produced no executable actions so the model is never left without an observation. Only the click/take_screenshot aliases and click/navigate argument shapes were confirmed from live gemini-3.5-flash traffic; the remaining aliases follow the same drop-the-qualifier pattern and fall through to the existing unknown-action warning if wrong. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
GoogleCUAClient read only promptTokenCount/candidatesTokenCount, dropping Gemini's cachedContentTokenCount and thoughtsTokenCount — so cached_input_tokens and reasoning_tokens were always 0 in agent metrics even though the CUA handler and updateMetrics already plumb them through. Surface both per step and in the aggregated usage. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: a00d95c The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
4 issues found across 6 files
Confidence score: 2/5
- In
packages/core/lib/v3/agent/GoogleCUAClient.ts,double_click/triple_clickare collapsed toclick_atwithout preserving click count, so intended multi-click interactions execute as single clicks and can break Gemini-3.x tasks that depend on double/triple click semantics — pass through and honor the click count before merging. - In
packages/core/lib/v3/agent/GoogleCUAClient.ts, right/middle click and mouse down/up actions are still unimplemented, so model-emitted click-family calls can silently no-op and leave automation flows stuck or incorrect — implement these handlers (or explicitly gate/fail fast) before merging. - In
packages/core/lib/v3/agent/AgentProvider.ts, extending hardcoded model-to-provider mappings keeps model onboarding tied to code changes, increasing regression risk whenever new models are introduced — switch to provider-derived/dynamic resolution instead of expanding allowlists. - In
packages/core/lib/v3/llm/LLMProvider.ts, adding support via deprecated unprefixed model IDs prolongs a legacy path and can create inconsistent model resolution behavior — route new support throughprovider/modelIDs and avoid expanding the deprecated mapping.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/core/lib/v3/agent/GoogleCUAClient.ts">
<violation number="1" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:845">
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</violation>
</file>
Architecture diagram
sequenceDiagram
participant App as Agent Loop (run)
participant Client as GoogleCUAClient (executeStep)
participant Mapper as convertFunctionCallToAction
participant Exec as Action Executor
participant SS as Screenshot Capture
participant API as Google Gemini API
Note over App,API: Gemini-3.5-flash CUA turn (happy path)
App->>Client: executeStep(context, logger)
Client->>API: send request (history + screenshot)
API-->>Client: response (functionCalls, usageMetadata)
alt functionCall has predefined function
Client->>Mapper: for each part.functionCall
Note over Mapper: NAME_ALIASES maps 3.x names → 2.5 canonicals<br/>e.g. "click" → "click_at", "type" → "type_text_at"
Mapper->>Mapper: normalize args shape<br/>(keys: string|array, scroll: magnitudeInPixels,<br/>drag: start/end, type: optional coords)
Mapper-->>Client: normalized AgentAction (e.g. type, click, screenshot)
end
alt action.type === "screenshot"
Client->>Client: log "take_screenshot: capturing current page"<br/>no browser interaction
else action.type === "type" AND coordinates present
Client->>Exec: click (x,y left)
Client->>Exec: select all (if clearBeforeTyping)
Client->>Exec: type text
else action.type === "type" AND no coordinates
Client->>Exec: type text directly<br/>(element already focused)
else other executable actions (click_at, scroll_at, etc.)
Client->>Exec: execute action via browser
end
Note over Client: Always capture fresh screenshot after processing actions<br/>(even if no executable actions, e.g. only take_screenshot)
Client->>SS: captureScreenshot()
SS-->>Client: screenshot bytes
Client->>Client: build functionResponses: [screenshot part]
Client->>API: turn call with functionResponses
API-->>Client: next turn result (final or continue)
Client->>Client: aggregate usage (input_tokens, output_tokens,<br/>reasoning_tokens, cached_input_tokens, inference_time_ms)
Note over Client: reasoning_tokens = usageMetadata.thoughtsTokenCount<br/>cached_input_tokens = usageMetadata.cachedContentTokenCount
Client-->>App: StepResult with actions, message, usage
App->>App: accumulate totals for all steps
App-->>App: final response with full usage (including reasoning, cached)
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| // NOTE: click and take_screenshot are confirmed from live gemini-3.5-flash | ||
| // traffic; the rest are inferred from the same drop-the-qualifier pattern | ||
| // and are safe aliases (any unmapped name still hits the warning below). | ||
| const NAME_ALIASES: Record<string, string> = { |
There was a problem hiding this comment.
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/lib/v3/agent/GoogleCUAClient.ts, line 845:
<comment>Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</comment>
<file context>
@@ -794,19 +824,62 @@ export class GoogleCUAClient extends AgentClient {
+ // NOTE: click and take_screenshot are confirmed from live gemini-3.5-flash
+ // traffic; the rest are inferred from the same drop-the-qualifier pattern
+ // and are safe aliases (any unmapped name still hits the warning below).
+ const NAME_ALIASES: Record<string, string> = {
+ click: "click_at",
+ left_click: "click_at",
</file context>
Per the Gemini 3.5 Flash computer-use spec, double_click/triple_click/ right_click/middle_click/move are distinct predefined functions. The converter collapsed double/triple click to a single left click and left right/middle click + move unmapped (silent no-op). Map them to the executor's native double_click/triple_click/move actions and click with the right button. gemini-2.5 emits none of these names, so its canonical handlers are unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
2 issues found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/core/lib/v3/agent/GoogleCUAClient.ts">
<violation number="1" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:845">
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</violation>
<violation number="2" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:926">
P3: New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.
(Based on your team's feedback about adding unit tests for new behavior.) [FEEDBACK_USED].</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| @@ -241,6 +241,8 @@ export class GoogleCUAClient extends AgentClient { | |||
|
|
|||
There was a problem hiding this comment.
P3: New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.
(Based on your team's feedback about adding unit tests for new behavior.) .
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/lib/v3/agent/GoogleCUAClient.ts, line 926:
<comment>New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.
(Based on your team's feedback about adding unit tests for new behavior.) .</comment>
<file context>
@@ -893,6 +897,40 @@ export class GoogleCUAClient extends AgentClient {
+ };
+ }
+
+ case "move": {
+ const { x, y } = this.normalizeCoordinates(
+ args.x as number,
</file context>
Guard the gemini-3.x click-family cases (double/triple/right/middle click, move) so a payload missing x/y returns null instead of normalizing NaN into the executor, matching drag_and_drop. Add focused unit tests asserting the produced AgentAction type/button/coordinates for each, the missing-coord null path, and 2.5 click_at backcompat. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
@miguelg719 I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
2 issues found across 7 files
Confidence score: 3/5
- In
packages/core/lib/v3/agent/GoogleCUAClient.ts, alias normalization happening before custom-tool routing can misclassify overlapping custom tool names as computer-use actions, which can send the wrong action path at runtime—checkisCustomToolagainstrawNamebefore applyingNAME_ALIASESto de-risk routing correctness before merging. - In
packages/core/lib/v3/agent/GoogleCUAClient.ts, defaulting missingfunctionCall.argsto{}without validating required-arg functions can trigger crashes or emit invalid actions on malformed calls—add a required-args guard that rejects arg-required function names when args are undefined before merging.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/core/lib/v3/agent/GoogleCUAClient.ts">
<violation number="1" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:845">
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</violation>
<violation number="2" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:926">
P3: New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.
(Based on your team's feedback about adding unit tests for new behavior.) [FEEDBACK_USED].</violation>
</file>
Architecture diagram
sequenceDiagram
participant UI as Client Application
participant Agent as AgentProvider
participant GoogleCUA as GoogleCUAClient
participant GeminiAPI as Gemini 3.5 Flash API
participant Executor as Action Executor
participant Screenshot as Screenshot Capture
Note over UI,Screenshot: Gemini 3.5 Flash Computer Use Flow
UI->>Agent: initialize agent with model "google/gemini-3.5-flash"
Agent->>GoogleCUA: create GoogleCUAClient instance
Note over GoogleCUA,GeminiAPI: Step Loop (per turn)
GoogleCUA->>GeminiAPI: executeStep() - send system prompt + screenshot
GeminiAPI-->>GoogleCUA: response with function calls + usage metadata
Note over GoogleCUA: Extract usage metrics<br/>including reasoning_tokens & cached_input_tokens
alt Gemini 3.x function call name received
GoogleCUA->>GoogleCUA: convertFunctionCallToAction()
Note over GoogleCUA: Apply NAME_ALIASES mapping<br/>e.g., "click" → "click_at", "type" → "type_text_at"
end
alt Click-family action (double_click, triple_click, right_click, middle_click, move)
GoogleCUA->>GoogleCUA: validate x/y coordinates exist
alt Coordinates missing
GoogleCUA->>GoogleCUA: return null (drop invalid action)
else Coordinates present
GoogleCUA->>GoogleCUA: normalizeCoordinates(0-999 grid to viewport)
GoogleCUA->>GoogleCUA: preserve click semantics (button type, click count)
end
end
alt Type action from Gemini 3.x
Note over GoogleCUA: action.type === "type"
alt Coordinates present (2.5 style type_text_at)
GoogleCUA->>GoogleCUA: prepend click action at coordinates
else No coordinates (3.x style)
Note over GoogleCUA: Skip click - type into focused element
end
end
alt Screenshot function call
GoogleCUA->>GoogleCUA: return { type: "screenshot" } (no-op)
end
Note over GoogleCUA: Process all actions (may be zero)
loop For each action
alt Action is "screenshot" or "open_web_browser"
GoogleCUA->>GoogleCUA: skip execution, just log
else Other action
GoogleCUA->>Executor: execute action (click, type, scroll, etc.)
Executor-->>GoogleCUA: action result
end
end
Note over GoogleCUA: After all actions processed
GoogleCUA->>Screenshot: capture fresh screenshot for function response
Screenshot-->>GoogleCUA: screenshot data
GoogleCUA->>GeminiAPI: return function responses (including screenshot)
GeminiAPI-->>GoogleCUA: next model response
Note over GoogleCUA: Track reasoning_tokens + cached_input_tokens across turns
UI->>GoogleCUA: getFinalResult()
GoogleCUA-->>UI: AgentResult with aggregated usage metrics
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
…mini CUA) - navigate/type_text_at now return null on a malformed call (missing url/text) instead of producing goto(undefined)/type(undefined); empty type text is still allowed (clear field). Matches the click-family coordinate guards. - When a custom tool is registered under a name that collides with a predefined Google CUA function, log at level 2 that the predefined tool takes precedence (predefined tools intentionally win; the custom tool isn't silently dropped without a trace). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
why
New model dropped
what changed
Added support for the Gemini 3.5 Flash Computer Use updated toolset in
GoogleCUAClient.ts, with all new tool formats correctly mapped.test plan
Summary by cubic
Adds support for the
google/gemini-3.5-flashcomputer-use agent. Normalizes Gemini 3.x function names/args to 2.5 handlers, preserves click semantics, validates coords, always returns a fresh screenshot, and reports reasoning/cached tokens.New Features
google/gemini-3.5-flashin agent/LLM provider maps and public types; update tests.type(click first only if coords given),keysarray/singlekey,magnitude_in_pixelsforscroll, drag start/end pairs; recognizescreenshot/take_screenshot.Bug Fixes
reasoning_tokensandcached_input_tokensin Google CUA usage and aggregate metrics.double_click,triple_click,right_click,middle_click,move) and drop calls with missing coordinates.navigatewithouturlandtype/type_text_atwithouttext(empty allowed); log when a custom tool name conflicts with a predefined function (predefined wins).Written for commit a00d95c. Summary will update on new commits.