Skip to content

Enable Gemini-3.5-flash cua#2273

Open
miguelg719 wants to merge 6 commits into
mainfrom
miguelgonzalez/gemini-3-5-flash-cua
Open

Enable Gemini-3.5-flash cua#2273
miguelg719 wants to merge 6 commits into
mainfrom
miguelgonzalez/gemini-3-5-flash-cua

Conversation

@miguelg719

@miguelg719 miguelg719 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

why

New model dropped

what changed

Added support for the Gemini 3.5 Flash Computer Use updated toolset in GoogleCUAClient.ts, with all new tool formats correctly mapped.

test plan


Summary by cubic

Adds support for the google/gemini-3.5-flash computer-use agent. Normalizes Gemini 3.x function names/args to 2.5 handlers, preserves click semantics, validates coords, always returns a fresh screenshot, and reports reasoning/cached tokens.

  • New Features

    • Enable google/gemini-3.5-flash in agent/LLM provider maps and public types; update tests.
    • Map 3.x functions to 2.5 handlers and accept new arg shapes: coordinate‑less type (click first only if coords given), keys array/single key, magnitude_in_pixels for scroll, drag start/end pairs; recognize screenshot/take_screenshot.
    • Always return a screenshot function response even when no executable actions are produced.
  • Bug Fixes

    • Track reasoning_tokens and cached_input_tokens in Google CUA usage and aggregate metrics.
    • Preserve 3.x click-family semantics (double_click, triple_click, right_click, middle_click, move) and drop calls with missing coordinates.
    • Guard required args and log custom-tool collisions: reject navigate without url and type/type_text_at without text (empty allowed); log when a custom tool name conflicts with a predefined function (predefined wins).

Written for commit a00d95c. Summary will update on new commits.

Review in cubic

miguelg719 and others added 3 commits June 23, 2026 10:15
Gemini 3.x emits predefined function names and argument shapes that
differ from the 2.5 computer-use vocabulary. Map the 3.x names onto the
canonical 2.5 handlers, tolerate the new argument shapes (coordinate-less
type, keys arrays, scroll magnitude_in_pixels, drag start/end pairs),
treat take_screenshot as a recognized no-op, and always return a
screenshot function response even when a turn produced no executable
actions so the model is never left without an observation.

Only the click/take_screenshot aliases and click/navigate argument
shapes were confirmed from live gemini-3.5-flash traffic; the remaining
aliases follow the same drop-the-qualifier pattern and fall through to
the existing unknown-action warning if wrong.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
GoogleCUAClient read only promptTokenCount/candidatesTokenCount, dropping
Gemini's cachedContentTokenCount and thoughtsTokenCount — so cached_input_tokens
and reasoning_tokens were always 0 in agent metrics even though the CUA
handler and updateMetrics already plumb them through. Surface both per step
and in the aggregated usage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@changeset-bot

changeset-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: a00d95c

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@browserbasehq/stagehand Minor
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server-v3 Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 6 files

Confidence score: 2/5

  • In packages/core/lib/v3/agent/GoogleCUAClient.ts, double_click/triple_click are collapsed to click_at without preserving click count, so intended multi-click interactions execute as single clicks and can break Gemini-3.x tasks that depend on double/triple click semantics — pass through and honor the click count before merging.
  • In packages/core/lib/v3/agent/GoogleCUAClient.ts, right/middle click and mouse down/up actions are still unimplemented, so model-emitted click-family calls can silently no-op and leave automation flows stuck or incorrect — implement these handlers (or explicitly gate/fail fast) before merging.
  • In packages/core/lib/v3/agent/AgentProvider.ts, extending hardcoded model-to-provider mappings keeps model onboarding tied to code changes, increasing regression risk whenever new models are introduced — switch to provider-derived/dynamic resolution instead of expanding allowlists.
  • In packages/core/lib/v3/llm/LLMProvider.ts, adding support via deprecated unprefixed model IDs prolongs a legacy path and can create inconsistent model resolution behavior — route new support through provider/model IDs and avoid expanding the deprecated mapping.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/lib/v3/agent/GoogleCUAClient.ts">

<violation number="1" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:845">
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant App as Agent Loop (run)
    participant Client as GoogleCUAClient (executeStep)
    participant Mapper as convertFunctionCallToAction
    participant Exec as Action Executor
    participant SS as Screenshot Capture
    participant API as Google Gemini API

    Note over App,API: Gemini-3.5-flash CUA turn (happy path)

    App->>Client: executeStep(context, logger)
    Client->>API: send request (history + screenshot)
    API-->>Client: response (functionCalls, usageMetadata)

    alt functionCall has predefined function
        Client->>Mapper: for each part.functionCall
        Note over Mapper: NAME_ALIASES maps 3.x names → 2.5 canonicals<br/>e.g. "click" → "click_at", "type" → "type_text_at"
        Mapper->>Mapper: normalize args shape<br/>(keys: string|array, scroll: magnitudeInPixels,<br/>drag: start/end, type: optional coords)
        Mapper-->>Client: normalized AgentAction (e.g. type, click, screenshot)
    end

    alt action.type === "screenshot"
        Client->>Client: log "take_screenshot: capturing current page"<br/>no browser interaction
    else action.type === "type" AND coordinates present
        Client->>Exec: click (x,y left)
        Client->>Exec: select all (if clearBeforeTyping)
        Client->>Exec: type text
    else action.type === "type" AND no coordinates
        Client->>Exec: type text directly<br/>(element already focused)
    else other executable actions (click_at, scroll_at, etc.)
        Client->>Exec: execute action via browser
    end

    Note over Client: Always capture fresh screenshot after processing actions<br/>(even if no executable actions, e.g. only take_screenshot)

    Client->>SS: captureScreenshot()
    SS-->>Client: screenshot bytes

    Client->>Client: build functionResponses: [screenshot part]

    Client->>API: turn call with functionResponses
    API-->>Client: next turn result (final or continue)

    Client->>Client: aggregate usage (input_tokens, output_tokens,<br/>reasoning_tokens, cached_input_tokens, inference_time_ms)
    Note over Client: reasoning_tokens = usageMetadata.thoughtsTokenCount<br/>cached_input_tokens = usageMetadata.cachedContentTokenCount

    Client-->>App: StepResult with actions, message, usage

    App->>App: accumulate totals for all steps
    App-->>App: final response with full usage (including reasoning, cached)
Loading

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread packages/core/lib/v3/agent/GoogleCUAClient.ts Outdated
Comment thread packages/core/lib/v3/agent/AgentProvider.ts
// NOTE: click and take_screenshot are confirmed from live gemini-3.5-flash
// traffic; the rest are inferred from the same drop-the-qualifier pattern
// and are safe aliases (any unmapped name still hits the warning below).
const NAME_ALIASES: Record<string, string> = {

@cubic-dev-ai cubic-dev-ai Bot Jun 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/lib/v3/agent/GoogleCUAClient.ts, line 845:

<comment>Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</comment>

<file context>
@@ -794,19 +824,62 @@ export class GoogleCUAClient extends AgentClient {
+    // NOTE: click and take_screenshot are confirmed from live gemini-3.5-flash
+    // traffic; the rest are inferred from the same drop-the-qualifier pattern
+    // and are safe aliases (any unmapped name still hits the warning below).
+    const NAME_ALIASES: Record<string, string> = {
+      click: "click_at",
+      left_click: "click_at",
</file context>
Fix with cubic

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment thread packages/core/lib/v3/llm/LLMProvider.ts
Per the Gemini 3.5 Flash computer-use spec, double_click/triple_click/
right_click/middle_click/move are distinct predefined functions. The
converter collapsed double/triple click to a single left click and left
right/middle click + move unmapped (silent no-op). Map them to the
executor's native double_click/triple_click/move actions and click with
the right button. gemini-2.5 emits none of these names, so its canonical
handlers are unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/lib/v3/agent/GoogleCUAClient.ts">

<violation number="1" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:845">
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</violation>

<violation number="2" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:926">
P3: New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.

(Based on your team's feedback about adding unit tests for new behavior.) [FEEDBACK_USED].</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread packages/core/lib/v3/agent/GoogleCUAClient.ts
@@ -241,6 +241,8 @@ export class GoogleCUAClient extends AgentClient {

@cubic-dev-ai cubic-dev-ai Bot Jun 25, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.

(Based on your team's feedback about adding unit tests for new behavior.) .

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/lib/v3/agent/GoogleCUAClient.ts, line 926:

<comment>New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.

(Based on your team's feedback about adding unit tests for new behavior.) .</comment>

<file context>
@@ -893,6 +897,40 @@ export class GoogleCUAClient extends AgentClient {
+        };
+      }
+
+      case "move": {
+        const { x, y } = this.normalizeCoordinates(
+          args.x as number,
</file context>
Fix with cubic

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in c6e675c

Guard the gemini-3.x click-family cases (double/triple/right/middle click,
move) so a payload missing x/y returns null instead of normalizing NaN
into the executor, matching drag_and_drop. Add focused unit tests asserting
the produced AgentAction type/button/coordinates for each, the missing-coord
null path, and 2.5 click_at backcompat.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@miguelg719

Copy link
Copy Markdown
Collaborator Author

@cubic-dev-ai

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@cubic-dev-ai

@miguelg719 I have started the AI code review. It will take a few minutes to complete.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 7 files

Confidence score: 3/5

  • In packages/core/lib/v3/agent/GoogleCUAClient.ts, alias normalization happening before custom-tool routing can misclassify overlapping custom tool names as computer-use actions, which can send the wrong action path at runtime—check isCustomTool against rawName before applying NAME_ALIASES to de-risk routing correctness before merging.
  • In packages/core/lib/v3/agent/GoogleCUAClient.ts, defaulting missing functionCall.args to {} without validating required-arg functions can trigger crashes or emit invalid actions on malformed calls—add a required-args guard that rejects arg-required function names when args are undefined before merging.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/lib/v3/agent/GoogleCUAClient.ts">

<violation number="1" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:845">
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</violation>

<violation number="2" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:926">
P3: New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.

(Based on your team's feedback about adding unit tests for new behavior.) [FEEDBACK_USED].</violation>
</file>
Architecture diagram
sequenceDiagram
    participant UI as Client Application
    participant Agent as AgentProvider
    participant GoogleCUA as GoogleCUAClient
    participant GeminiAPI as Gemini 3.5 Flash API
    participant Executor as Action Executor
    participant Screenshot as Screenshot Capture

    Note over UI,Screenshot: Gemini 3.5 Flash Computer Use Flow

    UI->>Agent: initialize agent with model "google/gemini-3.5-flash"
    Agent->>GoogleCUA: create GoogleCUAClient instance

    Note over GoogleCUA,GeminiAPI: Step Loop (per turn)

    GoogleCUA->>GeminiAPI: executeStep() - send system prompt + screenshot
    GeminiAPI-->>GoogleCUA: response with function calls + usage metadata

    Note over GoogleCUA: Extract usage metrics<br/>including reasoning_tokens & cached_input_tokens

    alt Gemini 3.x function call name received
        GoogleCUA->>GoogleCUA: convertFunctionCallToAction()
        Note over GoogleCUA: Apply NAME_ALIASES mapping<br/>e.g., "click" → "click_at", "type" → "type_text_at"
    end

    alt Click-family action (double_click, triple_click, right_click, middle_click, move)
        GoogleCUA->>GoogleCUA: validate x/y coordinates exist
        alt Coordinates missing
            GoogleCUA->>GoogleCUA: return null (drop invalid action)
        else Coordinates present
            GoogleCUA->>GoogleCUA: normalizeCoordinates(0-999 grid to viewport)
            GoogleCUA->>GoogleCUA: preserve click semantics (button type, click count)
        end
    end

    alt Type action from Gemini 3.x
        Note over GoogleCUA: action.type === "type"
        alt Coordinates present (2.5 style type_text_at)
            GoogleCUA->>GoogleCUA: prepend click action at coordinates
        else No coordinates (3.x style)
            Note over GoogleCUA: Skip click - type into focused element
        end
    end

    alt Screenshot function call
        GoogleCUA->>GoogleCUA: return { type: "screenshot" } (no-op)
    end

    Note over GoogleCUA: Process all actions (may be zero)

    loop For each action
        alt Action is "screenshot" or "open_web_browser"
            GoogleCUA->>GoogleCUA: skip execution, just log
        else Other action
            GoogleCUA->>Executor: execute action (click, type, scroll, etc.)
            Executor-->>GoogleCUA: action result
        end
    end

    Note over GoogleCUA: After all actions processed

    GoogleCUA->>Screenshot: capture fresh screenshot for function response
    Screenshot-->>GoogleCUA: screenshot data

    GoogleCUA->>GeminiAPI: return function responses (including screenshot)
    GeminiAPI-->>GoogleCUA: next model response

    Note over GoogleCUA: Track reasoning_tokens + cached_input_tokens across turns

    UI->>GoogleCUA: getFinalResult()
    GoogleCUA-->>UI: AgentResult with aggregated usage metrics
Loading

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread packages/core/lib/v3/agent/GoogleCUAClient.ts
Comment thread packages/core/lib/v3/agent/GoogleCUAClient.ts
…mini CUA)

- navigate/type_text_at now return null on a malformed call (missing url/text)
  instead of producing goto(undefined)/type(undefined); empty type text is
  still allowed (clear field). Matches the click-family coordinate guards.
- When a custom tool is registered under a name that collides with a predefined
  Google CUA function, log at level 2 that the predefined tool takes precedence
  (predefined tools intentionally win; the custom tool isn't silently dropped
  without a trace).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant