Feature request: Add hybrid DOM / accessibility-tree + screenshot input to Computer Use agent

### Description of the feature request:


Provide an opt-in hybrid environment mode for the Computer Use agent that sends a compact structural payload (filtered DOM / Playwright accessibility snapshot / list of interactive elements with roles, labels, bounding boxes and selectors) in addition to the existing screenshot. The structural payload must be intentionally compact (only interactive elements, truncated text, and optional short HTML snippets) to avoid token explosion and preserve privacy. Hybrid mode should be toggleable via config (env var or GenerateContentConfig option) and default to the current screenshot-only behavior.

### What problem are you trying to solve with this feature?


Coordinate-based interactions are brittle on responsive or dynamic pages; clicks by (x,y) break when layout shifts.

Visual-only reasoning makes reliably locating form fields, buttons, and links error-prone compared to using selectors / ARIA roles.

Single-page apps, dynamic content, and off-screen or shadow-DOM elements are difficult to handle visually.

Developers need a more robust way to express intent (e.g., “click the button labeled ‘Save’” or “fill the input named ‘email’”) without losing the generality of visual input for non-HTML contexts (canvas, VNC, native apps).

### Any other information you'd like to share?


Suggested minimal payload: list of interactive elements (tag, role, text/aria-label, id/class, stable selector if available, and bbox {x,y,width,height}), optional Playwright accessibility snapshot, and an optionally sampled or truncated HTML snippet.

Implementation notes:

Extend EnvState with elements and/or accessibility_tree.

Modify PlaywrightComputer.current_state() to collect the compact payload (use page.evaluate() to gather a, button, input, textarea, select plus elements with role=button|link), truncate long text, and mask/redact sensitive input values.

When building FunctionResponse parts, include the structural payload as a JSON/plain-text part before the screenshot blob so the model receives both modalities in one turn.

Add an opt-in feature flag (e.g., COMPUTER_USE_HYBRID=true) and privacy/redaction options (mask input values, redact sensitive attributes).

Provide a selector-first action strategy in the agent: prefer selector-based actions, fall back to coordinates if selector is missing or unreliable.

Add unit/integration tests and a small example PR demonstrating the hybrid flow on forms and SPAs.

Risks & mitigations: token/size growth (mitigate via filtering & truncation), PII exposure (mitigate via masking and opt-in), compatibility (keep screenshot-only as default).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: Add hybrid DOM / accessibility-tree + screenshot input to Computer Use agent #113

Description of the feature request:

What problem are you trying to solve with this feature?

Any other information you'd like to share?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Feature request: Add hybrid DOM / accessibility-tree + screenshot input to Computer Use agent #113

Description

Description of the feature request:

What problem are you trying to solve with this feature?

Any other information you'd like to share?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions