Description of the feature request:
Provide an opt-in hybrid environment mode for the Computer Use agent that sends a compact structural payload (filtered DOM / Playwright accessibility snapshot / list of interactive elements with roles, labels, bounding boxes and selectors) in addition to the existing screenshot. The structural payload must be intentionally compact (only interactive elements, truncated text, and optional short HTML snippets) to avoid token explosion and preserve privacy. Hybrid mode should be toggleable via config (env var or GenerateContentConfig option) and default to the current screenshot-only behavior.
What problem are you trying to solve with this feature?
Coordinate-based interactions are brittle on responsive or dynamic pages; clicks by (x,y) break when layout shifts.
Visual-only reasoning makes reliably locating form fields, buttons, and links error-prone compared to using selectors / ARIA roles.
Single-page apps, dynamic content, and off-screen or shadow-DOM elements are difficult to handle visually.
Developers need a more robust way to express intent (e.g., “click the button labeled ‘Save’” or “fill the input named ‘email’”) without losing the generality of visual input for non-HTML contexts (canvas, VNC, native apps).
Any other information you'd like to share?
Suggested minimal payload: list of interactive elements (tag, role, text/aria-label, id/class, stable selector if available, and bbox {x,y,width,height}), optional Playwright accessibility snapshot, and an optionally sampled or truncated HTML snippet.
Implementation notes:
Extend EnvState with elements and/or accessibility_tree.
Modify PlaywrightComputer.current_state() to collect the compact payload (use page.evaluate() to gather a, button, input, textarea, select plus elements with role=button|link), truncate long text, and mask/redact sensitive input values.
When building FunctionResponse parts, include the structural payload as a JSON/plain-text part before the screenshot blob so the model receives both modalities in one turn.
Add an opt-in feature flag (e.g., COMPUTER_USE_HYBRID=true) and privacy/redaction options (mask input values, redact sensitive attributes).
Provide a selector-first action strategy in the agent: prefer selector-based actions, fall back to coordinates if selector is missing or unreliable.
Add unit/integration tests and a small example PR demonstrating the hybrid flow on forms and SPAs.
Risks & mitigations: token/size growth (mitigate via filtering & truncation), PII exposure (mitigate via masking and opt-in), compatibility (keep screenshot-only as default).
Description of the feature request:
Provide an opt-in hybrid environment mode for the Computer Use agent that sends a compact structural payload (filtered DOM / Playwright accessibility snapshot / list of interactive elements with roles, labels, bounding boxes and selectors) in addition to the existing screenshot. The structural payload must be intentionally compact (only interactive elements, truncated text, and optional short HTML snippets) to avoid token explosion and preserve privacy. Hybrid mode should be toggleable via config (env var or GenerateContentConfig option) and default to the current screenshot-only behavior.
What problem are you trying to solve with this feature?
Coordinate-based interactions are brittle on responsive or dynamic pages; clicks by (x,y) break when layout shifts.
Visual-only reasoning makes reliably locating form fields, buttons, and links error-prone compared to using selectors / ARIA roles.
Single-page apps, dynamic content, and off-screen or shadow-DOM elements are difficult to handle visually.
Developers need a more robust way to express intent (e.g., “click the button labeled ‘Save’” or “fill the input named ‘email’”) without losing the generality of visual input for non-HTML contexts (canvas, VNC, native apps).
Any other information you'd like to share?
Suggested minimal payload: list of interactive elements (tag, role, text/aria-label, id/class, stable selector if available, and bbox {x,y,width,height}), optional Playwright accessibility snapshot, and an optionally sampled or truncated HTML snippet.
Implementation notes:
Extend EnvState with elements and/or accessibility_tree.
Modify PlaywrightComputer.current_state() to collect the compact payload (use page.evaluate() to gather a, button, input, textarea, select plus elements with role=button|link), truncate long text, and mask/redact sensitive input values.
When building FunctionResponse parts, include the structural payload as a JSON/plain-text part before the screenshot blob so the model receives both modalities in one turn.
Add an opt-in feature flag (e.g., COMPUTER_USE_HYBRID=true) and privacy/redaction options (mask input values, redact sensitive attributes).
Provide a selector-first action strategy in the agent: prefer selector-based actions, fall back to coordinates if selector is missing or unreliable.
Add unit/integration tests and a small example PR demonstrating the hybrid flow on forms and SPAs.
Risks & mitigations: token/size growth (mitigate via filtering & truncation), PII exposure (mitigate via masking and opt-in), compatibility (keep screenshot-only as default).