feat: Add screenshot functionality for multimodal AI input by aj47 · Pull Request #219 · aj47/SpeakMCP

aj47 · 2025-10-31T17:52:44Z

Closes #217

Summary

This PR adds screenshot functionality to SpeakMCP, allowing users to capture and send screenshots along with text input to multimodal AI models.

Changes

Core Features

Screenshot Capture: Implemented using Electron's desktopCapturer API
UI Checkbox: Added checkbox in text input panel to enable screenshot capture
Configuration: Added screenshot settings (quality, format, max dimensions)
Multimodal Support: Updated message types to support both text and images

Technical Implementation

Backend Changes

Added captureScreenshot TIPC handler in src/main/tipc.ts
- Captures primary display screenshot
- Resizes based on config (max width/height)
- Encodes to base64 with configurable format (PNG/JPEG) and quality
Updated createMcpTextInput to accept screenshotData parameter
Modified LLM API calls to handle multimodal content:
- makeOpenAICompatibleCall now accepts content: any (string or array)
- makeGeminiCall extracts text from multimodal content
- Updated token estimation to handle image content
Updated conversation service to handle MessageContent type

Frontend Changes

Enhanced TextInputPanel component:
- Added screenshot checkbox (only shown when screenshotEnabled in config)
- Added screenshot preview with remove button
- Captures screenshot when checkbox is checked
- Passes screenshot data to submit handler
Updated panel.tsx to pass screenshot data through mutation
Added helper functions to extract text from MessageContent in:
- conversation-display.tsx
- agent-progress.tsx
- conversation-context.tsx

Type System

Added multimodal content types in src/shared/types.ts:

export type MessageContentPart =
  | { type: "text"; text: string }
  | { type: "image_url"; image_url: { url: string; detail?: "auto" | "low" | "high" } }

export type MessageContent = string | MessageContentPart[]

Updated ConversationMessage to use MessageContent instead of string
Added screenshot configuration options to Config type

Configuration

Added default screenshot settings in src/main/config.ts:
- screenshotEnabled: true
- screenshotQuality: 0.8
- screenshotFormat: "jpeg"
- screenshotMaxWidth: 1920
- screenshotMaxHeight: 1080

API Format

Screenshots are sent to AI models using the OpenAI-compatible multimodal format:

{
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What's in this image?" },
        { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,..." } }
      ]
    }
  ]
}

Backward Compatibility

All changes are backward compatible
MessageContent is a union type (string | MessageContentPart[])
Existing string-based messages continue to work
Screenshot feature is opt-in via checkbox

Testing

TypeScript compilation passes (except pre-existing @fastify/cors issue)
All multimodal type errors resolved
Ready for end-to-end testing with multimodal models (GPT-4V, Claude, etc.)

Notes

The @fastify/cors type error is a pre-existing issue unrelated to this PR
Screenshot capture uses the primary display
Images are base64-encoded for transmission
Gemini API currently only uses text parts (image support can be added later)

Pull Request opened by Augment Code with guidance from the PR author

Summary by CodeRabbit

Release Notes

New Features
- Added screenshot capture and attachment functionality for messages
- Enabled multimodal message support combining text and images
- New screenshot configuration options for quality, format, and dimensions
Bug Fixes
- Improved tool execution stability with consistent ID generation
- Fixed UI state preservation across message updates
Chores
- Updated configuration schema to support screenshot settings

- Add screenshot capture using Electron's desktopCapturer API - Add screenshot checkbox in text input UI - Add screenshot configuration options (quality, format, max dimensions) - Update message types to support multimodal content (text + images) - Update LLM API calls to handle multimodal content - Update conversation service to handle MessageContent type - Add helper functions to extract text from multimodal content Implements multimodal support for sending screenshots to AI models via OpenAI-compatible APIs using base64-encoded images.

coderabbitai · 2025-10-31T17:52:56Z

Warning

Rate limit exceeded

@aj47 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 1 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between de52278 and 7e77673.

📒 Files selected for processing (9)

src/main/context-budget.ts (2 hunks)
src/main/llm-fetch.ts (7 hunks)
src/main/llm.ts (14 hunks)
src/main/tipc.ts (12 hunks)
src/renderer/src/components/text-input-panel.tsx (5 hunks)
src/renderer/src/hooks/use-input-processing.ts (3 hunks)
src/renderer/src/lib/query-types.ts (1 hunks)
src/renderer/src/pages/panel.tsx (7 hunks)
src/renderer/src/pages/settings-general.tsx (2 hunks)

Walkthrough

This PR adds multimodal screenshot support and improves message content handling across the application. It introduces screenshot capture functionality with UI controls, updates message content types to support structured multimodal payloads (text and images), and implements stable content-based IDs for tool executions. The changes span backend (LLM, configuration, conversation service) and frontend (text input, agent progress rendering) layers.

Changes

Cohort / File(s)	Change Summary
Shared Type Definitions `src/shared/types.ts`	Added new `MessageContentPart` and `MessageContent` types to support multimodal content (text and image_url blocks). Updated `ConversationMessage.content` and `AgentProgressUpdate.conversationHistory[].content` from `string` to `MessageContent`. Extended `Config` interface with screenshot settings (`screenshotEnabled`, `screenshotQuality`, `screenshotFormat`, `screenshotMaxWidth`, `screenshotMaxHeight`).
Backend Configuration & Service `src/main/config.ts`, `src/main/conversation-service.ts`	Added screenshot configuration fields to `defaultConfig`. Updated `createConversation()` and `addMessageToConversation()` method signatures to accept `MessageContent` instead of `string`. Added runtime logic to extract text from structured content arrays and handle multimodal payloads.
Backend LLM & Fetch `src/main/llm.ts`, `src/main/llm-fetch.ts`	Broadened content typing in `makeLLMCall()`, `makeOpenAICompatibleCall()`, `makeGeminiCall()`, `makeLLMCallAttempt()`, and `makeLLMCallWithFetch()` to accept `any` content type. Enhanced token estimation to handle both string and structured content parts. Updated Gemini prompt construction to extract only text parts from multimodal content.
Backend IPC & Screenshot Integration `src/main/tipc.ts`	Added `captureScreenshot()` public method to capture base64-encoded screenshots. Extended `createMcpTextInput()` signature with optional `screenshotData` parameter. Implemented multimodal message content assembly when screenshot is provided. Updated agent-mode history normalization to handle structured content arrays.
Frontend Components `src/renderer/src/components/agent-progress.tsx`, `src/renderer/src/components/conversation-display.tsx`	Introduced `extractTextFromContent()` helper function to normalize `MessageContent` to plain text. Applied text extraction in message filtering and rendering paths (MarkdownRenderer, AudioPlayer, compact display) to ensure consistent text handling across component tree.
Frontend Input & State Management `src/renderer/src/components/text-input-panel.tsx`, `src/renderer/src/pages/panel.tsx`	Updated `TextInputPanelProps.onSubmit` signature to accept optional `screenshotData` parameter. Added configuration-driven screenshot UI checkbox, capture logic, and preview area with remove functionality. Extended `handleTextSubmit()` to accept and forward `screenshotData` through MCP mutation.
Frontend Context `src/renderer/src/contexts/conversation-context.tsx`	Updated `conversationHistory` entry content type from `string` to `any` to accommodate structured `MessageContent`.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant TextInput as Text Input Panel
    participant Main as Main Process (IPC)
    participant ConvService as Conversation Service
    participant LLM as LLM Handler
    participant Renderer as Agent Progress Renderer

    User->>TextInput: Enters text + enables screenshot
    TextInput->>Main: captureScreenshot()
    Main-->>TextInput: base64 screenshot data
    User->>TextInput: Submit
    TextInput->>Main: createMcpTextInput({ text, screenshotData })
    alt Screenshot provided
        Main->>ConvService: addMessageToConversation(content: [text, image_url])
    else No screenshot
        Main->>ConvService: addMessageToConversation(content: text)
    end
    ConvService->>LLM: makeLLMCall(messages with MessageContent)
    LLM->>LLM: generateToolExecutionId(toolCall)
    LLM-->>Renderer: AgentProgressUpdate with multimodal content
    Renderer->>Renderer: extractTextFromContent(content)
    Renderer-->>User: Display text + images

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Type signature changes across multiple layers: The shift from string to MessageContent affects public methods in conversation service, LLM functions, and component props. Each requires verification that callers and implementations handle both union types correctly.
Multimodal content extraction logic: The extractTextFromContent() pattern is repeated across components but not universally applied. Verify all rendering paths that consume message content have been updated.
Screenshot capture and IPC flow: The new captureScreenshot() method and screenshot data threading through MCP mutations introduces new async/error handling paths that need careful validation.
Stable ID generation: The content-hash-based ID generation in generateToolExecutionId() is critical for UI stability but the hashing approach should be reviewed for collision and performance implications.
Conversation history normalization: Agent-mode history handling performs string extraction from potentially structured content—ensure fallbacks and edge cases (empty arrays, non-text parts) are handled safely.

Possibly related PRs

#209: Modifies agent-progress.tsx expansion state lifting and introduces generateToolExecutionId() for stable tool execution rendering
#202: Updates agent tool-execution flow and UI rendering in agent-progress.tsx and src/main/llm.ts

Suggested labels

augment_review

Poem

🐰 A screenshot captured, content now flows,
Multimodal messages—no more plain text prose!
With stable IDs and structured arrays so fine,
The agent sees images and text intertwine!
From backend to UI, the data takes flight,
Together they render a multimodal sight! 📸✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "feat: Add screenshot functionality for multimodal AI input" is clear, concise, and directly reflects the main purpose of the changeset. It accurately summarizes the primary feature being introduced—screenshot capture capability for multimodal AI model input—and uses conventional commit format conventions. The title is specific enough that reviewers scanning the history would immediately understand the key change without excessive detail.
Linked Issues Check	✅ Passed	The pull request successfully addresses all primary coding-related objectives from issue #217. The implementation includes screenshot capture via Electron's desktopCapturer with the new `captureScreenshot()` handler [tipc.ts], UI checkbox in the text input panel [text-input-panel.tsx] conditional on configuration, and screenshot configuration options [config.ts] for quality, format, and dimensions. Multimodal model integration is implemented through updated content typing in LLM call functions to accept structured content arrays [llm-fetch.ts, llm.ts], proper base64 encoding of images in data URIs [tipc.ts], and support for OpenAI-compatible multimodal message formats. The type system now supports multimodal messages through new `MessageContent` and `MessageContentPart` types [types.ts] and updated interfaces for conversations and agent progress. All acceptance criteria are addressed: users can capture and attach screenshots via UI checkbox, screenshot behavior is configurable, screenshot data is properly formatted for multimodal models, and the implementation maintains backward compatibility with string-based messages.
Out of Scope Changes Check	✅ Passed	All documented changes in the pull request are directly aligned with the requirements of issue #217. The modifications span configuration management [config.ts], type definitions [types.ts], backend screenshot capture and LLM integration [tipc.ts, llm-fetch.ts, llm.ts], frontend UI components [text-input-panel.tsx, panel.tsx], and supporting display/context updates to handle the new multimodal `MessageContent` type [conversation-display.tsx, agent-progress.tsx, conversation-context.tsx]. The text extraction helpers and type updates are necessary supporting changes for proper handling of multimodal content throughout the application. No changes appear to be unrelated to the screenshot functionality or multimodal message support objectives.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

aj47 · 2025-10-31T17:54:41Z

augment review

augmentcode

Review completed. 1 suggestion posted.

augmentcode · 2025-10-31T17:54:46Z

+
+      // Get the primary screen source
+      const primarySource = sources[0]
+      const thumbnail = primarySource.thumbnail


Choosing the primary screen with sources[0] is unreliable—desktopCapturer.getSources() does not guarantee ordering. Consider matching the source to screen.getPrimaryDisplay() (e.g., via display_id) to ensure the correct monitor is captured, especially on multi‑monitor setups.

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

augmentcode

Review completed. 1 suggestions posted.

augmentcode · 2025-10-31T17:56:16Z

+          screenshotData,
        })
      } else {
        textInputMutation.mutate({ text })


Screenshot data is only passed through the MCP path; when MCP is disabled this else branch calls textInputMutation.mutate({ text }), dropping the screenshot entirely. Consider forwarding screenshotData (and updating the non-MCP backend path) so the feature works consistently without MCP (also applies to the fallback below).

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/renderer/src/components/conversation-display.tsx (1)
92-104: Convert MessageContent to plain text before downstream use.

message.content is now MessageContent (string | MessageContentPart[]). When it’s an array (text + image), the current calls send an object/array to generateSpeech and the context menu IPC handler, which expect strings. In prod this blows up TTS requests (invalid payload / [object Object]) and the context-menu copy path. Let’s normalize once and reuse.
@@
-  const generateAudio = async (): Promise<ArrayBuffer> => {
+  const textContent = extractTextFromContent(message.content)
+
+  const generateAudio = async (): Promise<ArrayBuffer> => {
     if (!configQuery.data?.ttsEnabled) {
       throw new Error("TTS is not enabled")
     }
@@
-      const result = await tipcClient.generateSpeech({
-        text: message.content,
+      const result = await tipcClient.generateSpeech({
+        text: textContent,
       })
@@
-    tipcClient.showContextMenu({
+    tipcClient.showContextMenu({
       x: e.clientX,
       y: e.clientY,
       messageContext: {
-        content: message.content,
+        content: textContent,
         role: message.role,
         messageId: message.id,
       },
     })
   }
@@
-          <MarkdownRenderer content={extractTextFromContent(message.content)} />
+          <MarkdownRenderer content={textContent} />
@@
-              text={extractTextFromContent(message.content)}
+              text={textContent}
Also applies to: 169-181
src/main/conversation-service.ts (1)
91-98: Fix type error: generatePreview doesn't handle MessageContent arrays.

Line 95 uses msg.content.slice(0, 100) which assumes content is always a string. However, ConversationMessage.content is now MessageContent (string | array), so this will throw a runtime error when content is an array.

Apply this diff to extract text from MessageContent:
  private generatePreview(messages: ConversationMessage[]): string {
    // Generate a preview from the first few messages
    const previewMessages = messages.slice(0, 3)
    const preview = previewMessages
-     .map((msg) => `${msg.role}: ${msg.content.slice(0, 100)}`)
+     .map((msg) => {
+       const content = typeof msg.content === 'string'
+         ? msg.content
+         : msg.content.filter(p => p.type === 'text').map(p => p.text).join(' ')
+       return `${msg.role}: ${content.slice(0, 100)}`
+     })
      .join(" | ")
    return preview.length > 200 ? `${preview.slice(0, 200)}...` : preview
  }
src/main/tipc.ts (1)

634-697: Screenshot context is lost when passed to agent mode processing.

The review comment is correct. The code saves the multimodal messageContent with the screenshot to the conversation (lines 643-656), but processWithAgentMode only passes text to processTranscriptWithAgentMode (line 661). Additionally, when loading previousConversationHistory, the most recent message is excluded (slice(0, -1)), so the current user's screenshot never reaches the LLM for analysis.

To fix this, pass messageContent instead of text to processWithAgentMode, and modify processWithAgentMode to accept and forward the multimodal content to processTranscriptWithAgentMode.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 881512f and de52278.

📒 Files selected for processing (12)

PR209_FINAL_SUMMARY.md (1 hunks)
src/main/config.ts (1 hunks)
src/main/conversation-service.ts (5 hunks)
src/main/llm-fetch.ts (6 hunks)
src/main/llm.ts (2 hunks)
src/main/tipc.ts (5 hunks)
src/renderer/src/components/agent-progress.tsx (3 hunks)
src/renderer/src/components/conversation-display.tsx (4 hunks)
src/renderer/src/components/text-input-panel.tsx (5 hunks)
src/renderer/src/contexts/conversation-context.tsx (1 hunks)
src/renderer/src/pages/panel.tsx (3 hunks)
src/shared/types.ts (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

src/renderer/src/components/agent-progress.tsx (1)

src/shared/types.ts (1)

MessageContent (143-143)

src/renderer/src/components/text-input-panel.tsx (2)

src/renderer/src/contexts/theme-context.tsx (1)

useTheme (173-179)

src/renderer/src/lib/tipc-client.ts (1)

tipcClient (12-14)

src/renderer/src/components/conversation-display.tsx (2)

src/shared/types.ts (1)

MessageContent (143-143)

src/renderer/src/components/markdown-renderer.tsx (1)

MarkdownRenderer (102-228)

src/renderer/src/pages/panel.tsx (1)

src/renderer/src/lib/tipc-client.ts (1)

tipcClient (12-14)

src/main/conversation-service.ts (1)

src/shared/types.ts (1)

MessageContent (143-143)

src/main/tipc.ts (2)

src/main/conversation-service.ts (1)

conversationService (250-250)

src/main/config.ts (1)

configStore (148-148)

🪛 LanguageTool

PR209_FINAL_SUMMARY.md

[style] ~84-~84: Consider a different adjective to strengthen your wording.
Context: ...oot cause analysis - Explanation of the deeper issue discovered - Detailed solution wi...

(DEEP_PROFOUND)

🪛 markdownlint-cli2 (0.18.1)

PR209_FINAL_SUMMARY.md

77-77: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (8)

src/renderer/src/pages/panel.tsx (1)

158-193: LGTM! Clean parameter threading for screenshot support.

The addition of optional screenshotData parameter is properly threaded through the mutation chain from the UI handler to the backend call. The implementation maintains backward compatibility by making the parameter optional.

Also applies to: 352-375

src/main/conversation-service.ts (3)

75-79: LGTM! Proper MessageContent handling.

The extraction logic correctly handles both string and array content types with appropriate type guards and fallbacks.

175-204: LGTM! Proper multimodal content support in conversation creation.

The function correctly accepts MessageContent and handles title generation for both string and array content types with appropriate fallbacks.

206-236: LGTM! Clean signature update.

The function now correctly accepts MessageContent for the content parameter, enabling multimodal message support.

src/shared/types.ts (2)

138-143: LGTM! Well-designed multimodal content types.

The types follow OpenAI's multimodal message format with proper discriminated unions and optional detail control for image processing.

372-377: LGTM! Screenshot configuration fields are well-defined.

The configuration options provide appropriate control over screenshot capture (quality, format, dimensions) with sensible types.

src/main/tipc.ts (2)

103-105: LGTM! Proper text extraction from multimodal content.

The code correctly extracts text parts from MessageContent arrays for agent mode processing, filtering out image content appropriately.

634-641: The original review comment is incorrect. No fixes needed.

The frontend (text-input-panel.tsx) already constructs the complete data URI format (data:image/${result.format};base64,${result.data}) before passing screenshotData to the backend. By the time the data reaches line 634 in tipc.ts, input.screenshotData is already a properly formatted data URI like data:image/jpeg;base64,.... The backend code at lines 634–641 is correct and requires no changes.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-10-31T17:57:52Z

+  // Calculate tokens - handle both string and array content
+  const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => {
+    if (typeof msg.content === 'string') {
+      return sum + msg.content.length
+    } else if (Array.isArray(msg.content)) {
+      return sum + msg.content.reduce((s, part) => {
+        if (part.type === 'text') return s + part.text.length
+        return s + 100 // Rough estimate for image tokens
+      }, 0)
+    }
+    return sum
+  }, 0) / 4)


⚠️ Potential issue | 🟠 Major

Improve image token estimation to prevent context overflow.

Line 580 uses a rough estimate of 100 tokens per image, which is significantly lower than actual costs for vision models. GPT-4V typically uses 85-170 tokens per tile, and images can have multiple tiles depending on resolution. For example:

A 512×512 image uses ~85 tokens (1 tile)

A 1920×1080 image uses ~765 tokens (9 tiles at high detail)

This underestimation could lead to context limit issues when multiple images or long conversations are involved.

Consider implementing a more accurate estimation:

- return s + 100 // Rough estimate for image tokens + // Estimate image tokens based on tile count (GPT-4V uses 85-170 tokens per tile) + // Assume high detail mode: each 512x512 tile costs ~170 tokens, plus 85 base tokens + return s + 500 // Conservative estimate for a typical screenshot

For a more precise calculation, you could compute tile count based on the image dimensions if available in the metadata.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Calculate tokens - handle both string and array content

const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => {

if (typeof msg.content === 'string') {

return sum + msg.content.length

} else if (Array.isArray(msg.content)) {

return sum + msg.content.reduce((s, part) => {

if (part.type === 'text') return s + part.text.length

return s + 100 // Rough estimate for image tokens

}, 0)

}

return sum

}, 0) / 4)

// Calculate tokens - handle both string and array content

const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => {

if (typeof msg.content === 'string') {

return sum + msg.content.length

} else if (Array.isArray(msg.content)) {

return sum + msg.content.reduce((s, part) => {

if (part.type === 'text') return s + part.text.length

// Estimate image tokens based on tile count (GPT-4V uses 85-170 tokens per tile)

// Assume high detail mode: each 512x512 tile costs ~170 tokens, plus 85 base tokens

return s + 500 // Conservative estimate for a typical screenshot

}, 0)

}

return sum

}, 0) / 4)

🤖 Prompt for AI Agents

In src/main/llm-fetch.ts around lines 573 to 584, the image token estimate uses a fixed 100 tokens per image which underestimates vision model costs; update the logic to check for image metadata (width, height) when msg.content contains image parts, compute number of tiles based on a tile size (e.g., 512x512) then multiply by a per-tile token cost range (use a conservative value like 85–170 or a single safer constant like 765 for high-res) to produce token count per image, fall back to the conservative per-image estimate when dimensions are missing, and keep the rest of the message-length calculation unchanged.

coderabbitai · 2025-10-31T17:57:53Z

+  // For multimodal content, extract text parts only for now
+  const prompt = messages.map((m) => {
+    let content = m.content
+    if (Array.isArray(content)) {
+      content = content.filter(p => p.type === 'text').map(p => p.text).join(' ')
+    }
+    return `${m.role}: ${content}`
+  }).join("\n\n")


⚠️ Potential issue | 🟠 Major

Document or implement image support for Gemini.

The code filters out image content and only sends text parts to Gemini (line 711: content.filter(p => p.type === 'text')). This means users who select Gemini as their LLM provider cannot use the screenshot feature, even though the UI allows them to capture screenshots.

This creates an inconsistent user experience where the feature appears to work but silently drops image content.

Please either:

Implement Gemini's multimodal API support (Gemini 1.5+ supports images)

Disable the screenshot checkbox in the UI when Gemini is selected

Display a warning to users that screenshots aren't supported with Gemini

Would you like me to help implement Gemini multimodal support? The Gemini API supports inline images in the inlineData format.

coderabbitai · 2025-10-31T17:57:53Z

+  captureScreenshot: t.procedure.action(async () => {
+    try {
+      const config = configStore.get()
+
+      // Get all displays
+      const displays = screen.getAllDisplays()
+      const primaryDisplay = screen.getPrimaryDisplay()
+
+      // Capture screenshot from primary display
+      const sources = await desktopCapturer.getSources({
+        types: ['screen'],
+        thumbnailSize: {
+          width: primaryDisplay.size.width * primaryDisplay.scaleFactor,
+          height: primaryDisplay.size.height * primaryDisplay.scaleFactor
+        }
+      })
+
+      if (sources.length === 0) {
+        throw new Error('No screen sources available')
+      }
+
+      // Get the primary screen source
+      const primarySource = sources[0]
+      const thumbnail = primarySource.thumbnail
+
+      // Get image format and quality from config
+      const format = config.screenshotFormat || 'jpeg'
+      const quality = config.screenshotQuality || 0.8
+      const maxWidth = config.screenshotMaxWidth || 1920
+      const maxHeight = config.screenshotMaxHeight || 1080
+
+      // Resize if needed
+      let finalImage = thumbnail
+      const size = thumbnail.getSize()
+      if (size.width > maxWidth || size.height > maxHeight) {
+        const scale = Math.min(maxWidth / size.width, maxHeight / size.height)
+        const newWidth = Math.floor(size.width * scale)
+        const newHeight = Math.floor(size.height * scale)
+        finalImage = thumbnail.resize({ width: newWidth, height: newHeight })
+      }
+
+      // Convert to base64
+      let base64Data: string
+      if (format === 'png') {
+        base64Data = finalImage.toPNG().toString('base64')
+      } else {
+        base64Data = finalImage.toJPEG(Math.floor(quality * 100)).toString('base64')
+      }
+
+      return {
+        data: base64Data,
+        format,
+        width: finalImage.getSize().width,
+        height: finalImage.getSize().height
+      }
+    } catch (error) {
+      throw new Error(`Failed to capture screenshot: ${error instanceof Error ? error.message : String(error)}`)
+    }
+  }),


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

LGTM! Solid screenshot capture implementation with minor privacy consideration.

The implementation properly:

Captures from primary display with scale factor handling

Resizes images based on configuration to control size

Encodes to base64 with configurable format and quality

Includes comprehensive error handling

Consider whether users should be notified when screenshots are captured, especially since this happens automatically when the checkbox is enabled. On macOS, screen recording permissions are required, but an in-app notification or indicator might improve transparency.

🌐 Web query:

What are best practices for user consent and notification when capturing screenshots in desktop applications?

💡 Result:

Short checklist — best practices for consent & notification when capturing screenshots/screen recordings in desktop apps:

Obtain explicit, informed consent before enabling capture (opt-in; describe what is captured, purpose, retention, sharing and how to revoke). [1][6][9]

Show a persistent, clear visual indicator while capture is active (icon/overlay/LED notification) so users know they’re being recorded. Use OS APIs when available to detect capture and respond. [2][6]

Provide an easy opt-out and a simple way to stop/pause capture from the UI (and honor revocation immediately). Document this in the app and privacy policy. [1][2][9]

Minimize what you capture: mask or exclude sensitive fields (passwords, payment info, personal data) and offer app‑/window‑level or region filters. Default to conservative masking. [9][10]

Localize consent flows and disclosures to applicable law/region (GDPR, CCPA/state laws, workplace/employee‑monitoring rules) and record consent events for auditability. [4][9]

Limit retention, restrict access, encrypt stored captures, and log access/exports; publish retention and data‑access policies. [9][10]

Avoid third‑party analytics that capture raw screens without explicit consent; if used, disclose third parties and contractual protections in the privacy policy. [1][9]

Provide transparency: in‑app settings, an accessible privacy policy, and a “view/delete your captures” control for users. [1][3][10]

Use contextual notices (first run, before a specific session, or when capture state changes) rather than hidden or buried disclosures; record the timestamped notice and user response. [1][6]

Test UX and edge cases (multi‑monitor, remote desktop, screen‑mirroring, headless sessions) and handle system capture events (pause capture, warn, or black out content) appropriately. [6][3]

Key references:

Apple App Store & developer guidance on explicit consent and visible indicators. [1][2]

Apple technical guidance (UIScreen.isCaptured / UIScreenCapturedDidChange) for responding to capture. [6]

Microsoft Recall / Windows guidance on snapshot filtering, user controls and local processing. [3][7]

GDPR/Session‑replay vendor guidance on masking, consent, retention and compliance. [9][10]

Employer/employee monitoring legality overview and state differences (U.S.). [4]

If you want, I can convert this into a short consent UI copy + privacy-policy snippet tailored to your app and target jurisdictions.

I'll verify the codebase to understand how the screenshot feature is integrated and whether there are any existing consent/notification mechanisms.Now I need to search the codebase to understand how the screenshot feature is integrated into the application.

[/run_scripts]

Let me wait for the script results and then examine the codebase structure further.

[/run_scripts]

Let me check if there's any UI component that enables/disables screenshot capture:

[/run_scripts]

Let me try a broader search to understand the feature's integration:

[/run_scripts]

Let me also check for any existing configuration or privacy UI:

[/run_scripts]

Now let me re-read the review comment carefully to understand what was actually being evaluated:

Based on my research, I can now provide the rewritten review comment. The web search revealed comprehensive best practices, and the Electron documentation confirms that desktopCapturer on macOS requires user permissions, with Electron checking permissions and rejecting the promise if not granted. However, the code snippet itself doesn't show any user-facing consent UI, visual indicators, or notification mechanisms.

Add user notification and consent controls for screenshot capture.

The implementation correctly handles the technical capture, but best practices for screenshot capture in desktop applications include: webcam, microphone, and screen recording functionalities present serious risk to users when approval is granted by default.

Consider adding:

Explicit opt-in consent dialog before first use or when the feature is enabled

Persistent visual indicator (icon/overlay) while capture is active

Easy opt-out control in settings with immediate revocation

Privacy policy disclosure explaining data retention, access, and deletion controls

Optional sensitive data masking (passwords, payment info)

Contextual notice on configuration changes

For macOS, desktopCapturer.getSources() triggers a system permission prompt once, which is stored in system preferences, providing OS-level protection. However, in-app transparency and user controls enhance trust.

🏁 Script executed:

#!/bin/bash # Find where captureScreenshot is called rg -n "captureScreenshot" --type ts --type js -C 3

Length of output: 1643

🏁 Script executed:

#!/bin/bash # Search for config related to screenshot functionality rg -n "screenshot" --type ts --type js -C 2 | head -100

Length of output: 7245

🏁 Script executed:

#!/bin/bash # Search for privacy settings or consent dialogs rg -n -i "consent\|notification\|permission\|privacy" --type ts --type js | grep -i "screenshot\|capture"

Length of output: 0

🏁 Script executed:

#!/bin/bash # Find the main application structure fd -type f -name "*.ts" -o -name "*.tsx" | head -20

Length of output: 170

🏁 Script executed:

#!/bin/bash # Look for UI components related to settings or privacy fd -type f $ -name "*setting*" -o -name "*privacy*" -o -name "*config*" $ -name "*.ts*"

Length of output: 170

Add explicit consent and visual feedback for automatic screenshot capture.

The implementation technically works, but automatic capture when the checkbox is enabled lacks the user notifications recommended by platform guidelines and privacy best practices.

Currently:

User checks "Include Screenshot" checkbox

Screenshot is immediately captured with no warning or visual indicator

Preview is shown, but user may not realize automatic capture occurred

Add:

First-use consent dialog explaining screenshot capture, data handling, and how to disable

Visual indicator (icon/spinner) during capture so users know activity is occurring

Brief in-app notice when the screenshot checkbox auto-captures: "Capturing screenshot…"

Privacy policy section disclosing: what is captured, retention period, and where data is sent

Optional: masking/redaction UI for sensitive fields (passwords, payment info)

macOS system permissions provide underlying protection, but in-app transparency and explicit consent improve user trust and regulatory compliance (GDPR, CCPA).

🤖 Prompt for AI Agents

In src/main/tipc.ts around lines 914-972, the captureScreenshot procedure triggers automatic screenshots without user consent or visible feedback; add first-use explicit consent, runtime visual feedback, and privacy disclosures: check a persisted consent flag in configStore before capturing and if missing open a modal dialog (blocking auto-capture) that explains what is captured, retention, where data is sent, and how to opt-out, then persist the user’s choice; emit IPC/UI events before/after capture to show a brief in-app notice and a visual indicator (icon/spinner) during capture; include a configurable privacyPolicy URL and retention metadata in the config returned with the screenshot and surface a link to the policy in the modal; optionally add a flag to enable masking/redaction options (expose API to request masking before capture). Ensure the procedure aborts with a clear error when consent is denied and that all new UI interactions are driven via existing IPC channels rather than doing UI work in this main process.

- Fixed screenshot data being stripped in agent mode pipeline - Updated processTranscriptWithAgentMode to accept MessageContent (string or multimodal array) - Updated context-budget.ts to handle multimodal content (text + images) - Added extractTextFromContent helper to safely extract text from multimodal messages - Enhanced debug logging to track multimodal content through the pipeline Permission Detection & UI: - Added getScreenCaptureStatus, requestScreenCaptureAccess, and openScreenCaptureInSystemPreferences to tipc.ts - Added permission status display in Settings → General → Screenshot / Multimodal - Shows green checkmark when permission granted, amber warning when missing - Added 'Open System Settings' button to guide users to grant permission - Auto-refreshes permission status every 2 seconds Error Handling: - Added helpful toast notifications when screenshot capture fails - Detects macOS permission errors and shows targeted guidance - Guides users to System Settings → Privacy & Security → Screen Recording This ensures screenshots are properly sent to multimodal LLMs (GPT-4V, Claude with vision, Gemini 2.5 Flash, etc.) and users are guided through the permission setup process.

- Add automatic screenshot capture for voice input in MCP mode when screenshot setting is enabled - Implement dynamic window resizing when screenshot preview is shown/hidden in text input panel - Update createMcpRecording to accept optional screenshotData parameter for multimodal content - Add useEffect hook to automatically resize panel window based on screenshot preview state - Ensure error-resilient screenshot capture that doesn't break voice input flow

augmentcode Bot reviewed Oct 31, 2025

View reviewed changes

coderabbitai Bot reviewed Oct 31, 2025

View reviewed changes

aj47 added 2 commits November 1, 2025 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add screenshot functionality for multimodal AI input#219

feat: Add screenshot functionality for multimodal AI input#219
aj47 wants to merge 3 commits into
mainfrom
feature/screenshot-multimodal-217

aj47 commented Oct 31, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Oct 31, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

aj47 commented Oct 31, 2025

Uh oh!

augmentcode Bot left a comment

Uh oh!

augmentcode Bot Oct 31, 2025

Uh oh!

augmentcode Bot left a comment

Uh oh!

augmentcode Bot Oct 31, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Oct 31, 2025

Uh oh!

coderabbitai Bot Oct 31, 2025

Uh oh!

coderabbitai Bot Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aj47 commented Oct 31, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Features

Technical Implementation

Backend Changes

Frontend Changes

Type System

Configuration

API Format

Backward Compatibility

Testing

Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Pre-merge checks and finishing touches

Uh oh!

aj47 commented Oct 31, 2025

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aj47 commented Oct 31, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 31, 2025 •

edited

Loading