Skip to content

feat: Add screenshot functionality for multimodal AI input#219

Open
aj47 wants to merge 3 commits into
mainfrom
feature/screenshot-multimodal-217
Open

feat: Add screenshot functionality for multimodal AI input#219
aj47 wants to merge 3 commits into
mainfrom
feature/screenshot-multimodal-217

Conversation

@aj47

@aj47 aj47 commented Oct 31, 2025

Copy link
Copy Markdown
Owner

Closes #217

Summary

This PR adds screenshot functionality to SpeakMCP, allowing users to capture and send screenshots along with text input to multimodal AI models.

Changes

Core Features

  • Screenshot Capture: Implemented using Electron's desktopCapturer API
  • UI Checkbox: Added checkbox in text input panel to enable screenshot capture
  • Configuration: Added screenshot settings (quality, format, max dimensions)
  • Multimodal Support: Updated message types to support both text and images

Technical Implementation

Backend Changes

  • Added captureScreenshot TIPC handler in src/main/tipc.ts
    • Captures primary display screenshot
    • Resizes based on config (max width/height)
    • Encodes to base64 with configurable format (PNG/JPEG) and quality
  • Updated createMcpTextInput to accept screenshotData parameter
  • Modified LLM API calls to handle multimodal content:
    • makeOpenAICompatibleCall now accepts content: any (string or array)
    • makeGeminiCall extracts text from multimodal content
    • Updated token estimation to handle image content
  • Updated conversation service to handle MessageContent type

Frontend Changes

  • Enhanced TextInputPanel component:
    • Added screenshot checkbox (only shown when screenshotEnabled in config)
    • Added screenshot preview with remove button
    • Captures screenshot when checkbox is checked
    • Passes screenshot data to submit handler
  • Updated panel.tsx to pass screenshot data through mutation
  • Added helper functions to extract text from MessageContent in:
    • conversation-display.tsx
    • agent-progress.tsx
    • conversation-context.tsx

Type System

  • Added multimodal content types in src/shared/types.ts:
    export type MessageContentPart =
      | { type: "text"; text: string }
      | { type: "image_url"; image_url: { url: string; detail?: "auto" | "low" | "high" } }
    
    export type MessageContent = string | MessageContentPart[]
  • Updated ConversationMessage to use MessageContent instead of string
  • Added screenshot configuration options to Config type

Configuration

  • Added default screenshot settings in src/main/config.ts:
    • screenshotEnabled: true
    • screenshotQuality: 0.8
    • screenshotFormat: "jpeg"
    • screenshotMaxWidth: 1920
    • screenshotMaxHeight: 1080

API Format

Screenshots are sent to AI models using the OpenAI-compatible multimodal format:

{
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What's in this image?" },
        { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,..." } }
      ]
    }
  ]
}

Backward Compatibility

  • All changes are backward compatible
  • MessageContent is a union type (string | MessageContentPart[])
  • Existing string-based messages continue to work
  • Screenshot feature is opt-in via checkbox

Testing

  • TypeScript compilation passes (except pre-existing @fastify/cors issue)
  • All multimodal type errors resolved
  • Ready for end-to-end testing with multimodal models (GPT-4V, Claude, etc.)

Notes

  • The @fastify/cors type error is a pre-existing issue unrelated to this PR
  • Screenshot capture uses the primary display
  • Images are base64-encoded for transmission
  • Gemini API currently only uses text parts (image support can be added later)

Pull Request opened by Augment Code with guidance from the PR author

Summary by CodeRabbit

Release Notes

  • New Features

    • Added screenshot capture and attachment functionality for messages
    • Enabled multimodal message support combining text and images
    • New screenshot configuration options for quality, format, and dimensions
  • Bug Fixes

    • Improved tool execution stability with consistent ID generation
    • Fixed UI state preservation across message updates
  • Chores

    • Updated configuration schema to support screenshot settings

- Add screenshot capture using Electron's desktopCapturer API
- Add screenshot checkbox in text input UI
- Add screenshot configuration options (quality, format, max dimensions)
- Update message types to support multimodal content (text + images)
- Update LLM API calls to handle multimodal content
- Update conversation service to handle MessageContent type
- Add helper functions to extract text from multimodal content

Implements multimodal support for sending screenshots to AI models
via OpenAI-compatible APIs using base64-encoded images.
@coderabbitai

coderabbitai Bot commented Oct 31, 2025

Copy link
Copy Markdown

Warning

Rate limit exceeded

@aj47 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 1 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between de52278 and 7e77673.

📒 Files selected for processing (9)
  • src/main/context-budget.ts (2 hunks)
  • src/main/llm-fetch.ts (7 hunks)
  • src/main/llm.ts (14 hunks)
  • src/main/tipc.ts (12 hunks)
  • src/renderer/src/components/text-input-panel.tsx (5 hunks)
  • src/renderer/src/hooks/use-input-processing.ts (3 hunks)
  • src/renderer/src/lib/query-types.ts (1 hunks)
  • src/renderer/src/pages/panel.tsx (7 hunks)
  • src/renderer/src/pages/settings-general.tsx (2 hunks)

Walkthrough

This PR adds multimodal screenshot support and improves message content handling across the application. It introduces screenshot capture functionality with UI controls, updates message content types to support structured multimodal payloads (text and images), and implements stable content-based IDs for tool executions. The changes span backend (LLM, configuration, conversation service) and frontend (text input, agent progress rendering) layers.

Changes

Cohort / File(s) Change Summary
Shared Type Definitions
src/shared/types.ts
Added new MessageContentPart and MessageContent types to support multimodal content (text and image_url blocks). Updated ConversationMessage.content and AgentProgressUpdate.conversationHistory[].content from string to MessageContent. Extended Config interface with screenshot settings (screenshotEnabled, screenshotQuality, screenshotFormat, screenshotMaxWidth, screenshotMaxHeight).
Backend Configuration & Service
src/main/config.ts, src/main/conversation-service.ts
Added screenshot configuration fields to defaultConfig. Updated createConversation() and addMessageToConversation() method signatures to accept MessageContent instead of string. Added runtime logic to extract text from structured content arrays and handle multimodal payloads.
Backend LLM & Fetch
src/main/llm.ts, src/main/llm-fetch.ts
Broadened content typing in makeLLMCall(), makeOpenAICompatibleCall(), makeGeminiCall(), makeLLMCallAttempt(), and makeLLMCallWithFetch() to accept any content type. Enhanced token estimation to handle both string and structured content parts. Updated Gemini prompt construction to extract only text parts from multimodal content.
Backend IPC & Screenshot Integration
src/main/tipc.ts
Added captureScreenshot() public method to capture base64-encoded screenshots. Extended createMcpTextInput() signature with optional screenshotData parameter. Implemented multimodal message content assembly when screenshot is provided. Updated agent-mode history normalization to handle structured content arrays.
Frontend Components
src/renderer/src/components/agent-progress.tsx, src/renderer/src/components/conversation-display.tsx
Introduced extractTextFromContent() helper function to normalize MessageContent to plain text. Applied text extraction in message filtering and rendering paths (MarkdownRenderer, AudioPlayer, compact display) to ensure consistent text handling across component tree.
Frontend Input & State Management
src/renderer/src/components/text-input-panel.tsx, src/renderer/src/pages/panel.tsx
Updated TextInputPanelProps.onSubmit signature to accept optional screenshotData parameter. Added configuration-driven screenshot UI checkbox, capture logic, and preview area with remove functionality. Extended handleTextSubmit() to accept and forward screenshotData through MCP mutation.
Frontend Context
src/renderer/src/contexts/conversation-context.tsx
Updated conversationHistory entry content type from string to any to accommodate structured MessageContent.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant TextInput as Text Input Panel
    participant Main as Main Process (IPC)
    participant ConvService as Conversation Service
    participant LLM as LLM Handler
    participant Renderer as Agent Progress Renderer

    User->>TextInput: Enters text + enables screenshot
    TextInput->>Main: captureScreenshot()
    Main-->>TextInput: base64 screenshot data
    User->>TextInput: Submit
    TextInput->>Main: createMcpTextInput({ text, screenshotData })
    alt Screenshot provided
        Main->>ConvService: addMessageToConversation(content: [text, image_url])
    else No screenshot
        Main->>ConvService: addMessageToConversation(content: text)
    end
    ConvService->>LLM: makeLLMCall(messages with MessageContent)
    LLM->>LLM: generateToolExecutionId(toolCall)
    LLM-->>Renderer: AgentProgressUpdate with multimodal content
    Renderer->>Renderer: extractTextFromContent(content)
    Renderer-->>User: Display text + images
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Type signature changes across multiple layers: The shift from string to MessageContent affects public methods in conversation service, LLM functions, and component props. Each requires verification that callers and implementations handle both union types correctly.
  • Multimodal content extraction logic: The extractTextFromContent() pattern is repeated across components but not universally applied. Verify all rendering paths that consume message content have been updated.
  • Screenshot capture and IPC flow: The new captureScreenshot() method and screenshot data threading through MCP mutations introduces new async/error handling paths that need careful validation.
  • Stable ID generation: The content-hash-based ID generation in generateToolExecutionId() is critical for UI stability but the hashing approach should be reviewed for collision and performance implications.
  • Conversation history normalization: Agent-mode history handling performs string extraction from potentially structured content—ensure fallbacks and edge cases (empty arrays, non-text parts) are handled safely.

Possibly related PRs

  • #209: Modifies agent-progress.tsx expansion state lifting and introduces generateToolExecutionId() for stable tool execution rendering
  • #202: Updates agent tool-execution flow and UI rendering in agent-progress.tsx and src/main/llm.ts

Suggested labels

augment_review

Poem

🐰 A screenshot captured, content now flows,
Multimodal messages—no more plain text prose!
With stable IDs and structured arrays so fine,
The agent sees images and text intertwine!
From backend to UI, the data takes flight,
Together they render a multimodal sight! 📸✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "feat: Add screenshot functionality for multimodal AI input" is clear, concise, and directly reflects the main purpose of the changeset. It accurately summarizes the primary feature being introduced—screenshot capture capability for multimodal AI model input—and uses conventional commit format conventions. The title is specific enough that reviewers scanning the history would immediately understand the key change without excessive detail.
Linked Issues Check ✅ Passed The pull request successfully addresses all primary coding-related objectives from issue #217. The implementation includes screenshot capture via Electron's desktopCapturer with the new captureScreenshot() handler [tipc.ts], UI checkbox in the text input panel [text-input-panel.tsx] conditional on configuration, and screenshot configuration options [config.ts] for quality, format, and dimensions. Multimodal model integration is implemented through updated content typing in LLM call functions to accept structured content arrays [llm-fetch.ts, llm.ts], proper base64 encoding of images in data URIs [tipc.ts], and support for OpenAI-compatible multimodal message formats. The type system now supports multimodal messages through new MessageContent and MessageContentPart types [types.ts] and updated interfaces for conversations and agent progress. All acceptance criteria are addressed: users can capture and attach screenshots via UI checkbox, screenshot behavior is configurable, screenshot data is properly formatted for multimodal models, and the implementation maintains backward compatibility with string-based messages.
Out of Scope Changes Check ✅ Passed All documented changes in the pull request are directly aligned with the requirements of issue #217. The modifications span configuration management [config.ts], type definitions [types.ts], backend screenshot capture and LLM integration [tipc.ts, llm-fetch.ts, llm.ts], frontend UI components [text-input-panel.tsx, panel.tsx], and supporting display/context updates to handle the new multimodal MessageContent type [conversation-display.tsx, agent-progress.tsx, conversation-context.tsx]. The text extraction helpers and type updates are necessary supporting changes for proper handling of multimodal content throughout the application. No changes appear to be unrelated to the screenshot functionality or multimodal message support objectives.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@aj47

aj47 commented Oct 31, 2025

Copy link
Copy Markdown
Owner Author

augment review

@augmentcode augmentcode Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Comment thread src/main/tipc.ts

// Get the primary screen source
const primarySource = sources[0]
const thumbnail = primarySource.thumbnail

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choosing the primary screen with sources[0] is unreliable—desktopCapturer.getSources() does not guarantee ordering. Consider matching the source to screen.getPrimaryDisplay() (e.g., via display_id) to ensure the correct monitor is captured, especially on multi‑monitor setups.

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

@augmentcode augmentcode Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

screenshotData,
})
} else {
textInputMutation.mutate({ text })

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot data is only passed through the MCP path; when MCP is disabled this else branch calls textInputMutation.mutate({ text }), dropping the screenshot entirely. Consider forwarding screenshotData (and updating the non-MCP backend path) so the feature works consistently without MCP (also applies to the fallback below).

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/renderer/src/components/conversation-display.tsx (1)

92-104: Convert MessageContent to plain text before downstream use.

message.content is now MessageContent (string | MessageContentPart[]). When it’s an array (text + image), the current calls send an object/array to generateSpeech and the context menu IPC handler, which expect strings. In prod this blows up TTS requests (invalid payload / [object Object]) and the context-menu copy path. Let’s normalize once and reuse.

@@
-  const generateAudio = async (): Promise<ArrayBuffer> => {
+  const textContent = extractTextFromContent(message.content)
+
+  const generateAudio = async (): Promise<ArrayBuffer> => {
     if (!configQuery.data?.ttsEnabled) {
       throw new Error("TTS is not enabled")
     }
@@
-      const result = await tipcClient.generateSpeech({
-        text: message.content,
+      const result = await tipcClient.generateSpeech({
+        text: textContent,
       })
@@
-    tipcClient.showContextMenu({
+    tipcClient.showContextMenu({
       x: e.clientX,
       y: e.clientY,
       messageContext: {
-        content: message.content,
+        content: textContent,
         role: message.role,
         messageId: message.id,
       },
     })
   }
@@
-          <MarkdownRenderer content={extractTextFromContent(message.content)} />
+          <MarkdownRenderer content={textContent} />
@@
-              text={extractTextFromContent(message.content)}
+              text={textContent}

Also applies to: 169-181

src/main/conversation-service.ts (1)

91-98: Fix type error: generatePreview doesn't handle MessageContent arrays.

Line 95 uses msg.content.slice(0, 100) which assumes content is always a string. However, ConversationMessage.content is now MessageContent (string | array), so this will throw a runtime error when content is an array.

Apply this diff to extract text from MessageContent:

  private generatePreview(messages: ConversationMessage[]): string {
    // Generate a preview from the first few messages
    const previewMessages = messages.slice(0, 3)
    const preview = previewMessages
-     .map((msg) => `${msg.role}: ${msg.content.slice(0, 100)}`)
+     .map((msg) => {
+       const content = typeof msg.content === 'string'
+         ? msg.content
+         : msg.content.filter(p => p.type === 'text').map(p => p.text).join(' ')
+       return `${msg.role}: ${content.slice(0, 100)}`
+     })
      .join(" | ")
    return preview.length > 200 ? `${preview.slice(0, 200)}...` : preview
  }
src/main/tipc.ts (1)

634-697: Screenshot context is lost when passed to agent mode processing.

The review comment is correct. The code saves the multimodal messageContent with the screenshot to the conversation (lines 643-656), but processWithAgentMode only passes text to processTranscriptWithAgentMode (line 661). Additionally, when loading previousConversationHistory, the most recent message is excluded (slice(0, -1)), so the current user's screenshot never reaches the LLM for analysis.

To fix this, pass messageContent instead of text to processWithAgentMode, and modify processWithAgentMode to accept and forward the multimodal content to processTranscriptWithAgentMode.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 881512f and de52278.

📒 Files selected for processing (12)
  • PR209_FINAL_SUMMARY.md (1 hunks)
  • src/main/config.ts (1 hunks)
  • src/main/conversation-service.ts (5 hunks)
  • src/main/llm-fetch.ts (6 hunks)
  • src/main/llm.ts (2 hunks)
  • src/main/tipc.ts (5 hunks)
  • src/renderer/src/components/agent-progress.tsx (3 hunks)
  • src/renderer/src/components/conversation-display.tsx (4 hunks)
  • src/renderer/src/components/text-input-panel.tsx (5 hunks)
  • src/renderer/src/contexts/conversation-context.tsx (1 hunks)
  • src/renderer/src/pages/panel.tsx (3 hunks)
  • src/shared/types.ts (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
src/renderer/src/components/agent-progress.tsx (1)
src/shared/types.ts (1)
  • MessageContent (143-143)
src/renderer/src/components/text-input-panel.tsx (2)
src/renderer/src/contexts/theme-context.tsx (1)
  • useTheme (173-179)
src/renderer/src/lib/tipc-client.ts (1)
  • tipcClient (12-14)
src/renderer/src/components/conversation-display.tsx (2)
src/shared/types.ts (1)
  • MessageContent (143-143)
src/renderer/src/components/markdown-renderer.tsx (1)
  • MarkdownRenderer (102-228)
src/renderer/src/pages/panel.tsx (1)
src/renderer/src/lib/tipc-client.ts (1)
  • tipcClient (12-14)
src/main/conversation-service.ts (1)
src/shared/types.ts (1)
  • MessageContent (143-143)
src/main/tipc.ts (2)
src/main/conversation-service.ts (1)
  • conversationService (250-250)
src/main/config.ts (1)
  • configStore (148-148)
🪛 LanguageTool
PR209_FINAL_SUMMARY.md

[style] ~84-~84: Consider a different adjective to strengthen your wording.
Context: ...oot cause analysis - Explanation of the deeper issue discovered - Detailed solution wi...

(DEEP_PROFOUND)

🪛 markdownlint-cli2 (0.18.1)
PR209_FINAL_SUMMARY.md

77-77: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (8)
src/renderer/src/pages/panel.tsx (1)

158-193: LGTM! Clean parameter threading for screenshot support.

The addition of optional screenshotData parameter is properly threaded through the mutation chain from the UI handler to the backend call. The implementation maintains backward compatibility by making the parameter optional.

Also applies to: 352-375

src/main/conversation-service.ts (3)

75-79: LGTM! Proper MessageContent handling.

The extraction logic correctly handles both string and array content types with appropriate type guards and fallbacks.


175-204: LGTM! Proper multimodal content support in conversation creation.

The function correctly accepts MessageContent and handles title generation for both string and array content types with appropriate fallbacks.


206-236: LGTM! Clean signature update.

The function now correctly accepts MessageContent for the content parameter, enabling multimodal message support.

src/shared/types.ts (2)

138-143: LGTM! Well-designed multimodal content types.

The types follow OpenAI's multimodal message format with proper discriminated unions and optional detail control for image processing.


372-377: LGTM! Screenshot configuration fields are well-defined.

The configuration options provide appropriate control over screenshot capture (quality, format, dimensions) with sensible types.

src/main/tipc.ts (2)

103-105: LGTM! Proper text extraction from multimodal content.

The code correctly extracts text parts from MessageContent arrays for agent mode processing, filtering out image content appropriately.


634-641: The original review comment is incorrect. No fixes needed.

The frontend (text-input-panel.tsx) already constructs the complete data URI format (data:image/${result.format};base64,${result.data}) before passing screenshotData to the backend. By the time the data reaches line 634 in tipc.ts, input.screenshotData is already a properly formatted data URI like data:image/jpeg;base64,.... The backend code at lines 634–641 is correct and requires no changes.

Likely an incorrect or invalid review comment.

Comment thread src/main/llm-fetch.ts
Comment on lines +573 to +584
// Calculate tokens - handle both string and array content
const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => {
if (typeof msg.content === 'string') {
return sum + msg.content.length
} else if (Array.isArray(msg.content)) {
return sum + msg.content.reduce((s, part) => {
if (part.type === 'text') return s + part.text.length
return s + 100 // Rough estimate for image tokens
}, 0)
}
return sum
}, 0) / 4)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Improve image token estimation to prevent context overflow.

Line 580 uses a rough estimate of 100 tokens per image, which is significantly lower than actual costs for vision models. GPT-4V typically uses 85-170 tokens per tile, and images can have multiple tiles depending on resolution. For example:

  • A 512×512 image uses ~85 tokens (1 tile)
  • A 1920×1080 image uses ~765 tokens (9 tiles at high detail)

This underestimation could lead to context limit issues when multiple images or long conversations are involved.

Consider implementing a more accurate estimation:

-        return s + 100 // Rough estimate for image tokens
+        // Estimate image tokens based on tile count (GPT-4V uses 85-170 tokens per tile)
+        // Assume high detail mode: each 512x512 tile costs ~170 tokens, plus 85 base tokens
+        return s + 500 // Conservative estimate for a typical screenshot

For a more precise calculation, you could compute tile count based on the image dimensions if available in the metadata.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Calculate tokens - handle both string and array content
const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => {
if (typeof msg.content === 'string') {
return sum + msg.content.length
} else if (Array.isArray(msg.content)) {
return sum + msg.content.reduce((s, part) => {
if (part.type === 'text') return s + part.text.length
return s + 100 // Rough estimate for image tokens
}, 0)
}
return sum
}, 0) / 4)
// Calculate tokens - handle both string and array content
const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => {
if (typeof msg.content === 'string') {
return sum + msg.content.length
} else if (Array.isArray(msg.content)) {
return sum + msg.content.reduce((s, part) => {
if (part.type === 'text') return s + part.text.length
// Estimate image tokens based on tile count (GPT-4V uses 85-170 tokens per tile)
// Assume high detail mode: each 512x512 tile costs ~170 tokens, plus 85 base tokens
return s + 500 // Conservative estimate for a typical screenshot
}, 0)
}
return sum
}, 0) / 4)
🤖 Prompt for AI Agents
In src/main/llm-fetch.ts around lines 573 to 584, the image token estimate uses
a fixed 100 tokens per image which underestimates vision model costs; update the
logic to check for image metadata (width, height) when msg.content contains
image parts, compute number of tiles based on a tile size (e.g., 512x512) then
multiply by a per-tile token cost range (use a conservative value like 85–170 or
a single safer constant like 765 for high-res) to produce token count per image,
fall back to the conservative per-image estimate when dimensions are missing,
and keep the rest of the message-length calculation unchanged.

Comment thread src/main/llm-fetch.ts
Comment on lines +707 to +714
// For multimodal content, extract text parts only for now
const prompt = messages.map((m) => {
let content = m.content
if (Array.isArray(content)) {
content = content.filter(p => p.type === 'text').map(p => p.text).join(' ')
}
return `${m.role}: ${content}`
}).join("\n\n")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Document or implement image support for Gemini.

The code filters out image content and only sends text parts to Gemini (line 711: content.filter(p => p.type === 'text')). This means users who select Gemini as their LLM provider cannot use the screenshot feature, even though the UI allows them to capture screenshots.

This creates an inconsistent user experience where the feature appears to work but silently drops image content.

Please either:

  1. Implement Gemini's multimodal API support (Gemini 1.5+ supports images)
  2. Disable the screenshot checkbox in the UI when Gemini is selected
  3. Display a warning to users that screenshots aren't supported with Gemini

Would you like me to help implement Gemini multimodal support? The Gemini API supports inline images in the inlineData format.

Comment thread src/main/tipc.ts
Comment on lines +914 to +972
captureScreenshot: t.procedure.action(async () => {
try {
const config = configStore.get()

// Get all displays
const displays = screen.getAllDisplays()
const primaryDisplay = screen.getPrimaryDisplay()

// Capture screenshot from primary display
const sources = await desktopCapturer.getSources({
types: ['screen'],
thumbnailSize: {
width: primaryDisplay.size.width * primaryDisplay.scaleFactor,
height: primaryDisplay.size.height * primaryDisplay.scaleFactor
}
})

if (sources.length === 0) {
throw new Error('No screen sources available')
}

// Get the primary screen source
const primarySource = sources[0]
const thumbnail = primarySource.thumbnail

// Get image format and quality from config
const format = config.screenshotFormat || 'jpeg'
const quality = config.screenshotQuality || 0.8
const maxWidth = config.screenshotMaxWidth || 1920
const maxHeight = config.screenshotMaxHeight || 1080

// Resize if needed
let finalImage = thumbnail
const size = thumbnail.getSize()
if (size.width > maxWidth || size.height > maxHeight) {
const scale = Math.min(maxWidth / size.width, maxHeight / size.height)
const newWidth = Math.floor(size.width * scale)
const newHeight = Math.floor(size.height * scale)
finalImage = thumbnail.resize({ width: newWidth, height: newHeight })
}

// Convert to base64
let base64Data: string
if (format === 'png') {
base64Data = finalImage.toPNG().toString('base64')
} else {
base64Data = finalImage.toJPEG(Math.floor(quality * 100)).toString('base64')
}

return {
data: base64Data,
format,
width: finalImage.getSize().width,
height: finalImage.getSize().height
}
} catch (error) {
throw new Error(`Failed to capture screenshot: ${error instanceof Error ? error.message : String(error)}`)
}
}),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

LGTM! Solid screenshot capture implementation with minor privacy consideration.

The implementation properly:

  • Captures from primary display with scale factor handling
  • Resizes images based on configuration to control size
  • Encodes to base64 with configurable format and quality
  • Includes comprehensive error handling

Consider whether users should be notified when screenshots are captured, especially since this happens automatically when the checkbox is enabled. On macOS, screen recording permissions are required, but an in-app notification or indicator might improve transparency.


🌐 Web query:

What are best practices for user consent and notification when capturing screenshots in desktop applications?

💡 Result:

Short checklist — best practices for consent & notification when capturing screenshots/screen recordings in desktop apps:

  1. Obtain explicit, informed consent before enabling capture (opt-in; describe what is captured, purpose, retention, sharing and how to revoke). [1][6][9]

  2. Show a persistent, clear visual indicator while capture is active (icon/overlay/LED notification) so users know they’re being recorded. Use OS APIs when available to detect capture and respond. [2][6]

  3. Provide an easy opt-out and a simple way to stop/pause capture from the UI (and honor revocation immediately). Document this in the app and privacy policy. [1][2][9]

  4. Minimize what you capture: mask or exclude sensitive fields (passwords, payment info, personal data) and offer app‑/window‑level or region filters. Default to conservative masking. [9][10]

  5. Localize consent flows and disclosures to applicable law/region (GDPR, CCPA/state laws, workplace/employee‑monitoring rules) and record consent events for auditability. [4][9]

  6. Limit retention, restrict access, encrypt stored captures, and log access/exports; publish retention and data‑access policies. [9][10]

  7. Avoid third‑party analytics that capture raw screens without explicit consent; if used, disclose third parties and contractual protections in the privacy policy. [1][9]

  8. Provide transparency: in‑app settings, an accessible privacy policy, and a “view/delete your captures” control for users. [1][3][10]

  9. Use contextual notices (first run, before a specific session, or when capture state changes) rather than hidden or buried disclosures; record the timestamped notice and user response. [1][6]

  10. Test UX and edge cases (multi‑monitor, remote desktop, screen‑mirroring, headless sessions) and handle system capture events (pause capture, warn, or black out content) appropriately. [6][3]

Key references:

  • Apple App Store & developer guidance on explicit consent and visible indicators. [1][2]
  • Apple technical guidance (UIScreen.isCaptured / UIScreenCapturedDidChange) for responding to capture. [6]
  • Microsoft Recall / Windows guidance on snapshot filtering, user controls and local processing. [3][7]
  • GDPR/Session‑replay vendor guidance on masking, consent, retention and compliance. [9][10]
  • Employer/employee monitoring legality overview and state differences (U.S.). [4]

If you want, I can convert this into a short consent UI copy + privacy-policy snippet tailored to your app and target jurisdictions.


I'll verify the codebase to understand how the screenshot feature is integrated and whether there are any existing consent/notification mechanisms.Now I need to search the codebase to understand how the screenshot feature is integrated into the application.

[/run_scripts]

Let me wait for the script results and then examine the codebase structure further.

[/run_scripts]

Let me check if there's any UI component that enables/disables screenshot capture:

[/run_scripts]

Let me try a broader search to understand the feature's integration:

[/run_scripts]

Let me also check for any existing configuration or privacy UI:

[/run_scripts]

Now let me re-read the review comment carefully to understand what was actually being evaluated:

Based on my research, I can now provide the rewritten review comment. The web search revealed comprehensive best practices, and the Electron documentation confirms that desktopCapturer on macOS requires user permissions, with Electron checking permissions and rejecting the promise if not granted. However, the code snippet itself doesn't show any user-facing consent UI, visual indicators, or notification mechanisms.


Add user notification and consent controls for screenshot capture.

The implementation correctly handles the technical capture, but best practices for screenshot capture in desktop applications include: webcam, microphone, and screen recording functionalities present serious risk to users when approval is granted by default.

Consider adding:

  • Explicit opt-in consent dialog before first use or when the feature is enabled
  • Persistent visual indicator (icon/overlay) while capture is active
  • Easy opt-out control in settings with immediate revocation
  • Privacy policy disclosure explaining data retention, access, and deletion controls
  • Optional sensitive data masking (passwords, payment info)
  • Contextual notice on configuration changes

For macOS, desktopCapturer.getSources() triggers a system permission prompt once, which is stored in system preferences, providing OS-level protection. However, in-app transparency and user controls enhance trust.


🏁 Script executed:

#!/bin/bash
# Find where captureScreenshot is called
rg -n "captureScreenshot" --type ts --type js -C 3

Length of output: 1643


🏁 Script executed:

#!/bin/bash
# Search for config related to screenshot functionality
rg -n "screenshot" --type ts --type js -C 2 | head -100

Length of output: 7245


🏁 Script executed:

#!/bin/bash
# Search for privacy settings or consent dialogs
rg -n -i "consent\|notification\|permission\|privacy" --type ts --type js | grep -i "screenshot\|capture" 

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Find the main application structure
fd -type f -name "*.ts" -o -name "*.tsx" | head -20

Length of output: 170


🏁 Script executed:

#!/bin/bash
# Look for UI components related to settings or privacy
fd -type f \( -name "*setting*" -o -name "*privacy*" -o -name "*config*" \) -name "*.ts*"

Length of output: 170


Add explicit consent and visual feedback for automatic screenshot capture.

The implementation technically works, but automatic capture when the checkbox is enabled lacks the user notifications recommended by platform guidelines and privacy best practices.

Currently:

  • User checks "Include Screenshot" checkbox
  • Screenshot is immediately captured with no warning or visual indicator
  • Preview is shown, but user may not realize automatic capture occurred

Add:

  • First-use consent dialog explaining screenshot capture, data handling, and how to disable
  • Visual indicator (icon/spinner) during capture so users know activity is occurring
  • Brief in-app notice when the screenshot checkbox auto-captures: "Capturing screenshot…"
  • Privacy policy section disclosing: what is captured, retention period, and where data is sent
  • Optional: masking/redaction UI for sensitive fields (passwords, payment info)

macOS system permissions provide underlying protection, but in-app transparency and explicit consent improve user trust and regulatory compliance (GDPR, CCPA).

🤖 Prompt for AI Agents
In src/main/tipc.ts around lines 914-972, the captureScreenshot procedure
triggers automatic screenshots without user consent or visible feedback; add
first-use explicit consent, runtime visual feedback, and privacy disclosures:
check a persisted consent flag in configStore before capturing and if missing
open a modal dialog (blocking auto-capture) that explains what is captured,
retention, where data is sent, and how to opt-out, then persist the user’s
choice; emit IPC/UI events before/after capture to show a brief in-app notice
and a visual indicator (icon/spinner) during capture; include a configurable
privacyPolicy URL and retention metadata in the config returned with the
screenshot and surface a link to the policy in the modal; optionally add a flag
to enable masking/redaction options (expose API to request masking before
capture). Ensure the procedure aborts with a clear error when consent is denied
and that all new UI interactions are driven via existing IPC channels rather
than doing UI work in this main process.

aj47 added 2 commits November 1, 2025 10:18
- Fixed screenshot data being stripped in agent mode pipeline
- Updated processTranscriptWithAgentMode to accept MessageContent (string or multimodal array)
- Updated context-budget.ts to handle multimodal content (text + images)
- Added extractTextFromContent helper to safely extract text from multimodal messages
- Enhanced debug logging to track multimodal content through the pipeline

Permission Detection & UI:
- Added getScreenCaptureStatus, requestScreenCaptureAccess, and openScreenCaptureInSystemPreferences to tipc.ts
- Added permission status display in Settings → General → Screenshot / Multimodal
- Shows green checkmark when permission granted, amber warning when missing
- Added 'Open System Settings' button to guide users to grant permission
- Auto-refreshes permission status every 2 seconds

Error Handling:
- Added helpful toast notifications when screenshot capture fails
- Detects macOS permission errors and shows targeted guidance
- Guides users to System Settings → Privacy & Security → Screen Recording

This ensures screenshots are properly sent to multimodal LLMs (GPT-4V, Claude with vision, Gemini 2.5 Flash, etc.) and users are guided through the permission setup process.
- Add automatic screenshot capture for voice input in MCP mode when screenshot setting is enabled
- Implement dynamic window resizing when screenshot preview is shown/hidden in text input panel
- Update createMcpRecording to accept optional screenshotData parameter for multimodal content
- Add useEffect hook to automatically resize panel window based on screenshot preview state
- Ensure error-resilient screenshot capture that doesn't break voice input flow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add screenshot as context option for SpeakMCP input

1 participant