feat: Add screenshot functionality for multimodal AI input#219
Conversation
- Add screenshot capture using Electron's desktopCapturer API - Add screenshot checkbox in text input UI - Add screenshot configuration options (quality, format, max dimensions) - Update message types to support multimodal content (text + images) - Update LLM API calls to handle multimodal content - Update conversation service to handle MessageContent type - Add helper functions to extract text from multimodal content Implements multimodal support for sending screenshots to AI models via OpenAI-compatible APIs using base64-encoded images.
|
Warning Rate limit exceeded@aj47 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 1 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (9)
WalkthroughThis PR adds multimodal screenshot support and improves message content handling across the application. It introduces screenshot capture functionality with UI controls, updates message content types to support structured multimodal payloads (text and images), and implements stable content-based IDs for tool executions. The changes span backend (LLM, configuration, conversation service) and frontend (text input, agent progress rendering) layers. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant TextInput as Text Input Panel
participant Main as Main Process (IPC)
participant ConvService as Conversation Service
participant LLM as LLM Handler
participant Renderer as Agent Progress Renderer
User->>TextInput: Enters text + enables screenshot
TextInput->>Main: captureScreenshot()
Main-->>TextInput: base64 screenshot data
User->>TextInput: Submit
TextInput->>Main: createMcpTextInput({ text, screenshotData })
alt Screenshot provided
Main->>ConvService: addMessageToConversation(content: [text, image_url])
else No screenshot
Main->>ConvService: addMessageToConversation(content: text)
end
ConvService->>LLM: makeLLMCall(messages with MessageContent)
LLM->>LLM: generateToolExecutionId(toolCall)
LLM-->>Renderer: AgentProgressUpdate with multimodal content
Renderer->>Renderer: extractTextFromContent(content)
Renderer-->>User: Display text + images
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested labels
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
augment review |
|
|
||
| // Get the primary screen source | ||
| const primarySource = sources[0] | ||
| const thumbnail = primarySource.thumbnail |
There was a problem hiding this comment.
Choosing the primary screen with sources[0] is unreliable—desktopCapturer.getSources() does not guarantee ordering. Consider matching the source to screen.getPrimaryDisplay() (e.g., via display_id) to ensure the correct monitor is captured, especially on multi‑monitor setups.
🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.
| screenshotData, | ||
| }) | ||
| } else { | ||
| textInputMutation.mutate({ text }) |
There was a problem hiding this comment.
Screenshot data is only passed through the MCP path; when MCP is disabled this else branch calls textInputMutation.mutate({ text }), dropping the screenshot entirely. Consider forwarding screenshotData (and updating the non-MCP backend path) so the feature works consistently without MCP (also applies to the fallback below).
🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/renderer/src/components/conversation-display.tsx (1)
92-104: Convert MessageContent to plain text before downstream use.
message.contentis nowMessageContent(string | MessageContentPart[]). When it’s an array (text + image), the current calls send an object/array togenerateSpeechand the context menu IPC handler, which expect strings. In prod this blows up TTS requests (invalid payload /[object Object]) and the context-menu copy path. Let’s normalize once and reuse.@@ - const generateAudio = async (): Promise<ArrayBuffer> => { + const textContent = extractTextFromContent(message.content) + + const generateAudio = async (): Promise<ArrayBuffer> => { if (!configQuery.data?.ttsEnabled) { throw new Error("TTS is not enabled") } @@ - const result = await tipcClient.generateSpeech({ - text: message.content, + const result = await tipcClient.generateSpeech({ + text: textContent, }) @@ - tipcClient.showContextMenu({ + tipcClient.showContextMenu({ x: e.clientX, y: e.clientY, messageContext: { - content: message.content, + content: textContent, role: message.role, messageId: message.id, }, }) } @@ - <MarkdownRenderer content={extractTextFromContent(message.content)} /> + <MarkdownRenderer content={textContent} /> @@ - text={extractTextFromContent(message.content)} + text={textContent}Also applies to: 169-181
src/main/conversation-service.ts (1)
91-98: Fix type error:generatePreviewdoesn't handle MessageContent arrays.Line 95 uses
msg.content.slice(0, 100)which assumescontentis always a string. However,ConversationMessage.contentis nowMessageContent(string | array), so this will throw a runtime error when content is an array.Apply this diff to extract text from MessageContent:
private generatePreview(messages: ConversationMessage[]): string { // Generate a preview from the first few messages const previewMessages = messages.slice(0, 3) const preview = previewMessages - .map((msg) => `${msg.role}: ${msg.content.slice(0, 100)}`) + .map((msg) => { + const content = typeof msg.content === 'string' + ? msg.content + : msg.content.filter(p => p.type === 'text').map(p => p.text).join(' ') + return `${msg.role}: ${content.slice(0, 100)}` + }) .join(" | ") return preview.length > 200 ? `${preview.slice(0, 200)}...` : preview }src/main/tipc.ts (1)
634-697: Screenshot context is lost when passed to agent mode processing.The review comment is correct. The code saves the multimodal
messageContentwith the screenshot to the conversation (lines 643-656), butprocessWithAgentModeonly passestexttoprocessTranscriptWithAgentMode(line 661). Additionally, when loadingpreviousConversationHistory, the most recent message is excluded (slice(0, -1)), so the current user's screenshot never reaches the LLM for analysis.To fix this, pass
messageContentinstead oftexttoprocessWithAgentMode, and modifyprocessWithAgentModeto accept and forward the multimodal content toprocessTranscriptWithAgentMode.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
PR209_FINAL_SUMMARY.md(1 hunks)src/main/config.ts(1 hunks)src/main/conversation-service.ts(5 hunks)src/main/llm-fetch.ts(6 hunks)src/main/llm.ts(2 hunks)src/main/tipc.ts(5 hunks)src/renderer/src/components/agent-progress.tsx(3 hunks)src/renderer/src/components/conversation-display.tsx(4 hunks)src/renderer/src/components/text-input-panel.tsx(5 hunks)src/renderer/src/contexts/conversation-context.tsx(1 hunks)src/renderer/src/pages/panel.tsx(3 hunks)src/shared/types.ts(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
src/renderer/src/components/agent-progress.tsx (1)
src/shared/types.ts (1)
MessageContent(143-143)
src/renderer/src/components/text-input-panel.tsx (2)
src/renderer/src/contexts/theme-context.tsx (1)
useTheme(173-179)src/renderer/src/lib/tipc-client.ts (1)
tipcClient(12-14)
src/renderer/src/components/conversation-display.tsx (2)
src/shared/types.ts (1)
MessageContent(143-143)src/renderer/src/components/markdown-renderer.tsx (1)
MarkdownRenderer(102-228)
src/renderer/src/pages/panel.tsx (1)
src/renderer/src/lib/tipc-client.ts (1)
tipcClient(12-14)
src/main/conversation-service.ts (1)
src/shared/types.ts (1)
MessageContent(143-143)
src/main/tipc.ts (2)
src/main/conversation-service.ts (1)
conversationService(250-250)src/main/config.ts (1)
configStore(148-148)
🪛 LanguageTool
PR209_FINAL_SUMMARY.md
[style] ~84-~84: Consider a different adjective to strengthen your wording.
Context: ...oot cause analysis - Explanation of the deeper issue discovered - Detailed solution wi...
(DEEP_PROFOUND)
🪛 markdownlint-cli2 (0.18.1)
PR209_FINAL_SUMMARY.md
77-77: Bare URL used
(MD034, no-bare-urls)
🔇 Additional comments (8)
src/renderer/src/pages/panel.tsx (1)
158-193: LGTM! Clean parameter threading for screenshot support.The addition of optional
screenshotDataparameter is properly threaded through the mutation chain from the UI handler to the backend call. The implementation maintains backward compatibility by making the parameter optional.Also applies to: 352-375
src/main/conversation-service.ts (3)
75-79: LGTM! Proper MessageContent handling.The extraction logic correctly handles both string and array content types with appropriate type guards and fallbacks.
175-204: LGTM! Proper multimodal content support in conversation creation.The function correctly accepts
MessageContentand handles title generation for both string and array content types with appropriate fallbacks.
206-236: LGTM! Clean signature update.The function now correctly accepts
MessageContentfor the content parameter, enabling multimodal message support.src/shared/types.ts (2)
138-143: LGTM! Well-designed multimodal content types.The types follow OpenAI's multimodal message format with proper discriminated unions and optional detail control for image processing.
372-377: LGTM! Screenshot configuration fields are well-defined.The configuration options provide appropriate control over screenshot capture (quality, format, dimensions) with sensible types.
src/main/tipc.ts (2)
103-105: LGTM! Proper text extraction from multimodal content.The code correctly extracts text parts from MessageContent arrays for agent mode processing, filtering out image content appropriately.
634-641: The original review comment is incorrect. No fixes needed.The frontend (
text-input-panel.tsx) already constructs the complete data URI format (data:image/${result.format};base64,${result.data}) before passingscreenshotDatato the backend. By the time the data reaches line 634 intipc.ts,input.screenshotDatais already a properly formatted data URI likedata:image/jpeg;base64,.... The backend code at lines 634–641 is correct and requires no changes.Likely an incorrect or invalid review comment.
| // Calculate tokens - handle both string and array content | ||
| const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => { | ||
| if (typeof msg.content === 'string') { | ||
| return sum + msg.content.length | ||
| } else if (Array.isArray(msg.content)) { | ||
| return sum + msg.content.reduce((s, part) => { | ||
| if (part.type === 'text') return s + part.text.length | ||
| return s + 100 // Rough estimate for image tokens | ||
| }, 0) | ||
| } | ||
| return sum | ||
| }, 0) / 4) |
There was a problem hiding this comment.
Improve image token estimation to prevent context overflow.
Line 580 uses a rough estimate of 100 tokens per image, which is significantly lower than actual costs for vision models. GPT-4V typically uses 85-170 tokens per tile, and images can have multiple tiles depending on resolution. For example:
- A 512×512 image uses ~85 tokens (1 tile)
- A 1920×1080 image uses ~765 tokens (9 tiles at high detail)
This underestimation could lead to context limit issues when multiple images or long conversations are involved.
Consider implementing a more accurate estimation:
- return s + 100 // Rough estimate for image tokens
+ // Estimate image tokens based on tile count (GPT-4V uses 85-170 tokens per tile)
+ // Assume high detail mode: each 512x512 tile costs ~170 tokens, plus 85 base tokens
+ return s + 500 // Conservative estimate for a typical screenshotFor a more precise calculation, you could compute tile count based on the image dimensions if available in the metadata.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Calculate tokens - handle both string and array content | |
| const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => { | |
| if (typeof msg.content === 'string') { | |
| return sum + msg.content.length | |
| } else if (Array.isArray(msg.content)) { | |
| return sum + msg.content.reduce((s, part) => { | |
| if (part.type === 'text') return s + part.text.length | |
| return s + 100 // Rough estimate for image tokens | |
| }, 0) | |
| } | |
| return sum | |
| }, 0) / 4) | |
| // Calculate tokens - handle both string and array content | |
| const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => { | |
| if (typeof msg.content === 'string') { | |
| return sum + msg.content.length | |
| } else if (Array.isArray(msg.content)) { | |
| return sum + msg.content.reduce((s, part) => { | |
| if (part.type === 'text') return s + part.text.length | |
| // Estimate image tokens based on tile count (GPT-4V uses 85-170 tokens per tile) | |
| // Assume high detail mode: each 512x512 tile costs ~170 tokens, plus 85 base tokens | |
| return s + 500 // Conservative estimate for a typical screenshot | |
| }, 0) | |
| } | |
| return sum | |
| }, 0) / 4) |
🤖 Prompt for AI Agents
In src/main/llm-fetch.ts around lines 573 to 584, the image token estimate uses
a fixed 100 tokens per image which underestimates vision model costs; update the
logic to check for image metadata (width, height) when msg.content contains
image parts, compute number of tiles based on a tile size (e.g., 512x512) then
multiply by a per-tile token cost range (use a conservative value like 85–170 or
a single safer constant like 765 for high-res) to produce token count per image,
fall back to the conservative per-image estimate when dimensions are missing,
and keep the rest of the message-length calculation unchanged.
| // For multimodal content, extract text parts only for now | ||
| const prompt = messages.map((m) => { | ||
| let content = m.content | ||
| if (Array.isArray(content)) { | ||
| content = content.filter(p => p.type === 'text').map(p => p.text).join(' ') | ||
| } | ||
| return `${m.role}: ${content}` | ||
| }).join("\n\n") |
There was a problem hiding this comment.
Document or implement image support for Gemini.
The code filters out image content and only sends text parts to Gemini (line 711: content.filter(p => p.type === 'text')). This means users who select Gemini as their LLM provider cannot use the screenshot feature, even though the UI allows them to capture screenshots.
This creates an inconsistent user experience where the feature appears to work but silently drops image content.
Please either:
- Implement Gemini's multimodal API support (Gemini 1.5+ supports images)
- Disable the screenshot checkbox in the UI when Gemini is selected
- Display a warning to users that screenshots aren't supported with Gemini
Would you like me to help implement Gemini multimodal support? The Gemini API supports inline images in the inlineData format.
| captureScreenshot: t.procedure.action(async () => { | ||
| try { | ||
| const config = configStore.get() | ||
|
|
||
| // Get all displays | ||
| const displays = screen.getAllDisplays() | ||
| const primaryDisplay = screen.getPrimaryDisplay() | ||
|
|
||
| // Capture screenshot from primary display | ||
| const sources = await desktopCapturer.getSources({ | ||
| types: ['screen'], | ||
| thumbnailSize: { | ||
| width: primaryDisplay.size.width * primaryDisplay.scaleFactor, | ||
| height: primaryDisplay.size.height * primaryDisplay.scaleFactor | ||
| } | ||
| }) | ||
|
|
||
| if (sources.length === 0) { | ||
| throw new Error('No screen sources available') | ||
| } | ||
|
|
||
| // Get the primary screen source | ||
| const primarySource = sources[0] | ||
| const thumbnail = primarySource.thumbnail | ||
|
|
||
| // Get image format and quality from config | ||
| const format = config.screenshotFormat || 'jpeg' | ||
| const quality = config.screenshotQuality || 0.8 | ||
| const maxWidth = config.screenshotMaxWidth || 1920 | ||
| const maxHeight = config.screenshotMaxHeight || 1080 | ||
|
|
||
| // Resize if needed | ||
| let finalImage = thumbnail | ||
| const size = thumbnail.getSize() | ||
| if (size.width > maxWidth || size.height > maxHeight) { | ||
| const scale = Math.min(maxWidth / size.width, maxHeight / size.height) | ||
| const newWidth = Math.floor(size.width * scale) | ||
| const newHeight = Math.floor(size.height * scale) | ||
| finalImage = thumbnail.resize({ width: newWidth, height: newHeight }) | ||
| } | ||
|
|
||
| // Convert to base64 | ||
| let base64Data: string | ||
| if (format === 'png') { | ||
| base64Data = finalImage.toPNG().toString('base64') | ||
| } else { | ||
| base64Data = finalImage.toJPEG(Math.floor(quality * 100)).toString('base64') | ||
| } | ||
|
|
||
| return { | ||
| data: base64Data, | ||
| format, | ||
| width: finalImage.getSize().width, | ||
| height: finalImage.getSize().height | ||
| } | ||
| } catch (error) { | ||
| throw new Error(`Failed to capture screenshot: ${error instanceof Error ? error.message : String(error)}`) | ||
| } | ||
| }), |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
LGTM! Solid screenshot capture implementation with minor privacy consideration.
The implementation properly:
- Captures from primary display with scale factor handling
- Resizes images based on configuration to control size
- Encodes to base64 with configurable format and quality
- Includes comprehensive error handling
Consider whether users should be notified when screenshots are captured, especially since this happens automatically when the checkbox is enabled. On macOS, screen recording permissions are required, but an in-app notification or indicator might improve transparency.
🌐 Web query:
What are best practices for user consent and notification when capturing screenshots in desktop applications?
💡 Result:
Short checklist — best practices for consent & notification when capturing screenshots/screen recordings in desktop apps:
-
Obtain explicit, informed consent before enabling capture (opt-in; describe what is captured, purpose, retention, sharing and how to revoke). [1][6][9]
-
Show a persistent, clear visual indicator while capture is active (icon/overlay/LED notification) so users know they’re being recorded. Use OS APIs when available to detect capture and respond. [2][6]
-
Provide an easy opt-out and a simple way to stop/pause capture from the UI (and honor revocation immediately). Document this in the app and privacy policy. [1][2][9]
-
Minimize what you capture: mask or exclude sensitive fields (passwords, payment info, personal data) and offer app‑/window‑level or region filters. Default to conservative masking. [9][10]
-
Localize consent flows and disclosures to applicable law/region (GDPR, CCPA/state laws, workplace/employee‑monitoring rules) and record consent events for auditability. [4][9]
-
Limit retention, restrict access, encrypt stored captures, and log access/exports; publish retention and data‑access policies. [9][10]
-
Avoid third‑party analytics that capture raw screens without explicit consent; if used, disclose third parties and contractual protections in the privacy policy. [1][9]
-
Provide transparency: in‑app settings, an accessible privacy policy, and a “view/delete your captures” control for users. [1][3][10]
-
Use contextual notices (first run, before a specific session, or when capture state changes) rather than hidden or buried disclosures; record the timestamped notice and user response. [1][6]
-
Test UX and edge cases (multi‑monitor, remote desktop, screen‑mirroring, headless sessions) and handle system capture events (pause capture, warn, or black out content) appropriately. [6][3]
Key references:
- Apple App Store & developer guidance on explicit consent and visible indicators. [1][2]
- Apple technical guidance (UIScreen.isCaptured / UIScreenCapturedDidChange) for responding to capture. [6]
- Microsoft Recall / Windows guidance on snapshot filtering, user controls and local processing. [3][7]
- GDPR/Session‑replay vendor guidance on masking, consent, retention and compliance. [9][10]
- Employer/employee monitoring legality overview and state differences (U.S.). [4]
If you want, I can convert this into a short consent UI copy + privacy-policy snippet tailored to your app and target jurisdictions.
I'll verify the codebase to understand how the screenshot feature is integrated and whether there are any existing consent/notification mechanisms.Now I need to search the codebase to understand how the screenshot feature is integrated into the application.
[/run_scripts]
Let me wait for the script results and then examine the codebase structure further.
[/run_scripts]
Let me check if there's any UI component that enables/disables screenshot capture:
[/run_scripts]
Let me try a broader search to understand the feature's integration:
[/run_scripts]
Let me also check for any existing configuration or privacy UI:
[/run_scripts]
Now let me re-read the review comment carefully to understand what was actually being evaluated:
Based on my research, I can now provide the rewritten review comment. The web search revealed comprehensive best practices, and the Electron documentation confirms that desktopCapturer on macOS requires user permissions, with Electron checking permissions and rejecting the promise if not granted. However, the code snippet itself doesn't show any user-facing consent UI, visual indicators, or notification mechanisms.
Add user notification and consent controls for screenshot capture.
The implementation correctly handles the technical capture, but best practices for screenshot capture in desktop applications include: webcam, microphone, and screen recording functionalities present serious risk to users when approval is granted by default.
Consider adding:
- Explicit opt-in consent dialog before first use or when the feature is enabled
- Persistent visual indicator (icon/overlay) while capture is active
- Easy opt-out control in settings with immediate revocation
- Privacy policy disclosure explaining data retention, access, and deletion controls
- Optional sensitive data masking (passwords, payment info)
- Contextual notice on configuration changes
For macOS, desktopCapturer.getSources() triggers a system permission prompt once, which is stored in system preferences, providing OS-level protection. However, in-app transparency and user controls enhance trust.
🏁 Script executed:
#!/bin/bash
# Find where captureScreenshot is called
rg -n "captureScreenshot" --type ts --type js -C 3Length of output: 1643
🏁 Script executed:
#!/bin/bash
# Search for config related to screenshot functionality
rg -n "screenshot" --type ts --type js -C 2 | head -100Length of output: 7245
🏁 Script executed:
#!/bin/bash
# Search for privacy settings or consent dialogs
rg -n -i "consent\|notification\|permission\|privacy" --type ts --type js | grep -i "screenshot\|capture" Length of output: 0
🏁 Script executed:
#!/bin/bash
# Find the main application structure
fd -type f -name "*.ts" -o -name "*.tsx" | head -20Length of output: 170
🏁 Script executed:
#!/bin/bash
# Look for UI components related to settings or privacy
fd -type f \( -name "*setting*" -o -name "*privacy*" -o -name "*config*" \) -name "*.ts*"Length of output: 170
Add explicit consent and visual feedback for automatic screenshot capture.
The implementation technically works, but automatic capture when the checkbox is enabled lacks the user notifications recommended by platform guidelines and privacy best practices.
Currently:
- User checks "Include Screenshot" checkbox
- Screenshot is immediately captured with no warning or visual indicator
- Preview is shown, but user may not realize automatic capture occurred
Add:
- First-use consent dialog explaining screenshot capture, data handling, and how to disable
- Visual indicator (icon/spinner) during capture so users know activity is occurring
- Brief in-app notice when the screenshot checkbox auto-captures: "Capturing screenshot…"
- Privacy policy section disclosing: what is captured, retention period, and where data is sent
- Optional: masking/redaction UI for sensitive fields (passwords, payment info)
macOS system permissions provide underlying protection, but in-app transparency and explicit consent improve user trust and regulatory compliance (GDPR, CCPA).
🤖 Prompt for AI Agents
In src/main/tipc.ts around lines 914-972, the captureScreenshot procedure
triggers automatic screenshots without user consent or visible feedback; add
first-use explicit consent, runtime visual feedback, and privacy disclosures:
check a persisted consent flag in configStore before capturing and if missing
open a modal dialog (blocking auto-capture) that explains what is captured,
retention, where data is sent, and how to opt-out, then persist the user’s
choice; emit IPC/UI events before/after capture to show a brief in-app notice
and a visual indicator (icon/spinner) during capture; include a configurable
privacyPolicy URL and retention metadata in the config returned with the
screenshot and surface a link to the policy in the modal; optionally add a flag
to enable masking/redaction options (expose API to request masking before
capture). Ensure the procedure aborts with a clear error when consent is denied
and that all new UI interactions are driven via existing IPC channels rather
than doing UI work in this main process.
- Fixed screenshot data being stripped in agent mode pipeline - Updated processTranscriptWithAgentMode to accept MessageContent (string or multimodal array) - Updated context-budget.ts to handle multimodal content (text + images) - Added extractTextFromContent helper to safely extract text from multimodal messages - Enhanced debug logging to track multimodal content through the pipeline Permission Detection & UI: - Added getScreenCaptureStatus, requestScreenCaptureAccess, and openScreenCaptureInSystemPreferences to tipc.ts - Added permission status display in Settings → General → Screenshot / Multimodal - Shows green checkmark when permission granted, amber warning when missing - Added 'Open System Settings' button to guide users to grant permission - Auto-refreshes permission status every 2 seconds Error Handling: - Added helpful toast notifications when screenshot capture fails - Detects macOS permission errors and shows targeted guidance - Guides users to System Settings → Privacy & Security → Screen Recording This ensures screenshots are properly sent to multimodal LLMs (GPT-4V, Claude with vision, Gemini 2.5 Flash, etc.) and users are guided through the permission setup process.
- Add automatic screenshot capture for voice input in MCP mode when screenshot setting is enabled - Implement dynamic window resizing when screenshot preview is shown/hidden in text input panel - Update createMcpRecording to accept optional screenshotData parameter for multimodal content - Add useEffect hook to automatically resize panel window based on screenshot preview state - Ensure error-resilient screenshot capture that doesn't break voice input flow
Closes #217
Summary
This PR adds screenshot functionality to SpeakMCP, allowing users to capture and send screenshots along with text input to multimodal AI models.
Changes
Core Features
desktopCapturerAPITechnical Implementation
Backend Changes
captureScreenshotTIPC handler insrc/main/tipc.tscreateMcpTextInputto acceptscreenshotDataparametermakeOpenAICompatibleCallnow acceptscontent: any(string or array)makeGeminiCallextracts text from multimodal contentMessageContenttypeFrontend Changes
TextInputPanelcomponent:screenshotEnabledin config)panel.tsxto pass screenshot data through mutationMessageContentin:conversation-display.tsxagent-progress.tsxconversation-context.tsxType System
src/shared/types.ts:ConversationMessageto useMessageContentinstead ofstringConfigtypeConfiguration
src/main/config.ts:screenshotEnabled: truescreenshotQuality: 0.8screenshotFormat: "jpeg"screenshotMaxWidth: 1920screenshotMaxHeight: 1080API Format
Screenshots are sent to AI models using the OpenAI-compatible multimodal format:
{ "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What's in this image?" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,..." } } ] } ] }Backward Compatibility
MessageContentis a union type (string | MessageContentPart[])Testing
Notes
Pull Request opened by Augment Code with guidance from the PR author
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Chores