Conversation Reference: PimEyes Browser-Use Refinement
I developed this project through three distinct evolutionary stages to solve the challenge of automating PimEyes.com—a site protected by heavy anti-bot measures (Cloudflare, Prosopo CAPTCHA) and complex UI flows. This document details my journey, the technical hurdles I faced in each approach, how I solved them, and the final robust Hybrid Agent architecture.
For most users, Approach 3 (Hybrid) is the robust solution. It balances the flexibility of an LLM Agent with the reliability of hand-crafted code for critical security bypasses.
Quick Start:
python solve_with_browser_use.pyPhilosophy: "Deterministic Control". I attempted to write a pure Playwright script that manually handled every interaction.
-
Problem: The Unicode Crash
- Symptom: Script crashed instantly on Windows with
UnicodeEncodeError. - Root Cause: Python's default console encoding on Windows often fails with certain emoji characters used in logs.
- Fix: I implemented
sys.stdout.reconfigure(encoding='utf-8')to force UTF-8 output streams.
- Symptom: Script crashed instantly on Windows with
-
Problem: The Invisible Checkbox (Shadow DOM)
- Symptom: Playwright's
page.click('input[type=checkbox]')failed because the PROSOPO captcha hides elements inside an open Shadow Root. - Fix: I wrote custom JavaScript injection (
page.evaluate) to explicitly traversedocument.querySelector('...').shadowRootto find and click buttons.
- Symptom: Playwright's
-
Problem: "Access Denied" by Cloudflare
- Symptom:
403 Forbiddenor Cloudflare challenge loops. - Fix: I integrated a residential Proxy via
proxy.txt.
- Symptom:
-
Problem: Captcha Modal Not Loading
- Symptom: Automation was too fast/robotic; the site wouldn't trigger the challenge.
- Fix: I implemented a
human_behavior()function—adding random mouse movements, jitters, and scrolling to simulate a real user before interactions.
Philosophy: "Let the LLM figure it out". I tried giving a generic task to browser-use: "Go to PimEyes and search."
-
Problem: Hidden File Inputs
- Observation: The LLM tried to click the visual "Upload" button, but it was a
<div>masking a hidden<input type="file">. The Agent often clicked the wrong pixels or failed to invoke the OS file chooser. - Result: 50% failure rate on upload.
- Observation: The LLM tried to click the visual "Upload" button, but it was a
-
Problem: Complex Captcha Grids
- Observation: Standard Vision models struggled to map the 3x3 grid perfectly to click coordinates based on generic "click the images" instructions.
- Result: It would miss one image or miss-click, leading to infinite captcha loops.
-
Problem: Reasoning Cost & Latency
- Observation: The Agent would spend 30 seconds "thinking" about simple Consent popups.
- Result: Extremely slow execution compared to regex/selectors.
Philosophy: "Augmented Intelligence". I used browser-use for orchestration but injected Custom Tools (Python functions) for the hard parts. This is the Active Solution.
-
Problem: Tool Integration Crashes
- Symptom:
PydanticInvalidForJsonSchemawhen passing thebrowserobject to tool functions. - Fix: I refactored the architecture to define tools inside the
main()function's closure. This allows tools to access thebrowserinstance directly without needing it passed as a schema-validated argument.
- Symptom:
-
Problem: Reasoning Timeouts
- Symptom: "LLM call timed out after 90 seconds".
- Fix: I switched the model from
gemini-2.0-flash-exp(experimental) togemini-2.5-flash-lite, which is significantly faster and more stable for tool calling.
-
Problem: API Rate Limiting
- Symptom:
429 Too Many Requestsfrom Google Gemini API during heavy testing. - Fix: I implemented API Key Rotation. The script now loads multiple keys from
gemini_keys.txtand randomly selects one for each execution session.
- Symptom:
-
Problem: Dynamic File Selection
- Symptom: Hardcoding the filename in the tool meant the Agent couldn't choose which file to upload.
- Fix: Dynamic Prompting. The
main()function now scans thephoto/directory and explicitly inserts the found filename (e.g.,ronaldo.webp) into the Agent's natural language Prompt.
-
Problem: Prosopo Solver Reliability
- Fix: I ported the entire logic from Approach 1 (Shadow DOM piercing, Screenshotting, Coordinate Geometry) into a custom tool
@controller.action("Solve Captcha Challenge"). - Workflow:
- Capture screenshot.
- Ask Gemini Vision: "Return JSON list of target indices [1, 5, 9]".
- Convert indices to X/Y coordinates using Math.
- Click.
- Fix: I ported the entire logic from Approach 1 (Shadow DOM piercing, Screenshotting, Coordinate Geometry) into a custom tool
- Headless Mode Optimization:
- Currently, I run
headless=False(visible browser) becausebrowser-useworks best with visual context. Optimizing for headless execution would allow this to run on servers (CI/CD).
- Currently, I run
- Session Persistence:
- Save cookies/local storage after a successful Captcha solve so subsequent runs doesn't need to re-prove humanity.
- Docker Containerization:
- Package the Python environment, Playwright browsers, and proxy logic into a Docker container for easy deployment.
- Multi-Modal Fallback:
- If Gemini Vision fails, fallback to an alternative vision provider (like OpenAI GPT-4o) specifically for the Captcha step to increase redundancy.
- Python 3.11+
- Playwright Browsers (
playwright install) - Google Gemini API Key(s)
gemini_keys.txt: Add API keys (one per line).proxy.txt:server:port:user:passphoto/: Add your search images here.
Verified & Refined in conversation: PimEyes Browser-Use Refinement