PimEyes Automation Suite: A Technical Case Study

Conversation Reference: PimEyes Browser-Use Refinement

I developed this project through three distinct evolutionary stages to solve the challenge of automating PimEyes.com—a site protected by heavy anti-bot measures (Cloudflare, Prosopo CAPTCHA) and complex UI flows. This document details my journey, the technical hurdles I faced in each approach, how I solved them, and the final robust Hybrid Agent architecture.

Final Recommendation: The Hybrid AI Agent

For most users, Approach 3 (Hybrid) is the robust solution. It balances the flexibility of an LLM Agent with the reliability of hand-crafted code for critical security bypasses.

Quick Start:

python solve_with_browser_use.py

Architectural Evolution & Problem Solving

Approach 1: The Traditional Script (`script.py`)

Philosophy: "Deterministic Control". I attempted to write a pure Playwright script that manually handled every interaction.

The Problems & The Fixes

Problem: The Unicode Crash
- Symptom: Script crashed instantly on Windows with UnicodeEncodeError.
- Root Cause: Python's default console encoding on Windows often fails with certain emoji characters used in logs.
- Fix: I implemented sys.stdout.reconfigure(encoding='utf-8') to force UTF-8 output streams.
Problem: The Invisible Checkbox (Shadow DOM)
- Symptom: Playwright's page.click('input[type=checkbox]') failed because the PROSOPO captcha hides elements inside an open Shadow Root.
- Fix: I wrote custom JavaScript injection (page.evaluate) to explicitly traverse document.querySelector('...').shadowRoot to find and click buttons.
Problem: "Access Denied" by Cloudflare
- Symptom: 403 Forbidden or Cloudflare challenge loops.
- Fix: I integrated a residential Proxy via proxy.txt.
Problem: Captcha Modal Not Loading
- Symptom: Automation was too fast/robotic; the site wouldn't trigger the challenge.
- Fix: I implemented a human_behavior() function—adding random mouse movements, jitters, and scrolling to simulate a real user before interactions.

Approach 2: The Pure AI Agent (Conceptual Experiment)

Philosophy: "Let the LLM figure it out". I tried giving a generic task to browser-use: "Go to PimEyes and search."

The Problems (Why I abandoned it)

Problem: Hidden File Inputs
- Observation: The LLM tried to click the visual "Upload" button, but it was a <div> masking a hidden <input type="file">. The Agent often clicked the wrong pixels or failed to invoke the OS file chooser.
- Result: 50% failure rate on upload.
Problem: Complex Captcha Grids
- Observation: Standard Vision models struggled to map the 3x3 grid perfectly to click coordinates based on generic "click the images" instructions.
- Result: It would miss one image or miss-click, leading to infinite captcha loops.
Problem: Reasoning Cost & Latency
- Observation: The Agent would spend 30 seconds "thinking" about simple Consent popups.
- Result: Extremely slow execution compared to regex/selectors.

Approach 3: The Hybrid Agent (`solve_with_browser_use.py`)

Philosophy: "Augmented Intelligence". I used browser-use for orchestration but injected Custom Tools (Python functions) for the hard parts. This is the Active Solution.

The Problems & The Fixes

Problem: Tool Integration Crashes
- Symptom: PydanticInvalidForJsonSchema when passing the browser object to tool functions.
- Fix: I refactored the architecture to define tools inside the main() function's closure. This allows tools to access the browser instance directly without needing it passed as a schema-validated argument.
Problem: Reasoning Timeouts
- Symptom: "LLM call timed out after 90 seconds".
- Fix: I switched the model from gemini-2.0-flash-exp (experimental) to gemini-2.5-flash-lite, which is significantly faster and more stable for tool calling.
Problem: API Rate Limiting
- Symptom: 429 Too Many Requests from Google Gemini API during heavy testing.
- Fix: I implemented API Key Rotation. The script now loads multiple keys from gemini_keys.txt and randomly selects one for each execution session.
Problem: Dynamic File Selection
- Symptom: Hardcoding the filename in the tool meant the Agent couldn't choose which file to upload.
- Fix: Dynamic Prompting. The main() function now scans the photo/ directory and explicitly inserts the found filename (e.g., ronaldo.webp) into the Agent's natural language Prompt.
Problem: Prosopo Solver Reliability
- Fix: I ported the entire logic from Approach 1 (Shadow DOM piercing, Screenshotting, Coordinate Geometry) into a custom tool @controller.action("Solve Captcha Challenge").
- Workflow:
  1. Capture screenshot.
  2. Ask Gemini Vision: "Return JSON list of target indices [1, 5, 9]".
  3. Convert indices to X/Y coordinates using Math.
  4. Click.

Future Scope

Headless Mode Optimization:
- Currently, I run headless=False (visible browser) because browser-use works best with visual context. Optimizing for headless execution would allow this to run on servers (CI/CD).
Session Persistence:
- Save cookies/local storage after a successful Captcha solve so subsequent runs doesn't need to re-prove humanity.
Docker Containerization:
- Package the Python environment, Playwright browsers, and proxy logic into a Docker container for easy deployment.
Multi-Modal Fallback:
- If Gemini Vision fails, fallback to an alternative vision provider (like OpenAI GPT-4o) specifically for the Captcha step to increase redundancy.

Technical Setup

Prerequisites

Python 3.11+
Playwright Browsers (playwright install)
Google Gemini API Key(s)

Configuration

gemini_keys.txt: Add API keys (one per line).
proxy.txt: server:port:user:pass
photo/: Add your search images here.

Verified & Refined in conversation: PimEyes Browser-Use Refinement

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
debug/error		debug/error
photo		photo
README.md		README.md
browser_use_result.txt		browser_use_result.txt
gemini_solver.py		gemini_solver.py
script.py		script.py
solve_with_browser_use.py		solve_with_browser_use.py
test_browser.py		test_browser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PimEyes Automation Suite: A Technical Case Study

Final Recommendation: The Hybrid AI Agent

Architectural Evolution & Problem Solving

Approach 1: The Traditional Script (`script.py`)

The Problems & The Fixes

Approach 2: The Pure AI Agent (Conceptual Experiment)

The Problems (Why I abandoned it)

Approach 3: The Hybrid Agent (`solve_with_browser_use.py`)

The Problems & The Fixes

Future Scope

Technical Setup

Prerequisites

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PimEyes Automation Suite: A Technical Case Study

Final Recommendation: The Hybrid AI Agent

Architectural Evolution & Problem Solving

Approach 1: The Traditional Script (script.py)

The Problems & The Fixes

Approach 2: The Pure AI Agent (Conceptual Experiment)

The Problems (Why I abandoned it)

Approach 3: The Hybrid Agent (solve_with_browser_use.py)

The Problems & The Fixes

Future Scope

Technical Setup

Prerequisites

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Approach 1: The Traditional Script (`script.py`)

Approach 3: The Hybrid Agent (`solve_with_browser_use.py`)

Packages