Skip to content

shanthanu47/Prim-Eyes-Automation

Repository files navigation

PimEyes Automation Suite: A Technical Case Study

Conversation Reference: PimEyes Browser-Use Refinement

I developed this project through three distinct evolutionary stages to solve the challenge of automating PimEyes.com—a site protected by heavy anti-bot measures (Cloudflare, Prosopo CAPTCHA) and complex UI flows. This document details my journey, the technical hurdles I faced in each approach, how I solved them, and the final robust Hybrid Agent architecture.


Final Recommendation: The Hybrid AI Agent

For most users, Approach 3 (Hybrid) is the robust solution. It balances the flexibility of an LLM Agent with the reliability of hand-crafted code for critical security bypasses.

Quick Start:

python solve_with_browser_use.py

Architectural Evolution & Problem Solving

Approach 1: The Traditional Script (script.py)

Philosophy: "Deterministic Control". I attempted to write a pure Playwright script that manually handled every interaction.

The Problems & The Fixes

  1. Problem: The Unicode Crash

    • Symptom: Script crashed instantly on Windows with UnicodeEncodeError.
    • Root Cause: Python's default console encoding on Windows often fails with certain emoji characters used in logs.
    • Fix: I implemented sys.stdout.reconfigure(encoding='utf-8') to force UTF-8 output streams.
  2. Problem: The Invisible Checkbox (Shadow DOM)

    • Symptom: Playwright's page.click('input[type=checkbox]') failed because the PROSOPO captcha hides elements inside an open Shadow Root.
    • Fix: I wrote custom JavaScript injection (page.evaluate) to explicitly traverse document.querySelector('...').shadowRoot to find and click buttons.
  3. Problem: "Access Denied" by Cloudflare

    • Symptom: 403 Forbidden or Cloudflare challenge loops.
    • Fix: I integrated a residential Proxy via proxy.txt.
  4. Problem: Captcha Modal Not Loading

    • Symptom: Automation was too fast/robotic; the site wouldn't trigger the challenge.
    • Fix: I implemented a human_behavior() function—adding random mouse movements, jitters, and scrolling to simulate a real user before interactions.

Approach 2: The Pure AI Agent (Conceptual Experiment)

Philosophy: "Let the LLM figure it out". I tried giving a generic task to browser-use: "Go to PimEyes and search."

The Problems (Why I abandoned it)

  1. Problem: Hidden File Inputs

    • Observation: The LLM tried to click the visual "Upload" button, but it was a <div> masking a hidden <input type="file">. The Agent often clicked the wrong pixels or failed to invoke the OS file chooser.
    • Result: 50% failure rate on upload.
  2. Problem: Complex Captcha Grids

    • Observation: Standard Vision models struggled to map the 3x3 grid perfectly to click coordinates based on generic "click the images" instructions.
    • Result: It would miss one image or miss-click, leading to infinite captcha loops.
  3. Problem: Reasoning Cost & Latency

    • Observation: The Agent would spend 30 seconds "thinking" about simple Consent popups.
    • Result: Extremely slow execution compared to regex/selectors.

Approach 3: The Hybrid Agent (solve_with_browser_use.py)

Philosophy: "Augmented Intelligence". I used browser-use for orchestration but injected Custom Tools (Python functions) for the hard parts. This is the Active Solution.

The Problems & The Fixes

  1. Problem: Tool Integration Crashes

    • Symptom: PydanticInvalidForJsonSchema when passing the browser object to tool functions.
    • Fix: I refactored the architecture to define tools inside the main() function's closure. This allows tools to access the browser instance directly without needing it passed as a schema-validated argument.
  2. Problem: Reasoning Timeouts

    • Symptom: "LLM call timed out after 90 seconds".
    • Fix: I switched the model from gemini-2.0-flash-exp (experimental) to gemini-2.5-flash-lite, which is significantly faster and more stable for tool calling.
  3. Problem: API Rate Limiting

    • Symptom: 429 Too Many Requests from Google Gemini API during heavy testing.
    • Fix: I implemented API Key Rotation. The script now loads multiple keys from gemini_keys.txt and randomly selects one for each execution session.
  4. Problem: Dynamic File Selection

    • Symptom: Hardcoding the filename in the tool meant the Agent couldn't choose which file to upload.
    • Fix: Dynamic Prompting. The main() function now scans the photo/ directory and explicitly inserts the found filename (e.g., ronaldo.webp) into the Agent's natural language Prompt.
  5. Problem: Prosopo Solver Reliability

    • Fix: I ported the entire logic from Approach 1 (Shadow DOM piercing, Screenshotting, Coordinate Geometry) into a custom tool @controller.action("Solve Captcha Challenge").
    • Workflow:
      1. Capture screenshot.
      2. Ask Gemini Vision: "Return JSON list of target indices [1, 5, 9]".
      3. Convert indices to X/Y coordinates using Math.
      4. Click.

Future Scope

  1. Headless Mode Optimization:
    • Currently, I run headless=False (visible browser) because browser-use works best with visual context. Optimizing for headless execution would allow this to run on servers (CI/CD).
  2. Session Persistence:
    • Save cookies/local storage after a successful Captcha solve so subsequent runs doesn't need to re-prove humanity.
  3. Docker Containerization:
    • Package the Python environment, Playwright browsers, and proxy logic into a Docker container for easy deployment.
  4. Multi-Modal Fallback:
    • If Gemini Vision fails, fallback to an alternative vision provider (like OpenAI GPT-4o) specifically for the Captcha step to increase redundancy.

Technical Setup

Prerequisites

  • Python 3.11+
  • Playwright Browsers (playwright install)
  • Google Gemini API Key(s)

Configuration

  1. gemini_keys.txt: Add API keys (one per line).
  2. proxy.txt: server:port:user:pass
  3. photo/: Add your search images here.

Verified & Refined in conversation: PimEyes Browser-Use Refinement

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors