An advanced, highly resilient academic scout agent that tracks, crawls, and verifies scholarship application windows in real-time. It integrates a 4-engine search crawler, OpenAI-compatible Llama model verifications (via Cerebras), automated Google Sheets sync orchestration, and styled HTML digest reports emailed directly to your inbox.
Designed to handle real-time web search limitations, network blocks, API rate limits, and language barriers with maximum autonomy.
| Component | Technology | Description |
|---|---|---|
| Language | Python 3.x | Core programming language |
| API Framework | FastAPI + Uvicorn | High-performance asynchronous API endpoints and web server |
| LLM Inference | Cerebras Inference API | High-speed inference using Llama 3.x models via OpenAI-compatible HTTP requests |
| Search Crawler | DuckDuckGo, Yahoo, Bing, SearXNG | 4-engine fallback web search scheduler |
| Web Scraping | requests + beautifulsoup4 |
HTTP client and HTML parsing for deep crawling target domains |
| Spreadsheet Integration | Google Sheets API (gspread) |
Read-only access to tracking sheet data using Google Service Account |
| Email Service | SMTP (smtplib) |
Automated HTML digest reporting with SSL (465) / STARTTLS (587) support |
| Translation Engine | MyMemory API | Automated free translation for local/non-English scholarship websites |
| Configuration / Env | python-dotenv |
Secure loading of API keys and server variables |
| Data Validation | pydantic |
Typed schemas for configuration overrides and structured LLM outputs |
- 4-Engine Search Schedule: Queries DuckDuckGo ➔ Yahoo ➔ Bing ➔ SearXNG sequentially. The first engine to return results wins, avoiding reliance on a single provider.
- 3-Round Aggressive Retry: If all engines fail, the crawler retries after exponential delays (0s ➔ 30s ➔ 60s) with dynamic User-Agent rotation (Chrome, Safari, Firefox, Edge) and header spoofing to bypass Web Application Firewalls (WAF).
- Targeted Deep Crawling: Crawls program index pages and automatically branches into child links (e.g.
/news,/announcements,/deadline) matching structural keywords to locate text-based timeline notices.
Located in schreminder/src/config/scholarship_config.py, this engine overrides crawler behavior for complex sites:
preferred_query/preferred_urls: Directs searches to the exact sub-pages.locked_urls: Skips search engines entirely to scrape only target portals (e.g. for GKS).date_source_domain: Enforces date authority, preventing LLM from pulling dates from third-party blogs or wrong embassy regions.context_hint: Injects operator-verified knowledge (e.g., "deadlines vary by school") to help the LLM draw accurate conclusions.needs_translation: Detects non-English sites (under 5% ASCII ratio) and automatically translates page texts via MyMemory API.
- Handled in
schreminder/src/engine/name_parser.py, it detects parenthesized tags like(Scholarship Body) University Name(e.g.,(MEXT Scholarship) Hokkaido University) and automatically generates queries targeting specific recommendation guidelines.
- Queue Congestion Retry: Automatically detects Cerebras
429 queue_exceededserver loads and retries with backoff delays (10s ➔ 20s ➔ 30s). - Per-Row Error Sentinels: If an LLM call exhausts all retries, the system logs a
QUOTA_EXCEEDEDstatus for that row and continues the batch instead of aborting the pipeline.
- Grid Pre-expansion: Automatically expands Google Sheet dimensions before writing new output columns, avoiding grid-out-of-bounds
APIError 400. - Manual Override (Bypass Mode): If a row has
active_status = "T"andverified = "F", the engine bypasses search/LLM entirely and preserves user-entered data asVERIFIED (MANUAL). - Dry-Run Mode: Setting
SCOUT_DRY_RUN=truedisables Google Sheet updates while still parsing web results, generating JSON outputs, and sending email digests.
The script dynamically maps column indices by matching Row 1 headers against case-insensitive keywords and aliases.
| Column Type | Script Key | Preferred Header | Supported Aliases / Alternative Headings |
|---|---|---|---|
| Input | scholarship_name |
Scholarship Name |
Name, Scholarship |
| Input | active_status |
Status |
(Must contain T or t to process the row) |
| Input | verified |
Verified |
(If containing F or f while Status is T, triggers Manual Bypass) |
| Input | historical_method |
Processing Method (Historical) |
Method, Processing Method, Reg. Path |
| Input | historical_info_link |
Info Link (Historical) |
Info Link, Link Info, Historical Info Link |
| Input | historical_reg_link |
Registration Link (Historical) |
Reg. Link, Link Daftar, Historical Registration Link |
| Input | estimated_timeline |
Estimated Timeline |
Timeline, Est. Date |
| Input | note |
Note |
Optional operator remarks |
| Column Type | Script Key | Preferred Header | Supported Aliases / Alternative Headings |
|---|---|---|---|
| Output | status |
Verified Status |
Scout Status |
| Output | start_date |
Verified Start Date |
Start Date, Application Start Date |
| Output | deadline |
Verified Deadline |
Deadline, Application Deadline |
| Output | verified_info_url |
Verified Info Link |
Verified Source Link, Verified Info URL |
| Output | supplementary_url |
Supplementary Link |
Supplementary Source URL, Announcement Link |
| Output | verified_reg_url |
Verified Reg Link |
Verified Registration Link, Verified Reg URL |
| Output | fallback_used |
Fallback Used |
Url Verification Fallback Used |
| Output | confidence |
Confidence Score |
Confidence |
| Output | detected_method |
Detected Method |
Processing Method Detected |
| Output | remarks |
Remarks |
Notes, Summary |
To avoid Google API 403 Sheet Not Found or authorization errors:
- Open your Google Cloud IAM Console and navigate to your Service Account dashboard.
- Locate and copy the generated Service Account Email (e.g.
scout-agent@project.iam.gserviceaccount.com). - Open your Tracking Google Spreadsheet, click Share at the top right, and invite the email address as an "Editor".
Clone the repository and install requirements:
pip install -r requirements.txtCopy the example environment configuration to .env:
cp .env.example .envFill in the appropriate API keys, spreadsheet ID, service account JSON string, and SMTP credentials.
To launch the FastAPI server locally:
uvicorn src.app:app --reload --host 127.0.0.1 --port 8000Then navigate to: 👉 http://127.0.0.1:8000/docs to test via the Swagger UI.
POST /verify: Manually verify a single scholarship row by name without running the full sync.POST /sync: Manually trigger the full Google Sheets sync, updating rows and sending the report email.
python src/runner.py