- leads_20260326_134940.json serper website html validation logic is incorrect, as first lead "Southern Cape Properties" has a website, but website is not found.
- Test other sample inputs as well for basic website enrichment
Edge Cases:
- very big input files (OOM, no saves - risky if crashes or an api stops working)
- google maps or other apis rate limiting handling
Update project on portfolio, LinkedIn and Upwork
Build a production-style lead intelligence system that takes structured real estate data (e.g. from PropFlux or CSV input), enriches it using multiple external sources, verifies contact information, and produces high-quality, scored leads ready for outreach or CRM systems.
This project will serve as a portfolio-quality reference for:
- lead generation systems
- data enrichment pipelines
- automation + API-based backend systems
- PropFlux output (native integration)
- CSV file (flexible schema)
Note: Input should contain at minimum:
- business / agency name OR listing context
- optional website or location
The system should accept:
- CSV file upload
- JSON input (optional)
- PropFlux output (direct ingestion)
Example:
POST /upload
POST /jobs
The system must output enriched and structured lead data in:
- CSV file
- JSON file
- SQLite database (optional)
Each lead should include:
company_nameagent_name(if available)websiteemailphonelocationsource(origin of data)confidence_score
has_chatbot(Yes/No)website_speed_score(optional basic)last_updated_signal(approximate)contact_quality(verified / likely / low)
lead_score(0–100)lead_reason(explanation for score)
For each lead:
- scrape website for emails/phones
- fetch Google Maps data (business info, phone, validation)
- optionally cross-check basic registry info
- match entities across sources
- merge data into a single profile
- remove inconsistencies
- extract emails from HTML
- extract phone numbers
- normalize formats
- email: regex + domain validation
- phone: format + region consistency
- assign confidence levels
Score based on:
- missing chatbot / automation
- outdated website signals
- missing or weak contact info
- business activity indicators
- remove duplicates based on:
- website
- company name
- phone/email
Track:
- enrichment progress
- sources used per lead
- failures / missing data
- total leads processed
- retry failed requests
- skip invalid sources
- continue processing pipeline safely
/backend
api/
routes.py
jobs.py
services/
enrichment.py
scraper.py
verifier.py
scorer.py
core/
parser.py
normalizer.py
deduplicator.py
/frontend
dashboard/
/config
sources.yaml
/output
leads.csv
leads.json
runner.py
requirements.txt
README.md
- Python 3.11+
- FastAPI (API layer)
- Requests / BeautifulSoup (scraping)
- React + Vite (dashboard frontend)
- SQLite (jobs/results/settings storage)
- Fly.io (deployment target)
Create a config file:
/config/sources.yaml
Example:
sources:
website:
email_selectors:
- "mailto:"
- ".contact-email"
phone_patterns:
- "+27"
- "+1"
google_maps:
enabled: trueThis allows easy extension to new enrichment sources.
Core endpoints:
POST /jobs
GET /jobs
GET /jobs/{id}
POST /jobs/{id}/terminate
POST /jobs/{id}/resume
GET /jobs/{id}/results
GET /jobs/{id}/rejected
GET /jobs/{id}/batches
GET /jobs/{id}/export?format=csv|json
GET /settings
POST /settings/validate
PUT /settings
POST /settings/activate
DELETE /settings/{name}
Responsibilities:
- trigger enrichment jobs
- fetch results
- manage input data
The tool should be executable via CLI from runner.py with explicit arguments:
python runner.py api --host 127.0.0.1 --port 8000 --reload
python runner.py run --input data/leads.csv --input-format csv --config config/sources.yaml --output output/run_summary.txt- single CLI entrypoint (
runner.py) - subcommands for API mode and pipeline mode
- support configurable host, port, log level, and reload flag for API
- support configurable input path, input format, config path, and output path for pipeline
- CLI arguments should be documented in
README.md
Features:
- upload dataset
- run enrichment job
- view enriched leads
- filter by lead score
- export results
The final project must include:
- working enrichment system
- API endpoints
- dashboard UI
- clean structured output
- sample dataset
- README with usage instructions
- deployed version (Fly.io: frontend + backend)
| company | phone | website | score | reason | |
|---|---|---|---|---|---|
| XYZ Realty | info@xyz.com | +27… | xyz.com | 82 | No chatbot, outdated site |
- input system (CSV + schema handling)
- basic website enrichment
- Google Maps integration
- contact extraction + normalization
- verification logic
- lead scoring system
- API endpoints
- dashboard
- deployment strategy (Fly.io topology + configs)
- polish + documentation
- Build a single-page dashboard that supports:
- upload dataset (CSV/JSON/PropFlux)
- create and monitor enrichment jobs
- inspect enriched leads and rejected rows
- filter/sort by lead score and quality signals
- export results (CSV/JSON)
- Keep auth out of scope for MVP; run in trusted environment first.
- Framework: React + Vite.
- UI: custom modern dashboard styling and components.
- Data fetching: native fetch + polling from dashboard state.
- Tabs:
- Control Panel (upload/create/stop/resume)
- Analytics
- Job History
- Data Explorer
- Engine Settings
- Use existing endpoints first:
POST /jobs(multipart upload)GET /jobs(paginated listing)GET /jobs/{id}(status + counts + batch progress + error)GET /jobs/{id}/results(partial rows while processing, full set on completion)
- Add endpoint enhancements only if needed by UI:
- export route (
/jobs/{id}/export?format=csv|json) POST /jobs/{id}/terminate,POST /jobs/{id}/resume,GET /jobs/{id}/batches
- export route (
- Upload + create job
- choose format + file, submit, optimistic row in jobs list.
- Job monitoring
- poll every 2-3s while status is
processing, then stop. - show clear states: uploaded, processing, completed, failed.
- poll every 2-3s while status is
- Result exploration
- table columns: company, website, email, phone, location, contact_quality, lead_score, lead_reason.
- filters: min score, contact_quality, has_chatbot, freshness signal.
- sorting: lead_score desc by default.
- Client-side model mirrors API payload:
JobListItem:job_id,status, timestamps, counts,errorLeadRow: canonical lead fields + enrichment/scoring fields
- Keep types centralized under
frontend/src/types/api.tsto reduce drift.
- Empty/error states for all pages (no blank screens).
- Progress indicators for uploads and processing states.
- Failure handling with retry actions (
re-poll,download input,re-runlater). - URL state for filters/pagination so views are shareable.
- Deploy on Fly.io as two services:
- Frontend service exposed publicly
- Backend API exposed publicly
- Configure frontend build with
VITE_API_BASE_URLto backend Fly URL. - Configure strict CORS for frontend origin only.
- Persist SQLite volume on backend Fly app.
- Frontend:
- component tests for upload form, status badge, lead table filters
- integration tests for job lifecycle (mock API)
- E2E:
- upload sample file -> job completes -> results visible -> export works
- Acceptance checklist:
- a non-technical user can run one full job and download results without CLI.
- Day 1-2: scaffold frontend, API client, jobs list page.
- Day 3: upload/create job flow + polling.
- Day 4: results table + filters + rejected rows.
- Day 5: export wiring, UX polish, docs, deployment.
- email validation APIs
- LinkedIn enrichment
- CRM integrations (HubSpot, Pipedrive)
- scheduling / recurring jobs
- multi-region scaling
This project will be referenced in job proposals as:
“A multi-source lead intelligence platform that enriches, verifies, and scores real estate leads using automated data pipelines and APIs.”
This demonstrates:
- real-world data enrichment capability
- backend/API system design
- automation workflows
- business-focused engineering
The MVP is complete when:
- input data is processed correctly
- leads are enriched from multiple sources
- contact info is extracted + validated
- lead scoring works
- API + dashboard function correctly
- system runs end-to-end without manual fixes
- Focus on business value, not just data
- Keep architecture simple but scalable
- Build reusable enrichment components
- Design for extension (new sources later)
A production-grade lead intelligence system you can confidently show to clients to win jobs in:
- lead generation
- data enrichment
- web scraping + automation
- backend/API development
- real estate data pipelines