This project is a web scraper built using Python and Selenium to extract detailed doctor reviews and ratings from a medical website for further research purpose. The scraper collects data such as doctor names, specialties, ratings, and user comments, and saves the results into a CSV file.
- Dynamic Web Scraping: Extracts multiple pages of data using Selenium.
- Multithreading: Accelerates scraping by processing multiple pages concurrently with
ThreadPoolExecutor. - Comprehensive Data: Collects doctor details, reviews, tags, and ratings.
- Custom Handling: Handles ads, cookies, and dynamically loaded content.
- Install Python (3.7 or higher).
- Install Google Chrome and download the appropriate version of ChromeDriver.
- Install required Python libraries:
pip install selenium tqdm
project/
├──
├── chromedriver # ChromeDriver executable
├── scraper.py # Main Python script
├── doctor.csv # Output CSV file
Update the PATH in the get_driver() function with the location of your ChromeDriver:
PATH = "/path/to/your/chromedriver"Execute the script in the terminal:
python scraper.pyThe extracted data will be saved in doctor.csv with the following columns:
- d_name: Doctor's name
- d_speciality: Doctor's specialty
- total_score: Overall rating score
- total_survey_count: Total number of surveys
- five_star, four_star, three_star, two_star, one_star: Number and percentage for each rating category
- positive_tags, negative_tags: Lists of positive and negative tags
- comment_text: User comments
-
iselement(browser, cssselector) Checks if an element exists on the webpage.
-
get_driver() Sets up and returns a Selenium WebDriver instance.
-
get_doc_linklist(url) Scrapes doctor profile links from multiple pages.
-
get_doctor_details(link) Extracts detailed information from each doctor's profile page.
Uses ThreadPoolExecutor to scrape multiple doctor profiles simultaneously:
with ThreadPoolExecutor(max_workers=10) as executor:
executor.map(get_doctor_details, link_list)Automatically closes popups that obstruct scraping.
Handles cookie popups that block interaction.
Navigates through multiple pages until the specified limit (99 pages).
Here is the link to the output