OhMyScrapper scrapes texts and urls looking for links and jobs-data to create a final report with general information about job positions.
- Read texts;
- Extract and load urls;
- Scrapes the urls looking for og:tags and titles;
- Export a list of links with relevant information;
You can install directly in your pip:
pip install ohmyscrapperI recomend to use the uv, so you can just use the command bellow and everything is installed:
uv add ohmyscrapper
uv run ohmyscrapper --versionBut you can use everything as a tool, for example:
uvx ohmyscrapper --versionOhMyScrapper works in 3 stages:
- It collects and loads urls from a text in a database;
- It scraps/access the collected urls and read what is relevant. If it finds new urls, they are collected as well;
- Export a list of urls in CSV files;
You can do 3 stages with the command:
ohmyscrapper startRemember to add your text file in the folder
/inputwith the name that finishes with.txt!
You will find the exported files in the folder /output like this:
/output/report.csv/output/report.csv-preview.html/output/urls-simplified.csv/output/urls-simplified.csv-preview.html/output/urls.csv/output/urls.csv-preview.html
First we load a text file you would like to look for urls. It it works with any txt file.
The default folder is /input. Put one or more text (finished with .txt) files
in this folder and use the command load:
ohmyscrapper loador, if you have another file in a different folder, just use the argument -input like this:
ohmyscrapper load -input=my-text-file.txtIn this case, you can add an url directly to the database, like this:
ohmyscrapper load -input=https://cesarcardoso.cc/That will append the last url in the database to be scraped.
That will create a database if it doesn't exist and store every url the oh-my-scrapper
find. After that, let's scrap the urls with the command scrap-urls:
ohmyscrapper scrap-urls --recursive --ignore-typeThat will scrap only the linkedin urls we are interested in. For now they are:
- linkedin_post: https://%.linkedin.com/posts/%
- linkedin_redirect: https://lnkd.in/%
- linkedin_job: https://%.linkedin.com/jobs/view/%
- linkedin_feed" https://%.linkedin.com/feed/%
- linkedin_company: https://%.linkedin.com/company/%
But we can use every other one generically using the argument --ignore-type:
ohmyscrapper scrap-urls --ignore-typeAnd we can ask to make it recursively adding the argument --recursive:
ohmyscrapper scrap-urls --recursive!!! important: we are not sure about blocks we can have for excess of requests
And we can finally export with the command:
ohmyscrapper export
ohmyscrapper export --file=output/urls-simplified.csv --simplify
ohmyscrapper reportTo monitor recent scraping jobs locally, start the dashboard:
ohmyscrapper dashboardThen open http://127.0.0.1:8765. Use --host and --port to bind a different local address.
That's the basic usage! But you can understand more using the help:
ohmyscrapper --helpThis package is distributed under the MIT license.