Skip to content

bouli/ohmyscrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

190 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🐶 OhMyScrapper - v0.10.2

OhMyScrapper scrapes texts and urls looking for links and jobs-data to create a final report with general information about job positions.

Scope

  • Read texts;
  • Extract and load urls;
  • Scrapes the urls looking for og:tags and titles;
  • Export a list of links with relevant information;

Installation

You can install directly in your pip:

pip install ohmyscrapper

I recomend to use the uv, so you can just use the command bellow and everything is installed:

uv add ohmyscrapper
uv run ohmyscrapper --version

But you can use everything as a tool, for example:

uvx ohmyscrapper --version

How to use and test (development only)

OhMyScrapper works in 3 stages:

  1. It collects and loads urls from a text in a database;
  2. It scraps/access the collected urls and read what is relevant. If it finds new urls, they are collected as well;
  3. Export a list of urls in CSV files;

You can do 3 stages with the command:

ohmyscrapper start

Remember to add your text file in the folder /input with the name that finishes with .txt!

You will find the exported files in the folder /output like this:

  • /output/report.csv
  • /output/report.csv-preview.html
  • /output/urls-simplified.csv
  • /output/urls-simplified.csv-preview.html
  • /output/urls.csv
  • /output/urls.csv-preview.html

BUT: if you want to do step by step, here it is:

First we load a text file you would like to look for urls. It it works with any txt file.

The default folder is /input. Put one or more text (finished with .txt) files in this folder and use the command load:

ohmyscrapper load

or, if you have another file in a different folder, just use the argument -input like this:

ohmyscrapper load -input=my-text-file.txt

In this case, you can add an url directly to the database, like this:

ohmyscrapper load -input=https://cesarcardoso.cc/

That will append the last url in the database to be scraped.

That will create a database if it doesn't exist and store every url the oh-my-scrapper find. After that, let's scrap the urls with the command scrap-urls:

ohmyscrapper scrap-urls --recursive --ignore-type

That will scrap only the linkedin urls we are interested in. For now they are:

  • linkedin_post: https://%.linkedin.com/posts/%
  • linkedin_redirect: https://lnkd.in/%
  • linkedin_job: https://%.linkedin.com/jobs/view/%
  • linkedin_feed" https://%.linkedin.com/feed/%
  • linkedin_company: https://%.linkedin.com/company/%

But we can use every other one generically using the argument --ignore-type:

ohmyscrapper scrap-urls --ignore-type

And we can ask to make it recursively adding the argument --recursive:

ohmyscrapper scrap-urls --recursive

!!! important: we are not sure about blocks we can have for excess of requests

And we can finally export with the command:

ohmyscrapper export
ohmyscrapper export --file=output/urls-simplified.csv --simplify
ohmyscrapper report

To monitor recent scraping jobs locally, start the dashboard:

ohmyscrapper dashboard

Then open http://127.0.0.1:8765. Use --host and --port to bind a different local address.

That's the basic usage! But you can understand more using the help:

ohmyscrapper --help

See Also

License

This package is distributed under the MIT license.

About

Another scrapper :)

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors