Skip to content

lkarlslund/topdomains

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Top 17M website host URL

This repository contains the most popular 17M website host URLs according to the Google CRUX dataset.

I have exported this dataset using the BigQuery interface and downloaded - this dataset is freely available and free to use by anyone directly from Google as well. It here for your convenience.

The CSV file contains the top 17M domains and their popularity as order of magnitude starting with 1000, sorted by popularity ascending with the url sorted secondarily.

The format is "url,popularity" - example:

http(s)://domain.com,1000

Working with the dataset

Due to GitHub limitations, the file is compressed with 7z, so start by decompressing it:

7z x crux-2023-12-top17m.7z

Then you can use ripgrep to get the information you need

Strip the order of magnitude and just keep the URL

rg "^(.*),[0-9]+$" -r '$2' crux-2023-12-top17m.csv > top17m-urls.txt

Keep just the host name

rg "^http(s)?://(.*),[0-9]+$" -r '$2' crux-2023-12-top17m.csv > top17m-hosts.txt

Keep just the domain name

rg "^http(s)?://.*?(?<domain>[^.]*[.]?.{6}),[0-9]+$" -r '$domain' crux-2023-12-top17m.csv > top17-domains.txt

Limit to less than 17M entries

You can extract the top 1000 urls/hosts/domains using the following command:

head -n 1000 crux-2023-12-top17m.csv > top-1000.csv

Useful intervals for these cutoffs are 1000, 5000, 10000, 50000, 100000, 500000, 1000000, 5000000, 10000000, 50000000 (last is only from 10-17M)

Mass download stuff

You can feed host names into turbograb if you quickly want to download the primary page from each site

Happy hunting

Feedback is welcome -> Mastodon @lkarlslund / Twitter @lkarlslund / LinkedIn Lars Karlslund

About

Top 17 million most popular domains as CSV (from Google CRUX dataset)

Resources

Stars

Watchers

Forks

Contributors