rurl

rurl is a lightweight, vectorized toolkit for URL parsing, normalization, extraction, and matching in R.

Current package capabilities include:

Robust parsing via safe_parse_url() and safe_parse_urls()
URL normalization with fine-grained controls for protocol, www, case, trailing slashes, index pages, path normalization, scheme-relative URLs, host encoding, and path encoding
URL component extractors (get_* helpers)
URL-based joins with canonical_join()
Built-in memoization caches with introspection and configuration (rurl_cache_info(), rurl_cache_config(), rurl_clear_caches())

Installation

# From CRAN
install.packages("rurl")

# Development version from GitHub
# install.packages("remotes")
remotes::install_github("bart-turczynski/rurl")

Function Overview

Parsing and normalization: safe_parse_url(), safe_parse_urls(), get_clean_url()
Accessors: get_scheme(), get_host(), get_subdomain(), get_domain(), get_tld(), get_path(), get_query(), get_fragment(), get_port(), get_user(), get_password(), get_userinfo(), get_parse_status()
Matching/joining: canonical_join() for deterministic canonical-key joins
Cache control: rurl_cache_info(), rurl_cache_config(), rurl_clear_caches()

Quick Start

safe_parse_url() is the core workhorse. It returns parsed components and a normalized clean_url.

library(rurl)

parsed <- safe_parse_url(
  "HTTP://www.Example.com/a//b/../index.html?x=1#frag",
  protocol_handling = "https",
  www_handling = "strip",
  case_handling = "lower_host",
  trailing_slash_handling = "strip",
  index_page_handling = "strip",
  path_normalization = "both",
  host_encoding = "idna",
  path_encoding = "encode"
)

parsed$clean_url
#> [1] "https://example.com/a"
parsed$parse_status
#> [1] "ok"

clean_url is a normalized canonical key built from scheme, host, and path only. Port, query, fragment, and userinfo are intentionally excluded — read them from the dedicated components (get_port(), get_query(), get_fragment(), get_userinfo()) instead. With path_encoding = "decode" the path is shown decoded, so clean_url is human-readable rather than guaranteed URL-safe.

Scheme-relative URL handling is configurable:

safe_parse_url("//example.com/path", scheme_relative_handling = "keep")$parse_status
#> [1] "ok-scheme-relative"
safe_parse_url("//example.com/path", scheme_relative_handling = "https")$clean_url
#> [1] "https://example.com/path"

For vectors, use safe_parse_urls():

safe_parse_urls(c("example.com", "https://www.example.com/path"))[, c("original_url", "clean_url", "parse_status")]

Normalization Controls

safe_parse_url() and get_clean_url() support these controls:

protocol_handling: keep, none, strip, http, https
www_handling: none, strip, keep, if_no_subdomain
case_handling: lower_host (default), keep, lower, upper
trailing_slash_handling: none, keep, strip
index_page_handling: keep, strip
path_normalization: none, collapse_slashes, dot_segments, both
scheme_relative_handling: keep, http, https, error
host_encoding: keep, idna, unicode
path_encoding: keep, encode, decode
subdomain_levels_to_keep: NULL, 0, or N > 0

Subdomain retention is applied after www_handling:

get_host("http://www.three.two.one.example.com", www_handling = "strip", subdomain_levels_to_keep = 1)
#> [1] "one.example.com"
get_clean_url("http://www.deep.sub.example.com/path", subdomain_levels_to_keep = 0)
#> [1] "http://www.example.com/path"

Host and path encoding controls:

get_clean_url("http://münich.com/a%20b",
              host_encoding = "idna",
              path_encoding = "encode",
              case_handling = "lower_host")
#> [1] "http://xn--mnich-kva.com/a%20b"

get_clean_url("http://xn--mnich-kva.com/a%20b",
              host_encoding = "unicode",
              path_encoding = "decode",
              case_handling = "keep")
#> [1] "http://münich.com/a b"

Accessors

u <- "https://user:pass@www.blog.example.co.uk/path/to/page?a=1&b=2#frag"

get_scheme(u)
get_host(u)
get_subdomain(u)
get_domain(u)
get_tld(u)
get_path(u)
get_query(u)
get_query(u, format = "list")
get_fragment(u)
get_port(u)
get_user(u)
get_password(u)
get_userinfo(u)
get_parse_status(c(u, "mailto:test@example.com"))

URL Joins

canonical_join() matches on one canonicalized key per URL and is the preferred option for large datasets:

A <- data.frame(URL = c("http://Example.com/Page", "http://example.com/Other"),
                ValA = 1:2, stringsAsFactors = FALSE)
B <- data.frame(URL = c("https://www.example.com/Page/", "http://example.com/Miss"),
                ValB = c("x", "y"), stringsAsFactors = FALSE)

canonical_join(
  A, B,
  protocol_handling = "strip",
  www_handling = "strip",
  case_handling = "lower_host",
  trailing_slash_handling = "strip"
)

Caching

rurl memoizes URL parsing and punycode round-trips to speed repeated operations over large URL vectors; PSL query caching lives in pslr. Inspect, clear, and configure the caches:

rurl_cache_info()                          # entries / enabled / max per cache
rurl_clear_caches()                        # free memory in a long-running session
rurl_cache_config(max_full_parse = 1e5)    # bound the full-parse cache
rurl_cache_config(puny_encode = FALSE)     # disable a cache entirely

rurl_cache_config() covers three caches: full_parse, puny_encode, and puny_decode. The full_parse cache is unbounded by default (max_full_parse = Inf); set a bound to cap its peak memory. The puny_encode and puny_decode caches are unbounded by design and can be disabled for workloads with very many unique hosts.

Public Suffix List

Domain and TLD extraction is delegated to the pslr package, which owns the Public Suffix List and its refresh cycle. rurl ships no embedded copy of the list. To update the PSL, call pslr::psl_refresh() (see the pslr documentation for details).

Name		Name	Last commit message	Last commit date
Latest commit History 296 Commits
.github/workflows		.github/workflows
R		R
inst		inst
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.lintr		.lintr
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
RELEASE_NOTES_v1.md		RELEASE_NOTES_v1.md
rurl.Rproj		rurl.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rurl

Installation

Function Overview

Quick Start

Normalization Controls

Accessors

URL Joins

Caching

Public Suffix List

License

About

Licenses found

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rurl

Installation

Function Overview

Quick Start

Normalization Controls

Accessors

URL Joins

Caching

Public Suffix List

License

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages