Crates.io | tldextract-rs |
lib.rs | tldextract-rs |
version | 0.1.1 |
source | src |
created_at | 2023-12-14 11:51:11.512279 |
updated_at | 2023-12-19 13:16:35.434081 |
description | extract domain info from a url |
homepage | https://github.com/emo-crab/tldextract-rs |
repository | https://github.com/emo-crab/tldextract-rs |
max_upload_size | |
id | 1069362 |
size | 257,813 |
tldextract-rs is a high performance effective top level domains (eTLD) extraction module that extracts subcomponents from Domain.
tldextract-rs = { git = "https://github.com/emo-cat/tldextract-rs" }
use tldextract_rs::TLDExtract;
fn main() {
let source = tldextract_rs::Source::Hardcode;
let suffix = tldextract_rs::SuffixList::new(source, false, None);
let mut extract = TLDExtract::new(suffix, true).unwrap();
let e = extract.extract(" mirrors.tuna.tsinghua.edu.cn").unwrap();
let s = serde_json::to_string_pretty(&e).unwrap();
println!("{:}", s);
}
{
"subdomain": "mirrors.tuna",
"domain": "tsinghua",
"suffix": "edu.cn",
"registered_domain": "tsinghua.edu.cn"
}
Splitting on "." and taking the last element only works for simple eTLDs like com
, but not more complex ones like oseto.nagasaki.jp
.
tldextract-rs stores eTLDs in compressed tries.
Valid eTLDs from the Mozilla Public Suffix List are appended to the compressed trie in reverse-order.
Given the following eTLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac
and the example URL host `example.nsw.edu.au`
The compressed trie will be structured as follows:
START
╠═ au 🚩 ✅
║ ╚═ edu ✅
║ ╚═ nsw 🚩 ✅
╚═ ac
╠═ com 🚩
╠═ edu 🚩
╚═ gov 🚩
=== Symbol meanings ===
🚩 : path to this node is a valid eTLD
✅ : path to this node found in example URL host `example.nsw.edu.au`
The URL host subcomponents are parsed from right-to-left until no more matching nodes can be found. In this example, the path of matching nodes are au -> edu -> nsw
. Reversing the nodes gives the extracted eTLD nsw.edu.au
.