Crates.io | tld_extract |
lib.rs | tld_extract |
version | 0.1.0 |
source | src |
created_at | 2023-12-14 11:39:53.098168 |
updated_at | 2023-12-14 11:39:53.098168 |
description | extract domain info from a url |
homepage | |
repository | https://github.com/emo-crab/tldextract-rs |
max_upload_size | |
id | 1069354 |
size | 257,816 |
tldextract-rs is a high performance effective top level domains (eTLD) extraction module that extracts subcomponents from Domain.
tld_extract = { git = "https://github.com/emo-cat/tldextract-rs" }
use tld_extract::TLDExtract;
fn main() {
let source = tld_extract::Source::Hardcode;
let suffix = tld_extract::SuffixList::new(source, false, None);
let mut extract = TLDExtract::new(suffix, true).unwrap();
let e = extract.extract(" mirrors.tuna.tsinghua.edu.cn").unwrap();
let s = serde_json::to_string_pretty(&e).unwrap();
println!("{:}", s);
}
{
"subdomain": "mirrors.tuna",
"domain": "tsinghua",
"suffix": "edu.cn",
"registered_domain": "tsinghua.edu.cn"
}
Splitting on "." and taking the last element only works for simple eTLDs like com
, but not more complex ones like oseto.nagasaki.jp
.
tldextract-rs stores eTLDs in compressed tries.
Valid eTLDs from the Mozilla Public Suffix List are appended to the compressed trie in reverse-order.
Given the following eTLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac
and the example URL host `example.nsw.edu.au`
The compressed trie will be structured as follows:
START
╠═ au 🚩 ✅
║ ╚═ edu ✅
║ ╚═ nsw 🚩 ✅
╚═ ac
╠═ com 🚩
╠═ edu 🚩
╚═ gov 🚩
=== Symbol meanings ===
🚩 : path to this node is a valid eTLD
✅ : path to this node found in example URL host `example.nsw.edu.au`
The URL host subcomponents are parsed from right-to-left until no more matching nodes can be found. In this example, the path of matching nodes are au -> edu -> nsw
. Reversing the nodes gives the extracted eTLD nsw.edu.au
.