## Summary **tldextract-rs** is a high performance [effective top level domains (eTLD)](https://wiki.mozilla.org/Public_Suffix_List) extraction module that extracts subcomponents from Domain. ### Hostname - Cargo.toml: ```toml tld_extract = { git = "https://github.com/emo-cat/tldextract-rs" } ``` - example code ```rust use tld_extract::TLDExtract; fn main() { let source = tld_extract::Source::Hardcode; let suffix = tld_extract::SuffixList::new(source, false, None); let mut extract = TLDExtract::new(suffix, true).unwrap(); let e = extract.extract(" mirrors.tuna.tsinghua.edu.cn").unwrap(); let s = serde_json::to_string_pretty(&e).unwrap(); println!("{:}", s); } ``` - ExtractResult ```json { "subdomain": "mirrors.tuna", "domain": "tsinghua", "suffix": "edu.cn", "registered_domain": "tsinghua.edu.cn" } ``` ## Implementation details ### Why not split on "." and take the last element instead? Splitting on "." and taking the last element only works for simple eTLDs like `com`, but not more complex ones like `oseto.nagasaki.jp`. ### eTLD tries **tldextract-rs** stores eTLDs in [compressed tries](https://en.wikipedia.org/wiki/Trie). Valid eTLDs from the [Mozilla Public Suffix List](http://www.publicsuffix.org) are appended to the compressed trie in reverse-order. ```sh Given the following eTLDs au nsw.edu.au com.ac edu.ac gov.ac and the example URL host `example.nsw.edu.au` The compressed trie will be structured as follows: START ╠═ au 🚩 ✅ ║ ╚═ edu ✅ ║ ╚═ nsw 🚩 ✅ ╚═ ac ╠═ com 🚩 ╠═ edu 🚩 ╚═ gov 🚩 === Symbol meanings === 🚩 : path to this node is a valid eTLD ✅ : path to this node found in example URL host `example.nsw.edu.au` ``` The URL host subcomponents are parsed from right-to-left until no more matching nodes can be found. In this example, the path of matching nodes are `au -> edu -> nsw`. Reversing the nodes gives the extracted eTLD `nsw.edu.au`. ## Acknowledgements - [go-fasttld (Go)](https://github.com/elliotwutingfeng/go-fasttld) - [tldextract (Python)](https://github.com/john-kurkowski/tldextract)