Crates.io | robots_txt |
lib.rs | robots_txt |
version | 0.7.0 |
source | src |
created_at | 2017-04-05 21:47:07.39935 |
updated_at | 2020-03-31 09:53:33.050803 |
description | A lightweight parser and generator for robots.txt. |
homepage | https://github.com/alexander-irbis/robots_txt |
repository | https://github.com/alexander-irbis/robots_txt |
max_upload_size | |
id | 9744 |
size | 41,371 |
robots_txt is a lightweight robots.txt parser and generator for robots.txt written in Rust.
Nothing extra.
The implementation is WIP.
Robots_txt is available on crates.io and can be included in your Cargo enabled project like this:
Cargo.toml:
[dependencies]
robots_txt = "0.7"
use robots_txt::Robots;
static ROBOTS: &'static str = r#"
# robots.txt for http://www.site.com
User-Agent: *
Disallow: /cyberworld/map/ # this is an infinite virtual URL space
# Cybermapper knows where to go
User-Agent: cybermapper
Disallow:
"#;
fn main() {
let robots = Robots::from_str(ROBOTS);
let matcher = SimpleMatcher::new(&robots.choose_section("NoName Bot").rules);
assert!(matcher.check_path("/some/page"));
assert!(matcher.check_path("/cyberworld/welcome.html"));
assert!(!matcher.check_path("/cyberworld/map/object.html"));
let matcher = SimpleMatcher::new(&robots.choose_section("Mozilla/5.0; CyberMapper v. 3.14").rules);
assert!(matcher.check_path("/some/page"));
assert!(matcher.check_path("/cyberworld/welcome.html"));
assert!(matcher.check_path("/cyberworld/map/object.html"));
}
main.rs:
extern crate robots_txt;
use robots_txt::Robots;
fn main() {
let robots1 = Robots::builder()
.start_section("cybermapper")
.disallow("")
.end_section()
.start_section("*")
.disallow("/cyberworld/map/")
.end_section()
.build();
let conf_base_url: Url = "https://example.com/".parse().expect("parse domain");
let robots2 = Robots::builder()
.host(conf_base_url.domain().expect("domain"))
.start_section("*")
.disallow("/private")
.disallow("")
.crawl_delay(4.5)
.request_rate(9, 20)
.sitemap("http://example.com/sitemap.xml".parse().unwrap())
.end_section()
.build();
println!("# robots.txt for http://cyber.example.com/\n\n{}", robots1);
println!("# robots.txt for http://example.com/\n\n{}", robots2);
}
As a result we get
# robots.txt for http://cyber.example.com/
User-agent: cybermapper
Disallow:
User-agent: *
Disallow: /cyberworld/map/
# robots.txt for http://example.com/
User-agent: *
Disallow: /private
Disallow:
Crawl-delay: 4.5
Request-rate: 9/20
Sitemap: http://example.com/sitemap.xml
Host: example.com
messense/robotparser-rs robots.txt parser for Rust
Licensed under either of
Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT) at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.