robots_txt

Crates.io	robots_txt
lib.rs	robots_txt
version	0.7.0
created_at	2017-04-05 21:47:07.39935+00
updated_at	2020-03-31 09:53:33.050803+00
description	A lightweight parser and generator for robots.txt.
homepage	https://github.com/alexander-irbis/robots_txt
repository	https://github.com/alexander-irbis/robots_txt
max_upload_size
id	9744
size	41,371

Alexander Irbis (alexander-irbis)

documentation

https://docs.rs/robots_txt/

README

robots_txt

robots_txt is a lightweight robots.txt parser and generator for robots.txt written in Rust.

Nothing extra.

Documentation

Unstable

The implementation is WIP.

Installation

Robots_txt is available on crates.io and can be included in your Cargo enabled project like this:

Cargo.toml:

[dependencies]
robots_txt = "0.7"

Parsing & matching paths against rules

use robots_txt::Robots;

static ROBOTS: &'static str = r#"

# robots.txt for http://www.site.com
User-Agent: *
Disallow: /cyberworld/map/ # this is an infinite virtual URL space
# Cybermapper knows where to go
User-Agent: cybermapper
Disallow:

"#;

fn main() {
    let robots = Robots::from_str(ROBOTS);

    let matcher = SimpleMatcher::new(&robots.choose_section("NoName Bot").rules);
    assert!(matcher.check_path("/some/page"));
    assert!(matcher.check_path("/cyberworld/welcome.html"));
    assert!(!matcher.check_path("/cyberworld/map/object.html"));

    let matcher = SimpleMatcher::new(&robots.choose_section("Mozilla/5.0; CyberMapper v. 3.14").rules);
    assert!(matcher.check_path("/some/page"));
    assert!(matcher.check_path("/cyberworld/welcome.html"));
    assert!(matcher.check_path("/cyberworld/map/object.html"));
}

Building & rendering

main.rs:

extern crate robots_txt;

use robots_txt::Robots;

fn main() {
    let robots1 = Robots::builder()
        .start_section("cybermapper")
            .disallow("")
            .end_section()
        .start_section("*")
            .disallow("/cyberworld/map/")
            .end_section()
        .build();

    let conf_base_url: Url = "https://example.com/".parse().expect("parse domain");
    let robots2 = Robots::builder()
        .host(conf_base_url.domain().expect("domain"))
        .start_section("*")
            .disallow("/private")
            .disallow("")
            .crawl_delay(4.5)
            .request_rate(9, 20)
            .sitemap("http://example.com/sitemap.xml".parse().unwrap())
            .end_section()
        .build();
        
    println!("# robots.txt for http://cyber.example.com/\n\n{}", robots1);
    println!("# robots.txt for http://example.com/\n\n{}", robots2);
}

As a result we get

# robots.txt for http://cyber.example.com/

User-agent: cybermapper
Disallow:

User-agent: *
Disallow: /cyberworld/map/


# robots.txt for http://example.com/

User-agent: *
Disallow: /private
Disallow:
Crawl-delay: 4.5
Request-rate: 9/20
Sitemap: http://example.com/sitemap.xml

Host: example.com

Alternatives

messense/robotparser-rs robots.txt parser for Rust

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT) at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Commit count: 0

robots_txt

documentation

README

robots_txt

Unstable

Installation

Parsing & matching paths against rules

Building & rendering

Alternatives

License

Contribution

cargo fmt