## robotxt [![Build Status][action-badge]][action-url] [![Crate Docs][docs-badge]][docs-url] [![Crate Version][crates-badge]][crates-url] [![Crate Coverage][coverage-badge]][coverage-url] **Also check out other `spire-rs` projects [here](https://github.com/spire-rs).** [action-badge]: https://img.shields.io/github/actions/workflow/status/spire-rs/kit/build.yaml?branch=main&label=build&logo=github&style=flat-square [action-url]: https://github.com/spire-rs/kit/actions/workflows/build.yaml [crates-badge]: https://img.shields.io/crates/v/robotxt.svg?logo=rust&style=flat-square [crates-url]: https://crates.io/crates/robotxt [docs-badge]: https://img.shields.io/docsrs/robotxt?logo=Docs.rs&style=flat-square [docs-url]: http://docs.rs/robotxt [coverage-badge]: https://img.shields.io/codecov/c/github/spire-rs/kit?logo=codecov&logoColor=white&style=flat-square [coverage-url]: https://app.codecov.io/gh/spire-rs/kit The implementation of the robots.txt (or URL exclusion) protocol in the Rust programming language with the support of `crawl-delay`, `sitemap` and universal `*` match extensions (according to the RFC specification). ### Features - `parser` to enable `robotxt::{Robots}`. **Enabled by default**. - `builder` to enable `robotxt::{RobotsBuilder, GroupBuilder}`. **Enabled by default**. - `optimal` to optimize overlapping and global rules, potentially improving matching speed at the cost of longer parsing times. - `serde` to enable `serde::{Deserialize, Serialize}` implementation, allowing the caching of related rules. ### Examples - parse the most specific `user-agent` in the provided `robots.txt` file: ```rust use robotxt::Robots; fn main() { let txt = r#" User-Agent: foobot Disallow: * Allow: /example/ Disallow: /example/nope.txt "#; let r = Robots::from_bytes(txt.as_bytes(), "foobot"); assert!(r.is_relative_allowed("/example/yeah.txt")); assert!(!r.is_relative_allowed("/example/nope.txt")); assert!(!r.is_relative_allowed("/invalid/path.txt")); } ``` - build the new `robots.txt` file in a declarative manner: ```rust use robotxt::RobotsBuilder; fn main() -> Result<(), url::ParseError> { let txt = RobotsBuilder::default() .header("Robots.txt: Start") .group(["foobot"], |u| { u.crawl_delay(5) .header("Rules for Foobot: Start") .allow("/example/yeah.txt") .disallow("/example/nope.txt") .footer("Rules for Foobot: End") }) .group(["barbot", "nombot"], |u| { u.crawl_delay(2) .disallow("/example/yeah.txt") .disallow("/example/nope.txt") }) .sitemap("https://example.com/sitemap_1.xml".try_into()?) .sitemap("https://example.com/sitemap_1.xml".try_into()?) .footer("Robots.txt: End"); println!("{}", txt.to_string()); Ok(()) } ``` ### Links - [Request for Comments: 9309](https://www.rfc-editor.org/rfc/rfc9309.txt) on RFC-Editor.com - [Introduction to Robots.txt](https://developers.google.com/search/docs/crawling-indexing/robots/intro) on Google.com - [How Google interprets Robots.txt](https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt) on Google.com - [What is Robots.txt file](https://moz.com/learn/seo/robotstxt) on Moz.com ### Notes - The parser is based on [Smerity/texting_robots](https://github.com/Smerity/texting_robots). - The `Host` directive is not supported.