| Crates.io | tree-sitter-robots-txt |
| lib.rs | tree-sitter-robots-txt |
| version | 1.0.1 |
| created_at | 2026-01-02 02:48:32.299626+00 |
| updated_at | 2026-01-02 02:48:32.299626+00 |
| description | Grammar for robots.txt |
| homepage | |
| repository | https://github.com/opa-oz/tree-sitter-robots-txt |
| max_upload_size | |
| id | 2017863 |
| size | 108,778 |
This is a general tree-sitter parser grammar for the robots.txt.
A robots.txt file is a text file used to instruct web robots (often called crawlers or spiders) how to interact with pages on a website. Here is the basic syntax and rules for a robots.txt file:
User-agent line: Specifies the robot(s) to which the rules apply.
User-agent: * - Applies to all robots.User-agent: Googlebot - Applies specifically to Google's crawler.Disallow line: Specifies the files or directories that the specified robot(s) should not crawl.
Disallow: /directory/ - Disallows crawling of the specified directory.Disallow: /file.html - Disallows crawling of the specific file.Disallow: / - Disallows crawling of the entire site.Allow line (optional): Overrides a disallow rule for a specific file or directory.
Allow: /directory/file.html - Allows crawling of a specific file within a disallowed directory.Crawl-delay line (optional): Specifies the delay in seconds between successive requests to the site.
Crawl-delay: 10 - Sets a 10-second delay between requests.Sitemap line (optional): Directs robots to the location of the XML sitemap(s) for the website.
Sitemap: https://www.example.com/sitemap.xml - Specifies the location of the XML sitemap.Comments: Lines beginning with # are comments and are ignored by robots. They can be used to annotate the file for humans.
User-agent: *
Disallow: /admin/
Disallow: /private.html
Allow: /public.html
Crawl-delay: 5
Sitemap: https://www.example.com/sitemap.xml
# This is a comment explaining the robots.txt file.
*) can be used in Disallow directives, e.g., Disallow: /*.pdf to block all PDF files.User-agent and subsequent directives.It's important to note that while robots.txt files provide guidance to well-behaved crawlers, malicious or poorly programmed crawlers may ignore these instructions. Therefore, they are primarily used for managing how legitimate search engines and web crawlers interact with a website.
User-agent, Disallow, Allow, Crawl-delay, Sitemap, Host)# comment)X-Robots-Tag)How to run & test:
npm install
npm run test