Crates.io | isbot |
lib.rs | isbot |
version | 0.1.3 |
source | src |
created_at | 2022-03-27 21:01:30.012189 |
updated_at | 2022-06-18 20:33:04.448577 |
description | Detect if a user-agent is a known bot |
homepage | |
repository | https://github.com/BryanMorgan/isbot |
max_upload_size | |
id | 557440 |
size | 28,026,031 |
Rust library to detect bots using a user-agent string.
Add this to your Cargo.toml
:
[dependencies]
isbot = "0.1.2"
The example below uses the default bot patterns to correctly identify the Googlebot-Image
user-agent as a bot and the Opera
user-agent as a browser.
use isbot::Bots;
let bots = Bots::default();
assert_eq!(bots.is_bot("Googlebot-Image/1.0"), true);
assert_eq!(bots.is_bot("Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1"), false);
isbot
can be added as middleware to enable global or per-handler rejections of known bots.
There are multiple ways to use isbot
in Actix. One way is to pass a Bots
instance into the app data as state. For example:
use isbot::Bots;
struct AppState {
bots: Bots,
}
let state = AppState {
bots: Bots::default(),
};
let app = App::new()
.app_data(web::Data::new(state))
.route("/", web::get().to(index));
Request handlers can use the Bots
data to filter out bots:
async fn index(req: HttpRequest, data: web::Data<AppState>) -> HttpResponse {
if let Some(user_agent) = get_user_agent(req.headers()) {
if data.bots.is_bot(user_agent) {
return HttpResponse::Forbidden().body("Bots not allowed");
}
}
HttpResponse::Ok().body("Home")
}
Another option is to use add the middleware using the Actix wrap_fn
function and globally reject all requests from bots. These 2 examples plus other options and details can be seen in the test examples:
Bot user-agent patterns can be customized by adding or removing patterns, using the append
and remove
methods.
To add new bot patterns, use append
to specify an array of regular expression patterns. For example:
let mut bots = isbot::Bots::default();
assert_eq!(bots.is_bot("Mozilla/5.0 (CustomNewTestB0T /1.2)"), false);
bots.append(&[r"CustomNewTestB0T\s/\d\.\d"]);
assert_eq!(bots.is_bot("Mozilla/5.0 (CustomNewTestB0T /1.2)"), true);
To remove bot patterns, use remove
and specify an array of existing patterns to remove. For example, to remove the Chrome Lighthouse user-agent pattern to indicate it is not a bot:
let mut bots = isbot::Bots::default();
bots.remove(&["Chrome-Lighthouse"]);
assert_eq!(bots.is_bot("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36 Chrome-Lighthouse"), false);
The default user-agent regular expression patterns are managed in the bot_regex_patterns.txt file.
If you don't want to use the default bot patterns you can supply your own list. Since the default bot patterns are automatically added to the library at compile time you should first disable the default feature. The include-default-bots
feature is enabled by default so the patterns defined in bot_regex_patterns.txt
are included in the library at compile time.
You can exclude the patterns by disabling the default features and then including your own bot regular expressions. To do that set default-features
to false in your Cargo.toml
dependency definition. For example:
[dependencies]
isbot = { version = "0.1.1", default-features = false }
And then use Bots::new()
to supply a newline delimited list of regular expressions. For example:
use isbot::Bots;
let custom_user_agent_patterns = r#"
^Googlebot-Image/
bingpreview/"#;
let bots = Bots::new(custom_user_agent_patterns);
assert_eq!(bots.is_bot("Googlebot-Image/1.0"), true);
Some of the test fixture data is download from multiple sources to ensure the latest user-agents are validated.
To download the latest test data fixures, run the download_fixture_data.rs
executable:
cargo run --bin download_fixture_data --features="download-fixture-data"
This will update files in the fixtures directory.
To run all unit and integration tests:
cargo test
To validate changes to the Actix examples run the following:
cargo test --example actix_example
To validate changes to the Rocket examples run the following:
cargo test --example rocket_example
Bot detection is a gray area since there are no clear lines on what defines a bot user-agent and a real browser user-agent. Some libraries focus on broadly classifying bots and trying to identify as many as possible, with the risk that real user browser may be caught and falsely flagged as bots.
This library's focus is on identifying known bots while primarily ensuring no real users or browsers are falsely flagged. All of the bot user-agent patterns are validated against a large number of real browsers and bot patterns to eliminate false positives.
For example, the user-agent string below is identified as both a bot and a real browser by various libraries and data sources:
Mozilla/5.0 (Linux; Android 4.2.1; CUBOT GT99 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19
There are many excellent bot detection libraries available for other languages and awesome developers maintaining bot and user-agent identification data. This library draws inspiration from many of them, especially:
Library | Language |
---|---|
https://github.com/omrilotan/isbot | JavaScript |
https://github.com/JayBizzle/Crawler-Detect/ | PHP |
https://github.com/matomo-org/device-detector | PHP |
https://github.com/fnando/browser | Ruby |
https://github.com/biola/Voight-Kampff | Ruby |
The following data sources are used directly or as inspiration for the static test data and downloaded user-agent identification:
Data Source | Notes |
---|---|
user-agents.net | User-Agents Database |
myip.ms | List of known web bots & spiders |
monperrus | Collection of user-agents used by robots, crawlers, and spiders |
ua-core | Regex file and data to build language ports of Browserscope's user-agent parser |
See the Contributing guide.
isbot
is distributed under the terms of the MIT license. See LICENSE for details.