# rustrict
[![Documentation](https://docs.rs/rustrict/badge.svg)](https://docs.rs/rustrict)
[![crates.io](https://img.shields.io/crates/v/rustrict.svg)](https://crates.io/crates/rustrict)
[![Build](https://github.com/finnbear/rustrict/actions/workflows/build.yml/badge.svg)](https://github.com/finnbear/rustrict/actions/workflows/build.yml)
[![Test Page](https://img.shields.io/badge/Test-page-green)](https://finnbear.github.io/rustrict/)
`rustrict` is a profanity filter for Rust.
Disclaimer: Multiple source files (`.txt`, `.csv`, `.rs` test cases) contain profanity. Viewer discretion is advised.
## Features
- Multiple types (profane, offensive, sexual, mean, spam)
- Multiple levels (mild, moderate, severe)
- Resistant to evasion
- Alternative spellings (like "fck")
- Repeated characters (like "craaaap")
- Confusable characters (like 'ᑭ', '𝕡', and '🅿')
- Spacing (like "c r_a-p")
- Accents (like "pÓöp")
- Bidirectional Unicode ([related reading](https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html))
- Self-censoring (like "f*ck")
- Safe phrase list for known bad actors]
- Censors invalid Unicode characters
- Battle-tested in [Mk48.io](https://mk48.io)
- Resistant to false positives
- One word (like "**ass**assin")
- Two words (like "pu**sh it**")
- Flexible
- Censor and/or analyze
- Input `&str` or `Iterator- `
- Can track per-user state with `context` feature
- Can add words with the `customize` feature
- Accurately reports the width of Unicode via the `width` feature
- Plenty of options
- Performant
- O(n) analysis and censoring
- No `regex` (uses custom trie)
- 3 MB/s in `release` mode
- 100 KB/s in `debug` mode
## Limitations
- Mostly English/emoji
- Censoring removes most diacritics (accents)
- Does not detect right-to-left profanity while analyzing, so...
- Censoring forces Unicode to be left-to-right
- Doesn't understand context
- Not resistant to false positives affecting profanities added at runtime
## Usage
### Strings (`&str`)
```rust
use rustrict::CensorStr;
let censored: String = "hello crap".censor();
let inappropriate: bool = "f u c k".is_inappropriate();
assert_eq!(censored, "hello c***");
assert!(inappropriate);
```
### Iterators (`Iterator`)
```rust
use rustrict::CensorIter;
let censored: String = "hello crap".chars().censor().collect();
assert_eq!(censored, "hello c***");
```
### Advanced
By constructing a `Censor`, one can avoid scanning text multiple times to get a censored `String` and/or
answer multiple `is` queries. This also opens up more customization options (defaults are below).
```rust
use rustrict::{Censor, Type};
let (censored, analysis) = Censor::from_str("123 Crap")
.with_censor_threshold(Type::INAPPROPRIATE)
.with_censor_first_character_threshold(Type::OFFENSIVE & Type::SEVERE)
.with_ignore_false_positives(false)
.with_ignore_self_censoring(false)
.with_censor_replacement('*')
.censor_and_analyze();
assert_eq!(censored, "123 C***");
assert!(analysis.is(Type::INAPPROPRIATE));
assert!(analysis.isnt(Type::PROFANE & Type::SEVERE | Type::SEXUAL));
```
If you cannot afford to let anything slip though, or have reason to believe a particular user
is trying to evade the filter, you can check if their input matches a [short list of safe strings](src/safe.txt):
```rust
use rustrict::{CensorStr, Type};
// Figure out if a user is trying to evade the filter.
assert!("pron".is(Type::EVASIVE));
assert!("porn".isnt(Type::EVASIVE));
// Only let safe messages through.
assert!("Hello there!".is(Type::SAFE));
assert!("nice work.".is(Type::SAFE));
assert!("yes".is(Type::SAFE));
assert!("NVM".is(Type::SAFE));
assert!("gtg".is(Type::SAFE));
assert!("not a common phrase".isnt(Type::SAFE));
```
If you want to add custom profanities or safe words, enable the `customize` feature.
```rust
#[cfg(feature = "customize")]
{
use rustrict::{add_word, CensorStr, Type};
// You must take care not to call these when the crate is being
// used in any other way (to avoid concurrent mutation).
unsafe {
add_word("reallyreallybadword", (Type::PROFANE & Type::SEVERE) | Type::MEAN);
add_word("mybrandname", Type::SAFE);
}
assert!("Reallllllyreallllllybaaaadword".is(Type::PROFANE));
assert!("MyBrandName".is(Type::SAFE));
}
```
If your use-case is chat moderation, and you store data on a per-user basis, you can use `rustrict::Context` as a reference implementation:
```rust
#[cfg(feature = "context")]
{
use rustrict::{BlockReason, Context};
use std::time::Duration;
pub struct User {
context: Context,
}
let mut bob = User {
context: Context::default()
};
// Ok messages go right through.
assert_eq!(bob.context.process(String::from("hello")), Ok(String::from("hello")));
// Bad words are censored.
assert_eq!(bob.context.process(String::from("crap")), Ok(String::from("c***")));
// Can take user reports (After many reports or inappropriate messages,
// will only let known safe messages through.)
for _ in 0..5 {
bob.context.report();
}
// If many bad words are used or reports are made, the first letter of
// future bad words starts getting censored too.
assert_eq!(bob.context.process(String::from("crap")), Ok(String::from("****")));
// Can manually mute.
bob.context.mute_for(Duration::from_secs(2));
assert!(matches!(bob.context.process(String::from("anything")), Err(BlockReason::Muted(_))));
}
```
## Comparison
To compare filters, the first 100,000 items of [this list](https://raw.githubusercontent.com/vzhou842/profanity-check/master/profanity_check/data/clean_data.csv)
is used as a dataset. Positive accuracy is the percentage of profanity detected as profanity. Negative accuracy is the percentage of clean text detected as clean.
| Crate | Accuracy | Positive Accuracy | Negative Accuracy | Time |
|-------|----------|-------------------|-------------------|------|
| [rustrict](https://crates.io/crates/rustrict) | 79.82% | 94.00% | 76.29% | 9s |
| [censor](https://crates.io/crates/censor) | 76.16% | 72.76% | 77.01% | 23s |
| [stfu](https://crates.io/crates/stfu) | 91.74% | 77.69% | 95.25% | 45s |
| [profane-rs](https://crates.io/crates/profane-rs) | 80.47% | 73.79% | 82.14% | 52s |
## Development
[![Build](https://github.com/finnbear/rustrict/actions/workflows/build.yml/badge.svg?branch=master)](https://github.com/finnbear/rustrict/actions/workflows/build.yml)
If you make an adjustment that would affect false positives, such as adding profanity,
you will need to run `false_positive_finder`:
1. Run `make downloads` to download the required word lists and dictionaries
2. Run `make false_positives` to automatically find false positives
If you modify `replacements_extra.csv`, run `make replacements` to rebuild `replacements.csv`.
Finally, run `make test` for a full test or `make test_debug` for a fast test.
## License
Licensed under either of
* Apache License, Version 2.0
([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
* MIT license
([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.
## Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be
dual licensed as above, without any additional terms or conditions.