| Crates.io | gibberish-or-not |
| lib.rs | gibberish-or-not |
| version | 5.0.7 |
| created_at | 2025-02-22 17:41:45.364914+00 |
| updated_at | 2025-03-12 08:24:42.629303+00 |
| description | Figure out if text is gibberish or not |
| homepage | |
| repository | |
| max_upload_size | |
| id | 1565596 |
| size | 4,153,019 |
Instantly detect if text is English or nonsense with 99% accuracy
# As a CLI tool
cargo install gibberish-or-not
# As a library in Cargo.toml
gibberish-or-not = "4.1.1"
<<<<<<< Updated upstream
=======
main
The library offers enhanced detection using a BERT model for more accurate results on borderline cases. To use enhanced detection:
Set up HuggingFace authentication (one of two methods):
Method 1: Environment Variable
# Set the token in your environment
export HUGGING_FACE_HUB_TOKEN=your_token_here
Method 2: Direct Token
use gibberish_or_not::{download_model_with_progress_bar, default_model_path};
// Pass token directly to the download function
download_model_with_progress_bar(default_model_path(), Some("your_token_here"))?;
Get your token by:
Download the model (choose one method):
# Using the CLI (uses environment variable)
cargo run --bin download_model
# Or in your code (using direct token)
use gibberish_or_not::{download_model_with_progress_bar, default_model_path};
download_model_with_progress_bar(default_model_path(), Some("your_token_here"))?;
Use enhanced detection in your code:
<<<<<<< HEAD use gibberish_or_not::{GibberishDetector, Sensitivity, default_model_path};
// Create detector with model let detector = GibberishDetector::with_model(default_model_path());
// Basic usage - automatically uses enhanced detection if model exists use gibberish_or_not::{is_gibberish, Sensitivity};
// The function automatically checks for model at default_model_path() let result = is_gibberish("Your text here", Sensitivity::Medium);
// Optional: Explicit model control use gibberish_or_not::{GibberishDetector, default_model_path};
// Create detector with model - useful if you want to: // - Use a custom model path // - Check if enhanced detection is available // - Reuse the same model instance let detector = GibberishDetector::with_model(default_model_path());
main if detector.has_enhanced_detection() { let result = detector.is_gibberish("Your text here", Sensitivity::Medium); }
<<<<<<< HEAD
=======
Note: The basic detection algorithm will be used as a fallback if the model is not available. The model is automatically loaded from the default path (`default_model_path()`) when using the simple `is_gibberish` function.
>>>>>>> main
You can also check the token status programmatically:
```rust
use gibberish_or_not::{check_token_status, TokenStatus, default_model_path};
match check_token_status(default_model_path()) {
TokenStatus::Required => println!("HuggingFace token needed"),
TokenStatus::Available => println!("Token found, ready to download"),
TokenStatus::NotRequired => println!("Model exists, no token needed"),
}
<<<<<<< HEAD Note: The basic detection algorithm will be used as a fallback if the model is not available.
=======
Stashed changes main
use gibberish_or_not::{is_gibberish, is_password, Sensitivity};
// Password Detection
assert!(is_password("123456")); // Detects common passwords
// Valid English
assert!(!is_gibberish("The quick brown fox jumps over the lazy dog", Sensitivity::Medium));
assert!(!is_gibberish("Hello, world!", Sensitivity::Medium));
// Gibberish
assert!(is_gibberish("asdf jkl qwerty", Sensitivity::Medium));
assert!(is_gibberish("xkcd vwpq mntb", Sensitivity::Medium));
assert!(is_gibberish("println!({});", Sensitivity::Medium)); // Code snippets are classified as gibberish
Our advanced detection algorithm uses multiple components:
The library provides three sensitivity levels:
Built-in detection of common passwords:
use gibberish_or_not::is_password;
assert!(is_password("123456")); // Common password
assert!(is_password("password")); // Common password
assert!(!is_password("unique_and_secure_passphrase")); // Not in common list
The library handles various special cases:
The gibberish detection algorithm combines multiple scoring components into a weighted composite score. Here's a detailed look at each component:
The final classification uses a weighted sum:
$S = 0.4E + 0.25T + 0.15G_3 + 0.1G_4 + 0.1V$
Where:
The threshold is dynamically adjusted based on text length:
let threshold = match text_length {
0..=20 => 0.7, // Very short text needs higher threshold
21..=50 => 0.8, // Short text
51..=100 => 0.9, // Medium text
101..=200 => 1.0,// Standard threshold
_ => 1.1, // Long text can be more lenient
} * sensitivity_factor;
We calculate Shannon entropy to measure randomness:
$H = -\sum_{i} p_i \log_2(p_i)$
Where $p_i$ is the probability of character $i$ occurring in the text.
let entropy = char_frequencies.iter()
.map(|p| -p * p.log2())
.sum::<f64>();
Trigrams and quadgrams are scored using frequency analysis:
$G_n = \frac{\text{valid n-grams}}{\text{total n-grams}}$
let trigram_score = valid_trigrams.len() as f64 / total_trigrams.len() as f64;
let quadgram_score = valid_quadgrams.len() as f64 / total_quadgrams.len() as f64;
We analyze character pair frequencies against known English patterns:
$T = \frac{\text{valid transitions}}{\text{total transitions}}$
The transition matrix is pre-computed from a large English corpus and stored as a perfect hash table.
The final threshold varies by sensitivity:
The algorithm includes fast-path decisions:
This weighting balances accuracy with computational efficiency, prioritizing stronger indicators while still considering multiple aspects of language structure.
The library is optimized for speed, with benchmarks showing excellent performance across different text types:
| Text Length | Processing Time |
|---|---|
| Short (10-20 chars) | 2.3-2.7 μs |
| Medium (20-50 chars) | 4-7 μs |
| Long (50-100 chars) | 7-15 μs |
| Very Long (200+ chars) | ~50 μs |
| Text Length | First Run* | Subsequent Runs |
|---|---|---|
| Short (10-20 chars) | ~100ms | 5-10ms |
| Medium (20-50 chars) | ~100ms | 5-15ms |
| Long (50-100 chars) | ~100ms | 10-20ms |
| Very Long (200+ chars) | ~100ms | 15-30ms |
*First run includes model loading time. The model is cached after first use.
| Sensitivity | Processing Time |
|---|---|
| Low | ~7.3 μs |
| Medium | ~6.7 μs |
| High | ~7.9 μs |
These benchmarks were run on a modern CPU using the Criterion benchmarking framework. The library achieves this performance through:
Contributions are welcome! Please feel free to:
This project is licensed under the MIT License - see the LICENSE file for details.