Crates.io | ferret |
lib.rs | ferret |
version | 1.1.1 |
source | src |
created_at | 2020-10-30 17:33:20.919917 |
updated_at | 2023-11-24 12:45:20.152426 |
description | A trigram-based tool for detecting similarity in groups of text documents or program code. |
homepage | https://peterlane.codeberg.page/ferret/ |
repository | https://codeberg.org/peterlane/ferret/ |
max_upload_size | |
id | 307041 |
size | 86,062 |
Ferret is a copy-detection tool, locating duplicate text or code in multiple text documents or source files. Ferret is designed to detect copying ( collusion ) within a given set of files.
As a library, Ferret can be used to analyse program code or natural language texts into trigrams, and compare pairs of documents for similarity.
Features:
$ ferret --help
Usage: ferret [-ghluvx] filename [filenames...]
-g, --group Use subdirectory names to group files
-h, --help Show help information
-l, --list-trigrams
Output list of trigrams found
-u, --unique-counts
Output counts of unique trigrams
-v, --version Version number
-x, --xml-report filename1 filename2 outfile : Create XML report
Take some files and find the two most similar:
use ferret::documents::Documents;
fn main() {
let files = ["txt1.txt".to_string(), "txt2.txt".to_string(), "txt3.txt".to_string()];
let docs = Documents::new(&files[..]);
let results = docs.sorted_results(false);
println!("Most similar pair: {}", results[0]);
}
Take a file, and read it trigram-by-trigram:
use ferret::trigram_reader::TrigramReader;
use std::path::PathBuf;
fn main() {
let path = PathBuf::from(r"test.rb");
let mut reader = TrigramReader::new(&path);
while reader.read_trigram () {
println!("Trigram {}", reader.last_trigram ());
}
}