# Ferret: Copy-Detection in Text and Code Ferret is a copy-detection tool, locating duplicate text or code in multiple text documents or source files. Ferret is designed to detect copying ( _collusion_ ) within a given set of files. As a library, Ferret can be used to analyse program code or natural language texts into trigrams, and compare pairs of documents for similarity. **Features:** * compares text documents containing natural language or computer language * computes a similarity measure based on the trigrams found within pairs of documents * many major programming languages are recognised and tokenised appropriately * outputs for analysis include: * pairwise comparisons ordered by similarity, including trigram counts * counts of unique trigrams within each file / group * reverse index from trigrams to list of documents they are found in * XML detailed comparison of a pair of documents ## Command line use ``` console $ ferret --help Usage: ferret [-ghluvx] filename [filenames...] -g, --group Use subdirectory names to group files -h, --help Show help information -l, --list-trigrams Output list of trigrams found -u, --unique-counts Output counts of unique trigrams -v, --version Version number -x, --xml-report filename1 filename2 outfile : Create XML report ``` ## Library use Take some files and find the two most similar: ``` rust use ferret::documents::Documents; fn main() { let files = ["txt1.txt".to_string(), "txt2.txt".to_string(), "txt3.txt".to_string()]; let docs = Documents::new(&files[..]); let results = docs.sorted_results(false); println!("Most similar pair: {}", results[0]); } ``` Take a file, and read it trigram-by-trigram: ``` rust use ferret::trigram_reader::TrigramReader; use std::path::PathBuf; fn main() { let path = PathBuf::from(r"test.rb"); let mut reader = TrigramReader::new(&path); while reader.read_trigram () { println!("Trigram {}", reader.last_trigram ()); } } ```