# WhichLicense detection This is a library to facilitate the detection of licenses in source code. ## Usage ### License Detection #### Gaoya detection ```rust let mut gaoya = GaoyaDetection { index: MinHashIndex::new(num_bands, band_width, 0.5), min_hasher: MinHasher32::new(num_bands * band_width), shingle_text_size, normalization_fn: DEFAULT_NORMALIZATION_FN, }; gaoya.load_from_file("licenses"); // OR: // for l in load_licenses_from_folder("./licenses/RAW"){ // gaoya.add_plain(&l.name, &strip_spdx_heading(&l.text)); // } ``` #### Fuzzyhash-rs Detection ```rust let mut fuzzy = FuzzyDetection { licenses: vec![], min_confidence: 50, exit_on_exact_match: false, normalization_fn: DEFAULT_NORMALIZATION_FN, }; fuzzy.load_from_file("licenses"); // OR: // for l in load_licenses_from_folder("./licenses/RAW"){ // fuzzy.add_plain(&l.name, &strip_spdx_heading(&l.text)); // } ``` ### Normalization function The normalization function is used to normalize the license text before it is processed by the algorithm. This is used so that the algorithm can focus on the license text itself and not the formatting of the license text, which ultimately improves the accuracy of the algorithm (higher confidence). ### Pipeline System The pipeline system was developed to automatically improve the results of license detection outputs by allowing further processing when a confidence is, for example, too low. A pipeline works by executing each segment on the running license whilst also checking against the algorithm every time a segment is executed. The pipeline will stop running if the confidence of the top (highest confidence) license is above the desired confidence. The steps are as follows: 1. The pipeline is created with the given segments. 2. An initial sample is fetched from the algorithm directly without executing any pipeline segment. 3. The system checks if the confidence of the top (highest confidence) license is above the desired confidence. * If it is, the pipeline stops running and returns the results. * If it is not, the pipeline continues to step 4. 4. The next segment is executed on the running license (starts at the first segment). 5. The system checks if the confidence of the top (highest confidence) license is above the desired confidence. * If it is, the pipeline stops running and returns the results. * If it is not, the pipeline moves back to step 4 and runs the next segment. > Batched segments allow you to run multiple segments one after the other without checking against (i.e., testing) the algorithm after each segment. The algorithm will be tested after all batched segments have executed. #### Example ```rust let pipeline = Pipeline::new(vec![ Segment::Remove(Using::Regex(Regex::new(r"...").unwrap())), Segment::Remove(Using::Text("...".to_string())), Segment::Replace(Using::Text("...".to_string()), "***".to_string()), Segment::Batch(vec![ Segment::Remove(Using::Regex(Regex::new(r"...").unwrap())), Segment::Remove(Using::Regex(Regex::new(r"...").unwrap())), ]), ]); let results = pipeline.run(&algorithm, "", 100.0); ``` # Attributions ## ScanCode License data > The initial database was generated by making use of the license data from the ScanCode toolkit. You do not need to make use of this copyright notice in your project if you choose not to use the ScanCode license database. However, if you do make use of the ScanCode license database, you must include this copyright notice in your project. Copyright (c) nexB Inc. and others. All rights reserved. ScanCode is a trademark of nexB Inc. SPDX-License-Identifier: CC-BY-4.0 See https://creativecommons.org/licenses/by/4.0/legalcode for the license text. See https://github.com/nexB/scancode-toolkit for support or download. See https://aboutcode.org for more information about nexB OSS projects.