# FOSSlim

![](https://img.shields.io/crates/v/fosslim.svg)

FOSSlim stands for **F**ree **O**pen *S*ource *S*oftware **LI**cense **M**atcher and
it matches the text of the OSS license with SPDX id, but user can easily change & update training data
with additional EULAs and license text;

It is designed to be modular and to provide many low-level high-speed utilities which libraries written in high-level
languages like Ruby & Javascript could benefit;
Which means you could take advantage of various models implemented here, but they alone are not enough to provide a response
with high-confidence. This task is left for the RubyGem & NPM packages, which are cleaning up a raw-text 
and combining results from multiple models to increase the confidence of the match result;

It is still under **active development**, but it will be released as 

1. ~~Rust library ( *milestone.1*, *milestone.3* )~~
2. ~~RoR gem with example API ( *milestone.2* )~~ - [LicenseMatcher gem](https://rubygems.org/gems/license_matcher)
3. sample RoR application using the GEM - Fosslim.com

... TBD = release time unknown: priority depends on interests from community
4. NodeJS library with example AWS lambda function, TBD
5. Rust Microservice, TBD
6. commandline tool to scan files, TBD

#### Models

* **NaiveTF** - uses simple WordBag model and ranks results by [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index)
* **FingerNgram** - splits text into overlapping [Ngrams](https://en.wikipedia.org/wiki/N-gram) and hashes selected NGrams for fingerprint;

... in near future
*  TF/IDF models with Cosine similarity
* Okapi25 model
* Winnowing model
* Simple probabilistic ML models ~ Naive Bayes, HMM, ...?

#### Usage

```rust
use fosslim::index;
use fosslim::document::Document;
use fosslim::naive_tf; // Simple wordbag model with Jaccard similarity
...
let idx_file_path = "data/index.msgpack"; // it is pre-built index from SPDX data, includes ~300 licenses
let mit_txt = r#"
Permission is hereby granted, free of charge, to any person obtaining a copy of this software \
and associated documentation files (the "Software"), to deal in the Software without restriction,\
including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,\
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,\
subject to the following conditions:\
"#;

let doc1 = Document::new(0, "mit".to_string(), mit_txt.to_string());


// matching document with SPDX label
if let Ok(idx) = index::load(idx_file_path) {
    let mdl = naive_tf::from_index(&idx);
    
    mdl::match_document(&doc1);
}
...
```

check `tests` folder for more usage examples;

And yes, you can build your own index with `index::build_from_path()` function; you just have to use same file structure
the JSON files in the `data/licenses` folder;

#### Current alternatives

here are some of alternatives you could use already now:

* SPDX lookup - https://github.com/bbqsrc/spdx-lookup-python
* LibrariesIO license normalizer - https://github.com/librariesio/spdx
* **Google's license classifier** - https://github.com/google/licenseclassifier
* **Fossology** - https://github.com/fossology/fossology
* LicenseFinder - https://github.com/pivotal/LicenseFinder