![project-banner](https://github.com/shubham0204/tfidf-summarizer-rs/assets/41076823/f2855bcb-9573-4b70-9c38-7d4120511896) # Text Summarization With TF-IDF In Rust [![tfidf-text-summarizer crate](https://img.shields.io/crates/v/tfidf-text-summarizer.svg)](https://crates.io/crates/tfidf-text-summarizer) [![tfidf-text-summarizer documentation](https://docs.rs/tfidf-text-summarizer/badge.svg)](https://docs.rs/tfidf-text-summarizer) > Implementation of an extractive text summarization system which uses TF-IDF scores of words present in the text to rank sentences and generate a summary > [!NOTE] > Do read the [blog on TowardsDataScience](https://medium.com/towards-data-science/building-a-cross-platform-tfidf-text-summarizer-in-rust-7b05938f4507) **Contents** 1. [Usage](#usage) 1. [Usage in Rust](#usage-in-rust) 2. [Usage in C/C++ and with a Debian package](#usage-with-cc-codebases) 3. [Usage in Android](#usage-in-android) 2. [Contributing](#contributing) 3. [Useful External Resources](#useful-external-resources) ## Usage ### Usage in Rust Compiling this project requires Rust's nightly build (required by `punkt`, a dependency of this project) that can be added with [`rustup`](https://rust-lang.github.io/rustup/index.html), ``` $> rustup toolchain install nightly $> cargo new $> cd $> crate-name> rustup override set nightly ``` You may check the [official guides for Nightly builds](https://rust-lang.github.io/rustup/concepts/channels.html#working-with-nightly-rust) and [overrides](https://rust-lang.github.io/rustup/overrides.html) Add the dependency `tfidf-text-summarizer = "0.0.1"` in your project's `Cargo.toml`, ```toml [package] ... [dependencies] tfidf-text-summarizer = "0.0.1" ``` and then execute `cargo build` to download the crate (and its dependencies `punkt` and `rayon`). The crate provides two functions to extract summaries from a given text. Both functions take two parameters as input, `text: &str` and `reduction_factor: f32` where `text` is the document whose summary has to be generated and `reduction_factor` is the relative proportion of sentences that would be included in the generated summary. For instance, if `reduction_factor = 0.4` and the number of sentences in `text` is 20, then the extracted summary will contain the top-8 (40% of 20) most-informative sentences from `text`. * `tfidf-text-summarizer::summarize`: Computes the TF-IDF score of each word in `text` and then uses the normalized sum of TF-IDF scores of all words present in the document to rank each sentence. The normalization factor used is the number of tokens present in the sentence. It returns a `Strign` representing the extracted summary. * `tfidf-text-summarizer::par_summarize`: It is similar to `summarize` but uses [Rayon](https://github.com/rayon-rs/rayon) to parallelize some operations in the summarization pipeline. For larger texts, `par_summarize` out-performs `summarize` on a multi-core system. ```rust use summarizer::{summarize,par_summarize} ; use std::fs as fs ; fn main() { let text: String = fs::read_to_string( "wiki.txt" ) .expect( "Could not read wiki.txt" ) ; let reduction_factor: f32 = 0.4 ; // Use summarize of par_summarize here let summary: String = summarize( text.as_str() , reduction_factor ) ; println!( "Summary is {}" , summary ) ; } ``` ### Usage with C/C++ codebases and with a Debian package #### Building an executable with GCC Static libraries could be generated by setting `crate_type = [ "staticlib" ]` in `Cargo.toml`. Libraries (`.a` archives) along with C header files (generated with `cbindgen`) will help us use `summarize` and `par_summarize` methods in C/C++ projects. * [Using the Rust-generated static library with C](https://github.com/shubham0204/tfidf-summarizer-rs/tree/main/examples/c/README.md) Using the `summarize` method in C code (See [`examples/c`](https://github.com/shubham0204/tfidf-summarizer-rs/tree/main/examples/c) for a complete example): ```c #include "summarizer.h" #include #include int main( int argc , char** argv ) { char* filename = argv[ 1 ] ; FILE* file_ptr = fopen( filename , "r" ) ; fseek( file_ptr , 0 , SEEK_END ) ; long size = ftell( file_ptr ) ; fseek( file_ptr , 0 , SEEK_SET ) ; char* buffer = (char*) calloc( size , sizeof(char) ); fread( buffer , sizeof( char ) , size , file_ptr ) ; fclose( file_ptr ) ; const char* summarized_text = (char*) summarize( buffer , size , 0.5f ) ; printf( "%s \n" , summarized_text ) ; return 0 ; } ``` #### Building the Debian package Following the steps mentioned in [Using the static library with C/C++](https://github.com/shubham0204/tfidf-summarizer-rs/tree/main/examples/c/README.md), we can copy the C header file `summarizer.h` and static library in the `debian` directory, ``` $> cp target/x86_64-unknown-linux-gnu/release/libsummarizer.a debian/summarizer/ $> cp examples/c/summarizer.h debian/summarizer/ ``` ##### A. Packaging the header and library We can now build a Debian package which will perform the following tasks after its installation on the user's system, 1. Copy `libsummarizer.a` to `/usr/local/lib/` 2. Copy `summarizer.h` to `/usr/include/` These two steps are accomplished with the `postinst` script in `debian/summarizer/DEBIAN/` ``` #!/bin/bash cp ../libsummarizer.so /usr/local/lib/ cp ../summarizer.h /usr/include/ ``` the `control` script in the same directory provides information about the package, ``` Package: Summarizer Version: 0.0.1 Maintainer: Shubham Panchal Architecture: amd64 Description: A text summarizer based on TF-IDF ``` To build the package with `dpkg-deb` utility and then rename it, we can write a simple Bash script `build_package.sh`, ``` #!/bin/bash dpkg-deb --build summarizer mkdir -p packages mv summarizer.deb packages/summarizer-v0.0.1-amd64.deb ``` To build the package, execute `build_package.sh`, ``` $> cd debian $ debian> bash build_package.sh ``` The package `summarizer-v0.0.1-amd64.deb` will be generated in `debian/packages` directory. ##### Installing the Debian package To install the Debian Package, use the `dpkg` utility, ``` $> sudo dpkg -i summarizer-v0.0.1-amd64.deb ``` ### Usage in Android We can compile the Rust code to shared libraries targeting `armeabi-v7a` and `arm64` architectures. After installing the Android NDK package and necessary toolchains with `rustup`, we can compile the `.so` libraries. See the `android` module in `src/lib.rs` for the JNI functions. See [`examples/android/README.md`](https://github.com/shubham0204/tfidf-summarizer.rs/tree/main/examples/android#compiling-for-android-targets) for more details. ## Contributing The project can be improved on the following points (taken from the blog): - [ ] The current implementation requires the nightly build of Rust, only because of a single dependency punkt . punkt is a sentence tokenizer which is required to determine sentence boundaries in the text, following which other computations are made. If punkt can be built with stable Rust, the current implementation will no more require nightly Rust. - [ ] Adding newer metrics to rank sentences, especially which capture inter-sentence dependencies. TFIDF is not the most accurate scoring function and has its own limitations. Building sentence graphs and using them for scoring sentences has greatly enhance the overall quality of the extracted summary. - [ ] The summarizer has not been benchmarked against a known dataset. Rouge scores R1 , R2 and RL are frequently used to assess the quality of the generated summary against standard datasets like the New York Times dataset or the CNN Daily mail dataset. Measuring performance against standard benchmarks will provide developers more clarity and reliability towards the implementation. - [ ] Completing the Python implementation in `examples/python`. ## Useful External Resources * [NLP — Text Summarization using NLTK: TF-IDF Algorithm](https://towardsdatascience.com/text-summarization-using-tf-idf-e64a0644ace3) * [Text Summarization by Ashin Shakya](https://medium.com/@ashins1997/text-summarization-f2542bc6a167) * [Automatic Extractive Text Summarization using TF-IDF by ASHNA JAIN on Voice Tech Podcast](https://medium.com/voice-tech-podcast/automatic-extractive-text-summarization-using-tfidf-3fc9a7b26f5) * [Text2Summary API for Android](https://github.com/shubham0204/Text2Summary-Android)