Crates.io | tfidf-text-summarizer |
lib.rs | tfidf-text-summarizer |
version | 0.0.3 |
source | src |
created_at | 2023-11-13 01:25:48.190314 |
updated_at | 2023-12-15 02:23:53.59441 |
description | Implementation of an extractive text summarization system which uses TF-IDF scores of words present in the text to rank sentences and generate a summary |
homepage | https://github.com/shubham0204/tfidf-summarizer-rs |
repository | https://github.com/shubham0204/tfidf-summarizer-rs |
max_upload_size | |
id | 1033181 |
size | 59,839 |
Implementation of an extractive text summarization system which uses TF-IDF scores of words present in the text to rank sentences and generate a summary
[!NOTE] Do read the blog on TowardsDataScience
Contents
Compiling this project requires Rust's nightly build (required by punkt
, a dependency of this project) that can be added with rustup
,
$> rustup toolchain install nightly
$> cargo new <crate-name>
$> cd <crate-name>
$> crate-name> rustup override set nightly
You may check the official guides for Nightly builds and overrides
Add the dependency tfidf-text-summarizer = "0.0.1"
in your project's Cargo.toml
,
[package]
...
[dependencies]
tfidf-text-summarizer = "0.0.1"
and then execute cargo build
to download the crate (and its dependencies punkt
and rayon
).
The crate provides two functions to extract summaries from a given text. Both functions take two parameters as input, text: &str
and reduction_factor: f32
where text
is the document whose summary has to be generated and reduction_factor
is the relative proportion of sentences that would be included in the generated summary.
For instance, if reduction_factor = 0.4
and the number of sentences in text
is 20, then the extracted summary will contain the top-8 (40% of 20) most-informative sentences from text
.
tfidf-text-summarizer::summarize
: Computes the TF-IDF score of each word in text
and then uses the normalized sum of TF-IDF scores of all words present in the document to rank each sentence. The normalization factor used is the number of tokens present in the sentence. It returns a Strign
representing the extracted summary.
tfidf-text-summarizer::par_summarize
: It is similar to summarize
but uses Rayon to parallelize some operations in the summarization pipeline. For larger texts, par_summarize
out-performs summarize
on a multi-core system.
use summarizer::{summarize,par_summarize} ;
use std::fs as fs ;
fn main() {
let text: String = fs::read_to_string( "wiki.txt" )
.expect( "Could not read wiki.txt" ) ;
let reduction_factor: f32 = 0.4 ;
// Use summarize of par_summarize here
let summary: String = summarize( text.as_str() , reduction_factor ) ;
println!( "Summary is {}" , summary ) ;
}
Static libraries could be generated by setting crate_type = [ "staticlib" ]
in Cargo.toml
. Libraries (.a
archives) along with C header files (generated with cbindgen
) will help us use summarize
and par_summarize
methods in C/C++ projects.
Using the summarize
method in C code (See examples/c
for a complete example):
#include "summarizer.h"
#include <stdlib.h>
#include <stdio.h>
int main( int argc , char** argv ) {
char* filename = argv[ 1 ] ;
FILE* file_ptr = fopen( filename , "r" ) ;
fseek( file_ptr , 0 , SEEK_END ) ;
long size = ftell( file_ptr ) ;
fseek( file_ptr , 0 , SEEK_SET ) ;
char* buffer = (char*) calloc( size , sizeof(char) );
fread( buffer , sizeof( char ) , size , file_ptr ) ;
fclose( file_ptr ) ;
const char* summarized_text = (char*) summarize( buffer , size , 0.5f ) ;
printf( "%s \n" , summarized_text ) ;
return 0 ;
}
Following the steps mentioned in Using the static library with C/C++, we can copy the C header file summarizer.h
and static library in the debian
directory,
$> cp target/x86_64-unknown-linux-gnu/release/libsummarizer.a debian/summarizer/
$> cp examples/c/summarizer.h debian/summarizer/
We can now build a Debian package which will perform the following tasks after its installation on the user's system,
libsummarizer.a
to /usr/local/lib/
summarizer.h
to /usr/include/
These two steps are accomplished with the postinst
script in debian/summarizer/DEBIAN/
#!/bin/bash
cp ../libsummarizer.so /usr/local/lib/
cp ../summarizer.h /usr/include/
the control
script in the same directory provides information about the package,
Package: Summarizer
Version: 0.0.1
Maintainer: Shubham Panchal
Architecture: amd64
Description: A text summarizer based on TF-IDF
To build the package with dpkg-deb
utility and then rename it, we can write a simple Bash script build_package.sh
,
#!/bin/bash
dpkg-deb --build summarizer
mkdir -p packages
mv summarizer.deb packages/summarizer-v0.0.1-amd64.deb
To build the package, execute build_package.sh
,
$> cd debian
$ debian> bash build_package.sh
The package summarizer-v0.0.1-amd64.deb
will be generated in debian/packages
directory.
To install the Debian Package, use the dpkg
utility,
$> sudo dpkg -i summarizer-v0.0.1-amd64.deb
We can compile the Rust code to shared libraries targeting armeabi-v7a
and arm64
architectures. After installing the Android NDK package and necessary toolchains with rustup
, we can compile the .so
libraries. See the android
module in src/lib.rs
for the JNI functions.
See examples/android/README.md
for more details.
The project can be improved on the following points (taken from the blog):
The current implementation requires the nightly build of Rust, only because of a single dependency punkt . punkt is a sentence tokenizer which is required to determine sentence boundaries in the text, following which other computations are made. If punkt can be built with stable Rust, the current implementation will no more require nightly Rust.
Adding newer metrics to rank sentences, especially which capture inter-sentence dependencies. TFIDF is not the most accurate scoring function and has its own limitations. Building sentence graphs and using them for scoring sentences has greatly enhance the overall quality of the extracted summary.
The summarizer has not been benchmarked against a known dataset. Rouge scores R1 , R2 and RL are frequently used to assess the quality of the generated summary against standard datasets like the New York Times dataset or the CNN Daily mail dataset. Measuring performance against standard benchmarks will provide developers more clarity and reliability towards the implementation.
Completing the Python implementation in examples/python
.