Crates.io | keyword_extraction |
lib.rs | keyword_extraction |
version | 1.5.0 |
source | src |
created_at | 2023-05-20 10:30:14.338232 |
updated_at | 2024-10-12 02:50:45.455712 |
description | Collection of algorithms for keyword extraction from text |
homepage | https://github.com/tugascript/keyword-extraction-rs |
repository | https://github.com/tugascript/keyword-extraction-rs |
max_upload_size | |
id | 869318 |
size | 172,482 |
This is a simple NLP library with a list of unsupervised keyword extraction algorithms:
The full list of the algorithms in this library:
Add the library to your Cargo.toml
:
[dependencies]
keyword_extraction = "1.5.0"
Or use cargo add:
cargo add keyword_extraction
It is possible to enable or disable features:
"tf_idf"
: TF-IDF algorithm;"rake"
: RAKE algorithm;"text_rank"
: TextRank algorithm;"yake"
: YAKE algorithm;"all"
: algorimths and helpers;"parallel"
: parallelization of the algorithms with Rayon;"co_occurrence"
: Co-occurrence algorithm;Default features: ["tf_idf", "rake", "text_rank"]
. By default all algorithms apart from "co_occurrence"
and "yake"
are enabled.
NOTE: "parallel"
feature is only recommended for large documents, it exchanges memory for computation resourses.
For the stop words, you can use the stop-words
crate:
[dependencies]
stop-words = "0.8.0"
For example for english:
use stop_words::{get, LANGUAGE};
fn main() {
let stop_words = get(LANGUAGE::English);
let punctuation: Vec<String> =[
".", ",", ":", ";", "!", "?", "(", ")", "[", "]", "{", "}", "\"", "'",
].iter().map(|s| s.to_string()).collect();
// ...
}
Create a TfIdfParams
enum which can be one of the following:
TfIdfParams::UnprocessedDocuments
;TfIdfParams::ProcessedDocuments
;TfIdfParams::TextBlock
;use keyword_extraction::tf_idf::{TfIdf, TfIdfParams};
fn main() {
// ... stop_words & punctuation
let documents: Vec<String> = vec![
"This is a test document.".to_string(),
"This is another test document.".to_string(),
"This is a third test document.".to_string(),
];
let params = TfIdfParams::UnprocessedDocuments(&documents, &stop_words, Some(&punctuation));
let tf_idf = TfIdf::new(params);
let ranked_keywords: Vec<String> = tf_idf.get_ranked_words(10);
let ranked_keywords_scores: Vec<(String, f32)> = tf_idf.get_ranked_word_scores(10);
// ...
}
Create a RakeParams
enum which can be one of the following:
RakeParams::WithDefaults
;RakeParams::WithDefaultsAndPhraseLength
;RakeParams::All
;use keyword_extraction::rake::{Rake, RakeParams};
fn main() {
// ... stop_words
let text = r#"
This is a test document.
This is another test document.
This is a third test document.
"#;
let rake = Rake::new(RakeParams::WithDefaults(text, &stop_words));
let ranked_keywords: Vec<String> = rake.get_ranked_words(10);
let ranked_keywords_scores: Vec<(String, f32)> = rake.get_ranked_word_scores(10);
// ...
}
Create a TextRankParams
enum which can be one of the following:
TextRankParams::WithDefaults
;TextRankParams::WithDefaultsAndPhraseLength
;TextRankParams::All
;use keyword_extraction::text_rank::{TextRank, TextRankParams};
fn main() {
// ... stop_words
let text = r#"
This is a test document.
This is another test document.
This is a third test document.
"#;
let text_rank = TextRank::new(TextRankParams::WithDefaults(text, &stop_words));
let ranked_keywords: Vec<String> = text_rank.get_ranked_words(10);
let ranked_keywords_scores: Vec<(String, f32)> = text_rank.get_ranked_word_scores(10);
}
Create a YakeParams
enum which can be one of the following:
YakeParams::WithDefaults
;YakeParams::All
;use keyword_extraction::yake::{Yake, YakeParams};
fn main() {
// ... stop_words
let text = r#"
This is a test document.
This is another test document.
This is a third test document.
"#;
let yake = Yake::new(YakeParams::WithDefaults(text, &stop_words));
let ranked_keywords: Vec<String> = yake.get_ranked_keywords(10);
let ranked_keywords_scores: Vec<(String, f32)> = yake.get_ranked_keyword_scores(10);
// ...
}
I would love your input! I want to make contributing to this project as easy and transparent as possible, please read the CONTRIBUTING.md file for details.
This project is licensed under the GNU Lesser General Public License v3.0. See the Copying and Copying Lesser files for details.