Crates.io | newslookout |
lib.rs | newslookout |
version | 0.4.8 |
source | src |
created_at | 2024-11-08 08:48:34.253239 |
updated_at | 2024-12-06 03:09:12.422019 |
description | A web scraping platform built for news scanning, using LLMs for text processing, powered by Rust |
homepage | https://github.com/sandeep-sandhu/sandeep-sandhu |
repository | https://github.com/sandeep-sandhu/newslookout_rs |
max_upload_size | |
id | 1440994 |
size | 260,092 |
A light-weight web scraping platform built for scanning and processing news and data. It is a rust port of the python application of the same name.
Here's an illustration of this multi-threaded data pipeline:
This library sets up a web scraping pipeline and executes it as follows:
This package enables building a full-fledged multi-threaded web scraping solution that runs in batch mode with very meagre resources (e.g. single core CPU with less than 4GB RAM).
Add this to your Cargo.toml: [dependencies] newslookout = "0.3.0"
Get started with just a few lines of code, for example:
use std::env;
use config;
use newslookout;
fn main() {
if env::args().len() < 2 {
println!("Usage: newslookout_app <config_file>");
panic!("Provide config file as a command line parameter, (expect 2 parameters, but got {})",
env::args().len()
);
}
let config_file: String = env::args().nth(1).unwrap();
println!("Loading configuration from file: {}", config_file);
let app_config: config::Config = newslookout::utils::read_config(config_file);
let docs_retrieved: Vec<newslookout::document::DocInfo> = newslookout::run_app(app_config);
// use this collection of retrieved document-information structs for any further custom processing
}
Declare custom retriever plugin and add these to the pipeline to fetch data using your customised logic.
fn run_pipeline(config: &config::Config) -> Vec<Document> {
newslookout::init_logging(config);
newslookout::init_pid_file(config);
log::info!("Starting the custom pipeline");
let mut retriever_plugins = newslookout::pipeline::load_retriever_plugins(config);
let mut data_proc_plugins = newslookout::pipeline::load_dataproc_plugins(config);
// add custom data retriever:
retriever_plugins.push(my_plugin);
let docs_retrieved = newslookout::pipeline::start_data_pipeline(
retriever_plugins,
data_proc_plugins,
config
);
log::info!("Data pipeline completed processing {} documents.", docs_retrieved.len());
// use docs_retrieved for any further custom processing.
newslookout::cleanup_pid_file(&config);
}
Similarly, you can also declare and use custom data processing plugins, e.g.:
data_proc_plugins.push(my_own_data_processing);
Note that for data processing, these type of plugins are run in serial order of priority defined in the config file.
There are a few pre-built modules provided for a few websites. These can be readily extended for other websites as required.
Refer to the source code of these in the plugins folder and roll out your own plugins.
The entire application is driven by its config file. Refer to the example config file in the repository.