web-parser

Crates.io	web-parser
lib.rs	web-parser
version	0.1.3
created_at	2025-06-11 22:01:01.454705+00
updated_at	2025-11-29 16:17:12.427264+00
description	This website parser library allows asynchronous search, fetching and extracting data from web-pages in multiple formats.
homepage
repository	https://github.com/fuderis/rs-web-parser
max_upload_size
id	1709163
size	20,733,040

Bulat Sh. (fuderis)

documentation

README

Web Parser + Search

This web page parser library allows asynchronous fetching and extracting of data from web pages in multiple formats.

Asynchronous web search using the search engines [Google, Bing, Duck, Ecosia, Yahoo, Wiki] with domain blacklisting (feature search).
You can also create a custom search engine by using the SearchEngine trait (feature search).
Reading an HTML document from a URL with a randomized user-agent (User::random()).
Selecting elements by CSS selectors and retrieving their attributes and content.
Fetching the full page as plain text.
Fetching and parsing page content as JSON with serde_json support.

This tool is well-suited for web scraping and data extraction tasks, offering flexible parsing of HTML, plain text, and JSON to enable comprehensive data gathering from various web sources.

Examples:

Web Search (feature: 'search'):

Requires the chromedriver tool installed!

use web_parser::prelude::*;
use macron::path;

#[tokio::main]
async fn main() -> Result<()> {
    // WEB SEARCH:

    let chrome_path = path!("bin/chromedriver/chromedriver.exe");
    let session_path = path!("%/ChromeDriver/WebSearch");
    
    // start search engine:
    let mut engine = SearchEngine::<Duck>::new(
        chrome_path,
        Some(session_path),
        false,
    ).await?;

    println!("Searching results..");

    // send search query:
    let results = engine.search(
        "Rust (programming language)",  // query
        &["support.google.com", "youtube.com"],  // black list
        1000  // sleep in millis
    ).await;
    
    // handle search results:
    match results {
        Ok(cites) => {
            println!("Result cites list: {:#?}", cites.get_urls());

            /*
            println!("Reading result pages..");
            let contents = cites.read(
                5,  // cites count to read
                &[  // tag name black list
                    "header", "footer", "style", "script", "noscript",
                    "iframe", "button", "img", "svg"
                ]
            ).await?;

            println!("Results: {contents:#?}");
            */
        }
        Err(e) => eprintln!("Search error: {e}")
    }

    // stop search engine:
    engine.stop().await?;

    Ok(())
}

Web Parsing:

use web_parser::prelude::*;

#[tokio::main]
async fn main() -> Result<()> {
    // READ PAGE AS HTML DOCUMENT:
    
    // read website page:
    let mut doc = Document::read("https://example.com/", User::random()).await?;

    // select title:
    let title = doc.select("h1")?.expect("No elements found");
    println!("Title: '{}'", title.text());

    // select descriptions:
    let mut descrs = doc.select_all("p")?.expect("No elements found");
    
    while let Some(descr) = descrs.next() {
        println!("Description: '{}'", descr.text())
    }

    // READ PAGE AS PLAIN TEXT:

    let text: String = Document::text("https://example.com/", User::random()).await?;
    println!("Text: {text}");

    // READ PAGE AS JSON:

    let json: serde_json::Value = Document::json("https://example.com/", User::random()).await?.expect("Failed to parse JSON");
    println!("Json: {json}");

    Ok(())
}

Licensing:

Distributed under the MIT license.

Feedback:

You can find me here, also see my channel. I welcome your suggestions and feedback!

Copyright (c) 2025 Bulat Sh. (fuderis)

Commit count: 3