spider_transformations

Crates.iospider_transformations
lib.rsspider_transformations
version2.13.7
sourcesrc
created_at2024-09-21 11:37:35.50831
updated_at2024-11-09 18:29:09.039979
descriptionTransformation utils to use for Spider Web Crawler.
homepage
repositoryhttps://github.com/spider-rs/spider-transformations
max_upload_size
id1382144
size204,701
Jeff Mendez (j-mendez)

documentation

https://docs.rs/spider_transformations

README

spider_transformations

The Rust spider cloud transformation library built for performance, AI, and multiple locales. The library is used on Spider Cloud for data cleaning.

Usage

[dependencies]
spider_transformations = "0"
use spider_transformations::transformation::content;

fn main() {
    // page comes from the spider object when streaming.
    let conf = content::TransformConfig::default();
    let content = content::transform_content(&page, &conf, &None, &None);
}

Transfrom types

  1. Markdown
  2. Commonmark
  3. Text
  4. Markdown (Text Map) or HTML2Text
  5. WIP: HTML2XML

Enhancements

  1. Readability
  2. Encoding

Chunking

There are several chunking utils in the transformation mod.

This project has rewrites and forks of html2md, and html2text for performance and bug fixes.

License

MIT

Commit count: 0

cargo fmt