oscar-io

Crates.iooscar-io
lib.rsoscar-io
version0.4.0
sourcesrc
created_at2022-05-13 09:08:59.665991
updated_at2023-08-08 10:08:21.152194
descriptionReaders/Writers for OSCAR Corpora.
homepagehttps://oscar-corpus.com
repositoryhttps://github.com/oscar-corpus/oscar-io
max_upload_size
id585723
size3,756,114
Julien "uj" Abadji (Uinelj)

documentation

https://docs.rs/oscar-io

README

oscar-io

Types and IO (Reader/Writer) for OSCAR Corpus processing and generation.

The crate provides basic abstractions around Corpus items and generic readers/writers useable in OSCAR Corpus files. At some time, it should replace reader implementations in both Ungoliant and oscar-tools.

Features

oscar-io aims to provide readers/writers for numerous types of OSCAR Corpora.

OSCAR v2

  • Reader
    • Uncompressed [oscar_doc::Reader::new]
    • GZipped [oscar_doc::Reader::from_gzip]
    • Parquet
  • Writer
    • Uncompressed [oscar_doc::Writer::new]
    • GZipped [oscar_doc::Writer::new] (using a [GzEncoder] reader, from_gzip not yet implemented)
    • Parquet
  • SplitReader (Should be unified with SplitReader with split_size: Option<u64>)
    • Uncompressed
    • GZipped
  • SplitWriter (Same)
    • Uncompressed
    • GZipped

OSCAR v1.1

  • Reader
  • Writer
  • SplitReader (Should be unified with SplitReader with split_size: Option<u64>)
  • SplitWriter (Same)

OSCAR v1

  • Reader
  • Writer
  • SplitReader
  • SplitWriter
Commit count: 64

cargo fmt