# UTF-8 Buffered Reader This crate provides functions to read utf-8 text from any type implementing [`io::BufRead`](io::BufRead) through a trait, [`BufRead`](BufRead), without waiting for newline delimiters. These functions take advantage of buffering and either return `&`[`str`](str) or [`char`](char)s. Each has an associated iterator, some have an equivalent to a [`Map`](Map) iterator that avoids allocation and cloning as well. [![crates.io](http://img.shields.io/crates/v/utf8-bufread.svg)](https://crates.io/crates/utf8_bufread) [![docs.rs](https://docs.rs/utf8-bufread/badge.svg)](https://docs.rs/utf8-bufread/latest/utf8-bufread) [![build status](https://gitlab.com/Austreelis/utf8-bufread/badges/main/pipeline.svg)](https://gitlab.com/Austreelis/utf8-bufread/-/commits/main) # Usage Add this crate as a dependency in your `Cargo.toml`: ```toml [dependencies] utf8-bufread = "1.0.0" ``` The simplest way to read a file using this crate may be something along the following: ```rust // Reader may be any type implementing io::BufRead // We'll just use a cursor wrapping a slice for this example let mut reader = Cursor::new("Löwe 老虎 Léopard"); loop { // Loop until EOF match reader.read_str() { Ok(s) => { if s.is_empty() { break; // EOF } // Do something with `s` ... print!("{}", s); } Err(e) => { // We should try again if we get interrupted if e.kind() != ErrorKind::Interrupted { break; } } } } ``` ## Reading arbitrary-length string slices The [`read_str`](read_str) function returns a `&`[`str`](str) of arbitrary length (up to the reader's buffer capacity) read from the inner reader, without cloning data, unless a valid codepoint ends up cut at the end of the reader's buffer. Its associated iterator can be obtained by calling [`str_iter`](str_iter), and since it involves cloning the data at each iteration, [`str_map`](str_map) is also provided. ## Reading codepoints The [`read_char`](read_char) function returns a [`char`](char) read from the inner reader. Its associated iterator can be obtained by calling [`char_iter`](char_iter). ## Iterator types This crate provides several structs for several ways of iterating over the inner reader's data: - [`StrIter`](StrIter) and [`CodepointIter`](CodepointIter) clone the data on each iteration, but use an [`Rc`](Rc) to check if the returned [`String`](String) buffer is still used. If not, it is re-used to avoid re-allocating. ```rust let mut reader = Cursor::new("Löwe 老虎 Léopard"); for s in reader.str_iter().filter_map(|r| r.ok()) { // Do something with s ... print!("{}", s); } ``` - [`StrMap`](StrMap) and [`CodepointMap`](CodepointMap) allow having access to read data without allocating nor copying, but then it cannot be passed to further iterator adapters. ```rust let s = "Löwe 老虎 Léopard"; let mut reader = Cursor::new(s); let count: usize = reader .str_map(|s| s.len()) .filter_map(Result::ok) .sum(); println!("There is {} valid utf-8 bytes in {}", count, s); ``` - [`CharIter`](CharIter) is similar to [`StrIter`](StrIter) and others, except it relies on [`char`](char)s implementing [`Copy`](Copy) and thus doesn't need a buffer nor the "`Rc` trick". ```rust let s = "Löwe 老虎 Léopard"; let mut reader = Cursor::new(s); let count = reader .char_iter() .filter_map(Result::ok) .filter(|c| c.is_lowercase()) .count(); assert_eq!(count, 9); ``` All these iterators may read data until EOF or an invalid codepoint is found. If valid codepoints are read from the inner reader, they *will* be returned before reporting an error. After encountering an error or EOF, they always return `None`(option). They always ignore any [`Interrupted`](Interrupted) error. # Work in progress This crate is still a work in progress. Part of its API can be considered stable: - [`read_str`](read_str), [`read_codepoint`](read_codepoint) and [`read_char`](read_char)'s behavior and signature. - [`str_iter`](str_iter), [`str_map`](str_map), [`codepoints_iter`](codepoints_iter), [`codepoints_map`](codepoints_map) and [`char_iter`](char_iter)'s behavior and signature. - [`StrIter`](StrIter), [`StrMap`](StrMap), [`CodepointIter`](CodepointIter), [`CodepointMap`](CodepointMap) and [`CharIter`](CharIter)'s API. However some features are still considered unstable: - [`Error`](Error)'s behavior, particularly regarding its [`kind`](kind) and how it avoids data loss (see [`leftovers`](leftovers)). And some features still have to be added: - A lossy and unchecked version of `read_*` (see [`from_utf8_lossy`](from_ut8_lossy) & [`from_utf8_unchecked`](from_utf8_unchecked)). - (Optional) Support for grapheme clusters using the [`unicode-segmentation`](unicode-segmentation) crate, in the same fashion as [`read_codepoint`](read_codepoint). - I'm open to suggestion, if you have ideas 😉 Given I'm not the most experience developer at all, you are very welcome to submit issues and push requests [here](https://gitlab.com/Austreelis/utf8-bufread) # License Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the [LICENSE](https://gitlab.com/Austreelis/utf8-bufread/-/blob/main/LICENSE) file in the root directory of this repository. [io::BufRead]: https://doc.rust-lang.org/std/io/trait.BufRead.html [str]: https://doc.rust-lang.org/std/primitive.str.html [char]: https://doc.rust-lang.org/std/primitive.char.html [Map]: https://doc.rust-lang.org/std/iter/struct.Map.html [Rc]: https://doc.rust-lang.org/std/rc/struct.Rc.html [String]: https://doc.rust-lang.org/std/string/struct.String.html [Copy]: https://doc.rust-lang.org/std/marker/trait.Copy.html [option]: https://doc.rust-lang.org/std/option/index.html [Interrupted]: https://doc.rust-lang.org/std/io/enum.ErrorKind.html#variant.Interrupted [from_utf8_lossy]: https://doc.rust-lang.org/nightly/alloc/string/struct.String.html#method.from_utf8_lossy [from_utf8_unchecked]: https://doc.rust-lang.org/nightly/alloc/string/struct.String.html#method.from_utf8_unchecked [unicode-segmentation]: https://docs.rs/unicode-segmentation/latest/unicode_segmentation/index.html [BufRead]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html [read_str]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.read_str [str_iter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.str_iter [str_map]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.str_map [read_codepoint]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.read_codepoint [codepoints_iter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.codepoints_iter [codepoints_map]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.codepoints_map [read_char]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.read_char [char_iter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.char_iter [StrIter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/struct.StrIter.html [StrMap]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/struct.StrMap.html [CodepointIter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/struct.CodepointIter.html [CodepointMap]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/struct.CodepointMap.html [CharIter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/struct.CharIter.html