| Crates.io | utf8-bufread |
| lib.rs | utf8-bufread |
| version | 1.0.0 |
| created_at | 2021-03-06 16:06:21.053601+00 |
| updated_at | 2021-04-05 09:54:09.465251+00 |
| description | Provides alternatives to BufRead's read_line & lines that stop not on newlines |
| homepage | |
| repository | https://gitlab.com/Austreelis/utf8-bufread |
| max_upload_size | |
| id | 364829 |
| size | 4,405,473 |
This crate provides functions to read utf-8 text from any
type implementing io::BufRead through a
trait, BufRead, without waiting for newline
delimiters. These functions take advantage of buffering and
either return &str or chars. Each has
an associated iterator, some have an equivalent to a
Map iterator that avoids allocation and cloning as
well.
Add this crate as a dependency in your Cargo.toml:
[dependencies]
utf8-bufread = "1.0.0"
The simplest way to read a file using this crate may be something along the following:
// Reader may be any type implementing io::BufRead
// We'll just use a cursor wrapping a slice for this example
let mut reader = Cursor::new("Löwe 老虎 Léopard");
loop { // Loop until EOF
match reader.read_str() {
Ok(s) => {
if s.is_empty() {
break; // EOF
}
// Do something with `s` ...
print!("{}", s);
}
Err(e) => {
// We should try again if we get interrupted
if e.kind() != ErrorKind::Interrupted {
break;
}
}
}
}
The read_str function returns a
&str of arbitrary length (up to the reader's
buffer capacity) read from the inner reader, without cloning
data, unless a valid codepoint ends up cut at the end of the
reader's buffer. Its associated iterator can be obtained by
calling str_iter, and since it involves
cloning the data at each iteration, str_map is
also provided.
The read_char function returns a
char read from the inner reader. Its associated
iterator can be obtained by calling
char_iter.
This crate provides several structs for several ways of iterating over the inner reader's data:
StrIter and
CodepointIter clone the data on each
iteration, but use an Rc to check if the returned
String buffer is still used. If not, it is
re-used to avoid re-allocating.let mut reader = Cursor::new("Löwe 老虎 Léopard");
for s in reader.str_iter().filter_map(|r| r.ok()) {
// Do something with s ...
print!("{}", s);
}
StrMap and CodepointMap
allow having access to read data without allocating nor
copying, but then it cannot be passed to further iterator
adapters.let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count: usize = reader
.str_map(|s| s.len())
.filter_map(Result::ok)
.sum();
println!("There is {} valid utf-8 bytes in {}", count, s);
CharIter is similar to StrIter
and others, except it relies on chars
implementing Copy and thus doesn't need a buffer
nor the "Rc trick".let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count = reader
.char_iter()
.filter_map(Result::ok)
.filter(|c| c.is_lowercase())
.count();
assert_eq!(count, 9);
All these iterators may read data until EOF or an invalid
codepoint is found. If valid codepoints are read from the
inner reader, they will be returned before reporting an
error. After encountering an error or EOF, they always
return None(option). They always ignore any
Interrupted error.
This crate is still a work in progress. Part of its API can be considered stable:
read_str, read_codepoint and read_char's behavior and signature.str_iter, str_map, codepoints_iter, codepoints_map
and char_iter's behavior and signature.StrIter, StrMap, CodepointIter, CodepointMap and
CharIter's API.However some features are still considered unstable:
And some features still have to be added:
read_* (see
from_utf8_lossy &
from_utf8_unchecked).unicode-segmentation
crate, in the same fashion as read_codepoint.Given I'm not the most experience developer at all, you are very welcome to submit issues and push requests here
Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the LICENSE file in the root directory of this repository.