Crates.io | utf8-bufread |
lib.rs | utf8-bufread |
version | 1.0.0 |
source | src |
created_at | 2021-03-06 16:06:21.053601 |
updated_at | 2021-04-05 09:54:09.465251 |
description | Provides alternatives to BufRead's read_line & lines that stop not on newlines |
homepage | |
repository | https://gitlab.com/Austreelis/utf8-bufread |
max_upload_size | |
id | 364829 |
size | 4,405,473 |
This crate provides functions to read utf-8 text from any
type implementing io::BufRead
through a
trait, BufRead
, without waiting for newline
delimiters. These functions take advantage of buffering and
either return &
str
or char
s. Each has
an associated iterator, some have an equivalent to a
Map
iterator that avoids allocation and cloning as
well.
Add this crate as a dependency in your Cargo.toml
:
[dependencies]
utf8-bufread = "1.0.0"
The simplest way to read a file using this crate may be something along the following:
// Reader may be any type implementing io::BufRead
// We'll just use a cursor wrapping a slice for this example
let mut reader = Cursor::new("Löwe 老虎 Léopard");
loop { // Loop until EOF
match reader.read_str() {
Ok(s) => {
if s.is_empty() {
break; // EOF
}
// Do something with `s` ...
print!("{}", s);
}
Err(e) => {
// We should try again if we get interrupted
if e.kind() != ErrorKind::Interrupted {
break;
}
}
}
}
The read_str
function returns a
&
str
of arbitrary length (up to the reader's
buffer capacity) read from the inner reader, without cloning
data, unless a valid codepoint ends up cut at the end of the
reader's buffer. Its associated iterator can be obtained by
calling str_iter
, and since it involves
cloning the data at each iteration, str_map
is
also provided.
The read_char
function returns a
char
read from the inner reader. Its associated
iterator can be obtained by calling
char_iter
.
This crate provides several structs for several ways of iterating over the inner reader's data:
StrIter
and
CodepointIter
clone the data on each
iteration, but use an Rc
to check if the returned
String
buffer is still used. If not, it is
re-used to avoid re-allocating.let mut reader = Cursor::new("Löwe 老虎 Léopard");
for s in reader.str_iter().filter_map(|r| r.ok()) {
// Do something with s ...
print!("{}", s);
}
StrMap
and CodepointMap
allow having access to read data without allocating nor
copying, but then it cannot be passed to further iterator
adapters.let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count: usize = reader
.str_map(|s| s.len())
.filter_map(Result::ok)
.sum();
println!("There is {} valid utf-8 bytes in {}", count, s);
CharIter
is similar to StrIter
and others, except it relies on char
s
implementing Copy
and thus doesn't need a buffer
nor the "Rc
trick".let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count = reader
.char_iter()
.filter_map(Result::ok)
.filter(|c| c.is_lowercase())
.count();
assert_eq!(count, 9);
All these iterators may read data until EOF or an invalid
codepoint is found. If valid codepoints are read from the
inner reader, they will be returned before reporting an
error. After encountering an error or EOF, they always
return None
(option). They always ignore any
Interrupted
error.
This crate is still a work in progress. Part of its API can be considered stable:
read_str
, read_codepoint
and read_char
's behavior and signature.str_iter
, str_map
, codepoints_iter
, codepoints_map
and char_iter
's behavior and signature.StrIter
, StrMap
, CodepointIter
, CodepointMap
and
CharIter
's API.However some features are still considered unstable:
And some features still have to be added:
read_*
(see
from_utf8_lossy
&
from_utf8_unchecked
).unicode-segmentation
crate, in the same fashion as read_codepoint
.Given I'm not the most experience developer at all, you are very welcome to submit issues and push requests here
Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the LICENSE file in the root directory of this repository.