transcoding_rs

Crates.iotranscoding_rs
lib.rstranscoding_rs
version0.1.1
sourcesrc
created_at2021-10-24 12:28:42.874849
updated_at2021-10-24 12:39:16.08423
descriptionConverts text encoding the easy and efficient way
homepage
repositoryhttps://github.com/kena0ki/aconv/tree/main/transcoding_rs
max_upload_size
id470333
size70,198
Ken Aoki (kena0ki)

documentation

README

transcoding_rs

This is a transcoding library. Transcoding here means converting text encoding to another.

There are two excellent crates chardetng and encoding_rs. chardetng is created for encoding detection and encoding_rs can be used for transcoding. This library aims to transcode the easy and efficient way by combining these two crates.

Note: Supported encodings are the ones defined in the Encoding Standard.

Note: UTF-16 files are needed to have a BOM to be detected as the encoding.
This is because chardetng, on which this library depends, does not support UTF-16 and this library only added BOM sniffing to detect UTF-16.

Usage

See the document.

How encoding detection works.

Since texts are internally just byte sequences, there is no way to detect the right encoding with 100% accuracy.
So we need to guess the right encoding somehow.
The below is the flow we roughly follow.

  1. Do BOM sniffing to detect UTF-16.
    If a BOM is found, skip guessing the encoding.
  2. Guess the encoding using chardetng.
  3. Decode texts using encoding_rs.
  4. Check the decoded texts if there are non-text characters, which are described below.
    If non-text characters do not exceed the threshold, output the decoded texts.
    Otherwise, emit an error message and output the input texts as it is.

Non-text characters

Characters that are treated as non-text in this library are the same ones in the file command, plus the REPLACEMENT CHARACTER.
Namely, U+0000 ~ U+0006, U+000e ~ U+001a, U+001c ~ U+001f, U+007f, and U+FFFD are treated as the non-text characters.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Commit count: 120

cargo fmt