Crates.io | shift_or_euc |
lib.rs | shift_or_euc |
version | 0.1.0 |
source | src |
created_at | 2019-04-16 08:21:11.1577 |
updated_at | 2019-04-16 08:21:11.1577 |
description | Detects among the Japanese legacy encodings |
homepage | https://docs.rs/shift_or_euc/ |
repository | https://github.com/hsivonen/shift_or_euc |
max_upload_size | |
id | 128337 |
size | 31,606 |
A Japanese legacy encoding detector for detecting between Shift_JIS, EUC-JP, and, optionally, ISO-2022-JP given the assumption that the encoding is one of those.
This detector is generally more accurate (but see below about the failure mode on half-width katakana) and decides much sooner than machine learning-based detectors. To decide EUC-JP, machine learning-based detectors try to gain confidence that the input looks like EUC-JP. To decide EUC-JP, this detector instead looks for two simple rule-based signs of the input not being Shift_JIS.
As a consequence of not containing machine learning tables, the binary size
footprint that this crate adds on top of
encoding_rs
is tiny.
See the file named COPYRIGHT.
git clone https://github.com/hsivonen/shift_or_euc
cd shift_or_euc
cargo run --example detect PATH_TO_FILE
The program prints one of:
The detector is based on two observations:
The detector gives the wrong answer if the text has a half-width katakana character before normal kana or common kanji. Some uncommon kanji are undecidable. (All JIS X 0208 Level 1 kanji are decidable.)
The half-width katakana issue is mainly relevant for old 8-bit JIS X 0201-only text files that would decode correctly as Shift_JIS but that the detector detects as EUC-JP.
The undecidable kanji issue does not realistically show up when a full document is fed to the detector, because, realistically, in a full document, there is at least one kana or common kanji. It can occur, though, if the detector is only run on a prefix of a document and the prefix only contains the title of the document. It is possible for document title to consist entirely of undecidable kanji. (Indeed, Japanese Wikipedia has articles with such titles.) If the detector is undecided, falling back to Shift_JIS is typically the Web oriented better guess.