Crates.io | deunicode |
lib.rs | deunicode |
version | 1.6.0 |
source | src |
created_at | 2018-05-05 19:10:46.167858 |
updated_at | 2024-05-13 12:35:59.459289 |
description | Convert Unicode strings to pure ASCII by intelligently transliterating them. Suppors Emoji and Chinese. |
homepage | https://lib.rs/crates/deunicode |
repository | https://github.com/kornelski/deunicode/ |
max_upload_size | |
id | 63908 |
size | 495,891 |
The deunicode
library transliterates Unicode strings such as "Æneid" into pure
ASCII ones such as "AEneid". It includes support for emoji. It's compatible with no-std Rust environments.
Deunicode is quite fast, supports on-the-fly conversion without allocations. It has a compact representation of Unicode data to minimize memory overhead and executable size (about 75K codepoints mapped to 245K ASCII characters, using 450KB of memory, 160KB gzipped).
use deunicode::deunicode;
assert_eq!(deunicode("Æneid"), "AEneid");
assert_eq!(deunicode("étude"), "etude");
assert_eq!(deunicode("北亰"), "Bei Jing");
assert_eq!(deunicode("ᔕᓇᓇ"), "shanana");
assert_eq!(deunicode("げんまい茶"), "genmaiCha");
assert_eq!(deunicode("🦄☣"), "unicorn biohazard");
It's a better alternative than just stripping all non-ASCII characters or letting them get mangled by some encoding-ignorant system. It's be okay for one-way conversions for things like search indexes and tokenization, as a stronger version of Unicode NFKD. It may be used for generating nice identifiers for file names and URLs, which aren't too user-facing.
However, like most "universal" libraries of this kind, it has a one-size-fits-all 1:1 mapping of Unicode code points, which can't handle language-specific exceptions nor context-dependent romanization rules. These limitations are only slightly suboptimal for European languages and Korean Hangul, but make a mess of Japanese Kanji.
Here are some guarantees you have when calling deunicode()
:
String
returned will be valid ASCII; the decimal representation of
every char
in the string will be between 0 and 127, inclusive.\n
or characters in the range 0x20 - 0x7E).There are, however, some things you should keep in mind:
\n
characters.deunicode
does not know about the character."[?]"
(or a custom placeholder, or None
if you use a chars iterator).Text::Unidecode
by Sean M. BurkeFor a detailed explanation on the rationale behind the original dataset, refer to this article written by Burke in 2001.
This is a maintained alternative to the unidecode crate, which started as a Rust port of Text::Unidecode
Perl module.