pinyin-parser

Crates.iopinyin-parser
lib.rspinyin-parser
version0.1.9
sourcesrc
created_at2021-06-28 04:44:20.239522
updated_at2024-01-06 23:00:51.528577
descriptionParses a string of pinyin syllables. Covers marginal cases such as `ẑ`, `ŋ` and `ê`.
homepage
repositoryhttps://github.com/sozysozbot/pinyin-parser-rs
max_upload_size
id415637
size65,629
sozysozbot / hsjoihs (sozysozbot)

documentation

README

pinyin-parser-rs

Parses a string of pinyin syllables. Covers marginal cases such as , ŋ and ê.

Since pinyin strings in the wild does not necessarily conform to the standard, this parser offers two modes: strict and loose.

Strict mode:

  • forbids the use of breve instead of hacek to represent the third tone
  • forbids the use of IPA ɡ (U+0261) instead of g, and other such lookalike characters
  • allows apostrophes only before an a, an e or an o

Examples

use pinyin_parser::PinyinParser;
assert_eq!(
    PinyinParser::strict("jīntiān")
        .into_iter()
        .collect::<Vec<_>>(),
    vec!["jīn", "tiān"]
);

The resulting strings are NFC-normalized (i.e. the sample above gives a single-character ī U+012B)

Erhua is supported.

use pinyin_parser::PinyinParser;
assert_eq!(
      PinyinParser::strict("yīdiǎnr chàng'gēr")
          .collect::<Vec<_>>(),
      vec!["yī", "diǎnr"]
);

If you want r to be separated from the main syllable, use .split_erhua().
Note that syllables "er", "ēr", "ér", "ěr", and "èr" are exempt from this splitting.

use pinyin_parser::PinyinParser;
assert_eq!(
    PinyinParser::strict("yīdiǎnr chànggēr shuāng'ěr língtīng").split_erhua().collect::<Vec<_>>(),
    vec![
        "yī", "diǎn", "r", 
        "chàng", "gē", "r", 
        "shuāng", "ěr", 
        "líng", "tīng"
    ]
);

This parser supports the use of , ĉ, ŝ and ŋ, though I have never seen anyone use it.

use pinyin_parser::PinyinParser;
assert_eq!(
    PinyinParser::strict("Ẑāŋ").into_iter().collect::<Vec<_>>(),
    vec!["zhāng"]
)
use pinyin_parser::PinyinParser;
assert_eq!(
    // An apostrophe can come only before an `a`, an `e` or an `o` in strict mode,
    // but allowed here because it's loose    
    PinyinParser::loose("Yīng'guó") 
        .into_iter()
        .collect::<Vec<_>>(),
    vec!["yīng", "guó"]
);
Commit count: 26

cargo fmt