# LCID-rs: A Rust library for Windows Language Code Identifiers and other language/culture information [![crates.io](https://img.shields.io/crates/v/lcid.svg)](https://crates.io/crates/lcid) [![docs.rs](https://docs.rs/lcid/badge.svg)](https://docs.rs/lcid/) [![GitHub CI](https://github.com/tobywf/lcid-rs/actions/workflows/check.yaml/badge.svg)](https://github.com/tobywf/lcid-rs/) [[Repository](https://github.com/tobywf/lcid-rs/)] [[Documentation](https://docs.rs/lcid/)] [[Crate Registry (crates.io)](https://crates.io/crates/lcid)] --- This crate provides language code identifier parsing and information according to the [[MS-LCID] Windows Language Code Identifier (LCID) Reference](https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f) and [`System.Globalization.CultureInfo` API](https://docs.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo). The following information is provided: * Language Code Identifier/LCID (`lcid`), and lookup by LCID * Name/IETF language tag (`name`), and lookup by name * A non-localised, English readable language name (`english_name`) * ISO 639-1 two-letter code (`iso639_two_letter`) - note this is not always two letters * ISO 639-2/639-3 three-letter code (`iso639_three_letter`) * The Windows API three-letter language code (`windows_three_letter`) * ANSI code page (`ansi_code_page`), if available To use this crate, add the following to your `Cargo.toml`: ```toml [dependencies] lcid = "0.3" ``` Language identifiers/information can be queried by Language Code Identifier (LCID, a 32-bit unsigned integer), name (a string, i.e. supported [IETF BCP 47 language tags](https://tools.ietf.org/rfc/bcp/bcp47.txt)), or by directly referring to the language identifier constant: ```rust use lcid::LanguageId; use std::convert::TryInto; fn main() { let lang: &LanguageId = 1033.try_into().unwrap(); println!("Lang is '{}'/{}/'{}'", lang.name, lang.lcid, lang.english_name); let lang: &LanguageId = "en-US".try_into().unwrap(); println!("Lang is '{}'/{}/'{}'", lang.name, lang.lcid, lang.english_name); let lang: &LanguageId = lcid::constants::LANG_EN_US; println!("Lang is '{}'/{}/'{}'", lang.name, lang.lcid, lang.english_name); } ``` This prints the following for each: ``` Lang is 'en-US'/1033/'English (United States)' ``` ## Project name and status I struggle to find a good name for this. "locale-info" might be misleading (might imply some kind of POSIX locale support), or "culture-info" implying more than the project offers (like calendar information). In the end, I chose "lcid-rs", because "lcid" is ambiguous/hard to search for, although I named the crate itself "lcid" because in the context of Rust, "lcid" is not ambiguous. It'd be nice if this project was referred to as "lcid-rs" in ambiguous contexts (linking to the repo, blog posts, etc), and "lcid" only in Rust code/configuration. The maintenance status is "as-is". I'm happy to accept pull requests for corrections (as long as they align with MS-LCID and the Windows API), pull requests for new features, and pull requests for new MS-LCID protocol revisions in the future. ## MS-LCID protocol revision This library currently tracks the `15.0`/2021-06-25 protocol revision. Future protocol revisions will may only trigger a minor version bump, so if you need lookup behaviour of a specific revision, pin this crate accordingly. ## Changelog ### [0.3.0] - 2023-06-15 * Tracks MS-LCID `15.0`/2021-05-25 protocol revision * Breaking change: As the spec no longer enumerates "Locale Names without LCIDs", these are no longer supported * Codegen: Sort order is now as specified in the MS-LCID spec ### [0.2.1] - 2023-06-10 * Remove `thiserror` dependency * MSRV is Rust 1.56 * Edition is 2021 * Add `PartialEq`, `Eq`, and `Hash` traits to `AnsiCodePage` and `LanguageId` ### [0.2.0] - 2021-06-08 * Tracks MS-LCID `14.1`/2021-04-07 protocol revision * Provide ANSI code page information * Move `LanguageId` constants to a module, to avoid cluttering the crate namespace (breaking change) * Codegen: Sort languages by LCID and name, so the generated code is stable for languages that share an LCID (`0x1000` ones) ### [0.1.0] - 2021-06-06 * Initial release ## How the information was generated First, information was extracted from the [MS-LCID PDF](https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f) corresponding to the tracked protocol revision, and from the HTML table of the [associated LCIDs](https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/63d3d639-7fd2-4afb-abbe-0d5b5551eef8). This was then manually cleaned, converted to JSON, and compared. The `GetCultureInfo.ps1` script was run on a Windows Server 2022 machine (Build 20348, locale "en-us") and a Windows 10 (Build 19045, locale "en-us") to gather further information from the `System.Globalization.CultureInfo` API, based on the language IDs in MS-LCID. The values returned by the API do not always match the information in MS-LCID, so some fix-up were applied. For details, please see [`lcid_gen`](lcid_gen/src/). Since there were differences between the output on Windows Server 2022 and Windows 10, additional fix-ups were applied so that the information matches. Many of these are listed in the errata section. Finally, the `lcid_gen` crate was invoked to generate code for the `lcid` crate. The generated code is committed to the repository. This is done to avoid having a build-time dependency on the JSON files. ## MS-LCID/CultureInfo errata ### Protocol revision `15.0`/2021-06-25 * The download link for the diff file is [incorrect](https://winprotocoldoc.blob.core.windows.net/productionwindowsarchives/MS-LCID/%5bMS-LCID%5d-210625-diff.pdf) and points to `[MS-LCID]-210625-diff.pdf`; the [correct link](https://winprotocoldoc.blob.core.windows.net/productionwindowsarchives/MS-LCID/%5bMS-LCID%5d-diff.pdf) points to `[MS-LCID]-diff.pdf`. * The language ID for "quz-PE" is misprinted as `0x0C6b`. It should be `0x0C6B`, as all other language IDs are upper-cased hexadecimal. This does no affect lcid-rs. * On some versions (Windows 10 only?), the culture information's name for "zh-Hans"/`0x0004` is returned as "zh-CHS", and the name for "zh-Hant"/`0x7C04` is returned as "zh-CHT". These are legacy names. This is [a known problem](https://social.msdn.microsoft.com/Forums/en-US/8b93c07b-93bd-465f-b48f-0fff544c06d8/), which [Microsoft acknowledges](https://learn.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo): > There are two culture names that contradict this rule. The cultures Chinese (Simplified), named `zh-Hans`, and Chinese (Traditional), named `zh-Hant`, are neutral cultures. The culture names represent the current standard and should be used unless you have a reason for using the older names `zh-CHS` and `zh-CHT`. lcid-rs uses the names "zh-Hans"/"zh-Hant", and the English Names "Chinese (Simplified)"/"Chinese (Traditional)" (without the suffix "Legacy"). However, lcid-rs uses the Windows API three letter language code "CHT" instead of the sometimes used "ZHH" for "zh-Hant". * The culture information for "qut"/`0x0086` is quite broken. On Windows Server 2022, the LCID, ISO 639, and English Name are wrong or incomplete. On Windows 10, the culture information returned seems to be for "quc"/`0x0093`, which is reserved. This also means the culture information name does not match the MS-LCID name. lcid-rs v0.2 used to change this, but lcid-rs v0.3 uses the culture information as returned on Windows 10 when it was built, even though this seems to violate MS-LCID. * The MS-LCID spec specified "ff-NG, ff-Latn-NG" for `0x0467`. The culture information returned has the name "ff-Latn-NG". lcid-rs uses "ff-Latn-NG". * The culture information for "la-VA"/`0x0476` is a mess. When queried by LCID, the name is "la-001", and the English Name is "Latin (World)" (instead of "Latin (Vatican City)"). When queried by name, the LCID is incorrect (`0x1000`), and sometimes the English Name also. lcid-rs uses "la-VA" and "Latin (Vatican City)", as this is what is returned when queried by name. This also matches MS-LCID, which does not specify "la-001". * The culture information name for "es-ES_tradnl"/`0x040A` is "es-ES". However, the LCID, English Name, and Windows API three letter language code will be different from "es-ES"/`0x0C0A`. lcid-rs does not change this. * The ISO 639 two and three letter language codes for "no"/`0x0014` are confusing. On Windows Server 2022, they are "no"/"nor". On Windows 10, they seem to be "nb"/"nob" for "Bokmål". If you are Norwegian, please weigh in. lcid-rs uses "nb"/"nob". * Further small fix-ups to some English Names are documented in [`lcid_gen/src/fixup.rs`](lcid_gen/src/fixup.rs). Generally, a preference was given to the values returned by Windows 10. ### Protocol revision `14.1`/2021-04-07 * "es-CU" is listed twice. Once as `0x5C0A` in the "Language ID" table, and once in the "Locale Names without LCIDs" table as `0x1000`. The former LCID was used. * "ff-Latn-GM" is misprinted as "ff-latn-GM" (lower-case "l"). This was corrected. * Many more culture information errata/fix-ups. ## License Licensed under either of * Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0) * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT) at your option. ## Contribution Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.