# Basic Text The *Basic Text* format is a subset of the [Unicode] format and meant to fulfill common notions of "plain text". It is not yet standardized anywhere, and may evolve, but it is usable for many purposes. Basic Text permits homoglyphs and other visual ambiguities; see [Restricted Text] for an alternative which might provide some mitigations. For rationale and background information, see [Background]. For a prototype implementation, see [the Github repo]. [the Github repo]: https://github.com/sunfishcode/basic-text/ [Background]: Background.md ## Definitions A string is in Basic Text form iff: - it is a [Unicode] string in [Stream-Safe] [NFC] form, and - it doesn't start with a [Basic Text non-starter], and - it doesn't end with a [Basic Text non-ender], and - all scalar values with a `General_Category` of `Unassigned` are preceded and followed by U+34F. - it doesn't contain any of the sequences listed in the [Sequence Table]. A stream is in Basic Text form iff: - it consists entirely of a string in Basic Text form, and - it is empty or ends with U+A. ### Supplementary definitions #### Basic Text non-starter [Basic Text non-starter]: #basic-text-non-starter A Unicode scalar value is a Basic Text non-starter iff: - it is a [normalization-form non-starter], or - its [`Grapheme_Cluster_Break`] is `ZWJ`, `SpacingMark` or `Extend` and it isn't U+34F. #### Basic Text non-ender [Basic Text non-ender]: #basic-text-non-ender A Unicode scalar value is a Basic Text non-ender iff: - its `Grapheme_Cluster_Break` is `ZWJ` or `Prepend`. [normalization-form non-starter]: https://unicode.org/reports/tr15/#Description_Norm [`Grapheme_Cluster_Break`]: https://unicode.org/reports/tr29/#Grapheme_Cluster_Break_Property_Values ## Sequence Table | Sequence | aka | Replacement | Error | | ------------------- | ---- | ----------------- | ----------------------------------------- | | U+D U+A | CRLF | U+A | "Use U+A to terminate a line" | | U+D | CR | U+A | "Use U+A to terminate a line" | | \[U+C\]+ U+D U+A | | U+A | "Control code not valid in text" | | \[U+C\]+ U+A | | U+A | "Control code not valid in text" | | \[U+C\]+ U+D | | U+A | "Control code not valid in text" | | \[U+C\]+ | FF | U+20 | "Control code not valid in text" | | U+1B U+5B \[U+20–U+3F\]\* U+6D | SGR | | "Color escape sequences are not enabled" | | \[U+1B\]+ U+5B U+5B \[U+–U+7F\]? | | | "Unrecognized escape sequence" | | \[U+1B\]+ U+5B \[U+20–U+3F\]\* \[U+40–U+7E\]? | CSI | | "Unrecognized escape sequence" | | \[U+1B\]+ U+5D \[\^U+7,U+18,U+1B\]\* \[U+7,U+18\]? | OSC | | "Unrecognized escape sequence" | | \[U+1B\]+ \[U+40–U+7E\] | ESC | | "Unrecognized escape sequence" | | \[U+1B\]+ | ESC | | "Escape code not valid in text" | | \[U+0–U+8,U+B,U+E–U+1F\] | C0 | U+FFFD | "Control code not valid in text" | | U+7F | DEL | U+FFFD | "Control code not valid in text" | | U+85 | NEL | U+20 | "Control code not valid in text" | | \[U+80–U+84,U+86–U+9F\] | C1 | U+FFFD | "Control code not valid in text" | | U+149 | `ʼn` | U+2BC U+6E | "Use U+2BC U+6E instead of U+149" | | U+673 | `ا ٟ` | U+627 U+65F | "Use U+627 U+65F instead of U+673" | | U+9E4 | `।` | U+FFFD | "Use U+964 instead of U+9E4" | | U+9E5 | `॥` | U+FFFD | "Use U+965 instead of U+9E5" | | U+A64 | `।` | U+FFFD | "Use U+964 instead of U+A64" | | U+A65 | `॥` | U+FFFD | "Use U+965 instead of U+A65" | | U+AE4 | `।` | U+FFFD | "Use U+964 instead of U+AE4" | | U+AE5 | `॥` | U+FFFD | "Use U+965 instead of U+AE5" | | U+B64 | `।` | U+FFFD | "Use U+964 instead of U+B64" | | U+B65 | `॥` | U+FFFD | "Use U+965 instead of U+B65" | | U+BE4 | `।` | U+FFFD | "Use U+964 instead of U+BE4" | | U+BE5 | `॥` | U+FFFD | "Use U+965 instead of U+BE5" | | U+C64 | `।` | U+FFFD | "Use U+964 instead of U+C64" | | U+C65 | `॥` | U+FFFD | "Use U+965 instead of U+C65" | | U+CE4 | `।` | U+FFFD | "Use U+964 instead of U+CE4" | | U+CE5 | `॥` | U+FFFD | "Use U+965 instead of U+CE5" | | U+D64 | `।` | U+FFFD | "Use U+964 instead of U+D64" | | U+D65 | `॥` | U+FFFD | "Use U+965 instead of U+D65" | | U+F77 | `◌ྲ◌ཱྀ` | U+FB2 U+F71 U+F80 | "Use U+FB2 U+F71 U+F80 instead of U+F77" | | U+F79 | `◌ླ◌ཱྀ` | U+FB3 U+F71 U+F80 | "Use U+FB3 U+F71 U+F80 instead of U+F79" | | U+17A3 | `អ` | U+17A2 | "Use U+17A2 instead of U+17A3" | | U+17A4 | `អា` | U+17A2 U+17B6 | "Use U+17A2 U+17B6 instead of U+17A4" | | U+17B4 | | U+FFFD | "Omit U+17B4" | | U+17B5 | | U+FFFD | "Omit U+17B5" | | U+17D8 | | U+FFFD | "Spell beyyal with normal letters" | | U+2028 | LS | U+20 | "Line separation is a rich-text function" | | U+2029 | PS | U+20 | "Paragraph separation is a rich-text function" | | U+202A | LRE | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" | | U+202B | RLE | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" | | U+202C | PDF | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" | | U+202D | LRO | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" | | U+202E | RLO | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" | | U+2066 | LRI | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" | | U+2067 | RLI | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" | | U+2068 | FSI | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" | | U+2069 | PDI | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" | | \[U+206A–U+206F\] | | U+FFFD | "Deprecated Format Characters are deprecated" | | U+2072 | `²` | U+FFFD | "Use U+B2 instead of U+2072" | | U+2073 | `³` | U+FFFD | "Use U+B3 instead of U+2073" | | U+2126 | `Ω` | U+3A9 | "Use U+3A9 instead of U+2126" | | U+212A | `K` | U+4B | "Use U+4B instead of U+212A" | | U+212B | `Å` | U+C5 | "Use U+C5 instead of U+212B" | | U+2329 | `⟨` | U+FFFD | "Use U+27E8 instead of U+2329" | | U+232A | `⟩` | U+FFFD | "Use U+27E9 instead of U+232A" | | U+2DF5 | ` ⷭⷮ` | U+2DED U+2DEE | "Use U+2DED U+2DEE instead of U+2DF5" | | \[U+F900–U+FA0D,U+FA10,U+FA12,U+FA15–U+FA1E,U+FA20,U+FA22,U+FA25–U+FA26,U+FA2A–U+FA6D,U+FA70–U+FAD9\] | | CJK compatibility ideograph [Standardized Variant] | "Use Standardized Variants instead of CJK Compatibility Ideographs" | | U+FB00 | `ff` | U+66 U+66 | "Use U+66 U+66 instead of U+FB00" | | U+FB01 | `fi` | U+66 U+69 | "Use U+66 U+69 instead of U+FB01" | | U+FB02 | `fl` | U+66 U+6C | "Use U+66 U+6C instead of U+FB02" | | U+FB03 | `ffi`| U+66 U+66 U+66 | "Use U+66 U+66 U+69 instead of U+FB03" | | U+FB04 | `ffl`| U+66 U+66 U+6C | "Use U+66 U+66 U+6C instead of U+FB04" | | U+FB05 | `ſt` | U+17F U+74 | "Use U+17F U+74 instead of U+FB05" | | U+FB06 | `st` | U+73 U+74 | "Use U+73 U+74 instead of U+FB06" | | \[U+FDD0–U+FDEF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | U+FEFF | BOM | U+2060 | "U+FEFF is not necessary in Basic Text" | | \[U+FFF9–U+FFFB\] | | U+FFFD | "Interlinear Annotations depend on out-of-band information" | | U+FFFC | ORC | U+FFFD | "U+FFFC depends on out-of-band information" | | \[U+FFFE,U+FFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | U+1D455 | `ℎ` | U+FFFD | "Use U+210E instead of U+1D455" | | U+1D49D | `ℬ` | U+FFFD | "Use U+212C instead of U+1D49D" | | U+1D4A0 | `ℰ` | U+FFFD | "Use U+2130 instead of U+1D4A0" | | U+1D4A1 | `ℱ` | U+FFFD | "Use U+2131 instead of U+1D4A1" | | U+1D4A3 | `ℋ` | U+FFFD | "Use U+210B instead of U+1D4A3" | | U+1D4A4 | `ℐ` | U+FFFD | "Use U+2110 instead of U+1D4A4" | | U+1D4A7 | `ℒ` | U+FFFD | "Use U+2112 instead of U+1D4A7" | | U+1D4A8 | `ℳ` | U+FFFD | "Use U+2133 instead of U+1D4A8" | | U+1D4AD | `ℛ` | U+FFFD | "Use U+211B instead of U+1D4AD" | | U+1D4BA | `ℯ` | U+FFFD | "Use U+212F instead of U+1D4BA" | | U+1D4BC | `ℊ` | U+FFFD | "Use U+210A instead of U+1D4BC" | | U+1D4C4 | `ℴ` | U+FFFD | "Use U+2134 instead of U+1D4C4" | | U+1D506 | `ℭ` | U+FFFD | "Use U+212D instead of U+1D506" | | U+1D50B | `ℌ` | U+FFFD | "Use U+210C instead of U+1D50B" | | U+1D50C | `ℑ` | U+FFFD | "Use U+2111 instead of U+1D50C" | | U+1D515 | `ℜ` | U+FFFD | "Use U+211C instead of U+1D515" | | U+1D51D | `ℨ` | U+FFFD | "Use U+2128 instead of U+1D51D" | | U+1D53A | `ℂ` | U+FFFD | "Use U+2102 instead of U+1D53A" | | U+1D53F | `ℍ` | U+FFFD | "Use U+210D instead of U+1D53F" | | U+1D545 | `ℕ` | U+FFFD | "Use U+2115 instead of U+1D545" | | U+1D547 | `ℙ` | U+FFFD | "Use U+2119 instead of U+1D547" | | U+1D548 | `ℚ` | U+FFFD | "Use U+211A instead of U+1D548" | | U+1D549 | `ℝ` | U+FFFD | "Use U+211D instead of U+1D549" | | U+1D551 | `ℤ` | U+FFFD | "Use U+2124 instead of U+1D551" | | \[U+1FFFE,U+1FFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | U+111C4 | `𑆏𑆀` | U+1118F U+11180 | "Use U+1118F U+11180 instead of U+111C4" | | \[U+2F800–U+2FA1D\] | | CJK compatibility ideograph [Standardized Variant] | "Use Standardized Variants instead of CJK Compatibility Ideographs" | | \[U+2FFFE,U+2FFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+3FFFE,U+3FFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+4FFFE,U+4FFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+5FFFE,U+5FFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+6FFFE,U+6FFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+7FFFE,U+7FFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+8FFFE,U+8FFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+9FFFE,U+9FFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+AFFFE,U+AFFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+BFFFE,U+BFFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+CFFFE,U+CFFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+DFFFE,U+DFFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | U+E0001 | | U+FFFD | "Language tagging is a deprecated mechanism" | | \[U+EFFFE,U+EFFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+FFFFE,U+FFFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | | \[U+10FFFE,U+10FFFF\] | | U+FFFD | "Noncharacters are intended for internal use only" | ## Conversion ### From Unicode string to Basic Text string To convert a [Unicode] string into a Basic Text string in a manner that always succeeds, discarding information not usually considered meaningful or valid in plain text: - If the string starts with a [Basic Text non-starter], prepend U+34F. - If the string ends with a [Basic Text non-ender], append U+34F. - When [*NEL Compatibility*] is enabled, replace any U+85 with U+A. - When [*LSPS Compatibility*] is enabled, replace any U+2028 or U+2029 with U+A. - Perform the Replacement actions from the [Sequence Table]. - For any scalar value with `General_Category` of `Unassigned` that isn't already preceded by U+34F, insert U+34F before it. - For any scalar value with `General_Category` of `Unassigned` that isn't already followed by U+34F, insert U+34F after it. - Perform the [Stream-Safe Text Process (UAX15-D4)]. - Perform `toNFC` with the [Normalization Process]. [*NEL Compatibility*]: #options [*LSPS Compatibility*]: #options #### Options The following options may be enabled: | Name | Type | Default | | ------------------ | ------- | ------- | | NEL Compatibility | Boolean | `false` | | LSPS Compatibility | Boolean | `false` | ### From Unicode string to Basic Text string, strict To convert a [Unicode] string into a Basic Text string in a manner that discards information not usually considered meaningful and otherwise fails if the content is not valid Basic Text: - If the string starts with a [Basic Text non-starter], error with "Basic Text string must not begin with Basic Text non-starter". - If the string ends with a [Basic Text non-ender], error with "Basic Text string must not end with Basic Text non-ender". - Perform the Error actions from the [Sequence Table]. - When [*CRLF Compatibility*] is enabled, replace any U+A with U+D U+A. - For any scalar value with `General_Category` of `Unassigned` that isn't already preceded by U+34F, insert U+34F before it. - For any scalar value with `General_Category` of `Unassigned` that isn't already followed by U+34F, insert U+34F after it. - Perform the [Stream-Safe Text Process (UAX15-D4)]. - Perform `toNFC` with the [Normalization Process]. [*CRLF Compatibility*]: #options #### Options The following options may be enabled: | Name | Type | Default | | ------------------ | ------- | ------- | | CRLF Compatibility | Boolean | `false` | ### From Unicode stream to Basic Text stream To convert a [Unicode] stream into a Basic Text stream in a manner than always succeeds, discarding information not usually considered meaningful or valid in plain text: - If the stream starts with U+FEFF, remove it. - If the stream is non-empty and doesn't end with U+A or U+D, append a U+A. - Perform [From Unicode string to Basic Text string]. ### From Unicode stream to Basic Text stream, strict To convert a [Unicode] stream into a Basic Text stream in a manner that discards information not usually considered meaningful and otherwise fails if the content is not valid Basic Text: - When [*BOM Compatibility*] is enabled, insert a U+FEFF at the beginning of the stream. - If the stream is non-empty and doesn't end with U+A or U+D, error with "Basic Text stream must be empty or end with newline". - Perform [From Unicode string to Basic Text string, strict]. [Sequence Table]: #sequence-table [From Unicode string to Basic Text string]: #from-unicode-string-to-basic-text-string [From Unicode string to Basic Text string, strict]: #from-unicode-string-to-basic-text-string-strict [From Unicode stream to Basic Text stream]: #from-unicode-stream-to-basic-text-stream [From Unicode stream to Basic Text stream, strict]: #from-unicode-stream-to-basic-text-stream-strict [*BOM Compatibility*]: #options #### Options The following options may be enabled: | Name | Type | Default | | ------------------ | ------- | ------- | | BOM Compatibility | Boolean | `false` | [NFC]: https://unicode.org/reports/tr15/#Norm_Forms [Stream-Safe]: https://unicode.org/reports/tr15/#Stream_Safe_Text_Format [Stream-Safe Text Process (UAX15-D4)]: https://unicode.org/reports/tr15/#UAX15-D4 [Standardized Variant]: https://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt [Normalization Process]: https://unicode.org/reports/tr15/#Description_Norm [Restricted Text]: RestrictedText.md [Unicode]: Unicode.md