mdbook-fix-cjk-spacing

Crates.io	mdbook-fix-cjk-spacing
lib.rs	mdbook-fix-cjk-spacing
version	0.1.1
created_at	2020-07-20 07:49:54.316427+00
updated_at	2020-07-20 09:38:31.821385+00
description	mdbook preprocess that fixes CJK line breaks
homepage	https://github.com/lotabout/mdbook-fix-cjk-spacing
repository	https://github.com/lotabout/mdbook-fix-cjk-spacing
max_upload_size
id	267137
size	101,488

Jinzhou Zhang (lotabout)

documentation

https://github.com/lotabout/mdbook-fix-cjk-spacing

README

mdbook will render extra space of continuous lines with CJK characters.

.....中文结尾
中文顶格...

will result in

.....中文结尾 中文顶格...
             `- note the space here

This preprocessor will fix that.

Usage

Download the binary from the release page and put it in your PATH.
- Alternatively, build from source: cargo install mdbook-fix-cjk-spacing

Add the following config to your book.toml

[preprocessor.fix-cjk-spacing]
command = "mdbook-fix-cjk-spacing"

Done

How does it work?

This preprocessor will work on AST of the markdown file:

It will use pulldown-cmark to parse the markdown file.
When encounter a SoftBreak token, it will search before and after for a Text token.
The SoftBreak is omitted when the previous text ends with CJK and next text starts with CJK character.

The binary has a "raw" mode for showing the processed output:

cat markdown.md | md-fix-cjk-spacing raw

The problem

In markdown, if we write several lines continuously, it will be parsed as a whole block:

line 1
line 2
line 3

// will be parsed as

<p>line 1
line 2
line 3</p>

That means line breaks are kept and all the three lines are treated as a whole paragraph.

However, the browser will convert the line break in a <p> into a single space, so when we see the previous content in a browser, it will look like:

line 1 line 2 line 3

That is OK except when we use Chinese. There is no concept of space in Chinese, so when we write:

中文第一行
中文接上行

// will show as

中文第一行 中文接上行
//        `- not the space here

It is really frustrating! So there are two major solutions:

Fixing the markdown parsing code to treat it correctly.
Write the whole paragraph in a long line.

The first option is actually not so practical. This 'bug' exist for so long and still not fixed. The second will be so boring and un-friendly.

So here comes our solution with mdbook: Write a preprocessor to merge Chinese lines automatically before parsing!

The use case

Only the following situation are dealt with:

...<chinese character>[should contains no spaces]
[zero or more spaces|tab]<chinese character>

.....中文结尾
中文顶格...

// are modified to
.....中文结尾中文顶格...
//           `- note no space here

Note that the content in code block will not be changed.

Commit count: 0