wiktionary-zim-trimmer

Crates.iowiktionary-zim-trimmer
lib.rswiktionary-zim-trimmer
version1.0.2
created_at2025-11-19 01:21:25.315533+00
updated_at2025-11-20 19:49:29.565789+00
descriptionA tool for reducing sizes of Wiktionary ZIM archives by filtering languages and removing specified parts of the content
homepage
repositoryhttps://codeberg.org/tomekb234/wiktionary-zim-trimmer
max_upload_size
id1939260
size211,447
Tomasz Buczyński (tomekb234)

documentation

README

wiktionary-zim-trimmer

A tool for reducing sizes of Wiktionary ZIM archives by filtering languages and removing specified parts of the content

  • Currently, works only with the English language edition of Wiktionary.
  • Example: Filtering out all languages except English reduces the size of the ZIM archive from about 8 GiB to about 500 MiB.
  • Allows customizable filtering of languages, referenced languages (translations and descendants) and sections.
  • Not a Wiktionary parser — processes HTML and not Wikitext, with minimal assumptions about the document structure, but just enough to find the content that should be removed. Informs the user about any broken assumptions and any otherwise suspicious conditions in the HTML structure.
  • Fairly efficient — written in Rust and in some parts in C++, uses multiple threads for processing, uses a fast HTML parsing library (tl). The processing can take from a few minutes to a few hours, depending on the CPU speed and the number of CPU cores.
  • Uses libzim — the official, standard implementation of the ZIM format. (Rust-C++ interop is done with cxx.)

See the manual for more information.

ZIM files

Wiktionary ZIM files can be downloaded from the Wikimedia downloads page or from the Kiwix Library.

Note that this program only supports the newest Wiktionary ZIM archive and does not guarantee backward compatibility with older ones.

ZIM files can be read with Kiwix, a free and open-source offline web browser.

Downloading

First, check if this program can be installed with your preferred package manager (if any). If not, you can either build it from source or download a prebuilt version from the releases page.

To use a prebuilt version on Windows, make sure that the latest Visual C++ Redistributable is also installed.

Note: Prebuilt versions are bundled with libzim releases obtained from the libzim downloads page.

Building from source

To build this program from source, first install the following:

  • a Rust toolchain (with Cargo),
  • a C++ compiler with C++17 support,
  • libzim, version 9.4 or any compatible.

The source code of this program is distributed on crates.io, and the easiest method of building and installing it is by simply entering the following in the terminal:

cargo install wiktionary-zim-trimmer

You can also obtain the source code from the project's repository. Refer to Cargo manual for build and install instructions.

Usage

This program can be used with a command-line interface. See the manual for instructions.

If you have downloaded a prebuilt version from the releases page, remember to first enter the program's directory in the terminal. On GNU/Linux, you should then type ./wiktionary-zim-trimmer instead of wiktionary-zim-trimmer. You can also add the directory to your PATH environment variable to allow running wiktionary-zim-trimmer from any directory (and in this way de facto installing the program).

Related projects and rationale for this one

Wiktextract is a project aiming to parse whole Wiktionary and provide its content in a formal, machine-readable format. This is very valuable for linguistic research, and it can also be used to generate alternative presentation formats of Wiktionary. Ebook dictionary creator is one such project, using Wiktextract data to generate a presentation format of Wiktionary suitable for ebook readers. You may want to use it instead of wiktionary-zim-trimmer if you prefer to see only word definitions, in a plain and concise format without additional details provided by Wiktionary.

Before writing wiktionary-zim-trimmer, I also used Wiktextract to generate a "trimmed Wiktionary" for personal use, but it was not perfect — Wiktextract does not capture (as of writing this file) all the details that I find interesting (e.g. usage notes), and I guess it is rather awkward to have to reinvent HTML presentation of fully detailed Wiktionary data when Wiktionary itself is presented in HTML in the first place. I considered working with Wikitext, but (properly) rendering it to HTML is too burdensome (requires setting up MediaWiki) and takes too long. Working directly with HTML thus seemed to be the best solution (despite all the risks with this approach), and hence wiktionary-zim-trimmer was born.

License

This program is released under the GNU General Public License, version 3 or later. See LICENSE for more details.

Commit count: 0

cargo fmt