# subtile-ocr `subtile-ocr` is a blazingly fast and accurate DVD `VobSub` to SRT subtitle conversion tool. It's started as a fork of [vobsubocr](https://github.com/elizagamedev/vobsubocr). ## Background DVD subtitles are unfortunately encoded essentially as a series of images. This presents problems when needing a text representation of the subtitle, e.g. for language learning. `subtile-ocr` can alleviate this problem by generating SRT subtitles from an input `VobSub` file, leveraging the power of [Tesseract](https://github.com/tesseract-ocr/tesseract). ## Installation Install the latest release with cargo: ```sh cargo install subtile-ocr ``` Or alternatively, install the development version from git: ```sh cargo install --git https://github.com/gwen-lg/subtile-ocr ``` You will need to have Tesseract's development libraries installed; see the [leptess readme](https://github.com/houqp/leptess) for more details. If you use Nix, the provided shell.nix provides an environment with all of the necessary dependencies. ## Usage ```sh # Convert simplified Chinese vobsub subtitles and print them to stdout. subtile-ocr -l chi_sim shrek_chi.idx # Convert English vobsub subtitles and write them to a file named "shrek_eng.srt". subtile-ocr -l eng -o shrek_eng.srt shrek_eng.idx ``` We can also specify more advanced configuration options for Tesseract with `-c`. ```sh # Convert subtitles and blacklist the specified characters from being (mistakenly) recognized. subtile-ocr -l eng -c tessedit_char_blacklist='|\/`_~' shrek_eng.idx ``` ## How does it work/compare to similar tools? The most comparable tool to `subtile-ocr` is [VobSub2SRT](https://github.com/ruediger/VobSub2SRT), but `subtile-ocr` has significantly better output, especially for non-English languages, mainly because `VobSub2SRT` does not do much preprocessing of the image at all before sending it to Tesseract. For example, Tesseract 4.0 expects black text on a white background, which `VobSub2SRT` does not guarantee, but `subtile-ocr` does. Additionally, `subtile-ocr` splits each line into separate images to take advantage of page segmentation method 7, which greatly improves accuracy of non-English languages in particular. Official documentation on how to improve accuracy of Tesseract output can be viewed [here](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html). ## Miscellaneous Notes From my understanding, the `chi_sim` and `chi_tra` Tesseract models work on both simplified and traditional Chinese text, but automatically convert said text to their respective forms.