# Data Optional (but included by default) corpus data. ## License The corpus data included in this library was downloaded and is redistributed under the [CC BY 2.0 FR DEED (Attribution 2.0 France) license][3] ([English version][4]). No changes have been made. ## Data format The format of the sequence number seems to have changed from `#ID=nnnnnn` to `#ID=nnnnnn_nnnnnn`. We treat this as a string, so should be able to handle the older format. *Copied from a section on [this page][1].* The file is in text format, with the Japanese in either EUC-JP or UTF8 encoding. If you wish to have it in any other format or coding, you will have to convert it yourself. The format is as follows: - the file consists of pairs of lines, beginning with "A: " and "B: " respectively. There are also comment lines which begin with a "#". In many cases these are A:/B: lines that have been removed from the file as far as WWWJDIC is concerned. - the "A:" lines contain the Japanese sentence and the English translation, separated by a TAB character. At the end of the English translation is a sequence number identifying the sentence pair. It is in the format: #ID=nnnnnn. This sequence number is used to identify the pair uniquely across several projects using the file. - the "B:" lines contain a space-delimited list of Japanese words found in the preceding sentence. - the Japanese words in the "B:" lines can have the following appended: - a reading in hiragana. This is to resolve cases where the word can be read different ways. WWWJDIC uses this to ensure that only the appropriate sentences are linked. The reading is in "round" parentheses. - a sense number. This occurs when the word has multiple senses in the EDICT file, and indicates which sense applies in the sentence. WWWJDIC displays these numbers. The sense number is in "square" parentheses. - the form in which the word appears in the sentence. This will differ from the indexing word if it has been inflected, for example. This field is in "curly" parentheses. - a "~" character to indicate that the sentence pair is a good and checked example of the usage of the word. Words are marked to enable appropriate sentences to be selected by dictionary software. Typically only one instance per sense of a word will be marked. The WWWJDIC server displays these sentences below the display of the related dictionary entry. - The following example pair illustrates the format: ```text A: その家はかなりぼろ屋になっている。[TAB]The house is quite run down.#ID=25507 B: 其の{その} 家(いえ)[1] は 可也{かなり} ぼろ屋[1]~ になる[1]{になっている} ``` ## Problems with the public domain version In theory there is an older public domain version of the corpus available (see [here][2]), but I could not convert this to UTF-8. I could not find its encoding listed anywhere, and `file` says it's ISO-8859 text: ```bash $ file examples_pd examples_pd: ISO-8859 text, with very long lines (330), with CRLF line terminators ``` However, `iconv` recognises a few versions of this, but none of them produced sensible output, and a guess that it might be EUC-JP encoded (as later versions are) failed for me too. The public domain format also uses a different format for the ID, but we treat this as a string so should in theory be able to handle this, but this has not been tested. [1]: https://dict.longdo.com/about/hintcontents/tanakacorpus.html [2]: https://www.edrdg.org/wiki/index.php/Tanaka_Corpus#Downloads [3]: https://creativecommons.org/licenses/by/2.0/fr/ [4]: https://creativecommons.org/licenses/by/2.0/fr/deed.en