# Data

Optional (but included by default) corpus data.


## License

The corpus data included in this library was downloaded and is
redistributed under the [CC BY 2.0 FR DEED (Attribution 2.0 France)
license][3] ([English version][4]).

No changes have been made.


## Data format

<id class="warning">The format of the sequence number seems to have
changed from `#ID=nnnnnn` to `#ID=nnnnnn_nnnnnn`. We treat this as a
string, so should be able to handle the older format.</id>

*Copied from a section on [this page][1].*

The file is in text format, with the Japanese in either EUC-JP or UTF8
encoding. If you wish to have it in any other format or coding, you
will have to convert it yourself.

The format is as follows:

- the file consists of pairs of lines, beginning with "A: " and "B: "
  respectively. There are also comment lines which begin with a
  "#". In many cases these are A:/B: lines that have been removed from
  the file as far as WWWJDIC is concerned.
- the "A:" lines contain the Japanese sentence and the English
  translation, separated by a TAB character. At the end of the English
  translation is a sequence number identifying the sentence pair. It
  is in the format: #ID=nnnnnn. This sequence number is used to
  identify the pair uniquely across several projects using the file.
- the "B:" lines contain a space-delimited list of Japanese words
  found in the preceding sentence.
- the Japanese words in the "B:" lines can have the following
  appended:
- a reading in hiragana. This is to resolve cases where the word can
  be read different ways. WWWJDIC uses this to ensure that only the
  appropriate sentences are linked. The reading is in "round"
  parentheses.
- a sense number. This occurs when the word has multiple senses in the
  EDICT file, and indicates which sense applies in the
  sentence. WWWJDIC displays these numbers. The sense number is in
  "square" parentheses.
- the form in which the word appears in the sentence. This will differ
  from the indexing word if it has been inflected, for example. This
  field is in "curly" parentheses.
- a "~" character to indicate that the sentence pair is a good and
  checked example of the usage of the word. Words are marked to enable
  appropriate sentences to be selected by dictionary
  software. Typically only one instance per sense of a word will be
  marked. The WWWJDIC server displays these sentences below the
  display of the related dictionary entry.
- The following example pair illustrates the format:

```text
A: その家はかなりぼろ屋になっている。[TAB]The house is quite run down.#ID=25507
B: 其の{その} 家(いえ)[1] は 可也{かなり} ぼろ屋[1]~ になる[1]{になっている}
```


## Problems with the public domain version

In theory there is an older public domain version of the corpus
available (see [here][2]), but I could not convert this to UTF-8. I
could not find its encoding listed anywhere, and `file` says it's
ISO-8859 text:

```bash
$ file examples_pd
examples_pd: ISO-8859 text, with very long lines (330), with CRLF line terminators
```

However, `iconv` recognises a few versions of this, but none of them
produced sensible output, and a guess that it might be EUC-JP encoded
(as later versions are) failed for me too. The public domain format
also uses a different format for the ID, but we treat this as a string
so should in theory be able to handle this, but this has not been
tested.


[1]: https://dict.longdo.com/about/hintcontents/tanakacorpus.html
[2]: https://www.edrdg.org/wiki/index.php/Tanaka_Corpus#Downloads
[3]: https://creativecommons.org/licenses/by/2.0/fr/
[4]: https://creativecommons.org/licenses/by/2.0/fr/deed.en