Crates.io | parse_wiktionary_cs |
lib.rs | parse_wiktionary_cs |
version | 0.1.0 |
source | src |
created_at | 2018-11-02 13:42:14.989333 |
updated_at | 2018-11-02 13:42:14.989333 |
description | Parse dictionary pages from the Czech language edition of Wiktionary into structured data |
homepage | |
repository | https://github.com/portstrom/parse_wiktionary_de |
max_upload_size | |
id | 94312 |
size | 159,010 |
Parse dictionary pages from the Czech language edition of Wiktionary into structured data.
This following information applies to all language editions of Parse Wiktionary. For information specific to each language edition, see its documentation.
Wiktionary is a dictionary with millions of entries containing a wide variety of data about words and phrases in many languages. The dictionary data is distributed under a free license, allowing it to be reused in other applications. Unfortunately it's written in a format that prevents using it in other applications. The format is designed only to be transformed into the exact HTML format displayed in Wiktionary itself, not to be parsed into semantically meaningful data that can be used for other purposes or displayed in other formats.
The format does however contain enough structurally meaningful data to allow most of it with great difficulty to be parsed into structured data. Parse Wiktionary does the challenging task of parsing entries from Wiktionary into a structured format that can easily be used to query details about entries and use them for different purposes and present them in different formats. Because each language edition of Wiktionary unfortunately has a completely different format, there is a different edition of Parse Wiktionary for each edition of Wiktionary and they all have a different output format. Currently Parse Wiktionary exists for the English (en.wiktionary.org), German (de.wiktionary.org) and Czech (cs.wiktionary.org) editions of Wiktionary.
Different parts of the information in Wiktionary are written in different formats that vary in regularity and complexity.
In all editions of Wiktionary, the headings follow a regular format. Headings are therefore parsed semantically. The content of sections however may or may not be in a regular format depending on the section. Many sections are therefore parsed as a free form document. That free form document is however stored in a field with a semantic meaning. This means that even though the content of the documents is not semantic, they are organized in a semantic way allowing applications to choose what sections to take and what to do with each section.
The long term goal is to eliminate all these limitations and parse all information in Wiktionary as structured semantic data. This will however require cooperation from Wiktionary editors. A standard format could be created for each section, and authors encouraged to follow the standard format. Parse Wiktionary could be integrated in Wiktionary and validate entries as they are being edited, showing warnings about anything that doesn't conform to the standard format. More data can also be transferred to Wikidata which is already designed from the beginning to store semantic data.