Crates.io | annatto |
lib.rs | annatto |
version | 0.19.0 |
source | src |
created_at | 2023-04-12 09:15:44.632691 |
updated_at | 2024-11-20 15:50:57.788608 |
description | Converts linguistic data formats based on the graphANNIS data model as intermediate representation and can apply consistency tests. |
homepage | https://github.com/korpling/annatto/ |
repository | https://github.com/korpling/annatto/ |
max_upload_size | |
id | 836700 |
size | 4,809,706 |
This software aims to test and convert data within the RUEG research group at Humboldt-Universität zu Berlin. Tests aim at continuously evaluating the state of the RUEG corpus data to early identify issues regarding compatibility, consistency, and integrity to facilitate data handling with regard to annotation, releases and integration.
For efficiency annatto relies on the graphANNIS representation and already provides a basic set of data handling modules.
Annatto is a command line program, which is available pre-compiled for Linux, Windows and macOS. Download and extract the latest release file for your platform.
After extracting the binary to a directory of your choice, you can run the binary by opening a terminal and execute
<path-to-directory>/annatto
on Linux and macOS and
<path-to-directory>\annatto.exe
on Windows.
If the annatto binary is located in the current working directory, you can also just execute ./annatto
on Linux and macOS and annatto.exe
on Windows.
In the following examples, the prefix to the path is omitted.
The main usage of annatto is through the command line interface. Run
annatto --help
to get more help on the sub-commands.
The most important command is annatto run <workflow-file>
, which runs all the modules as defined in the given [workflow] file.
Annatto comes with a number of modules, which have different types:
Importer modules allow importing files from different formats. More than one importer can be used in a workflow, but then the corpus data needs to be merged using one of the merger manipulators. When running a workflow, the importers are executed first and in parallel.
Graph operation modules change the imported corpus data. They are executed one after another (non-parallel) and in the order they have been defined in the workflow.
Exporter modules export the data into different formats. More than one exporter can be used in a workflow. When running a workflow, the exporters are executed last and in parallel.
To list all available formats (importer, exporter) and graph operations run
annatto list
To show information about modules for the given format or graph operation use
annatto info <name>
The documentation for the modules are also included here.
Annatto workflow files list which importers, graph operations and exporters to execute.
We use an TOML file with the ending .toml
to configure the workflow.
TOML files can be as simple as key-value pairs, like config-key = "config-value"
.
But they allow representing more complex structures, such as lists.
The TOML website has a great "Quick Tour" section which explains the basics concepts of TOML with examples.
An import step starts with the header [[import]]
, and a
configuration value for the key path
where to read the corpus from and the key format
which declares in which format the corpus is encoded.
The file path is relative to the workflow file.
Importers also have an additional configuration header, that follows the [[import]]
section and is marked with the [import.config]
header.
[[import]]
path = "textgrid/exampleCorpus/"
format = "textgrid"
[import.config]
tier_groups = { tok = [ "pos", "lemma", "Inf-Struct" ] }
skip_timeline_generation = true
skip_audio = true
skip_time_annotations = true
audio_extension = "wav"
You can have more than one importer, and you can simply list all the different importers at the beginning of the workflow file. An importer always needs to have a configuration header, even if it does not set any specific configuration option.
[[import]]
path = "a/mycorpus/"
format = "format-a"
[import.config]
[[import]]
path = "b/mycorpus/"
format = "format-b"
[import.config]
[[import]]
path = "c/mycorpus/"
format = "format-c"
[import.config]
# ...
Graph operations use the header [[graph_op]]
and the key action
to describe which action to execute.
Since there are no files to import/export, they don't have a path
configuration.
[[graph_op]]
action = "check"
[graph_op.config]
# Empty list of tests
tests = []
Exporters work similar to importers, but use the keyword [[export]]
instead.
[[export]]
path = "output/exampleCorpus"
format = "graphml"
[export.config]
add_vis = "# no vis"
guess_vis = true
You cannot mix import, graph operations and export headers. You have to first list all the import steps, then the graph operations and then the export steps.
[[import]]
path = "conll/ExampleCorpus"
format = "conllu"
config = {}
[[graph_op]]
action = "check"
[graph_op.config]
report = "list"
[[graph_op.config.tests]]
query = "tok"
expected = [ 1, inf ]
description = "There is at least one token."
[[graph_op.config.tests]]
query = "node ->dep node"
expected = [ 1, inf ]
description = "There is at least one dependency relation."
[[export]]
path = "grapml/"
format = "graphml"
[export.config]
add_vis = "# no vis"
guess_vis = true
You need to install Rust to compile the project. We recommend installing the following Cargo subcommands for developing annis-web:
You can run the tests with the default cargo test
command.
To calculate the code coverage, you can use cargo-llvm-cov
:
cargo llvm-cov --open --all-features --ignore-filename-regex 'tests?\.rs'
You need to have cargo-release
installed to perform a release. Execute the follwing cargo
command once to
install it.
cargo install cargo-release cargo-about
To perform a release, switch to the main branch and execute:
cargo release [LEVEL] --execute
The level should be patch
, minor
or major
depending on the changes made in the release.
Running the release command will also trigger a CI workflow to create release binaries on GitHub.
Die Forschungsergebnisse dieser Veröffentlichung wurden gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – SFB 1412, 416591334 sowie FOR 2537, 313607803, GZ LU 856/16-1.
This research was funded by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) – SFB 1412, 416591334 and FOR 2537, 313607803, GZ LU 856/16-1.