[![Crate](https://img.shields.io/crates/v/stam-tools.svg)](https://crates.io/crates/stam-tools) [![Docs](https://docs.rs/stam-tools/badge.svg)](https://docs.rs/stam-tools/) [![GitHub release](https://img.shields.io/github/release/annotation/stam-tools.svg)](https://GitHub.com/annotation/stam-tools/releases/) [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active) ![Technology Readiness Level 7/9 - Release Candidate - Technology ready enough and in initial use by end-users in intended scholarly environments. Further validation in progress.](https://w3id.org/research-technology-readiness-levels/Level7ReleaseCandidate.svg) # STAM Tools A collection of command-line tools for working with [STAM](https://github.com/annotation/stam), a data-model for stand-off annotations on text. Various tools are grouped under the `stam` tool, and invoked with a subcommand: * ``stam align`` - Align two similar texts, mapping their coordinate spaces. * ``stam annotate`` or ``stam add`` - Add annotations or datasets or resources (from file or by query). * ``stam batch`` or `stam shell` - Process multiple subcommands in sequence, or run interactively. * ``stam info`` - Return information regarding a STAM model. * ``stam init`` - Initialize a new STAM annotationstore (either from scratch or as a copy/merge of others) * ``stam import`` - Import STAM data in tabular from a simple TSV (Tab Separated Values) format, allows custom columns. * ``stam fromxml`` - Import data from XML-based formats (like xHTML, TEI) to STAM. Effectively 'untangling' text and annotations. * ``stam print`` - Output the text of any resources in the model. * ``stam query`` or ``stam export`` - Query the annotation store and export the output in tabular form to a simple TSV (Tab Separated Values) format. This is not lossless but provides a decent view on the data. It provides a lot of flexibility by allowing you to configure the output columns as you see fit. * ``stam validate`` - Validate a STAM model. * ``stam tag`` - Regular-expression based tagger on plain text. * ``stam view`` - View annotations as queried by outputting to HTML (or ANSI coloured text). For many of these, you can set `--verbose` for extra details in the output. Alternatively, the functionality provided by the tools is also exposed as a library via a [Rust API](https://docs.rs/stam-tools/). ## Installation ### From source ``` $ cargo install stam-tools ``` ## Usage Add the ``--help`` flag after the subcommand for extensive usage instructions. Most tools take as input a STAM JSON or CSV file containing an annotation store. You may also specify multiple stores which will be merged into one. Any files mentioned via the `@include` mechanism are loaded automatically. When output is written, the first store file used as input is also used as output. You can prevent writing output files by setting `--dry-run` or prevent reusing the first input file by setting an explicit output using `--output`. Instead of passing STAM JSON files, you can read from stdin and/or output to stdout by setting the filename to ``-``, this works in many places. These tools also support reading and writing [STAM CSV](https://github.com/annotation/stam/tree/master/extensions/stam-csv). ## Tools ### stam init & stam annotate The `stam init` command is used to initialize a new STAM annotationstore with resources (`--resource`, plain text or STAM JSON), annotation data sets (`--annotationset`, STAM JSON) and/or annotations (`--annotations`, JSON list of annotations in STAM JSON). Example, the positional parameter (last one) is the annotation store to output, it may be STAM JSON or STAM CSV, determined by the file extension: ``` $ stam init --resource document.txt new.store.stam.json ``` The `stam annotate` command is almost identical to `stam init`, except it reads and modifies an existing annotation store, rather than starting a new one from scratch: ``` $ stam annotate --resource document.txt existing.store.stam.json ``` Whenever you load annotations and annotation data sets using these commands, they need to already be in STAM JSON format. To import data from other formats, use `stam import` instead. The `stam init` and `stam annotate` commands are also capable of merging multiple annotation stores into one. If you want to load a STAM annotationstore (or multiple) and save it under another name and/or other format, you can use `stam init` (or `stam annotate`) as well, they key is to then use an explicit `--output` filename that differs from the input. It serves to merge stores and/or convert between STAM JSON and STAM CSV. Example: ``` $ stam init --output merged.store.stam.csv mystore1.store.stam.json mystore2.store.stam.json ``` You can also pass [STAMQL](https://github.com/annotation/stam/blob/master/extensions/stam-query/README.md) queries to `stam annotate` to add (or delete) annotations: ``` stam annotate --query 'ADD ANNOTATION WITH DATA "my-vocab" "type" "sentence"; TARGET ?x { SELECT TEXT ?x WHERE RESOURCE "smallquote.txt" OFFSET 0 25; }' demo.store.stam.json ``` ### stam info The `stam info` command provides either some high-level details on the annotation store (number of resource, annotations, etc), or with the `--verbose` flag it goes as far as presenting, in a fairly raw format, all the data it holds. Example: ``` $ stam info my.store.stam.json ``` ### stam query The `stam query` tool is used to consult the annotation store and export selected STAM data into a simple tabular data format (TSV, tab separated values). You can configure precisely what columns you want to export using the ``--columns`` parameter, or simply rely on the defaults that are autodetected. See ``stam query --help`` for a list of supported columns. A full query is done using the ``--query`` parameter and subsequently a query statement in [the STAM Query Language (STAMQL)](https://github.com/annotation/stam/blob/master/extensions/stam-query/README.md): *Example 1) a query in STAMQL:* ``` $ stam query --query 'SELECT ANNOTATION ?a WHERE DATA "myset" "pos" = "noun";' ``` However, if you simply want all annotations, resource, data, and don't want to formulate a query a shortcut is available by just the ``--type`` parameter to `annotation`,`key`,`data`,`resource` or `dataset`. *Example 2) get all annotations (also default behaviour if you omit `--type` and `--query`):* ``` $ stam query --type annotation my.store.stam.json ``` For certain types, you can set ``--verbose`` to output more information, e.g. when querying for annotations it will also output *all* annotation data pertaining to the annotations. Do not that `stam import` can not import annotations back when you use this. *Example 3) get all annotations verbosely with all data:* ``` $ stam query --verbose --type annotation my.store.stam.json ``` *Example 4) get all keys:* ``` $ stam query --type key my.store.stam.json ``` One of the more powerful functions is that you can specify custom columns by specifying a set ID, a delimiter and a key ID (the delimiter by default is a slash), for instance: `my_set/part_of_speech`. Such columns are automatically added for you if you have `DATA` or `KEY` constraints in your query (like in example 1), if that is not what you want, set `--strict-columns`. This custom column will hold the corresponding value if they key exists for the annotation. Example 5) explicitly specified columns including a custom one: ``` $ stam query --columns Id,Text,TextResource,BeginOffset,EndOffset,my_set/part_of_speech my.store.stam.json ``` Example 6) Subqueries and multiple result variables ``` $ stam query --query 'SELECT ANNOTATION ?sentence WHERE DATA "myset" "type" = "sentence"; { SELECT ANNOTATION ?word WHERE RELATION ?sentence EMBEDS; DATA "myset" "type" = "word"; }' ``` This will result in a TSV file where the sentence will be repeated for each word that is found in it, a result number will be returned in a column, as well as the variable name. The TSV output produced by this tool is not lossless, that is, it can not encode everything that STAM supports, unlike STAM JSON and STAM CSV. It does, however, give you a great deal of flexibility to quickly output only the data relevant for whatever your specific purpose is. For queries that modify the annotation store, use `stam annotate` rather than `stam query`. ### stam export `stam export` is just an alias for `stam query`, their functionality is identical. ### stam import The `stam import` tool is used to import tabular data from a TSV (Tab Separated Values) file into STAM. Like `stam query`, you can configure precisely what columns you want to import, using the ``--columns`` parameter. By default, the import function will attempt to parse the first line of your TSV file as the header and use that to figure out the column configuration. You will often want to set ``--annotationset`` to set a default annotation set to use for custom columns. If you set ``--annotationset my_set`` then a column like `part_of_speech` will be interpreted in that set (same as if you wrote `my_set/part_of_speech` explicitly). Here is a simple example of a possible import TSV file (with ``--annotationset my_set``): ```tsv Text TextResource BeginOffset EndOffset part_of_speech Hello hello.txt 0 5 interjection world hello.txt 6 10 noun ``` The import function has some special abilities. If your TSV data does not mention specific offsets in a text resource(s), they will be looked up automatically during the import procedure! If the text resources don't even exist in the first place, they can be reconstructed (within certain constraints, the output text will likely be in tokenised form only). If your data does not explicitly reference a resource, use the ``--resource`` parameter to point to an existing resource that will act as a default, or ``--new-resource`` for the reconstruction behaviour. By setting ``--resource hello.txt`` or ``--new-resource hello.txt`` you can import the following much more minimal TSV: ```tsv Text part_of_speech Hello interjection world noun ``` The importer supports empty lines within the TSV file. When reconstructing text, these will map to (typically) a newline in the to-be-constructed text (this configurable with ``--outputdelimiter2``). Likewise, the delimiter between rows is configurable with `--outputdelimiter`, and defaults to a space. Note that `stam import` can not import everything `stam query` can export. It can only import rows exported with ``--type Annotation`` (the default), in which each row corresponds with one annotation. ### stam grep The `stam grep` tool can be used for matching regular expressions in text, it will return the resource identifiers, offsets and exact texts of all matching occurrences. Example: ``` $ stam grep -e "[hzwHZW]ij" frogdeep.store.stam.json example.deep 690:693 Hij 1/1 example.deep 799:802 hij 1/1 ``` The tab-separated columns are as follows: 1. Resource ID 2. Begin offset and end offset (non-inclusive) in unicode points 3. The matching text 4. The current capture group and total number of capture groups (if any) ### stam tag The `stam tag` tool can be used for matching regular expressions in text and subsequently associating annotations with the found results. It is a tool to do for example tokenization or other tagging tasks. The `stam tag` command takes a TSV file ([example](https://github.com/knaw-huc/stam-experiments/blob/main/config/stam-tag/simpletagger.tsv)) containing regular expression rules for the tagger. The file contains the following columns: 1. The regular expressions follow the [this syntax](https://docs.rs/regex/latest/regex/#syntax). The expression may contain one or or more capture groups containing the items that will be tagged, in that case anything else is considered context and will not be tagged. 2. The ID of annotation data set 3. The ID of the data key 4. The value to set. If this follows the syntax $1,$2,etc.. it will assign the value of that capture group (1-indexed). Example of the rules: ```tsv #EXPRESSION #ANNOTATIONSET #DATAKEY #DATAVALUE \w+(?:[-_]\w+)* simpletokens type word [\.\?,/]+ simpletokens type punctuation [0-9]+(?:[,\.][0-9]+) simpletokens type number ``` Example of applying this to a text resource: ``` # first we create a store and add a text resource $ stam init --resource sometext.txt my.store.stam.json # then we start the tagging $ stam tag --rules rules.tsv my.store.stam.json ``` ### stam view The `stam view` tool is used to visualize annotations. The default visualisation is HTML. This will output a self-contained static HTML document to standard output (the document does not reference any external assets). An alternative visualisation is text with ANSI escape codes for colours (`--format ansi`), which is suited for display in a terminal rather than a browser. The annotations you want to visualise are requested via queries in [STAMQL](https://github.com/annotation/stam/tree/master/extensions/stam-query), using the `--query` parameter. The `--query` parameter can be specified multiple times. The first query is always the *selection query*, it determines what the main selection is and can be anything you can query that has text (i.e. resources, annotations, text selections). Any subsequent queries are *highlight queries*, they determine what parts of the selections produced by the selection query you want to highlight. Highlighting is done by drawing a line underneath the text and optionally by a *tag* that shows extra information. ![STAM view example](https://github.com/annotation/stam-tools/raw/master/stamvis1.png) Example with tags: ![STAM view example with tags](https://github.com/annotation/stam-tools/raw/master/stamvis2.png) Tags can be enabled by prepending the query with one of the following *attributes*: * `@KEYTAG` - Outputs a tag with the key, pertaining to the first DATA constraint in the query * `@KEYVALUETAG` - Outputs a tag with the key and the value, pertaining to the first DATA constraint in the query * `@VALUETAG` - Outputs a tag with the value only, pertaining to the first DATA constraint in the query * `@IDTAG` - Outputs a tag with the public identifier of the ANNOTATION that has been selected If you don't want to match the first DATA constraint, but the n-th, then specify a number to refer to the DATA constraint (1-indexed) in the order specifies. Note that only DATA constraints are counted: * `@KEYTAG=`*n* - Outputs a tag with the key, pertaining to the *n*-th DATA constraint in the query * `@KEYVALUETAG=`*n* - Outputs a tag with the key and the value, pertaining to the *n*-th DATA constraint in the query * `@VALUETAG=`*n* - Outputs a tag with the value only, pertaining to the *n*-th DATA constraint in the query Attributes may also be provided for styling HTML output: * `@STYLE=`*class* - Will associate the mentioned CSS class (it's up to you to associate a proper stylesheet). The default one predefines only a few simple classes: `italic`, `bold`, `red`,`green`,`blue`, `super`. * `@HIDE` - Do not add the highlight underline and do not add an entry to the legend. This may be useful if you only want to apply `@STYLE`. If no attribute is provided, there will be no tags or styling shown for that query, only a highlight underline. In the highlight queries, the variable from the main selection query is available and you *should* always use it in a constraint, otherwise performance will be sub-optimal! All your queries *should* have variable names and these will appear in the legend (unless you pass `--no-legend`). Various real examples of visualisation and queries are shown here: