# agldt **author:** Caio Geraldes Tools for parsing treebanks from AGLDT ## Basic usage ```rust use serde_xml_rs::from_str; use std::fs::read_to_string; use agldt::parser::*; fn main() { let src = read_to_string("/path/to/agldt/tlg0007.tlg004.perseus-grc1.tb.xml").unwrap(); let doc = from_str::(&preprocess(&src)).unwrap(); assert_eq!(doc.count_words(), 9451); assert_eq!(doc.count_tokens(), 10709); } ``` ## Description of parsing stages ### Preprocessing Pre-processes the source `.xml` code to allow for serialization of the treebank. There are some oddities in the scheme used in AGLDT's `xml` header and body, that otherwise make serializing it to a `struct` quite messy. This is kind of a bodge, but should do the trick. #### Oddities The main oddity on AGLDT use of `xml` occurs inside the tag ``, where the tag `` might contain either a single string value or a series of tags: ```xml Bridget Almas responsible for the annotation environment and cts:urn technology
Tufts University
Vanessa Gorman Vanessa Gorman
vbgorman@gmail.com
http://data.perseus.org/sosol/users/Vanessa%20Gorman
annotator of the text
``` To solve this oddity, we apply two regex replacements so as to move the `` and `
` tags inside ``. A handful of other oddities concern the use of the tags ``, `` and `` inside the tag ``. Those are also removed by the regex in the current version. Finally, the `head` value is sometimes an empty string, which is still an issue for me to serialize. As `0` is not used anywhere else, I replace empty strings for `"0"`. ### Serialization Uses `serde` for serializing the data. I did my best to keep the metadata accessible, but there are still some missing fields that will later be included.