`cstree` is a generic library for creating and working with concrete syntax trees (CSTs).
"Traditional" abstract syntax trees (ASTs) usually contain different types of nodes which
represent different syntactical elements of the source text of a document and reduce its
information to the minimal amount necessary to correctly interpret it. In contrast, CSTs are
lossless representations of the entire input where all tree nodes are represented homogeneously
(i.e., the nodes are _untyped_), but are tagged with a `RawSyntaxKind` to determine the kind
of grammatical element they represent.
One big advantage of this representation is that it cannot only recreate the original source
exactly, but also lends itself very well to the representation of _incomplete or erroneous_
trees and is thus highly suited for usage in contexts such as IDEs or any other application
where a user is _editing_ the source text.
The concept of and the data structures for `cstree`'s syntax trees are inspired in part by
Swift's [libsyntax](https://github.com/apple/swift/tree/5e2c815edfd758f9b1309ce07bfc01c4bc20ec23/lib/Syntax).
Trees consist of two layers: the inner tree (called _green_ tree) contains the actual source
text as position independent green nodes. Tokens and nodes that appear identically at multiple
places in the source are deduplicated in this representation in order to store the tree
efficiently. This means that a green tree may not actually structurally be a tree. To remedy
this, the real syntax tree is constructed on top of the green tree as a secondary tree (called
the _red_ tree), which models the exact source structure.
As a possible third layer, a strongly typed AST [can be built] on top of the red tree.
[can be built]: #ast-layer
The `cstree` implementation is a fork of the excellent [`rowan`](https://github.com/rust-analyzer/rowan/),
developed by the authors of [rust-analyzer](https://github.com/rust-analyzer/rust-analyzer/) who
wrote up a conceptual overview of their implementation in
[their repository](https://github.com/rust-analyzer/rust-analyzer/blob/master/docs/dev/syntax.md#trees).
Notable differences of `cstree` compared to `rowan`:
- Syntax trees (red trees) are created lazily, but are persistent. Once a red node has been
created by `cstree`, it will remain allocated. In contrast, `rowan` re-creates the red layer on
the fly on each traversal of the tree. Apart from the trade-off discussed
[here](https://github.com/rust-analyzer/rust-analyzer/blob/master/docs/dev/syntax.md#memoized-rednodes),
this helps to achieve good tree traversal speed while helping to provide the following:
- Syntax (red) nodes are `Send` and `Sync`, allowing to share realized trees across threads. This is achieved by
atomically reference counting syntax trees as a whole, which also gets rid of the need to reference count
individual nodes.
- `SyntaxNode`s can hold custom data.
- `cstree` trees are trees over interned strings. This means `cstree` will deduplicate the text of tokens with the
same source string, such as identifiers with the same name. In this position, `rowan` stores each token's text
together with its metadata as a custom DST (dynamically-sized type).
- `cstree` includes some performance optimizations for tree creation: it only allocates space for new nodes on the
heap if they are not in cache and avoids recursively hashing subtrees by pre-hashing them.
- `cstree` includes some performance optimizations for tree traversal: persisting red nodes allows tree traversal
methods to return references instead of cloning nodes, which involves updating the tree's reference count. You can
still `clone` the reference to obtain an owned node, but you only pay that cost when you need to.
- The downside of offering thread safe syntax trees is that `cstree` cannot offer any mutability API for its CSTs.
Trees can still be updated into new trees through replacing nodes, but cannot be mutated in place.
## Getting Started
If you're looking at `cstree`, you're probably looking at or already writing a parser and are considering using
concrete syntax trees as its output. We'll talk more about parsing below -- first, let's have a look at what needs
to happen to go from input text to a `cstree` syntax tree:
1. Define an enumeration of the types of tokens (like keywords) and nodes (like "an expression")
that you want to have in your syntax and implement `Syntax`
2. Create a `GreenNodeBuilder` and call `start_node`, `token` and `finish_node` from your parser
3. Call `SyntaxNode::new_root` or `SyntaxNode::new_root_with_resolver` with the resulting
`GreenNode` to obtain a syntax tree that you can traverse
Let's walk through the motions of parsing a (very) simple language into `cstree` syntax trees.
We'll just support addition and subtraction on integers, from which the user is allowed to construct a single,
compound expression. They will, however, be allowed to write nested expressions in parentheses, like `1 - (2 + 5)`.
### Defining the language
First, we need to list the different part of our language's grammar.
We can do that using an `enum` with a unit variant for any terminal and non-terminal.
The `enum` needs to be convertible to a `u32`, so we use the `repr` attribute to ensure it uses the correct
representation.
```rust
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u32)]
enum SyntaxKind {
/* Tokens */
Int, // 42
Plus, // +
Minus, // -
LParen, // (
RParen, // )
/* Nodes */
Expr,
Root,
}
```
For convenience when we're working with generic `cstree` types like `SyntaxNode`, we'll also give a
name to our syntax as a whole and add a type alias for it.
That way, we can match against `SyntaxKind`s using the original name, but use the more informative
`Node