# Notes on a full fidelity parser
Nickel does not have a formatter, which it should ideally have.
The current parser and ast are not suited for this purpose since they perform
desugaring during decoding and drop all information on whitespace and
comments. For example, the following program:
```nickel
{
  // This bar is very important
  foo.bar = "bar",
  foo.baz = "baz",
}
```

Parses to:
```nickel
{
  foo = { bar = "bar"} & { baz = "baz" },
}
```

This note explores the options we have within the rust ecosystem to construct a
full fidelity (or lossless) syntax tree.

# Glossary
- **Concrete Syntax Tree (CST)/Parse Tree**: A tree that results from parsing, and
represents the syntactic structure of a string/file. May be lossless/full
fidelity.
- **Abstract Syntax Tree (AST)**: A tree that represents the abstract syntactic
structure of a string/file. Used for further processing of the file. Not
lossless/full fidelity. 
- **Lossless/full fidelity**: A lossless CST is a tree that contains all information
to recreate the original string/file exactly.
- **Typed**: A Typed Syntax Tree is a tree where the type of the parent implies the
type of the children. Every node has a specific type, with specific fields,
nodes are not interchangeable. Most ASTs are typed.
- **Untyped**: An Untyped Syntax Tree is a tree where the type/kind of a parent node
does not force the type/kind of the children nodes. Lossless syntax trees are
typically untyped.

# Keeping the existing lalrpop grammar
The current lexing/parsing stack of Nickel is based on a lexer generated using
[logos][logos], a parser generated by
[lalrpop][lalrpop], and a native (abstract) syntax tree.

Our initial idea was to keep the existing lalrpop grammar in place, but
replacing the syntax tree with a [rowan][rowan] based
alternative. Unfortunately, while it is possible to integrate logos, rowan, and
lalrpop, comments and whitespace would still be lost. This is an inherent
problem with lalrpop in which grammar rules match directly on the token input.

# The problem with comments/whitespace
Above, it was stated that the way lalrpop consumes the token input stream is
not well suited for a formatter (or other programming tools). This is a general
problem with parsers that attempt to construct a typed syntax tree directly
(there is no place for comments in a typed tree). There are ways around this,
that fall broadly into three categories:
1. Don't have a typed syntax tree. For these solutions, an untyped syntax tree
   is used instead, that is optionally converted to a typed tree when needed.
2. Lose some information by "intelligently" assigning every comment to a node
   in the typed tree. The loss in information here means, for instance, that we
   have to assign a comment surrounded by empty lines to a expression somewhere
   near it.
3. Lex the whitespace and comments, but don't pass this information to the
   lalrpop parser. From here, have lalrpop create either an untyped tree, or a
   stream of event representing an untyped ast and reinject the comments and
   whitespace into said tree or stream.

# Options
## Custom
Like rustc, nickel could employ a custom lexer and parser.

### Advantages
- Full control

### Disadvantages
- By far the most work
- Rapid development is harder when implementing a parser from scratch

## [Rowan][rowan] + [Ungrammar][ungrammar] (optional) + [Logos][logos]
Rowan and Ungrammar are two libraries created for the
[rust-analyzer][rust-analyzer] project.

Of course, rust-analyzer is greatly benefited by a lossless concrete syntax tree.
To this extent they created two libraries specifically for
this purpose. [rowan][rowan] and [ungrammar][ungrammar].

Rowan is a library specifically created to represent (lossless) concrete syntax
trees. It requires a parser to be written. For a higher level view, one can write
functions that convert these nodes in the CST to AST nodes. This is done using
the `AstNode` trait. In rust-analyzer, this layer is actually generated using a
tool called Ungrammar. An overview of Rowan can be found
[in the rust-analyzer repository](https://github.com/rust-analyzer/rust-analyzer/blob/master/docs/dev/syntax.md).

Ungrammar is "a ... formalism for describing concrete syntax trees". Concretely,
ungrammar takes a
[file written in its grammar](https://github.com/rust-analyzer/ungrammar/blob/master/rust.ungram)
and produces
[a file](https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/syntax/src/ast/generated/nodes.rs)
that contains functions used to interact with the GreenNodes/RedNodes/lossess
syntax tree (typically produced by rowan). Ungrammar has seen limited/no use
outside of rust-analyzer.
The main advantage of ungrammar is that it allows the syntax of a language to
be changed easily without requiring manual changes to the `AstNode`
implementations. This can be a great benefit in the early stages of development
when changes to the CST are frequent. More on ungrammar can be read in a
[relevant blogpost](https://rust-analyzer.github.io/blog/2020/10/24/introducing-ungrammar.html).

Rowan is a library that has been adopted in
[several](https://crates.io/crates/rowan/reverse_dependencies) language servers
and other programming tools.
Additionally, there exists at least
[one language (slint)](https://github.com/slint-ui/slint/) that (aside from a
language server and formatter) also uses rowan for their
[compiler](https://github.com/slint-ui/slint/tree/master/internal/compiler).
Hence we can safely say rowan is likely suitable for nickel.

Since we can no longer use lalrpop, the parser would have to be written by
hand. Parser combinators are not very suitable to this approach, as many
are based around a parser succeeding or failing, negating rowan's advantage of
being able to produce a CST even when there is an error.

There is of course nothing fundamentally impossible about using some parser
combinator library.

### Advantages
- Rowan has seen use for compilers, language servers, and formatters.
- Logos and Rowan can work nicely together, because the created parser links
  them together.

### Disadvantages
- We would have to manually write an additional layer for the `AstNode`
  implementations.
- We might need two `AstNode` implementations, one used for the formatter, one
  used to construct the current AST. Alternatively, it might be possible to
  build a formatter directly on the CST.

## Comment map/assigning comments
Another approach is used by [ucg](https://github.com/zaphar/ucg) and
[nixfmt](https://github.com/serokell/nixfmt) that both associate a comment with
nodes in the typed syntax tree. Either by looking them up during pretty
printing (thus saving having them in the tree for other purposes), or having
every node in the tree have an optional associated comment.

This method is obviously not lossless in the sense that whitespace might still
removed, but it would suit our purpose (formatter). There are some challenges
related to how to format associated comments, especially comments that were not
intended to be associated with anything.

### Advantages
- This is a safe option, ucg is a language much like nickel
- Allows us to use any parser combinator library

### Disadvantages
- Not lossless

## Reinjecting whitespace/comments
Xavier proposed the idea of keeping lalrpop, but having it generate a
stream/list of what he called "events". These events represent an untyped ast
in a flat manner. From here, the comments and whitespace can be reinjected into
this event list before or during the construction of the lossless untyped ast.
Of course, this method isn't restricted to lalrpop, any parser library can be
used in this approach. Rust-analyzer employs this approach as is discussed in
[event.rs](https://github.com/rust-lang/rust-analyzer/blob/ac736dc667b079e629ad629654956f87c2e08c9a/crates/parser/src/event.rs).

### Advantages
- Is a method that is agnostic to the libraries used. We could keep lalrpop in
  place for instance.

### Disadvantages
- Perceived complexity. This adds another layer to the parser. Xavier has
  already applied this approach to a toy language of themselves, so it is at
  least a valid approach.
- Doesn't reduce the number of grammars.

## Tree-sitter?
In issue [#656](https://github.com/tweag/nickel/issues/656), Yann remarked that
there are multiple grammar definitions for the nickel language. It would be
nice to share a grammar for editors and nickel itself. Tree-sitter's concept is
not too far removed from rowan (an untyped tree that is easy to update) with
similar purposes (to ease the creation of development tools). However,
tree-sitter does not provide the nice interface that rowan provides. For
instances, there are no SyntaxNodes/RedNodes and no included trait to convert
the CST to a suitable AST.
Aside from
[toy-examples](https://github.com/vanhtuan0409/toon-lang) there are, no
existing projects that utilize tree-sitter as a parser for a rts/compiler/etc.
This is due to the fact that
[tree-sitter's error nodes do not provide substantial information about the error that occurred](https://github.com/tree-sitter/tree-sitter/issues/1631).
Some information can be derived from the surrounding nodes, but this is
obviously suboptimal.

## Don't use a uniform parser
While the concept of using the same parser/grammar for the interpreter and
formatter/other tools is nice, we ought to also consider the fact that these
tools have distinctly varying purposes. In short, a lossless parser just
doesn't serve any purpose to the nickel binary. Based on current experience
with the lalrpop parser, it might actually introduce more overhead if we try to
use the same parser for both.

# Proposed approach
I can't recommend any solution without input from those who have worked with 
the current lalrpop grammar. Depending on how hard it was to implement that, we
should determine if integrating a lossless ast with the nickel binary is
desirable. If so, I propose we contribute to
[the existing tree-sitter parser](https://github.com/milahu/tree-sitter-nickel)
. With that experience, we can attempt to write a simple formatter, which could
give us the insight required to determine if tree-sitter would be sufficient
for use in the nickel binary.

[rowan]: https://lib.rs/crates/rowan
[logos]: https://lib.rs/crates/logos
[lalrpop]: https://lib.rs/crates/lalrpop
[ungrammar]: https://lib.rs/crates/ungrammar
[rust-analyzer]: https://rust-analyzer.github.io/