fa_lex

25 July, 2007

Lexical analyzer (fa_lex)

Introduction

The lexical analyzer (the lexer) takes a sequence of characters and returns a sequence of tokens; where every token is a meaningful unit identified by its type and its boundaries.

Everything that is not meaningful is normally discarded, like spaces or new-line symbols for C++. The tokens cannot overlap and include each other, in other words each character belongs to not more than one token. Depending on the language, the definition of the tokens can be different, see examples below:

For C++:

Input: if(++i==0) { j = 0; }

Output: if/OP (/LRB ++/OP i/VAR ==/OP 0/NUM )/RBR {/LCBR j/VAR =/OP 0/NUM ;/OP }/RCBR

Where {OP, LBR, RBR, VAR, NUM, LCBR, RCBR} is a possible set of token types for C++.

For English:

Input: Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov.29.

Output: Pierre/WORD Vinken/WORD ,/PUNKT 61/CD years/WORD old/WORD ,/PUNKT will/WORD join/WORD the/WORD board/WORD as/WORD a/WORD nonexecutive/WORD director/WORD Nov./WORD 29/CD ./EOS

Where: {WORD, PUNKT, CD, EOS} is a possible set of token types for English.

Grammar of fa_lex rules

The lexer uses rules in order to identify the boundaries and types of the tokens. Each rule describes one token in a context. The rules are based on the character regular expressions.

Each rule consists of optional left context description, the token description, optional right context description and the token type. A token description is enclosed into triangular brackets. A left, right context and token descriptions are character regular expressions. However, context descriptions should not be cyclic (e.g. accept a string of an infinite length) and the token description should not be empty (e.g. accept a string of a zero length). All rules are combined together by an "or" operator. The following grammar in Backus-Naur form formally describes the syntax of the lexer rules.

GRAMMAR ::= RULES

GRAMMAR ::= FUNCTIONS

RULES ::= RULE

RULES ::= RULE\n RULES

RULE ::= CONDITION --> ACTION

CONDITION ::= Regexp* < Regexp > Regexp*

CONDITION ::= < Regexp > Regexp*

CONDITION ::= Regexp* < Regexp >

CONDITION ::= < Regexp >

ACTION ::= Tag

ACTION ::= _call FUNCTION_NAMES

ACTION ::= Tag _call FUNCTION_NAMES

FUNCTION_NAMES ::= _main

FUNCTION_NAMES ::= FnName

FUNCTION_NAMES ::= FnName FUNCTION_NAMES

FUNCTIONS ::= FUNCTION

FUNCTIONS ::= FUNCTION\n FUNCTIONS

FUNCTION ::= _function FnName\n RULES\n _end

Regexp -- non-empty character-based regular expression

Regexp* -- acyclic character-based regular expression

Tag – Tag name (token type name)

FnName – function name, can be one of the tags or a new name

_function – a keyword indicating beginning of the function

_end – a keyword indicating the end of the function

_call – a keyword indicating function call

_main – a special function name referring to the main rule set

Example 1:

< [0-9]+ > --> NUM

[0-9] < [-+*/] > [-]?[0-9] --> OP

< [[:alpha:]]+ > --> VAR

Example 2:

< ([A-Za-z\x00C0-\x00D6\x00D8-\x00F6\x00F8-\x00FF\x0152\x0153])+[+-] > [0-9] --> WORD

< ([0]?[0-9]|(1)[0-9]|(2)[0-4])[:]([0-5][0-9])([:]([0-5][0-9]))? > [^0-9] --> HHMM

< ([\x0024\x00A2-\x00A5\x09F2\x09F3\x0E3F\x20A0\x20A2\x20A3\x20A4\x20A6-\x20AF])[\x0020\t]*

((0)|[1-9][0-9]*)[\x0020\t]*((

(AED)|(ARP)|(ATS)|(AUD)|(BBD)|(BEF)|(BGL)|(BHD)|(BMD)|(BRR)|(BRL)|(BSD)

| (CAD)|(CHF)|(CLP)|(CNY)|(CSK)|(CYP)|(DEM)|(DKK)|(DJF)|(DZD)|(EGP)|(ESP)

| (EUR)|(FIM)|(FJD)|(FRF)|(GBP)|(GRD)|(HKD)|(HUF)|(IDR)|(IEP)|(ILS)|(INR)

| (IQD)|(ISK)|(ITL)|(JMD)|(JOD)|(JPY)|(KRW)|(KWD)|(LBP)|(LUF)|(LYD)|(MAD)

| (MRO)|(MXP)|(MYR)|(NLG)|(NOK)|(NZD)|(OMR)|(PHP)|(PKR)|(PLN)|(PTE)|(QAR)

| (ROL)|(RUR)|(SAR)|(SDD)|(SEK)|(SGD)|(SKK)|(SOS)|(SYP)|(SUR)|(THB)|(TND)

| (TRL)|(TRY)|(TTD)|(TWD)|(USD)|(VEB)|(XEC)|(YER)|(ZAR)|(ZMK)|(DM)|(FF)

| (\x20AC((uro)|(URO))[s]?)

)) > ([^.,0-9A-Za-z\x00C0-\x00D6\x00D8-\x00F6\x00F8-\x00FF\x0152\x0153]) --> CURR

The following rules are incorrect:

< [-+*/] > [-]?[0-9]+ --> OP ; the context should be acyclic

< [-+*/]* > [-]?[0-9] --> OP ; the token description must not allow empty tokens

[-]?[0-9]+ --> CD ; the token definition should be enclosed in the triangular brackets

The following are equivalent rule-sets:

1. The

< [-+*/] > [-]?[0-9] --> OP

and

< [-+*/] > [-][0-9] --> OP

< [-+*/] > [0-9] --> OP

2. The

< [-+*/] > [[:alpha:]]|[[:digit:]] --> OP

and

< [-+*/] > [[:alpha:]] --> OP

< [-+*/] > [[:digit:]] --> OP

3. The

< [-+*/] > [-]? --> OP

and

< [-+*/] > --> OP

4. The

< [-+*/] > [^a] --> OP

and

< [-+*/] > [-] --> OP

< [-+*/] > [^a] --> OP

5. The

< [-+*/] > . --> OP

and

< [-+*/] > [^a] --> OP

< [-+*/] > [^b] --> OP

Description of Functions:

Functions are isolated named sets of rules in fa_lex syntax.

If the action of the rule R contains _call keyword followed by a function name FnName or by a _main keyword then each time R extracts a token, the rule set FnName or _main is applied to the token span. If _call is followed by one function name then the function’s rule set extracts all possible non-overlapping tokens out of the span. If _call is followed by more than one function name then each corresponding rule set extracts just one token, one after another in a sequence with exception to the main rule set _main (it always extracts all possible tokens.)

As formal grammar defines, there may be three types of actions: a) tag assignment b) function call c) tag assignment and function call. If the action is a function call without tag assignment then no token corresponding to the span is extracted. In this case "fa_lex" returns whatever is the output of the calling function. It is possible that the calling function will return nothing.

Functions are optional, they may be used for hierarchical tokens extractions (such as date as a whole and day, month and year as its parts,) they also may be used for wide context description and conflict resolution.

Examples of Functions:

The following function HY_WORD is called to split the hyphenated word into segments. Input: “out-of-date”, output: “out/WORD -/WORD of/WORD -/WORD date/WORD” No nested tokens are created.

_function HY_WORD

< [A-Za-z]+ > --> WORD

< [-] > --> WORD

_end

< [A-Za-z]+([-][A-Za-z]+)+ > --> _call HY_WORD

In the following example the tag ACR assigned and the function ACR is called (it is fine to have functions and tags of the same names.) Input: “A.B.C.”, output: “A.B.C./ACR A./WORD B./WORD C./WORD”.

_function ACR

< [A-Z][.] > --> WORD

_end

< ([A-Z][.])+ > --> ACR _call ACR

The following functions DAY, MONTH, YEAR are called in a sequence one after another, each of them extracts just one token in the selected by the caller rule span. Input: “12/13/2006 13/12/2006”, output: “12/13/2006/DATE_US 12/MONTH 13/DAY 2006/YEAR 13/12/2006/DATE_EU 13/DAY 12/MONTH 2006/YEAR”. For the cases when the input token matches both DATE_US and DATE_EU rules the fa_lex prefers tag name which has smaller value, so depending on the tagset definition DATE_US may be preferred to the DATE_EU and vice versa (see “Conflict resolution rules.”)

_function MONTH

< [0-9][0-9] > --> MONTH

_end

_function DAY

< [0-9][0-9] > --> DAY

_end

_function YEAR

< [0-9][0-9][0-9]?[0-9]? > --> YEAR

_end

< [01][0-9][/][0123][0-9][/][0-9][0-9][0-9]?[0-9]? > --> DATE_US _call MONTH DAY YEAR

< [0123][0-9][/][01][0-9][/][0-9][0-9][0-9]?[0-9]? > --> DATE_EU _call DAY MONTH YEAR

Extra syntax notes:

1. The blank characters does not mean anything for fa_lex and they are simply ignored. In order to match with any of those characters, the following constructions can be used: \t, \n, \r, \f, \v, [ ], [\t], [\n], [\r], [\f], [\v], \x20, \x09, \x0D, \x0A, [\x20], [\x09], [\x0D], [\x0A], [[:blank:]], [[:space:]] and so on.

2. Left (^) and right ($) anchors are ordinary symbols for the lexer, they can be included into both contexts as well as the token definition. If possible, including them into the token definition is more preferable. The “any” symbol (e.g. . ) matches both of the anchors, the negation of a character (e.g. [^a]) also matches any of the anchors.

3. Chracter classes:

[:alnum:] [:alpha:] [:lower:] [:xdigit:]

[:digit:] [:space:] [:upper:] [:print:]

[:punct:] [:blank:] [:cntrl:] [:graph:]

are defined as in POSIX "C" locale and have to be extended for Unicode range, if necessary.

4. See POSIX 1003.2 standard for regular expressions for more details on the regular expression syntax (http://www.unusualresearch.com/regex/regexmanpage.htm ).

Compilation:

The lexer compiler fa_build_lex takes two input files: one a rule-set and the other a tagset. The tagset is a list of symbolic names of token types each of which has a numerical value associated with it. The tagset can be shared with some other modules like POS tagging, in case of NL analysis.

Suppose the file lex_rules.utf8 contains the following rule-set:

[0-9] < [-+*/] > [-]?[0-9] --> PUNKT

< [-]?[0-9]+ > --> CD

< [-+*/]+ > --> WORD

The tagset.txt contains the following tagset:

CD 1

PUNKT 2

WORD 3

The following command will compile the rule-set:

fa_build_lex --in=lex_rules.utf8 --out=lex_rules.dump \

--build-dump --tagset=tagset.txt

The --build-dump parameter makes a memory-dump representation of the compiled rule-set, without this parameter the compiled rule-set will be stored in the textual representation. See the description of all switches by typing: fa_build_lex --help

If there were no compilation errors the output file lex_rules.dump will be created.

Lexical analysis

As it has been said, the lexical analysis is a process of conversion of a sequence of characters into a sequence of tokens; where each token is a meaningful unit identified by its type and its boundaries. Everything that is not the token is ignored. The output tokens cannot overlap and include each other, in other words each character of the input text belongs to not more than one token. In order to guarantee this condition, it is necessary to be able to prefer one match over the other if more than one rule matches the given character of the input text. This is addressed by the conflict resolution rules (see below).

Conflict resolution for matching rules:

The following is the order in which fa_lex selects which rule to execute if more than one matched the text:

1. The leftmost rule,

2. The rule with the longest span,

3. The rule with the smallest left context,

4. The rule with the smallest right context,

5. The rule with no tag assignment (just function call)

6. The rule with the smaller tag value

7. The rule with lexicographically smaller list of function names (based their values)

Conflict resolution rules #1 and #2 are common for many lexical analyzer implementations (including lex and flex). The rules #3 -- #7 are specific for fa_lex. Unlike in lex/flex in fa_lex the rule order does not play any role in conflict resolution, thus it absolutely does not matter in which order the rules are specified.

Runtime Execution:

The lexical analysis of the text can be performed by a stand-alone program fa_lex. It takes two obligatory parameters: a compiled rule-set and a tagset and reads from stdin or from an input file the raw text and prints out the extracted tokens to stdout or an output file in the tagged-text format. The output can be redirected to programs like fa_ts2ps, fa_gcd, fa_ts2stat or any other understanding this format.

The following command will make a lexical analysis of the input text with respect to the grammar defined and compiled in the pervious section:

$ echo 23452345+34534 | fa_lex --tagset=tagset.txt --stage=lex_rules.dump

> 23452345/CD +/PUNKT 34534/CD

$ echo 23452345+34534+ | fa_lex --tagset=tagset.txt --stage=lex_rules.dump

> 23452345/CD +/PUNKT 34534/CD +/WORD

FAQ

1. Why use fa_lex?

The fa_lex lexical analyzer does not require rule-authors to write a C/C++ code or even have a C/C++ compiler installed. In fa_lex, the rule-sets are purely declarative. This allows authors (usually linguists) to focus on the linguistic aspects of the problem and be isolated from the actual implementation. The rules “by-design” cannot contain a hard to understand logic, they are more independent from each other, and, thus, easier to maintain than in other lexer programs.

2. How is fa_lex different from flex?

Efficiency aspects:

In fa_lex, the compiled automata (e.g. the tokenization logic) are separated from the client code. A C++/C or even C#, Ruby or Perl program may use the exact same tokenization automata.

Depending on how the automaton structure is represented in memory, the fa_lex’s approach allows balancing between speed and size for the same tokenization.

Unlike in flex, the result does not depend on rule order, there is no such thing as rule priority. Conflicts are solved based on the span, the token size, and the token type only, see “Conflict resolution for matching rules” section.

Due to the previous points, the fa_lex lexical analyzer is smaller and faster than one based on flex. And yet, there is a possibility to get even more speed by using more space or to take even less space by having a lower speed, e.g. speed/size balancing.

The fa_lex compiles faster than flex (the automata creation stage only). This is mainly due to two reasons: the different semantics of rule actions, and a better optimization for big grammars. The difference can be significant 2 minutes vs. 2 hours on the same machine for the same grammar.

Authoring aspects:

fa_lex has optional functions which may serve for complex context description or for the nested tokens extraction.

Rule-authors don't write any code at all (even for the actions), they can only specify the token type and the context. On later stages of processing, if needed, the token normalization code can be called with respect the token type.

Unlike in flex, it is possible to specify the left context. However, for efficiency reasons it is better not to, if possible.

There are some slight syntactic differences, see “Extra syntax notes” section.

Other aspects:

The fa_lex does not have any legal restrictions on its use.

3. How would you use a function for context description and conflict resolution?

As for the context you can detect that span contains something (for example, the quotes at the beginning and the end of the span) then you can apply a specific rule-set to this span (for example, a separate rule to the left quote, the right quote and the same rule-set for the rest of the span.) Needless to say this is impossible to do without functions.

As for the conflict resolution, the key here is that a function call is always preferred to the tag assignment. So if we need some special/exceptional treatment for this particular span(s) and tag priorities does not work then we just call a function for this span(s) and override the behavior.

For example, the ruleset 1:

< [^\s\t\r\n]+ > --> WORD

< [A-Za-z][.]([A-Za-z][.])+ > --> ACR

< e[.]g[.] > --> WORD

And the tagset:

ACR 1

WORD 2

The intension of the author is to mark sequences of non spaces with the tag WORD, but sequences of more than one letter followed by a dot with the tag ACR, but the "e.g." as a WORD. Author wants to choose ACR every time possible and only for the rest of the cases use WORD tag, that is why the WORD tag has the biggest value in the tagset. The "e.g." word however belongs to both the /[A-Za-z][.]([A-Za-z][.])+/ language and the /e[.]g[.]/ language so there is a conflict. This conflict cannot be resolved by changing the tag value, it either should be resolved by adding more context to the /e[.]g[.]/ rule or by using a function call as follows (see the rulset2)

The ruleset 2:

< [^\s\t\r\n]+ > --> WORD

< [A-Za-z][.]([A-Za-z][.])+ > --> ACR

< e[.]g[.] > --> _call WORD

_function WORD

< ^ .+ $ > --> WORD

_end

The function WORD assigns the tag WORD to the entire span. According to the priority rules the "e.g." rule will have the highest priority. See “Conflict resolution for matching rules” section.

4. How to obtain fa_lex?

The fa_lex lexical analyzer is available in the NLG-MAIN enlistment. The program fa_build_lex builds rule-sets and the fa_lex program is a stand alone program that does a lexical analysis, see ”Compilation” and “Runtime Execution” sections.