25 July, 2007

 

Lexical analyzer (fa_lex)

 

 

Introduction

 

 

The lexical analyzer (the lexer) takes a sequence of characters and returns a sequence of tokens; where every token is a meaningful unit identified by its type and its boundaries.

 

Everything that is not meaningful is normally discarded, like spaces or new-line symbols for C++. The tokens cannot overlap and include each other, in other words each character belongs to not more than one token. Depending on the language, the definition of the tokens can be different, see examples below:

 

For C++:

 

Input: if(++i==0) { j = 0; }

 

Output: if/OP (/LRB ++/OP i/VAR ==/OP 0/NUM )/RBR {/LCBR j/VAR =/OP 0/NUM ;/OP }/RCBR

 

Where {OP, LBR, RBR, VAR, NUM, LCBR, RCBR} is a possible set of token types for C++.

 

For English:

 

Input: Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov.29.

 

Output: Pierre/WORD Vinken/WORD ,/PUNKT 61/CD years/WORD old/WORD ,/PUNKT will/WORD join/WORD the/WORD board/WORD as/WORD a/WORD nonexecutive/WORD director/WORD Nov./WORD 29/CD ./EOS

 

Where: {WORD, PUNKT, CD, EOS} is a possible set of token types for English.

 

 

 

Grammar of fa_lex rules

 

 

The lexer uses rules in order to identify the boundaries and types of the tokens. Each rule describes one token in a context. The rules are based on the character regular expressions.

 

Each rule consists of optional left context description, the token description, optional right context description and the token type. A token description is enclosed into triangular brackets. A left, right context and token descriptions are character regular expressions. However, context descriptions should not be cyclic (e.g. accept a string of an infinite length) and the token description should not be empty (e.g. accept a string of a zero length). All rules are combined together by an "or" operator. The following grammar in Backus-Naur form formally describes the syntax of the lexer rules.

 

    GRAMMAR ::= RULES

    GRAMMAR ::= FUNCTIONS

 

    RULES ::= RULE

    RULES ::= RULE\n RULES

 

    RULE ::= CONDITION --> ACTION

 

    CONDITION ::= Regexp* < Regexp > Regexp*

    CONDITION ::= < Regexp > Regexp*

    CONDITION ::= Regexp* < Regexp >

    CONDITION ::= < Regexp >

 

    ACTION ::= Tag

    ACTION ::= _call FUNCTION_NAMES

    ACTION ::= Tag _call FUNCTION_NAMES

 

    FUNCTION_NAMES ::= _main

    FUNCTION_NAMES ::= FnName

    FUNCTION_NAMES ::= FnName FUNCTION_NAMES

 

    FUNCTIONS ::= FUNCTION

    FUNCTIONS ::= FUNCTION\n FUNCTIONS

 

    FUNCTION ::= _function FnName\n RULES\n _end

 

 

    Regexp  -- non-empty character-based regular expression

    Regexp* -- acyclic character-based regular expression

    Tag – Tag name (token type name)

    FnName – function name, can be one of the tags or a new name

    _function – a keyword indicating beginning of the function

    _end – a keyword indicating the end of the function

    _call – a keyword indicating function call

    _main – a special function name referring to the main rule set

 

 

 

Example 1:

 

  < [0-9]+ > --> NUM

  [0-9] < [-+*/] > [-]?[0-9] --> OP

  < [[:alpha:]]+ > --> VAR

 

Example 2:

 

  < ([A-Za-z\x00C0-\x00D6\x00D8-\x00F6\x00F8-\x00FF\x0152\x0153])+[+-] > [0-9] --> WORD

 

  < ([0]?[0-9]|(1)[0-9]|(2)[0-4])[:]([0-5][0-9])([:]([0-5][0-9]))? > [^0-9] --> HHMM

 

  < ([\x0024\x00A2-\x00A5\x09F2\x09F3\x0E3F\x20A0\x20A2\x20A3\x20A4\x20A6-\x20AF])[\x0020\t]*

    ((0)|[1-9][0-9]*)[\x0020\t]*((

     (AED)|(ARP)|(ATS)|(AUD)|(BBD)|(BEF)|(BGL)|(BHD)|(BMD)|(BRR)|(BRL)|(BSD)

   | (CAD)|(CHF)|(CLP)|(CNY)|(CSK)|(CYP)|(DEM)|(DKK)|(DJF)|(DZD)|(EGP)|(ESP)

   | (EUR)|(FIM)|(FJD)|(FRF)|(GBP)|(GRD)|(HKD)|(HUF)|(IDR)|(IEP)|(ILS)|(INR)

   | (IQD)|(ISK)|(ITL)|(JMD)|(JOD)|(JPY)|(KRW)|(KWD)|(LBP)|(LUF)|(LYD)|(MAD)

   | (MRO)|(MXP)|(MYR)|(NLG)|(NOK)|(NZD)|(OMR)|(PHP)|(PKR)|(PLN)|(PTE)|(QAR)

   | (ROL)|(RUR)|(SAR)|(SDD)|(SEK)|(SGD)|(SKK)|(SOS)|(SYP)|(SUR)|(THB)|(TND)

   | (TRL)|(TRY)|(TTD)|(TWD)|(USD)|(VEB)|(XEC)|(YER)|(ZAR)|(ZMK)|(DM)|(FF)

   | (\x20AC((uro)|(URO))[s]?)

  )) > ([^.,0-9A-Za-z\x00C0-\x00D6\x00D8-\x00F6\x00F8-\x00FF\x0152\x0153]) --> CURR

 

 

 

The following rules are incorrect:

 

  < [-+*/] > [-]?[0-9]+ --> OP   ; the context should be acyclic

  < [-+*/]* > [-]?[0-9] --> OP   ; the token description must not allow empty tokens

  [-]?[0-9]+ --> CD       ; the token definition should be enclosed in the triangular brackets

 

 

The following are equivalent rule-sets:

 

  1. The

    < [-+*/] > [-]?[0-9] --> OP

  and

    < [-+*/] > [-][0-9] --> OP

    < [-+*/] > [0-9] --> OP

 

  2. The

    < [-+*/] > [[:alpha:]]|[[:digit:]] --> OP

  and

    < [-+*/] > [[:alpha:]] --> OP

    < [-+*/] > [[:digit:]] --> OP

 

  3. The

    < [-+*/] > [-]? --> OP

  and

    < [-+*/] > --> OP

 

4. The

    < [-+*/] > [^a] --> OP

and

    < [-+*/] > [-] --> OP

    < [-+*/] > [^a] --> OP

 

5. The

    < [-+*/] > . --> OP

  and

    < [-+*/] > [^a] --> OP

    < [-+*/] > [^b] --> OP

 

 

Description of Functions:

 

 

Functions are isolated named sets of rules in fa_lex syntax.

 

If the action of the rule R contains _call keyword followed by a function name FnName or by a _main keyword then each time R extracts a token, the rule set FnName or _main is applied to the token span. If _call is followed by one function name then the function’s rule set extracts all possible non-overlapping tokens out of the span. If _call is followed by more than one function name then each corresponding rule set extracts just one token, one after another in a sequence with exception to the main rule set _main (it always extracts all possible tokens.)

 

As formal grammar defines, there may be three types of actions: a) tag assignment b) function call c) tag assignment and function call. If the action is a function call without tag assignment then no token corresponding to the span is extracted. In this case "fa_lex" returns whatever is the output of the calling function. It is possible that the calling function will return nothing.

 

Functions are optional, they may be used for hierarchical tokens extractions (such as date as a whole and day, month and year as its parts,) they also may be used for wide context description and conflict resolution.

 

 

Examples of Functions:

 

 

The following function HY_WORD is called to split the hyphenated word into segments. Input: “out-of-date”, output: “out/WORD -/WORD of/WORD -/WORD date/WORD” No nested tokens are created.

 

_function HY_WORD

< [A-Za-z]+ > --> WORD

< [-] > --> WORD

_end

 

< [A-Za-z]+([-][A-Za-z]+)+ > --> _call HY_WORD

 

 

In the following example the tag ACR assigned and the function ACR is called (it is fine to have functions and tags of the same names.) Input: “A.B.C.”, output: “A.B.C./ACR A./WORD B./WORD C./WORD”.

 

_function ACR

< [A-Z][.] > --> WORD

_end

 

< ([A-Z][.])+ > --> ACR _call ACR

 

 

The following functions DAY, MONTH, YEAR are called in a sequence one after another, each of them extracts just one token in the selected by the caller rule span. Input: “12/13/2006 13/12/2006”, output: “12/13/2006/DATE_US 12/MONTH 13/DAY 2006/YEAR 13/12/2006/DATE_EU 13/DAY 12/MONTH 2006/YEAR”. For the cases when the input token matches both DATE_US and DATE_EU rules the fa_lex prefers tag name which has smaller value, so depending on the tagset definition DATE_US may be preferred to the DATE_EU and vice versa (see “Conflict resolution rules.”)

 

_function MONTH

< [0-9][0-9] > --> MONTH

_end

 

_function DAY

< [0-9][0-9] > --> DAY

_end

 

_function YEAR

< [0-9][0-9][0-9]?[0-9]? > --> YEAR

_end

 

< [01][0-9][/][0123][0-9][/][0-9][0-9][0-9]?[0-9]? > --> DATE_US _call MONTH DAY YEAR

 

< [0123][0-9][/][01][0-9][/][0-9][0-9][0-9]?[0-9]? > --> DATE_EU _call DAY MONTH YEAR

 

 

Extra syntax notes:

 

1. The blank characters does not mean anything for fa_lex and they are simply ignored. In order to match with any of those characters, the following constructions can be used: \t, \n, \r, \f, \v, [ ], [\t], [\n], [\r], [\f], [\v], \x20, \x09, \x0D, \x0A, [\x20], [\x09], [\x0D], [\x0A], [[:blank:]], [[:space:]] and so on.

 

2. Left (^) and right ($) anchors are ordinary symbols for the lexer, they can be included into both contexts as well as the token definition. If possible, including them into the token definition is more preferable. The “any” symbol (e.g. . ) matches both of the anchors, the negation of a character (e.g. [^a]) also matches any of the anchors.

 

3. Chracter classes:

        [:alnum:] [:alpha:] [:lower:] [:xdigit:]

        [:digit:] [:space:] [:upper:] [:print:]

        [:punct:] [:blank:] [:cntrl:] [:graph:]

are defined as in POSIX "C" locale and have to be extended for Unicode range, if necessary.

 

4. See POSIX 1003.2 standard for regular expressions for more details on the regular expression syntax (http://www.unusualresearch.com/regex/regexmanpage.htm ).

 

 

Compilation:

 

The lexer compiler fa_build_lex takes two input files: one a rule-set and the other a tagset. The tagset is a list of symbolic names of token types each of which has a numerical value associated with it. The tagset can be shared with some other modules like POS tagging, in case of NL analysis.

 

Suppose the file lex_rules.utf8 contains the following rule-set:

 

    [0-9] < [-+*/] > [-]?[0-9] --> PUNKT

    < [-]?[0-9]+ > --> CD

    < [-+*/]+ > --> WORD

 

The tagset.txt contains the following tagset:

 

    CD 1

    PUNKT 2

    WORD 3

 

The following command will compile the rule-set:

 

    fa_build_lex --in=lex_rules.utf8 --out=lex_rules.dump \

      --build-dump --tagset=tagset.txt

 

The --build-dump parameter makes a memory-dump representation of the compiled rule-set, without this parameter the compiled rule-set will be stored in the textual representation. See the description of all switches by typing: fa_build_lex --help

 

If there were no compilation errors the output file lex_rules.dump will be created.

 

 

 

Lexical analysis

 

 

As it has been said, the lexical analysis is a process of conversion of a sequence of characters into a sequence of tokens; where each token is a meaningful unit identified by its type and its boundaries. Everything that is not the token is ignored. The output tokens cannot overlap and include each other, in other words each character of the input text belongs to not more than one token. In order to guarantee this condition, it is necessary to be able to prefer one match over the other if more than one rule matches the given character of the input text. This is addressed by the conflict resolution rules (see below).

 

Conflict resolution for matching rules:

 

The following is the order in which fa_lex selects which rule to execute if more than one matched the text:

 

1. The leftmost rule,

2. The rule with the longest span,

3. The rule with the smallest left context,

4. The rule with the smallest right context,

5. The rule with no tag assignment (just function call)

6. The rule with the smaller tag value

7. The rule with lexicographically smaller list of function names (based their values)

 

 

Conflict resolution rules #1 and #2 are common for many lexical analyzer implementations (including lex and flex). The rules #3 -- #7 are specific for fa_lex. Unlike in lex/flex in fa_lex the rule order does not play any role in conflict resolution, thus it absolutely does not matter in which order the rules are specified.

 

 

Runtime Execution:

 

The lexical analysis of the text can be performed by a stand-alone program fa_lex. It takes two obligatory parameters: a compiled rule-set and a tagset and reads from stdin or from an input file the raw text and prints out the extracted tokens to stdout or an output file in the tagged-text format. The output can be redirected to programs like fa_ts2ps, fa_gcd, fa_ts2stat or any other understanding this format.

 

The following command will make a lexical analysis of the input text with respect to the grammar defined and compiled in the pervious section:

 

   $ echo 23452345+34534 | fa_lex --tagset=tagset.txt --stage=lex_rules.dump

   > 23452345/CD +/PUNKT 34534/CD

 

   $ echo 23452345+34534+ | fa_lex --tagset=tagset.txt --stage=lex_rules.dump

   > 23452345/CD +/PUNKT 34534/CD +/WORD

 

 

 

FAQ

 

 

1. Why use fa_lex?

 

The fa_lex lexical analyzer does not require rule-authors to write a C/C++ code or even have a C/C++ compiler installed. In fa_lex, the rule-sets are purely declarative. This allows authors (usually linguists) to focus on the linguistic aspects of the problem and be isolated from the actual implementation. The rules “by-design” cannot contain a hard to understand logic, they are more independent from each other, and, thus, easier to maintain than in other lexer programs.

 

 

2. How is fa_lex different from flex?

 

Efficiency aspects:

 

 

 

 

 

 

Authoring aspects:

 

 

 

 

 

Other aspects:

 

 

 

3. How would you use a function for context description and conflict resolution?

 

As for the context you can detect that span contains something (for example, the quotes at the beginning and the end of the span) then you can apply a specific rule-set to this span (for example, a separate rule to the left quote, the right quote and the same rule-set for the rest of the span.) Needless to say this is impossible to do without functions.

 

As for the conflict resolution, the key here is that a function call is always preferred to the tag assignment. So if we need some special/exceptional treatment for this particular span(s) and tag priorities does not work then we just call a function for this span(s) and override the behavior.

 

For example, the ruleset 1:

 

< [^\s\t\r\n]+ > --> WORD

< [A-Za-z][.]([A-Za-z][.])+ > --> ACR

< e[.]g[.] > --> WORD

 

And the tagset:

 

ACR 1

WORD 2

 

The intension of the author is to mark sequences of non spaces with the tag WORD, but sequences of more than one letter followed by a dot with the tag ACR, but the "e.g." as a WORD. Author wants to choose ACR every time possible and only for the rest of the cases use WORD tag, that is why the WORD tag has the biggest value in the tagset. The "e.g." word however belongs to both the /[A-Za-z][.]([A-Za-z][.])+/ language and the /e[.]g[.]/ language so  there is a conflict. This conflict cannot be resolved by changing the tag value, it either should be resolved by adding more context to the /e[.]g[.]/ rule or by using a function call as follows (see the rulset2)

 

The ruleset 2:

 

< [^\s\t\r\n]+ > --> WORD

< [A-Za-z][.]([A-Za-z][.])+ > --> ACR

< e[.]g[.] > --> _call WORD

 

_function WORD

< ^ .+ $ > --> WORD

_end

 

The function WORD assigns the tag WORD to the entire span. According to the priority rules the "e.g." rule will have the highest priority. See “Conflict resolution for matching rules” section.

 

 

4. How to obtain fa_lex?

 

The fa_lex lexical analyzer is available in the NLG-MAIN enlistment. The program fa_build_lex builds rule-sets and the fa_lex program is a stand alone program that does a lexical analysis, see ”Compilation” and “Runtime Execution” sections.