LEX/FLEX AND YACC/BISON OVERVIEW Department of Computer Science Mekelle University
General Lex/Flex Information • lex is a tool to generate lexical analyzers. It was written by Mike Lesk and Eric Schmidt (the Google guy). divides a stream of input characters into meaningful units (lexemes), identifies them (token) and may the token to a parser generator, yacc lex specifications are regular expressions
• flex (fast lexical analyzer generator) –Free and open source alternative. –You’ll be using this.
General Yacc/Bison Information • Yacc (yet another compiler compiler) –Is a tool to generate parsers (syntactic analyzers). –Generated parsers require a lexical analyzer. –It isn’t used anymore.
• Bison –Free and open source alternative. –You’ll be using this.
Lex/Flex and Yacc/Bison relation to a compiler tool chain
FLEX IN DETAIL
How Flex Works • Flex uses a .l spec file to generate a tokenizer/scanner.
• The tokenizer reads an input file and chunks it into a series of tokens which are ed to the parser. • Flex is a program that automatically creates a scanner in C, using rules for tokens as regular expressions.
Internal Structure of Lex/Flex
Regular expressions
NFA
DFA
Minimal DFA
The final states of the DFA are associated with actions
Flex/Lex Structure • Format of the input file is like …
Definitions section (1) • There are three things that can go in the definitions section: • C code Any indented code between %{ and %} is copied to the C file. This is typically used for defining file variables, and for prototypes of routines that are defined in the code segment. • definitions A definition is very much like a #define p directive. For example letter [a-zA-Z] digit [0-9] punct [,.:;!?] nonblank [ˆ \t] These definitions can be used in the rules section: one could start a rule {letter}+ {...
Definitions section (2) • State definitions If a rule depends on context, it’s possible to introduce states and incorporate those in the rules. A state definition looks like %s STATE, and by default a state INITIAL is already given.
Rules section • The rules section has a number of pattern-action pairs. • The patterns are regular expressions and the actions are either a single C command, or a sequence enclosed in braces. • Example: RE Action \n [0-9]+ [a-zA-Z]
linenum++; printf(“integer”); printf(“letter”);
Lex/Flex Regular Expression (1) • Regular Expression contains: – text characters (which match the corresponding characters in the strings being compared) and – operator characters (which specify repetitions, choices, and other features).
• Text characters: the letters of the alphabet and the digits are always text characters. • Operator Characters: the operator characters are: “\[]^-?.*+|()$/{}%<> • and, if they are to be used as text characters, an escape (\) should be used. • The quotation mark operator (“) indicates that whatever is contained between a pair of quotes is to be taken as text character.
Lex/Flex Regular Expression (2) • Character classes: • Classes of characters can be specified using the operator pair []. • The construction [abc] matches a single character, which may be a, b, c. • Within square brackets, most operator meanings are ignored. Only three characters are special: these are \ - and ^ • The – character indicates ranges, for example, [a-z0-9] (i.e., it indicates the character class containing all the lower case letters and digits) • If it is desired to include the character - in a character class, it should be first or last; Ex: [-+0-9] matches all digits and two signs.
Lex/Flex Regular Expression (3) • In character classes, the ^ operator must appear as the first character after the left bracket; • It indicates that the resulting string is to be complemented with respect to the computer character set. • Example: • [^abc] matches all characters except a, b, or c • [^a-zA-Z] matches any character which is not a letter. • The \ character provides the usual escapes within character class brackets.
Lex/Flex Regular Expression (4) • Arbitrary character: • To match almost any character, the operator character. (dot) is the class of all characters except newline.
• Optional Expression: • The operator ? Indicates an optional element of an expression. Ex: ab?c matches either ac or abc
• Repeated Expressions: • Repetitions of classes are indicated by the operators * and + • Ex: a* is any number of consecutive a characters including zero; while a+ is one or more instance of a.
Lex/Flex Regular Expression (5) • Example: • [a-z]+ is all strings of lower case letters • [A-Z][a-z]+ indicates strings with a first upper case letter followed by any number of lower case letters. • Alternation and Grouping: • The operator | indicates alternation: • Ex: (ab|cd) matches either ab or cd. • Ex: (ab | cd+)?(ef)* matches such strings as abefef, efefef, cdef, or cddd; but not abc, abcd, or abcdef
Lex/Flex Regular Expression (6) • Repetition and Definitions: • The operators { } specify either repetitions (if they enclose number) or definition expression (if they enclose a name) • Ex: {digit} looks for a predefined string named digit and inserts it at that point in the expression. The definitions are given in the first part of the Lex input, before the rules. • In contrast, a{1, 5} looks for 1 to 5 occurrences of a.
Example Pattern c “c” \c [cd] [a-z] [^c] . ^x x$ x?
Meaning The char “c” The char “c” even if it is a special char in this table Same as “c”, used to quote a single char The char c or the char d Any single char in the range a through z Any char but c Any char but newline The pattern x if it occurs at the beginning of a line The pattern x at the end of a line An optional x
Example x* x+ xy x|y (x) x/y <S>x {name} x{m} x{m,n}
Zero or more occurrences of the pattern x One or more occurrences of the pattern x The pattern x concatenated with the pattern y An x or a y An x An x only if followed by y The pattern x when lex is in start condition S The value of a macro from definitions section m occurrences of the pattern x m through n occurrences of x (takes precedence over concatenation)
Flex/Lex Predefined Variables Name int yylex(void) char *yytext yyleng yylval int yywrap(void) FILE *yyout FILE *yyin INITIAL BEGIN ECHO
function call to invoke lexer, returns token pointer to matched string length of matched string value associated with token wrapup, return 1 if done, 0 if not done output file input file initial start condition condition switch start condition write matched string
Flex/Lex Action • When an expression written matched, Lex/Flex executes the corresponding actions. • Example: %% [a-z]+ printf (“alpha\n”); [0-9]+ printf (“numeric\n”); [a-z0-9]+ printf (“alphanumeric\n”); [ \t]+ printf (“white space\n”); . printf (“special char\n”); \n ; %%
Disambiguation Rules • If there are several patterns which match the current input, yylex() chooses one of them according to these rules: 1. The longest match is preferred. 2. Among rules that match the same number of characters, the rule that occurs earliest in the list is preferred.
Example • Show the output if the input to yylex() generated by the lex program above is abc123 abc 123?x • Solution: alphanumeric white space alpha white space numeric special char alpha