About Grammars
Start with an unstructured file (list) of characters.
"program var a : integer; const max = 10; ..."
Describe lexical structure with finite state machine
state 0:
if char is digit, goto state 1
if char is alphabetic, goto state 2
if char is symbol, goto state 3
if char is white space, goto state 0
else error
state 1:
if char is digit, goto state 1
else goto state 0
...
Or, we can use "linear" patterns (regular expressions)
ident = alpha ( alpha | digit ) *
number = digit digit*
...
Either way, we can end up with list of tokens.
[program, var, ide(a), :, integer, ';', const, ide(max), =, int(10),
';', ... ]
-- advantages of separate lexical pass
: less irrelevant detail (white space, comments)
: less complexity (reserved words embedded in strings)
: easier to change (add reserved words)
Remaining program structure is non-linear (recursive). Use context-free
syntax (e.g. forget about subtle type errors)
Expression:
E ::= E + E
| E - E
| E * E
| E / E
| Ident
| Number
| ( E )
Structure needs to be unambiguous; e.g.
7 - 5 - 1
should always be grouped as
(7 - 5) - 1
not
7 - (5 - 1)
Try again:
E ::= E + T
| E - T
| T
T ::= T * F
| T / F
| F
F ::= Ident
| Number
| ( E )
Now, the problem (for Prolog) is that a naive translation will be
left-recursive (i.e. Prolog's search strategy tries for the
infinitely-long expression alternative first!)
Hack:
E ::= T E'
E' ::= + E
| - E
| <nothing>
T ::= F T'
T' ::= * T
| / T
| <nothing>
F ::= <as before>
Actually, not quite right for minus. Really need
E' ::= ...
| - T E'
| ...