lex



       lex - Generates programs for lexical tasks


SYNOPSIS

       lex [-ct] [-n | -v] [file ...]

       [Digital]  The  following  syntax applies when the CMD_ENV
       environment variable is set to svr4:

       lex [-crt] [-n | -v] [-V] [-Qy | -Qn] [file ...]


STANDARDS

       Interfaces documented on this reference  page  conform  to
       industry standards as follows:

       lex:  XPG4, XPG4-UNIX

       Refer to the standards(5) reference page for more informa-
       tion about industry standards and associated tags.


FLAGS

       Writes C code to the file lex.yy.c. This is  the  default.
       Suppresses  the  statistics summary. When you set your own
       table sizes for the finite state  machine,  lex  automati-
       cally  produces  this  summary  if  you do not select this
       flag.  [Digital]  Writes RATFOR code to the file lex.yy.r.
       (There is no RATFOR compiler for DIGITAL UNIX.)  Writes to
       standard output instead of writing to a file.  Provides  a
       summary  of the generated finite state machine statistics.
       [Digital]  Outputs lex version number to  standard  error.
       Requires  the  environment  variable  CMD_ENV to be set to
       svr4.  [Digital]  Determines whether the lex version  num-
       ber is written to the output file.  -Qn does not do so and
       is the default.  Requires the environment variable CMD_ENV
       to be set to svr4.


DESCRIPTION

       The  lex  command  uses the rules and actions contained in
       file to generate a program, lex.yy.c, which  can  be  com-
       piled  with the cc command.  That program can then receive
       input, break the input into the logical pieces defined  by
       the  rules in file, and run program fragments contained in
       the actions in file.

       The generated program is  a  C  Language  function  called
       yylex().   The  lex command stores yylex() in a file named
       lex.yy.c.  You can use yylex() alone to recognize  simple,
       1-word input, or you can use it with other C Language pro-
       grams to perform more difficult input analysis  functions.
       For  example,  you  can use lex to generate a program that
       tokenizes an input stream before sending it  to  a  parser
       program generated by the yacc command.

       structure  allows  the  program to exist in only one state
       (or condition) at a time.  A finite number of  states  are
       allowed.   The  rules  in  file  determine how the program
       moves from one state to another in response to  the  input
       that the program receives.

       The  lex  command  reads its skeleton finite state machine
       from the file /usr/ccs/lib/ncpform or /usr/ccs/lib/ncform.
       Use  the  environment  variable  LEXER  to specify another
       location for lex to read from.

       If you do not specify a file, lex  reads  standard  input.
       It treats multiple files as a single file.

   Input File Format
       The  input  file can contain three sections:  definitions,
       rules, and user subroutines.  Each section must  be  sepa-
       rated from the others by a line containing only the delim-
       iter, %%.  The format is as follows: definitions %%  rules
       %%  user_subroutines  The  purpose  and  format of each of
       these sections are described under the headings that  fol-
       low.

   Definitions Section
       If  you  want  to  use variables in rules, you must define
       them in the definitions section.  The  variables  make  up
       the  left  column, and their definitions make up the right
       column.  For example, to define D as  a  numerical  digit,
       enter:  D    [0-9]  You  can use a defined variable in the
       rules section by enclosing the variable  name  in  braces,
       {D}.

       In the definitions section, you can set either of the fol-
       lowing two mutually exclusive  declarations:  Declare  the
       type  of  yytext  to be a null-terminated character array.
       Declare the type of yytext to be a pointer to a  null-ter-
       minated  character  string. Use of the %pointer definition
       selects the /usr/ccs/lib/ncpform skeleton.

       In the definitions section, you can also set  table  sizes
       for the resulting finite state machine.  The default sizes
       are large enough for small programs.  You may want to  set
       larger  sizes  for  more complex programs: Number of posi-
       tions is number (default 5000) Number of states is  number
       (default  2500)  Number  of  parse  tree  nodes  is number
       (default 2000) Number of transitions  is  number  (default
       5000)   Number  of  packed  character  classes  is  number
       (default 2000) Number of output slots is  number  (default
       5000)

       If   extended  characters  appear  in  regular  expression
       strings, you may need to reset the output array size  with
       number  of  extended  characters relative to the number of
       ASCII characters.

   Rules Section
       The rules section is required, and it must be preceded  by
       the  %%  delimiter,  even if you do not have a definitions
       section.  The lex command does not recognize rules without
       the delimiter.

       In  this  section, the left column contains the pattern to
       be recognized in an input file to yylex().  The right col-
       umn  contains  the  C  program fragment executed when that
       pattern is recognized.

       Patterns can include extended characters with  one  excep-
       tion: extended characters may not appear in range specifi-
       cations within character class expressions  surrounded  by
       brackets.

       The  columns  are  separated  by  a  tab.  For example, to
       search files for the word LEAD and replace it  with  GOLD,
       perform  the  following steps: Create a file called trans-
       mute.l containing the lines:  %%  (LEAD)   printf("GOLD");
       Then issue the following commands to the shell: lex trans-
       mute.l cc -o transmute  lex.yy.c  -ll  You  can  test  the
       resulting program with the command: transmute <transmute.l

       This command echoes the contents of transmute.l, with  the
       occurrences of LEAD changed to GOLD.

       Each  pattern  may have a corresponding action, that is, a
       fragment of C source code to execute when the  pattern  is
       matched.   Each  statement  must end with a ; (semicolon).
       If you use more than one statement in an action, you  must
       enclose  all  of them in {} (braces).  A second delimiter,
       %%, must follow the rules section if you have a user  sub-
       routine section.

       When  yylex()  matches  a  string  in the input stream, it
       copies the matched text to an  external  character  array,
       yytext,  before  it executes any actions in the rules sec-
       tion.

       You can use the following operators to form patterns  that
       you   want  to  match:  Matches  the  characters  written.
       Matches any one character in the enclosed range ([.-.]) or
       the enclosed list ([...]).  [abcx-z] matches a,b,c,x,y, or
       z.  Matches the enclosed character or string even if it is
       an  operator.   "$"  prevents  lex from interpreting the $
       character as an operator.  Acts the same as double quotes.
       \$  prevents  lex  from interpreting the $ character as an
       matches  zero  or  more  repeated  literal  characters  x.
       Matches  one  or  more occurrences of the single-character
       regular  expression  immediately  preceding  it.   Matches
       either zero or one occurrence of the single-character reg-
       ular expression immediately  preceding  it.   Matches  the
       character  only at the beginning of a line.  ^x matches an
       x at the beginning  of  a  line.   Matches  any  character
       except for the characters following the ^.  [^xyz] matches
       any character but x,  y,  or  z.   Matches  any  character
       except  the newline character.  Matches the end of a line.
       Matches either of two characters.  x|y matches either x or
       y.   Matches  one  extended  regular expression (ERE) only
       when followed by a second ERE. It  reads  only  the  first
       token  into  yytext.   Given the regular expression a*b/cc
       and the input aaabcc, yytext would contain the string aaab
       on  this match.  Matches the pattern in the ( ) (parenthe-
       ses).  This is used for grouping.  It reads the whole pat-
       tern  into  yytext.  A group in parentheses can be used in
       place of  any  single  character  in  any  other  pattern.
       (xyz123)  matches  the  pattern xyz123 and reads the whole
       string into yytext.  Matches the character as  defined  in
       the  definitions section.  If D is defined as numeric dig-
       its, {D}  matches  all  numeric  digits.   Matches  m-to-n
       occurrences of the specified character.  x{2,4} matches 2,
       3, or 4 occurrences of x.

       If a line begins with only a space, lex copies it  to  the
       lex.yy.c  output  file.  If the line is in the definitions
       section of file, lex copies it to the declarations section
       of  lex.yy.c.   If  the  line is in the rules section, lex
       copies it to the program code section of lex.yy.c.

   User Subroutines Section
       The lex library has three subroutines  defined  as  macros
       that  you  can  use  in the rules.  Reads a character from
       yyin.  Replaces a character after it is  read.   Writes  a
       character to yyout.

       You  can  override  these three macros by writing your own
       code for these routines in the user  subroutines  section.
       But  if  you  write  your  own routines, you must undefine
       these macros in the definitions  section  as  follows:  %{
       #undef  input  #undef  unput #undef output }% When you are
       using lex as a simple transformer/recognizer for stdin  to
       stdout  piping,  you  can  avoid  writing the framework by
       using libl.a (the lex library).  It  has  a  main  routine
       that calls yylex() for you.

       External  names generated by lex all begin with the prefix
       yy, as in yyin, yyout, yylex, and yytext.

   Putting Spaces in an Expression
       the spaces or tab characters  in  ""  (double  quotes)  to
       include  them  in  the  expression.  Use quotes around all
       spaces in expressions that are not already within sets  of
       [ ] (brackets).

   Other Special Characters
       The  lex  program recognizes many of the normal C language
       special characters.  These character sequences are as fol-
       lows: Sequence       Meaning

       \n             Newline                  \t             Tab
       \b             Backspace          \\             Backslash
       \digits             The   character   whose   encoding  is
                      represented by the three-digit octal number
       \xdigits       The    character    whose    encoding    is
                      represented by the hexadecimal  integer  Do
       not use the actual newline character in an expression.

       When  using these special characters in an expression, you
       do not need to enclose them in quotes.   Every  character,
       except   these   special  characters  and  the  previously
       described operator symbols, is always a text character.

   Matching Rules
       When more than one expression can match the current input,
       lex  chooses  the  longest  match first.  Among rules that
       match the same number of characters, the rule that  occurs
       first  is chosen.  For example: integer keyword action...;
       [a-z]+ identifier action...;

       If the preceding rules are given in that order  and  inte-
       gers  is the input word, lex matches the input as an iden-
       tifier because  [a-z]+  matches  eight  characters,  while
       integer  matches  only  seven.   However,  if the input is
       integer, both rules match seven characters.   The  keyword
       rule  is  selected  because  it  occurs  first.  A shorter
       input, such as int, does not  match  the  expression  rule
       integer and causes lex to select the rule identifier.

   Matching a String with Wildcard Characters
       Because  lex  chooses  the longest match first, do not use
       rules containing expressions like .*  (for example: '.*'

       The preceding rule might seem like a good way to recognize
       a  string in single quotes.  However, the lexical analyzer
       reads far ahead, looking for a  distant  single  quote  to
       complete  the long match.  If a lexical analyzer with such
       a rule gets the following  input,  it  matches  the  whole
       string:  'first' quoted string here, 'second' here To find
       the smaller strings, first and second, use  the  following
       rule: '[^'\n]*' This rule stops after matching 'first'.

       fore,  expressions  like  .* stop on the current line.  Do
       not try to defeat this with expressions like [.\n] +.  The
       lexical  analyzer tries to read the entire input file, and
       an internal buffer overflow occurs.

   Finding Strings within Strings
       The lex program partitions the input stream and  does  not
       search  for all possible matches of each expression.  Each
       character is accounted for once and only once.  For  exam-
       ple,  to  count occurrences of both she and he in an input
       text, try the following rules: she  s++; he   h++;  \n   |
       .    ; The last two rules ignore everything besides he and
       she.  However, because she includes he, lex does not  rec-
       ognize the instances of he that are included in she.

       To  override  this  choice,  use  the REJECT action.  This
       directive tells lex to go to  the  next  rule.   lex  then
       adjusts  the position of the input pointer to where it was
       before the first rule was executed, and executes the  sec-
       ond  choice  rule.   For  example,  to  count the included
       instances of  he,  use  the  following  rules:  she  {s++;
       REJECT;}  he   {h++; REJECT;} \n   | .    ; After counting
       the occurrences of she, lex rejects the input  stream  and
       then  counts the occurrences of he.  In this case, you can
       omit the REJECT action on he because she includes  he  but
       not  vice  versa.   In other cases, it may be difficult to
       determine which input characters are in both classes.

       In general, REJECT is useful whenever the purpose  of  lex
       is  not  to  partition  the input stream but to detect all
       examples of some items in the input, and the instances  of
       these items may overlap or include each other.

   Environment Variables
       The following environment variables affect the behavior of
       lex(): Provides a default value for  the  locale  category
       variables that are not set or null.  If set, overrides the
       values of all  other  locale  variables.   Determines  the
       order in which output is sorted for the -x option.  Deter-
       mines the locale for the interpretation of byte  sequences
       as characters (single-byte or multi-byte) in input parame-
       ters and files.  Determines the locale used to affect  the
       format  and  contents  of diagnostic messages displayed by
       the command.  Determines the location of message  catalogs
       for the processing of LC_MESSAGES.


NOTES

       Because  lex  uses fixed names for intermediate and output
       files, you can have only one lex-generated  program  in  a
       given  directory.   If  the  -t  option  is not specified,
       informational, error, and warning messages are written  to
       stdout.  If  the  -t  option  is specified, informational,
       200, controlled by the constant YYLMAX.  If the programmer
       needs to allow a larger array, the YYLMAX constant may  be
       redefined  as  follows from within the lex command file: {
       #undef YYLMAX #define YYLMAX 8192 } Two other  arrays  use
       YYLMAX, yysubf, and yylstate.


EXAMPLES

       The following command draws lex instructions from the file
       lexcommands and places the output in lex.yy.c: lex lexcom-
       mands  The  file  lexcommands contains an example of a lex
       program that would be put into a lex  command  file.   The
       following program converts uppercase to lowercase, removes
       spaces at the end of a line, and replaces multiple  spaces
       with  single spaces: %% [A-Z] putchar(tolower(yytext[0]));
       [ ]+$ ; [ ]+ putchar(' ');


FILES

       Run-time library.   Default  C  language  skeleton  finite
       state machine for lex.  Default C language skeleton finite
       state machine for lex, implemented with the pointer  defi-
       nition of yytext.  Default RATFOR language skeleton finite
       state machine for lex.


RELATED INFORMATION

       Commands:  yacc(1)

       Guides: Programming Support Tools

       Standards: standards(5) delim off