forked from diivanand/cs164---COOL-Compiler
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathjlex_manual.html
1248 lines (1233 loc) · 43.5 KB
/
jlex_manual.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!-- X-URL: http://www.cs.princeton.edu/~appel/modern/java/JLex/manual.html -->
<BR> <P>
<H1 ALIGN=CENTER>JLex:<BR> A lexical analyzer generator for Java<sup><small>(TM)</small></sup><BR>
</H1>
<P ALIGN=CENTER>
<STRONG>Elliot Berk<BR>
Department of Computer Science, Princeton University
</STRONG></P>
<P ALIGN=CENTER>Version 1.2, May 5, 1997</P>
<P ALIGN=CENTER>Manual revision October 29, 1997</P>
<P ALIGN=CENTER>Last updated September 6, 2000 for JLex 1.2.5</P>
<P ALIGN=CENTER>(latest version can be obtained from
<A HREF="http://www.cs.princeton.edu/~appel/modern/java/JLex/">http://www.cs.princeton.edu/~appel/modern/java/JLex/</A> )</p>
<P>
<HR>
<P><H2><A NAME="SECTION00010000000000000000">Contents</A></H2>
<UL>
<LI> <A NAME="tex2html54"
HREF="#SECTION1">1. Introduction</A>
<LI> <A NAME="tex2html55"
HREF="#SECTION2">2. JLex Specifications</A>
<UL>
<LI> <A NAME="tex2html56"
HREF="#SECTION2.1">2.1 User Code</A>
<LI> <A NAME="tex2html57"
HREF="#SECTION2.2">2.2 JLex Directives</A>
<UL>
<LI> <A NAME="tex2html58"
HREF="#SECTION2.2.1">2.2.1 Internal Code to Lexical Analyzer Class</A>
<LI> <A NAME="tex2html59"
HREF="#SECTION2.2.2">2.2.2 Initialization Code for Lexical Analyzer Class</A>
<LI> <A NAME="tex2html60"
HREF="#SECTION2.2.3">2.2.3 End-of-File Code for Lexical Analyzer Class</A>
<LI> <A NAME="tex2html61"
HREF="#SECTION2.2.4">2.2.4 Macro Definitions</A>
<LI> <A NAME="tex2html62"
HREF="#SECTION2.2.5">2.2.5 State Declarations</A>
<LI> <A NAME="tex2html63"
HREF="#SECTION2.2.6">2.2.6 Character Counting</A>
<LI> <A NAME="tex2html64"
HREF="#SECTION2.2.7">2.2.7 Line Counting</A>
<LI> <A NAME="tex2html65"
HREF="#SECTION2.2.8">2.2.8 Java CUP Compatibility </A>
<LI> <A NAME="tex2html66"
HREF="#SECTION2.2.9">2.2.9 Lexical Analyzer Component Titles</A>
<LI> <A NAME="tex2html67"
HREF="#SECTION2.2.10">2.2.10 Default Token Type</A>
<LI> <A NAME="tex2html68"
HREF="#SECTION2.2.11">2.2.11 Default Token Type II: Wrapped Integer</A>
<LI> <A NAME="tex2html69"
HREF="#SECTION2.2.12">2.2.12 YYEOF on End-of-File</A>
<LI> <A NAME="tex2html70"
HREF="#SECTION2.2.13">2.2.13 Newlines and Operating System Compatibility</A>
<LI> <A NAME="tex2html71"
HREF="#SECTION2.2.14">2.2.14 Character Sets</A>
<LI> <A NAME="tex2html72"
HREF="#SECTION2.2.15">2.2.15 Character Format To and From File</A>
<LI> <A NAME="tex2html73"
HREF="#SECTION2.2.16">2.2.16 Exceptions Generated by Lexical Actions</A>
<LI> <A NAME="tex2html74"
HREF="#SECTION2.2.17">2.2.17 Specifying the Return Value on End-of-File</A>
<LI> <A NAME="tex2html74a"
HREF="#SECTION2.2.18">2.2.18 Specifying an interface to implement</A>
<LI> <A NAME="tex2html75"
HREF="#SECTION2.2.19">2.2.19 Making the Generated Class Public</A>
</UL>
<LI> <A NAME="tex2html75"
HREF="#SECTION2.3">2.3 Regular Expression Rules</A>
<UL>
<LI> <A NAME="tex2html76"
HREF="#SECTION2.3.1">2.3.1 Lexical States</A>
<LI> <A NAME="tex2html77"
HREF="#SECTION2.3.2">2.3.2 Regular Expressions</A>
<LI> <A NAME="tex2html78"
HREF="#SECTION2.3.3">2.3.3 Associated Actions</A>
<UL>
<LI> <A NAME="tex2html79"
HREF="#SECTION2.3.3.1">2.3.3.1 Actions and Recursion:</A>
<LI> <A NAME="tex2html80"
HREF="#SECTION2.3.3.2">2.3.3.2 State Transitions:</A>
<LI> <A NAME="tex2html81"
HREF="#SECTION2.3.3.3">2.3.3.3 Available Lexical Values:</A>
</UL>
</UL>
</UL>
<LI> <A NAME="tex2html82"
HREF="#SECTION3">3. Generated Lexical Analyzers</A>
<LI> <A NAME="tex2html83"
HREF="#SECTION4">4. Performance</A>
<LI> <A NAME="tex2html84"
HREF="#SECTION5">5. Implementation Issues</A>
<UL>
<LI> <A NAME="tex2html85"
HREF="#SECTION5.1">5.1 Unimplemented Features</A>
<LI> <A NAME="tex2html86"
HREF="#SECTION5.2">5.2 Unicode vs Ascii</A>
<LI> <A NAME="tex2html87"
HREF="#SECTION5.3">5.3 Commas in State Lists</A>
<LI> <A NAME="tex2html88"
HREF="#SECTION5.4">5.4 Wish List of Unimplemented Features</A>
</UL>
<LI> <A NAME="tex2html89"
HREF="#SECTION6">6. Credits and Copyrights</A>
<UL>
<LI> <A NAME="tex2html90"
HREF="#SECTION6.1">6.1 Credits</A>
<LI> <A NAME="tex2html91"
HREF="#SECTION6.2">6.2 Copyright</A>
</UL>
</UL>
<P>
<BR> <HR>
<BR> <P>
<H1><A NAME="SECTION1">1. Introduction</A></H1>
<P>
A lexical analyzer breaks an input stream of characters
into tokens.
Writing lexical analyzers by hand can be a tedious
process, so software tools have been developed to ease
this task.
<P>
Perhaps the best known such utility is Lex.
Lex is a lexical analyzer generator for the UNIX
operating system, targeted to the C programming language.
Lex takes a specially-formatted specification file
containing the details of a lexical analyzer.
This tool then creates a C source file for the
associated table-driven lexer.
<P>
The JLex utility is based upon the Lex lexical
analyzer generator model. JLex takes a specification
file similar to that accepted by Lex, then
creates a Java source file for the corresponding lexical
analyzer.
<P>
<BR> <HR>
<BR> <P>
<H1><A NAME="SECTION2">2. JLex Specifications</A></H1>
<P>
A JLex input file is organized into three sections,
separated by double-percent directives (``%%'').
A proper JLex specification has the following format.<BR>
<I>user code</I><BR>
%%<BR>
<I>JLex directives</I><BR>
%%<BR>
<I>regular expression rules</I><BR>
The ``%%'' directives distinguish sections of the input
file and must be placed at the beginning of their line.
The remainder of the line containing the ``%%'' directives
may be discarded and should not be used to house
additional declarations or code.
<P>
The user code section - the first section of the specification
file - is copied directly into the resulting output file.
This area of the specification provides space for the
implementation of utility classes or return types.
<P>
The JLex directives section is the second part of the input
file. Here, macros definitions are given and state names
are declared.
<P>
The third section contains the rules of lexical analysis,
each of which consists of three parts: an optional state list,
a regular expression, and an action.
<P>
<BR> <HR>
<BR> <P>
<H2><A NAME="SECTION2.1">2.1 User Code</A></H2>
<P>
User code precedes the first double-percent directive (``%%').
This code is copied verbatim into the lexical analyzer source
file that JLex outputs, at the top of the file.
Therefore, if the lexer source file needs to begin
with a package declaration or with
the importation of an external class,
the user code section should begin with
the corresponding declaration.
This declaration will then be copied onto
the top of the generated source file.
<P>
<BR> <HR>
<BR> <P>
<H2><A NAME="SECTION2.2">2.2 JLex Directives</A></H2>
<P>
The JLex directive section begins after the first ``%%''
and continues until the second ``%%'' delimiter.
Each JLex directive should be contained on a single line
and should begin that line.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.1">2.2.1 Internal Code to Lexical Analyzer Class</A></H3>
<P>
The <I>%{...%}</I> directive allows the user to write
Java code to be copied into the lexical analyzer class.
This directive is used as follows.<BR>
<I>%{ </I><BR>
<I><code> </I><BR>
<I>%} </I><BR>
To be properly recognized, the <I>%{ </I> and <I>%} </I>
should each be situated at the beginning of a line.
The specified Java code in <I><code></I> will be then copied into
the lexical analyzer class created by JLex.<BR>
<I>class Yylex { </I><BR>
<I>... <code> ... </I><BR>
<I>} </I><BR>
This permits the declaration of variables and functions
internal to the generated lexical analyzer class.
Variable names beginning with <I>yy</I> should be
avoided, as these are reserved for use by the generated
lexical analyzer class.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.2">2.2.2 Initialization Code for Lexical Analyzer Class</A></H3>
<P>
The <I>%init{ ... %init}</I> directive allows the user to write
Java code to be copied into the constructor for the
lexical analyzer class.<BR>
<I>%init{ </I><BR>
<I><code></I><BR>
<I>%init} </I><BR>
The <I>%init{</I> and <I>%init}</I> directives
should be situated at the beginning of a line.
The specified Java code in <I><code></I> will be then copied into
the lexical analyzer class constructor.<BR>
<I>class Yylex { </I><BR>
<I>Yylex () { </I><BR>
<I>... <code> ... </I><BR>
<I>} </I><BR>
<I>} </I><BR>
This directive permits one-time initializations
of the lexical analyzer class from inside its constructor.
Variable names beginning with <I>yy</I> should be
avoided, as these are reserved for use by the generated
lexical analyzer class.
<P>
The code given in the <I>%init{ ... %init}</I> directive
may potentially throw an exception, or propagate it from
another function. To declare this exception, use
the <I>%initthrow{ ... %initthrow}</I> directive.<BR>
<I>%initthrow{ </I><BR>
<I><exception[1]></I>[<I>, <exception[2]>, ...</I>]<BR>
<I>%initthrow} </I><BR>
The Java code specified here will be copied
into the declaration of the lexical analyzer
constructor.<BR>
<I>Yylex () </I><BR>
<I>throws <exception[1]></I>[<I>, <exception[2]>, ...</I>]<BR>
<I>{ </I><BR>
<I>... <code> ... </I><BR>
<I>} </I><BR>
If the Java code given in the <I>%init{ ... %init}</I>
directive throws an exception that is not declared,
the resulting lexical analyzer source file may not compile
successfully.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.3">2.2.3 End-of-File Code for Lexical Analyzer Class</A></H3>
<P>
The <I>%eof{ ... %eof}</I> directive
allows the user to write Java code to be
copied into the lexical analyzer class
for execution after the end-of-file is reached.<BR>
<I>%eof{ </I><BR>
<I><code></I><BR>
<I>%eof} </I><BR>
The <I>%eof{</I> and <I>%eof}</I> directives
should be situated at the beginning of a line.
The specified Java code in <I><code></I>
will be executed at most once, and immediately
after the end-of-file is reached for the input file
the lexical analyzer class is processing.
<P>
The code given in the <I>%eof{ ... %eof}</I> directive
may potentially throw an exception, or propagate it from
another function. To declare this exception, use
the <I>%eofthrow{ ... %eofthrow}</I> directive.<BR>
<I>%eofthrow{ </I><BR>
<I><exception[1]></I>[<I>, <exception[2]>, ...</I>]<BR>
<I>%eofthrow} </I><BR>
The Java code specified here will be copied
into the declaration of the lexical analyzer function
called to clean-up upon reaching end-of-file.<BR>
<I>private void yy_do_eof () </I><BR>
<I>throws <exception[1]></I>[<I>, <exception[2]>, ...</I>]<BR>
<I>{ </I><BR>
<I>... <code> ... </I><BR>
<I>} </I><BR>
The Java code in <code> that makes up
the body of this function will, in part,
come from the code given in the
<I>%eof{ ... %eof}</I> directive.
If this code throws an exception that is not declared
using the <I>%eofthrow{ ... %eofthrow}</I> directive,
the resulting lexer may not compile successfully.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.4">2.2.4 Macro Definitions</A></H3>
<P>
Macro definitions are given in the JLex directives section
of the specification.
Each macro definition is contained on a single line and
consists of a macro name followed by an equal sign (=),
then by its associated definition.
The format can therefore be summarized as follows.<BR>
<I><name></I> = <I><definition></I><BR>
Non-newline white space, e.g. blanks and tabs,
is optional between the macro name and the equal sign
and between the equal sign and the macro definition.
Each macro definition should be contained on a
single line.
<P>
Macro names should be valid identifiers,
e.g. sequences of letters, digits, and underscores
beginning with a letter or underscore.
<P>
Macro definitions should be valid regular expressions,
the details of which are described in another section below.
<P>
Macro definitions can contain other macro expansions,
in the standard<BR><I>{<name>} </I> format for macros
within regular expressions.
However, the user should note that these expressions
are macros - not functions or nonterminals - so
mutually recursive constructs using macros are illegal.
Therefore, cycles in macro definitions will have
unpredictable results.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.5">2.2.5 State Declarations</A></H3>
<P>
Lexical states are used to control when certain
regular expressions are matched.
These are declared in the JLex directives
in the following way.<BR>
<I>%state </I>state[0][<I>, state[1], state[2], ...</I>]<BR>
Each declaration of a series of lexical states should
be contained on a single line.
Multiple declarations can be included in the same
JLex specification, so the declaration of many states
can be broken into many declarations over multiple lines.
<P>
State names should be valid identifiers,
e.g. sequences of letters, digits, and underscores
beginning with a letter or underscore.
<P>
A single lexical state is implicitly declared by JLex.
This state is called <I>YYINITIAL</I>, and the generated
lexer begins lexical analysis in this state.
<P>
Rules of lexical analysis begin with an optional state list.
If a state list is given, the lexical rule is matched only when
the lexical analyzer is in one of the specified states.
If a state list is not given, the lexical rule is matched when
the lexical analyzer is in any state.
<P>
If a JLex specification does not make use of states,
by neither declaring states nor preceding lexical rules
with state lists,
the resulting lexer will remain in state <I>YYINITIAL</I>
throughout execution.
Since lexical rules are not prefaced by state lists,
these rules are matched in all existing states,
including the implicitly declared state <I>YYINITIAL</I>.
Therefore, everything works as expected if states are
not used at all.
<P>
States are declared as constant integers within the generated
lexical analyzer class.
The constant integer declared for a declared state
has the same name as that state.
The user should be careful to avoid name conflict
between state names and variables declared in the
action portion of rules or elsewhere within
the lexical analyzer class.
A convenient convention would be to declare state
names in all capitals, as a reminder that these
identifiers effectively become constants.
<P>
<BR> <P> <hr>
<H3><A NAME="SECTION2.2.6">2.2.6 Character Counting</A></H3>
<P>
Character counting is turned off by default, but can be activated
with the <I>%char</I> directive.<BR>
<I>%char</I><BR>
The zero-based character index of the first character in
the matched region of text is then placed in the
integer variable <I>yychar</I>.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.7">2.2.7 Line Counting</A></H3>
<P>
Line counting is turned off by default, but can be activated
with the <I>%line</I> directive.<BR>
<I>%line</I><BR>
The zero-based line index at the beginning of the
matched region of text is then placed in the
integer variable <I>yyline</I>.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.8">2.2.8 Java CUP Compatibility </A></H3>
<P>
Java CUP is a parser generator for Java originally written
by Scott Hudson of Georgia Tech University, and maintained and
extended by Frank Flannery, Dan Wang, and C. Scott Ananian.
Details of this software tool are on the World Wide Web
at<BR>
<a href="http://www.cs.princeton.edu/~appel/modern/java/CUP/">http://www.cs.princeton.edu/~appel/modern/java/CUP/</a>.<BR>
Java CUP compatibility is turned off by default, but can
be activated with the following JLex directive.<BR>
<I>%cup</I><BR>
When given, this directive makes the generated scanner conform to the
<code>java_cup.runtime.Scanner</code> interface. It has the same
effect as the following three directives:<BR>
<i>%implements java_cup.runtime.Scanner</i><BR>
<i>%function next_token</i><BR>
<i>%type java_cup.runtime.Symbol</i><BR>
See <a href="#SECTION2.2.9">the next section</a> for more details on
these three directives, and the CUP manual for more details on using
CUP and JLex together.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.9">2.2.9 Lexical Analyzer Component Titles</A></H3>
<P>
The following directives can be used to change the name of
the generated lexical analyzer class, the tokenizing function, and
the token return type. To change the name of the lexical
analyzer class from <I>Yylex</I>, use the
<I>%class</I> directive.<BR>
<I>%class <name></I><BR>
To change the name of the tokenizing function from <I>yylex</I>,
use the <I>%function</I> directive.<BR>
<I>%function <name></I><BR>
To change the name of the return type from the tokenizing
function from <I>Yytoken</I>, use the <I>%type</I>
directive.<BR>
<I>%type <name></I><BR>
If the default names are not altering using these directives,
the tokenizing function is envoked with a call to
<I>Yylex.yylex()</I>, which returns the <I>Ytoken</I> type.
<P>
To avoid scoping conflicts, names beginning with <I>yy</I>
are normally reserved for lexical analyzer internal functions
and variables.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.10">2.2.10 Default Token Type</A></H3>
<P>
To make the 32-bit primitive integer type <I>int</I>,
the return type for the tokenizing function
(and therefore the token type),
use the <I>%integer</I> directive.<BR>
<I>%integer</I><BR>
Under default settings, <I>Yytoken</I> is the return
type of the tokenizing function<BR><I>Yylex.yylex()</I>,
as in the following code fragment.<BR>
<I>class Yylex { ... </I><BR>
<I>public Yytoken yylex () {</I><BR>
<I>... } </I><BR>
The <I>%integer</I> directive replaces the previous code
with a revised declaration, in which the token type
has been changed to <I>int</I>.<BR>
<I>class Yylex { ... </I><BR>
<I>public int yylex () {</I><BR>
<I>... } </I><BR>
This declaration allows lexical actions to return
integer codes, as in the following code fragment
from a hypothetical lexical action.<BR>
<I>{ ...</I><BR>
<I>return 7; </I><BR>
<I>... } </I>
<P>
The integer return type forces changes the behavior
at end of file.
Under default settings, objects - subclasses of the
java.lang.Object class - are returned by <I>Yylex.yylex()</I>.
During execution of the generated lexer <I>Yylex</I>,
a special object value must be reserved for end-of-file.
Therefore, when the end-of-file is reached
for the processed input file (and from then onward),
<I>Yylex.yylex()</I> returns <I>null</I>.
<P>
When <I>int</I> is the return type of <I>Yylex.yylex()</I>,
<I>null</I> can no longer be returned. Instead,
<I>Yylex.yylex()</I> returns the value -1, corresponding
to constant integer<BR><I>Yylex.YYEOF</I>.
The <I>%integer</I> directive implies <I>%yyeof</I>; see below.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.11">2.2.11 Default Token Type II: Wrapped Integer</A></H3>
<P>
To make java.lang.Integer the return type for the
tokenizing function (and therefore the token type),
use the <I>%intwrap</I> directive.<BR>
<I>%intwrap</I><BR>
Under default settings, <I>Yytoken</I> is the return
type of the tokenizing function<BR><I>Yylex.yylex()</I>,
as in the following code fragment.<BR>
<I>class Yylex { ... </I><BR>
<I>public Yytoken yylex () {</I><BR>
<I>... } </I><BR>
The <I>%intwrap</I> directive replaces the previous code
with a revised declaration, in which the token type
has been changed to java.lang.Integer.<BR>
<I>class Yylex { ... </I><BR>
<I>public java.lang.Integer yylex () {</I><BR>
<I>... } </I><BR>
This declaration allows lexical actions to return
wrapped integer codes, as in the following code fragment
from a hypothetical lexical action.<BR>
<I>{ ...</I><BR>
<I>return new java.lang.Integer(0); </I><BR>
<I>... } </I>
<P>
Notice that the effect of <I>%intwrap</I> directive can be
equivalently accomplished using the <I>%type</I>
directive, as follows.<BR>
<I>%type java.lang.Integer</I><BR>
This manually changes the name of the return type
from <I>Yylex.yylex()</I> to<BR><I>java.lang.Integer</I>.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.12">2.2.12 YYEOF on End-of-File</A></H3>
<P>
The <I>%yyeof</I> directive causes the constant
integer <I>Yylex.YYEOF</I> to be declared. If
the <I>%integer</i> directive is present, <i>Yylex.YYEOF</i>
is returned upon end-of-file.<BR>
<I>%yyeof</I><BR>
This directive causes <i>Yylex.YYEOF</i> to be declared as
follows:<BR>
<I>public final int YYEOF = -1;</I><BR>
The <i>%integer</i> directive implies <i>%yyeof</i>.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.13">2.2.13 Newlines and Operating System Compatibility</A></H3>
<P>
In UNIX operating systems, the character code sequence
representing a newline is the single character ``\n''.
Conversely, in DOS-based operating systems, the newline is
the two-character sequence ``\r\n''
consisting of the carriage return followed by the newline.
The <I>%notunix</I> directive results in either the carriage
return or the newline being recognized as a newline.<BR>
<I>%notunix</I><BR>
This issue of recognizing the proper sequence of characters
as a newline is important in ensuring Java platform independence.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.14">2.2.14 Character Sets</A></H3>
<P>
The default settings support an alphabet
of character codes between 0 and 127 inclusive.
If the generated lexical analyzer receives
an input character code that falls outside
of these bounds, the lexer may fail.
<P>
The <I>%full</I> directive can be used
to extend this alphabet to include
all 8-bit values.<BR>
<I>%full</I><BR>
If the <I>%full</I> directive is given,
JLex will generate a lexical analyzer
that supports an alphabet of character codes
between 0 and 255 inclusive.
<P>
The <I>%unicode</I> can be used
to extend the alphabet to include the
full 16-bit Unicode alphabet.<BR>
<I>%unicode</I><BR>
If the <I>%unicode</I> directive is given,
JLex will generate a lexical analyzer
that supports an alphabet of character codes
between 0 and 2^16-1 inclusive.
<p>
The <i>%ignorecase</i> directive can be given to generate
case-insensitive lexers.<br>
<i>%ignorecase</i><br>
If the <i>%ignorecase</i> directive is given, CUP will expand all
character classes in a unicode-friendly way to match both upper,
lower, and title-case letters.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.15">2.2.15 Character Format To and From File</A></H3>
<P>
Under the status quo, JLex and the lexical
analyzer it generates read from and write to
Ascii text files, with byte sized characters.
However, to support further extensions on the JLex tool,
all internal processing of characters is done
using the 16-bit Java character type,
although the full range of 16-bit values is not supported.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.16">2.2.16 Exceptions Generated by Lexical Actions</A></H3>
<P>
The code given in the action portion of
the regular expression rules,
in section three of the JLex specification,
may potentially throw an exception, or propagate
it from another function.
To declare these exceptions, use
the <I>%yylexthrow{ ... %yylexthrow}</I> directive.<BR>
<I>%yylexthrow{ </I><BR>
<I><exception[1]></I>[<I>, <exception[2]>, ...</I>]<BR>
<I>%yylexthrow} </I><BR>
The Java code specified here will be copied
into the declaration of the lexical analyzer
tokenizing function <I>Yylex.yylex()</I>, as follows.<BR>
<I>public Yytoken yylex () </I><BR>
<I>throws <exception[1]></I>[<I>, <exception[2]>, ...</I>]
<BR>
<I>{ </I><BR>
<I>... </I><BR>
<I>} </I><BR>
If the code given in the action portion of
the regular expression rules
throws an exception that is not declared
using the <I>%yylexthrow{ ... %yylexthrow}</I> directive,
the resulting lexer may not compile successfully.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.17">2.2.17 Specifying the Return Value on End-of-File</A></H3>
<P>
The <I>%eofval{ ... %eofval}</I> directive
specifies the return value on end-of-file.
This directive allows the user to write Java code to be
copied into the lexical analyzer tokenizing
function <I>Yylex.yylex()</I>
for execution when the end-of-file is reached.
This code must return a value compatible with
the type of the tokenizing function <I>Yylex.yylex()</I>.<BR>
<I>%eofval{ </I><BR>
<I><code></I><BR>
<I>%eofval} </I><BR>
The specified Java code in <I><code></I>
determines the return value of <I>Yylex.yylex()</I>
when the end-of-file is reached for the input file
the lexical analyzer class is processing.
This will also be the value returned by <I>Yylex.yylex()</I>
each additional time this function is called
after end-of-file is initially reached,
so <I><code></I> may be executed more than once.
Finally, the <I>%eofval{</I> and <I>%eofval}</I> directives
should be situated at the beginning of a line.
<P>
An example of usage is given below.
Suppose the return value desired on end-of-file is
<I>(new token(sym.EOF))</I> rather than
the default value <I>null</I>.
The user adds the following declaration to the
specification file.<BR>
<I>%eofval{ </I><BR>
<I>return (new token(sym.EOF)); </I><BR>
<I>%eofval} </I><BR>
The code is then copied into <I>Yylex.yylex()</I>
into the appropriate place.<BR>
<I>public Yytoken yylex () { ... </I><BR>
<I>return (new token(sym.EOF)); </I><BR>
<I>... } </I><BR>
The value returned by <I>Yylex.yylex()</I> upon
end-of-file and from that point onward is now
<I>(new token(sym.EOF))</I>.
<P>
<BR> <HR>
<H3><A NAME="SECTION2.2.18">2.2.18 Specifying an interface to implement</A></H3>
<P>
JLex allows the user to specify an interface which the <i>Yylex</i>
class will implement. By adding the following declaration to the input
file:<br>
<i>%implements <classname></i><br>
the user specifies that Yylex will implement <i>classname</i>. The
generated parser class declaration will look like:<br>
<tt>
class Yylex implements <i>classname</i> { ...
</tt>
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.19">2.2.19 Making the Generated Class Public</A></H3>
<P>
The <I>%public</I> directive causes the lexical analyzer class
generated by JLex to be a public class.<br>
<I>%public</I><BR>
The default behavior adds no access specifier to the generated
class, resulting in the class being visible only from the current
package.
<P>
<BR> <HR>
<BR> <P>
<H2><A NAME="SECTION2.3">2.3 Regular Expression Rules</A></H2>
<P>
The third part of the JLex specification consists
of a series of rules for breaking the input stream into tokens.
These rules specify regular expressions, then associate
these expressions with actions consisting of Java source code.
<P>
The rules have three distinct parts:
the optional state list, the regular expression,
and the associated action.
This format is represented as follows.<BR>
[<I><states></I>] <I><expression> { <action> }</I><BR>
Each part of the rule is discussed in a section below.
<P>
If more than one rule matches strings from its input,
the generated lexer resolves conflicts between rules
by greedily choosing the rule that matches the longest string.
If more than one rule matches strings of the same length,
the lexer will choose the rule that is given first in
the JLex specification.
Therefore, rules appearing earlier in the specification
are given a higher priority by the generated lexer.
<P>
The rules given in a JLex specification should
match all possible input.
If the generated lexical analyzer receives input that
does not match any of its rules,
an error will be raised.
<P>
Therefore, all input should be matched by at least one rule.
This can be guaranteed by placing the following rule
at the bottom of a JLex specification:<BR>
<I>. { java.lang.System.out.println("Unmatched input: " + yytext());
}</I><BR>
The dot (.), as described below, will match any input
except for the newline.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.3.1">2.3.1 Lexical States</A></H3>
<P>
An optional lexical state list preceeds each rule.
This list should be in the following form:<BR>
<I><</I>state[0][<I>, state[1], state[2], ...</I>]<I>></I><BR>
The outer set of brackets ([]) indicate that multiple states are optional.
The greater than (<) and less than (>) symbols
represent themselves and should surround the state
list, preceding the regular expression.
The state list specifies under which initial states
the rule can be matched.
<P>
For instance, if <I>yylex()</I> is called with
the lexer at state <I>A</I>,
the lexer will attempt to match the input only
against those rules that have
<I>A</I> in their state list.
<P>
If no state list is specified for a given rule,
the rule is matched against in all lexical states.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.3.2">2.3.2 Regular Expressions</A></H3>
<P>
Regular expressions should not contain any white space,
as white space is interpreted as the end of
the current regular expression.
There is one exception; if (non-newline) white
space characters appear from within double quotes,
these characters are taken to represent themselves.
For instance, `` '' is interpreted as a blank space.
<P>
The alphabet for JLex is the Ascii character set,
meaning character codes between 0 and 127 inclusive.
<P>
The following characters are metacharacters, with
special meanings in JLex regular expressions.<BR>
<pre><h4>? * + | ( ) ^ $ . [ ] { } " \</h4></pre><br>
Otherwise, individual characters stand for themselves.
<P>
<i>ef</i> Consecutive regular expressions represents
their concatenation.
<P>
<i>e</i>|<i>f</i> The vertical bar (|) represents an option between
the regular expressions that surround it, so
matches either expression <i>e</i> or <i>f</i>.
<P>
The following escape sequences are recognized and expanded:
<TABLE>
<TR>
<TD>\b</td>
<TD>Backspace</td>
</tr>
<tr>
<TD>\n</td>
<TD>newline</td>
</tr>
<tr>
<TD>\t</td>
<TD>Tab</td>
</tr>
<tr>
<TD>\f</td>
<TD>Formfeed</td>
</tr>
<tr>
<TD>\r</td>
<TD>Carriage return</td>
</tr>
<tr>
<TD>\<i>ddd</i></td>
<TD>The character code corresponding to the number formed by three octal
digits <i>ddd</i></td>
</tr>
<tr>
<TD>\x<i>dd</i></td>
<TD>The character code corresponding to the number formed by two
hexadecimal digits <i>dd</i></td>
</tr>
<tr>
<TD>\u<i>dddd</i></td>
<TD>The Unicode character code corresponding to the number formed by
four hexidecimal digits <i>dddd</i>.</td>
</tr>
<tr>
<TD>\^<i>C</i></td>
<TD>Control character</td>
</tr>
<tr>
<TD>\<i>c</i></td>
<TD>A backslash followed by any other character <i>c</i> matches itself</td>
</tr>
</table>
$ The dollar sign ($) denotes the end of a line.
If the dollar sign ends a regular expression, the expression
is matched only at the end of a line.
<P>
. The dot (.) matches any character except the newline,
so this expression is equivalent to [^\n].
<P>
"..." Metacharacters lose their meaning within
double quotes and represent themselves.
The sequence <code>\"</code> (which represents the
single character <code>"</code>) is the only exception.
<P>
<I>{name}</I> Curly braces denote a macro expansion,
with <I>name</I> the declared name of the associated macro.
<P>
* The star (*) represents Kleene closure and matches
zero or more repetitions of the preceding regular expression.
<P>
+ The plus (+) matches one or more repetitions of the
preceding regular expression, so <I>e</I>+ is equivalent to <I>ee</I>*.
<P>
? The question mark (?) matches zero or one repetitions
of the preceding regular expression.
<P>
(...) Parentheses are used for grouping within regular
expressions.
<P>
[...] Square backets denote a class of characters
and match any one character enclosed in the backets. If the
first character following the left bracket ([) is
the up arrow (^),
the set is negated and the expression matches any character
except those enclosed in the backets. Different
metacharacter rules hold inside the backets, with the
following expressions having special meanings:
<TABLE>
<tr>
<td><i>{name}</i></td>
<td>Macro expansion</td>
</tr>
<tr>
<td><i>a</i> - <i>b</i></td>
<td>Range of character codes from <i>a</i> to <i>b</i> to be included in
character set</td>
</tr>
<tr>
<td>"..."</td>
<td>All metacharacters within double quotes lose
their special meanings. The sequence <code>\"</code> (which represents the
single character <code>"</code>) is the only exception.</td>
</tr>
<tr>
<td>\</td>
<td>Metacharacter following backslash(\) loses its special meaning</td>
</tr>
</table>
<P>
For example, [a-z] matches any lower-case letter, [^0-9]
matches anything except a digit, and [0-9a-fA-F] matches any hexadecimal
digit.
Inside character class brackets,
a metacharacter following a backslash loses its special meaning.
Therefore, [\-\\] matches a dash or a backslash.
Likewise ["A-Z"] matches one of the three characters A, dash, or Z.
Leading and trailing dashes in a character class also lose their
special meanings, so [+-] and [-+] do what you would expect them to
(ie, match only '+' and '-').
<P>
<BR> <P>
<H3><A NAME="SECTION2.3.3">2.3.3 Associated Actions</A></H3>
<P>
The action associated with a lexical rule consists
of Java code enclosed inside block-delimiting curly braces.<BR>
<I>{ action } </I><BR>
The Java code <I>action</I> is copied, as given, into
the state-driven lexical analyzer produced by JLex.
<P>
All curly braces contained in <I>action</I> not part of strings or comments
should be balanced.
<P>
<BR> <HR>
<BR> <P>
<H4><A NAME="SECTION2.3.3.1">2.3.3.1 Actions and Recursion:</A></H4>
<P>
If no return value is returned in an action, the
lexical analyzer will loop, searching for the next match
from the input stream and returning the value
associated with that match.
<P>
The lexical analyzer can be made to recur explicitly
with a call to <I>yylex()</I>, as in the following
code fragment.<BR>
<I>{ ...</I> <BR>
<I>return yylex();</I> <BR>
<I>... } </I> <BR>
This code fragment causes the lexical analyzer to recur,
searching for the next match in the input
and returning the value associated with that match.
The same effect can be had, however, by simply
not returning from a given action.
This results in the lexer searching for the next match,
without the additional overhead of recursion.
<P>
The preceding code fragment is an example of tail recursion,
since the recursive call comes at the end of
the calling function's execution.
The following code fragment is an example of a recursive
call that is not tail recursive.<BR>
<I>{ ...</I> <BR>
<I>next = yylex();</I> <BR>
<I>... } </I> <BR>
Recursive actions that are not tail-recursive work
in the expected way,
except that variables such as
<i>yyline</i> and <i>yychar</i>
may be changed during recursion.
<P>
<BR> <HR>
<BR> <P>
<H4><A NAME="SECTION2.3.3.2">2.3.3.2 State Transitions:</A></H4>
<P>
If lexical states are declared in the JLex
directives section, transitions on these states
can be declared within the regular expression actions.
State transitions are made by the following
function call.<BR>
<I>yybegin(state);</I><BR>
The void function <I>yybegin()</I> is passed the state
name <I>state</I> and effects a transition to
this lexical state.
<P>
The state <I>state</I> must be declared within the JLex
directives section, or this call will result in a
compiler error in the generated source file.
The one exception to this declaration requirement is
state <I>YYINITIAL</I>, the lexical state
implicitly declared by JLex.
The generated lexer begins lexical analysis in state
<I>YYINITIAL</I> and remains in this state until
a transition is made.
<P>
<BR> <HR>
<BR> <P>
<H4><A NAME="SECTION2.3.3.3">2.3.3.3 Available Lexical Values:</A></H4>
<P>
The following values, internal to the <I>Yylex</I> class,
are available within the action portion of the lexical rules.
<table>
<tr>
<th align=left>Variable or Method</th>
<th align=left>ActivationDirective</th>
<th align=left>Description</th>
</tr>
<tr>
<td><i>java.lang.String yytext();</i></td>