Writing a Perl Core Feature - part 5: Lexer

Index | < Prev | Next >

Now we have a controllable feature flag that conditionally recognises our new keywords, and we have a new opcode that we can use to implement some behaviour for it, we can begin to tie them together. The previous post mentioned that the Perl interpreter converts source code of a program into an optree, stored in memory. This is done by a collection of code loosely described as the compiler. Exactly what the compiler will do with these new keywords depends on its two main parts - the lexer, and the parser.

If you're unfamiliar with these general concepts of compiler technology, allow me a brief explanation. A lexer takes the source code, in the form of a stream of characters, and begins analysing it by grouping those characters up into the basic elements of the syntax, called tokens (sometimes called lexemes). This sequence of tokens is then passed into the parser, whose job is to build up the syntax tree representing the program from those analysed tokens. (The lexer is sometimes also called a tokenizer; the two words are interchangable).

Tokens may be as small as a single character (for example a + or - operator), or could be an entire string or numerical constant. It is the job of the lexer to skip over things like comments and ignorable whitespace. Typically in compilers, tokens are usually represented by some sort of type system, where each kind of token has a specific type, often with associated values. For example, any numerical constant in the source code would be represented by a token giving a "NUMBER" type, whose associated value was the specific number. In this manner the parser can then consider the types of tokens it has received (for example it may have recently received a number, a + operator, and another number), and emit some form of syntax tree to represent the numerical addition of these two numbers.

For example for a simple expression language we might find it gets first tokenized into a stream of tokens. Any sequence of digits becomes a NUMBER token with its associated numerical value, and operators become their own token types representing the symbol itself:

It then gets parsed by recursively applying an ordered list of rules (to implement operator precedence) to form some sort of syntax tree. We're looking ultimately for an expr (short for "expression"). At high priority, a sequence of expr-STAR-expr can be considered as an expr (by combining the two numbers by a MULTIPLY operation). At lesser priority, a sequence expr-PLUS-expr can be considered as such (by using ADD). Finally, a NUMBER token can stand alone as an expr.

Specifically in Perl's case, the lexer is rather more complex than most typical languages. It has a number of features which may surprise you if you are familiar with the overall concept of token-based parsing. Whereas some much simpler languages can be tokenized with a statically-defined set of rules, Perl's lexer is much more stateful and dynamically controlled. The recent history of tokens it has already seen can change its interpretation of things to come. The parser can influence what the lexer will expect to see next. Additionally, existing code that has already been seen and parsed will also affect its decisions.

To give a few examples here, consider the way that braces are used both to delimit blocks of code, and anonymous hash references. The lexer resolves which case is which by examining what "expect" state it is in - whether it should be expecting an expression term, or a statement. Consider also the way that the names of existing functions already in scope (and what prototypes, if any, they may have) influences the way that calls to those functions are parsed. This is, in part, performed by the lexer.

my $hashref = { one => 1, two => 2 };
# These braces are a hashref constructor

if($cond) { say "Cond is true"; }
# These braces are a code block
sub listy_func { ... }
sub unary_func($) { ... }

say listy_func 1, 2, 3;
# parsed as say(listy_func(1, 2, 3));

say unary_func 4, 5, 6;
# parsed as say(unary_func(4), 5, 6);

Due to its central role in parsing the source code of a program, it is important that the lexer knows about every keyword and combination of symbols used in its syntax. Not all new features and keywords would need to consider the parser, so for now we'll leave that for the next post in this series and concentrate on the lexer.

The lexer is contained in the file toke.c. When the isa feature was added the change here was rather small: (github.com/Perl/perl5).

--- a/toke.c
+++ b/toke.c
@@ -7800,6 +7800,11 @@ yyl_word_or_keyword(pTHX_ char *s, STRLEN len, I32 key, I32 orig_keyword, struct
case KEY_ioctl:

+ case KEY_isa:
+ Perl_ck_warner_d(aTHX_
+ packWARN(WARN_EXPERIMENTAL__ISA), "isa is experimental");
+ Rop(OP_ISA);
case KEY_join:

Here we have extended the main function that recognises barewords vs keywords; the function yyl_word_or_keyword. This function is based, in part, on the function in keywords.c that we saw modified back in part 3. (Remember; that added the new keywords, to be conditionally recognised depending on whether our feature is enabled). If the keyword was recognised as the isa keyword (meaning the feature had been enabled), then the lexer will recognise it as a token in the category of "relational operator", called Rop. We additionally report the value of the opcode to implement it; the opcode OP_ISA which we saw added in part 4. Since the feature is experimental, here is the time at which we emit the "is experimental" warning, using the warning category we saw added in part 2.

Because of this neat convenience, the change adding the isa operator didn't need to touch the parser at all. In order for us to have something interesting to talk about when we move on to the parser, lets imagine a slightly weirder grammar shape for our new banana feature. We have two keywords to play with, so lets now imagine that they are used in a pair, surrounding some other expression; as in the syntax:

use feature 'banana';

my $something = ban "Some other stuff goes here" ana;

Because of this rather weird structure, we won't be able to make use of any of the convenience token types, so we'll instead just emit these as plain TOKENs and let the parser deal with it. This will necessitate some changes to the parser as well, to add some new token values for it to recognise, so we'll do that in the next part too.

Before we leave the topic of the lexer, lets just take a look at another recent Perl core change - the one that first introduces the try/catch syntax, via the try named feature: (github.com/Perl/perl5).

@@ -7704,6 +7706,11 @@ yyl_word_or_keyword(pTHX_ char *s, STRLEN len, I32 key, I32 orig_keyword, struct
case KEY_break:

+ case KEY_catch:
+ Perl_ck_warner_d(aTHX_
+ packWARN(WARN_EXPERIMENTAL__TRY), "try/catch is experimental");
case KEY_chop:

@@ -8435,6 +8442,11 @@ yyl_word_or_keyword(pTHX_ char *s, STRLEN len, I32 key, I32 orig_keyword, struct
case KEY_truncate:

+ case KEY_try:
+ Perl_ck_warner_d(aTHX_
+ packWARN(WARN_EXPERIMENTAL__TRY), "try/catch is experimental");
case KEY_uc:

This was a very similar change - again just two new case labels to handle the two newly-added keywords. Each one emits a token of the PREBLOCK type. This is a hint to the parser that following the keyword it should expect to find a block of code surrounded by braces ({ ... }). In general when adding new syntax, there will likely be some existing token types that can be used for it, because it is likely following a similar shape to things already there.

Each of these changes adds a new warning - a call to Perl_ck_warner_d. There's a porting test file that checks to see that every one of these has been mentioned somewhere in pod/perldiag.pod. In order to keep that test happy, each commit had to add a new section there too; for example for isa: (github.com/Perl/perl5).

--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -3262,6 +3262,12 @@ an anonymous subroutine, or a reference to a subroutine.
(W overload) You tried to overload a constant type the overload package is
unaware of.

+=item isa is experimental
+(S experimental::isa) This warning is emitted if you use the (C<isa>)
+operator. This operator is currently experimental and its behaviour may
+change in future releases of Perl.
=item -i used with no filenames on the command line, reading from STDIN

(S inplace) The C<-i> option was passed on the command line, indicating

In the next part, we'll take a look at the other half of the compiler, the parser. It is there where we'll make our next modifications to add the banana feature.

Index | < Prev | Next >

Source: leonerds code

Writing a Perl Core Feature - part 5: Lexer