Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

ethan (3163)

ethan
  tassilo.von.pars ... AMrwth-aachen.de

Being a 25-year old chap living in the western-most town of Germany. Stuying communication and information science and being a huge fan of XS-related things.

Journal of ethan (3163)

Monday April 18, 2005
02:55 AM

toke.c

[ #24250 ]

I never quite understood why Perl offered no hooks into its lexer and parser. They're contained in the interpreter, the very same program that runs my Perl scripts.

So I snuck a peek at the dreaded toke.c. My initial thought was that it was merely a matter of calling yylex() after initializing a few of the global PL_* variables appropriately. Only that on closer inspection there turned out to be exactly 99 of these global variables involved in the lexing process, including those dealing with the various perl stacks, control OPs and symbol tables.

So what I did was create a C++ class with 99 member variables. Each function in toke.c became a method that no longer works on PL_variable but this->pl_variable instead. Some non-lexer related functions had to be modified thusly, too, such as Perl_init_stacks() and a handful of those Perl_save_*() functions in scope.c. The whole purpose of that was to make the lexer re-entrant.

With these adjustments (and a few hundred #undefs/#defines), the actual XS code is very tiny:

MODULE = Perl::Lexer        PACKAGE = Perl::Lexer
 
Lexer *
Lexer::new ()
    CODE:
    {
        RETVAL = new Lexer();
        RETVAL->Pinit_stacks(aTHX);
    }
    OUTPUT:
        RETVAL
    CLEANUP:
        RETVAL->ME = newSVsv(ST(0));
 
void
Lexer::set_string (SV *line)
    CODE:
    {
        THIS->lex_start(aTHX_ line);
    }
 
void
Lexer::next_token ()
    CODE:
    {
        int tok = THIS->yylex(aTHX);
 
        /* skip empty lines */
        if (tok && THIS->bufptr)
            while (THIS->bufptr == '\n') THIS->bufptr++;
 
        if (tok == 0)
            XSRETURN_EMPTY;
 
        EXTEND(SP, 2);
        ST(0) = sv_2mortal(newSViv(tok));
        ST(1) = sv_2mortal(newSVpv(TOKENNAME(tok), 0));
        XSRETURN(2);
    }
 
void
Lexer::DESTROY ()

And a sample script along with its output looks like this:

use blib;
use Perl::Lexer;
 
my $string = <<'EOS';
$a{1} = 1;
print keys %a;
EOS
 
my $lexer = Perl::Lexer->new;
$lexer->set_string($string);
while (my $l = $lexer->next_token) {
    print $l, " ";
}
print "\n";
 
__END__
$ WORD { THING ; } ASSIGNOP THING ; LSTOP UNIOP % WORD ; ;

A couple of problems still exist: Once the lexer sees a comment, an empty line or a shebang line, it seems to gobble up all characters up to the end of the string and thus finishes scanning. The shebang-line stuff is done in S_find_beginning() in perl.c before parsing even starts. As for empty lines, I suppose they are handled by perl's parser and not its lexer.

The last thing that needs to be done is making the actual attributes belonging to a token available. Ideally, this is just a matter of exposing yylval to the outside world.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.