Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Journal of cogent (987)

Monday October 28, 2002
02:33 AM

utf8

[ #8633 ]

This is make testdb using Devel::DProf on JavaScript::Tokenizer to tokenize a largish bit of real-world JS code when regular expressions include such Unicodey things as \p{Nd} and \x{000A}:

01:20:18 [cogent@birthday] Parser>$ dprofpp
Total Elapsed Time = 167.9217 Seconds
  User+System Time = 151.5851 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
88.6   134.3 263.93  32590   0.0041 0.0081  utf8::SWASHNEW
7.75   11.74 150.49  22400   0.0005 0.0067  JavaScript::Token::try
3.91   5.930  7.911  32590   0.0002 0.0002  utf8::SWASHGET
3.01   4.562 155.29   1602   0.0028 0.0969  JavaScript::Tokenizer::pop
0.40   0.609  0.235  32579   0.0000 0.0000  utf8::DESTROY
0.31   0.469  0.231  20787   0.0000 0.0000  JavaScript::Token::__ANON__
0.12   0.189  0.091   1600   0.0001 0.0001  JavaScript::Token::newlines
0.11   0.170  0.114   4892   0.0000 0.0000  JavaScript::Token::length
0.09   0.140  0.128   3258   0.0000 0.0000  JavaScript::Token::bool
0.07   0.110  0.073   3200   0.0000 0.0000  JavaScript::Tokenizer::state
0.07   0.110  0.072   3209   0.0000 0.0000  JavaScript::Token::BEGIN
0.07   0.110  0.128   1600   0.0001 0.0001  JavaScript::Tokenizer::token_types
0.07   0.100  0.062   3261   0.0000 0.0000  UNIVERSAL::isa
0.05   0.080  0.025   4800   0.0000 0.0000  JavaScript::Token::lexeme
0.05   0.080  0.062   1600   0.0000 0.0000  JavaScript::Token::line

And this is the same code, when the unicode is replaced with similar (though obviously not identical) entities such as \d and \n

02:14:27 [cogent@birthday] Parser>$ dprofpp
Total Elapsed Time = 3.985910 Seconds
  User+System Time = 3.932676 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
111.   4.389  4.051  20230   0.0002 0.0002  JavaScript::Token::try
31.5   1.239  5.505   1447   0.0009 0.0038  JavaScript::Tokenizer::pop
12.1   0.479  0.198  21661   0.0000 0.0000  JavaScript::Token::__ANON__
6.84   0.269  0.259   1445   0.0002 0.0002  JavaScript::Token::newlines
4.55   0.179  0.136   1445   0.0001 0.0001  JavaScript::Tokenizer::token_types
3.81   0.150  0.109   3181   0.0000 0.0000  UNIVERSAL::isa
3.05   0.120  0.146   3178   0.0000 0.0000  JavaScript::Token::bool
2.54   0.100  0.044   4335   0.0000 0.0000  JavaScript::Token::lexeme
2.03   0.080  0.019   4658   0.0000 0.0000  JavaScript::Token::length
1.53   0.060  0.041   1445   0.0000 0.0000  JavaScript::Tokenizer::line
1.27   0.050  0.012   2890   0.0000 0.0000  JavaScript::Tokenizer::state
1.27   0.050  0.031   1445   0.0000 0.0000  JavaScript::Token::line
1.02   0.040  0.147      4   0.0100 0.0367  main::BEGIN
0.76   0.030 -0.006   1445   0.0000      -  JavaScript::Token::string
0.76   0.030  0.029      8   0.0037 0.0037  JavaScript::Token::BEGIN

autrijus says this is a well-known problem. <sigh.> Out comes Unicode support from JavaScript::Token::Regex .

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.