Holy cow!
knight!rjbs:~/code/pep/Email-Simple/tags/1.996$ perl -I lib readmail headers.msg
just started: 1360 28328
after require File::Slurp : 2228 28704
after slurping: 12192 38656
after require Email::Simple: 12248 38656
after construction : 129368 159816
If you try to build an Email::Simple from a message with 10,000 unique headers, be prepared to give up some RAM. It's not quite as bad if you just have 10,000 values for one header:
knight!rjbs:~/code/pep/Email-Simple/tags/1.996$ perl -I lib readmail oneheader.msg
just started: 1360 28328
after require File::Slurp : 2228 28704
after slurping: 11040 37504
after require Email::Simple: 11096 37504
after construction: 87380 114852
Still, this is nuts...
knight!rjbs:~/code/pep/Email-Simple/tags/1.996$ ls -l *head*msg
-rw-r--r-- 1 rjbs rjbs 5092896 Nov 28 08:29 headers.msg
-rw-r--r-- 1 rjbs rjbs 4504001 Nov 28 08:56 oneheader.msg
It should not take 100 MB to store five meg message in an object!
The latest trunk of Email::Simple makes this just a bit better:
knight!rjbs:~/code/pep/Email-Simple/trunk$ perl -I lib readmail headers.msg
just started: 1364 28328
after require File::Slurp : 2232 28704
after slurping: 12196 38656
after require Email::Simple: 12260 38656
after construction : 114444 144892
knight!rjbs:~/code/pep/Email-Simple/trunk$ perl -I lib readmail oneheader.msg
just started: 1364 28328
after require File::Slurp : 2232 28704
after slurping: 11044 37504
after require Email::Simple: 11112 37504
after construction: 74184 101656
That's hardly a big improvement, though. Reading the message in as a reference, a new feature in the trunk, doesn't help much either: the memory usage isn't in the body storage or copying, it's in the data structures used to store the header. Despite the massive data structure, I decided that my first "simple" fix would be to replace this code:
for (split
/$mycrlf/, $$head) {
#... parse the field ...
}
With this "better" code:
my $head_txt = $$head;
while ($head_txt =~ m/\G(.+?)$mycrlf/g) {
my $line = $1;
#... parse the field ...
}
What made it better? Well, position-marking regex are less commonly used, so
surely they're an efficiency trick, right? Well, no. Actually, I thought this
would help me approximate a lazy split, moving from one record separator to the
next as I process each line. The problem is that I apparently can't use a
pattern with/g on a dereferenced scalar. I need to make a copy in memory,
which takes up a lot of space. In fact, if I'm going to copy it, I might was
well just split! So, back to looking for useful improvements.
Email::Simple headers are stored as three data structures, two hashes and an array, which are called the head, the header names, and the order. The head has one entry for each header-name; the value is an arrayref of every value for that header, in the order they should appear in the header. The order a list of header names, including duplicates, in the order in which they appear. The header names, bizarrely, relate the lowercased version of a header to the latest casing of it to appear in the input... this is almost certainly a bug introduced when fixing other problems with the header. Still, it means that this message:
Foo: Alfa
foo: Bravo
Bar: Charlie
FOO: Delta
Baz: Foxtrot
Baz: Gulf
bar: Hotel
is stored like this:
$head = {
Bar => [ qw(Charlie) ],
bar => [ qw(Hotel) ],
Baz => [ qw(Foxtrot Gulf) ],
Foo => [ qw(Alfa) ],
FOO => [ qw(Delta) ],
foo => [ qw(Bravo) ],
};
$order = [ qw(Foo foo Bar FOO Baz Bar) ];
$header_names = {
bar => 'Bar',
baz => 'Baz',
foo => 'FOO',
}
That's a lot of repeated data! Given all those unique headers in our first
test message, we're going to have to store every header name four times: once
in the head, once in the order, and twice in the header names.
Given all those repeated headers, we're still going to store the header name
10,003 times -- once for the entry in the head, twice in the header names (once
lowercased and once verbatim), and ten thousand times in the order.
I guess it's nice to have a super-fast lookup, but how many headers are there really going to be, in a normal message, that you want to have such heavy duplication for optimization? Is hash lookup really going to be faster in real time than a linear search by name acros a hundred headers? If you have ten thousand headers (for... some reason), would you rather that searching take a few milliseconds longer, if it costs you fifty megs of RAM? My wager, here, is that you'd rather optimize for the common case and save plenty of memory for no noticeable change.
So, I replaced the head, the order, and the header names with "headers." It's just a reference to an array of pairs. The above message is now:
my $headers = [
Foo => 'Alfa',
foo => 'Bravo',
Bar => 'Charlie',
FOO => 'Delta',
Baz => 'Foxtrot',
Baz => 'Gulf',
bar => 'Hotel',
];
Nice and simple. How does it compare memory-wise?
~/code/pep/Email-Simple/trunk$ perl -I lib readmail headers.msg
just started: 1364 28328
after require File::Slurp : 2232 28704
after slurping: 12196 38656
after require Email::Simple: 12260 38656
after construction: 57804 82732
~/code/pep/Email-Simple/trunk$ perl -I lib readmail oneheader.msg
just started: 1364 28328
after require File::Slurp : 2232 28704
after slurping: 11044 37504
after require Email::Simple: 11108 37504
after construction: 52756 78232
Not great; an eight meg message probably shouldn't take forty megs. Still, savings twenty to sixty megs isn't a bad start. I think I'll cut a development release (which will secretly require the newest Email::MIME, for forward compat reasons) and see what bugs are shaken out.
how this affects normal messages (Score:1)
rjbs
/o (Score:2)
while ($head_txt =~ m/\G(.+?)$mycrlf/g) {towhile ($head_txt =~ m/\G(.+?)$mycrlf/go) {I'm assuming the $mycrlf never changes, right?--
xoa
Re: (Score:1)
If I use $crlf, I should be safe, and that is constant. Switching to use that and enabling
Here's the better thing, though. It seems that I was wrong in my belief that I couldn't use a
rjbs
Will that really help? (Score:1)
Re: (Score:1)
MJD once gave me and/or the internet an explanation of when it helped, and it was a much smaller case than I had previously thought, so now I usually don't even think about it.
rjbs
continued regexps on scalar references (Score:1)
You certainly can. I use this technique extensively in CAM::PDF to incrementally parse a PDF document into a DOM. I pass a scalar reference to the content from one sub to the next.
Without delving too deep into your code, I believe the important bit that you are missing is the "c" flag on the regexp.
Re: (Score:1)
I didn't know about
I'm glad to know about it, but it should not be an issue here: at no point should the regex fail and then need to match again.
Thanks!
rjbs
uniting the array and hash in the symbol table (Score:1)
Re: (Score:1)
rjbs
Re: (Score:1)
Re: self referencing data structure (Score:1)
As you construct the second and third parts, you can reference the first copy:
Then, dumping it out, you see:
Re: (Score:1)
Alternatively, perhaps you could keep just one copy of the structure, and have methods to look up value different ways as-needed, without having whole other structures pre-built, if they if they might not be used.
That's what I did, effectively. There is on structure, an array of pairs, and methods that let you do the normal things. You can say "give me the values with name Foo" and it does. It just uses a linear search.
Since Email::Simple 2 will have a Header object with a known interface, a more memory-hungry but faster impleme
rjbs