Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

perrin (4270)

perrin
  (email not shown publicly)

Perrin is a contributor to various Perl-related projects like mod_perl, Template Toolkit, and Class::DBI. He is a frequent speaker at OSCON, YAPC, and ApacheCon, and a contributor to several perl-related books.

Journal of perrin (4270)

Tuesday August 07, 2007
04:38 PM

My OSCON slides are up

The slides from my talk, "Care and Feeding of Large Web Applications", are available for download. These have a couple of corrections from the YAPC version, but nothing major.

Tuesday July 31, 2007
02:27 PM

don't be so dismissive

I don't know if this is an actual trend, or I'm just noticing it more, but I feel like people are being more dismissive of others' work these days. Nat Torkington actually touched on this a little at OSCON in his keynote (which is available online, incdentally.) There's a tendency to just write off entire popular projects with some kind of sweeping generalization. We all make jokes about the "competition" now and then, but lately it feels more vicious.

I'll give you a couple of examples. First, MySQL. I heard lots of snide remarks about MySQL at OSCON. Some people went as far as to say that if everyone would use Postgres instead, none of the scaling techniques we hear about (like splitting your db up into shards on multiple servers) would be necessary.

Think about who uses MySQL: Yahoo, Google, etc. These people have enough money to try Postgres, and a huge financial incentive to look for something that would make their database scaling easier. Don't you think they might have tried it? Maybe it didn't meet their needs. Maybe it's better at certain things than Postgres.

Another popular target is PHP. People who have never used it slam it left and right as being a tool for idiots. The fact is, some of the smartest people I know do a lot of work in PHP. It's not a toy language anymore. It has nice OO support. It has a profiler that works more reliably than Devel::DProf.

The problem with this attitude is that what goes around comes around. I recall being at an open source content management conference and having a Java fan derisively say to me "People still use Perl?" I also experienced how this looks from the other side: this guy came off as an arrogant fool.

Tools that become very popular, like MySQL, PHP, and Perl, have good reasons for it. Even if they aren't your chosen tools, keeping enough of an open mind to learn from what they do right is worth it. I know I learn a great deal from articles and talks for Java programmers, even though I haven't used it as a primary language since I stopped working at Scholastic.

Monday May 14, 2007
05:35 PM

MySQL bulk loading techniques

I'm working on the next generation of the data warehouse described here by Sam Tregar. This time, I'm trying to keep it up-to-date in near real-time. Because of this, I have several new constraints on how I load data:

  • The tables are all InnoDB. This is necessary because there will be long-running queries running on these tables while data is being loaded, and that requires the MVCC support in InnoDB. MyISAM tables would block updates while anyone is reading. Incidentally, contrary to what people often claim on Slashdot, converting these tables to InnoDB improved query performance quite a bit. Only the loading speed suffered compared to MyISAM tables.
  • We can't disable the indexes while loading. This is a huge help in the current system where we load into an off-line database. The ALTER TABLE DISABLE KEYS and ALTER TABLE ENABLE KEYS commands allow the indexes to be rebuilt in bulk. If we did this to an on-line table though, anyone using it would suddenly have no indexes available. Also, InnoDB doesn't have the same bulk index creation optimization, although this is supposed to be coming soon.
  • The incoming data will be a mix of new rows and updates to existing rows.
  • Some of the loads will be partial data for a table, i.e. not all data loads cover all columns in the target table.

So, I had a few ideas of ways to load the data and wanted to see what would give me the best results. I made a quick and dirty benchmark script and tried them out on a relatively small table (~50K rows) which I loaded with 20K rows and then tested ways of copying the full data in, meaning a combination of new rows and updates. Here are the results.

The fastest approach is an INSERT...SELECT with an ON DUPLICATE KEY UPDATE clause. That looks a bit like this:

INSERT INTO foo_test SELECT * FROM foo ON DUPLICATE KEY UPDATE bar=VALUES(bar), baz=VALUES(baz),...

This was pretty fast, coming in at 29 seconds. Some people have trouble with the INSERT...SELECT because it takes a shared lock (like SELECT...FOR UDPATE) on the source table while it runs. This is apparently fixed in MySQL 5.1 by using row-based replication. It's also not really an issue for us because we're doing this work on a replicated database, so the worst case is that the replication falls behind a bit while the statement runs.

Although it won't work for us, I tried REPLACE as well, just to see how it compared. It was quite a bit slower, coming in at 54 seconds, or almost twice as long.

I considered trying a combination of INSERT IGNORE...SELECT and a bulk UPDATE (using a join), but figured this would do poorly if the SELECT had any real work in it, since it would be running twice.

The most common workaround for people who have trouble with the INSERT...SELECT locking is to use a temporary file with SELECT INTO OUTFILE and LOAD DATA INFILE. I tried that next. The dump is really fast, taking only 1 second. Loading is complicated by the fact that you can't do updates with LOAD DATA INFILE, so I decided the best thing would be to load the data into a temporary table and then do an INSERT...SELECT from that.

I got that load to go very quickly by making my temp table a MyISAM one and running an ALTER TABLE DISABLE KEYS on it before loading. It loaded in 3 seconds. Then I did the same INSERT...SELECT from the temp table which took the same 29 seconds (and I never built the indexes because I didn't need them). In total, the temp file only added 4 seconds or 14% overhead. This seems like a good solution for people who run into locking issues.

Then I tested using two database handles, one to SELECT and one to INSERT/UPDATE, pumping the data from one to the other. I was pretty sure I couldn't beat the INSERT...SELECT with this approach, but we have some situations where we need to process every row in perl during the load, such as geocoding addresses or applying logic that gets too ugly when done in SQL.

I played around with the frequency of commits and with MySQL's multi-row INSERT extension, and got this reasonably fast. It ran in 43 seconds, or a bit less than 50% slower than the INSERT...SELECT.

Looking at how fast the LOAD DATA INFILE was, I tried a different approach for processing every row, doing a SELECT and writing the rows out with Text::CSV_XS. Then I loaded that file into a temp table with LOAD DATA INFILE and did an INSERT...SELECT from the temp table as before.

This was much better. Dumping the SELECT with Text::CSV_XS only took 3 seconds and, combined with the 4 second load, it only adds 24% overhead and gives me a chance to work on every row in perl. It's also much simpler to code than the multi-row INSERT stuff.

I should point out that working with these large data sets row by row requires the "mysql_use_result" option, which makes the server spool results to the client instead of dumping them all at once. I activate it for specific statement handles like this:

my $sth = $dbh->prepare($sql, {mysql_use_result => 1});

If anyone has additional ideas, I'd be interested in hearing them. For now, I'm happy with the first approach for updates that can all be done in SQL and the last one for updates that require perl processing on every row.

Tuesday April 17, 2007
10:07 PM

Bruce Perens seems uninformed about Apache HTTPD

There's an article here about Bruce Perens manipulating the headers on some Lighttpd servers to look like Apache for Netcraft stats purposes. At the end of this article, he says this:

"Apache is desirable if you want mod_perl, or mod_some-interpretive-language. That's an old-fashioned way of programming. Most newly-architected Web sites decouple the dispatcher running the interpretive language code from the Web server. These days, something like lighttpd works better for most sites. Open Source is about evolution."

This statement is confused on many levels. To begin with, I don't think many people choose Apache HTTPD because they want to run mod_perl, but that's subjective so I'll skip it.

Look at the part about separating the interpreter from the web server. This has been the recommended mod_perl configuration for busy sites for at least the past 10 years. The mod_perl docs have recommended running a light front-end server (Apache with mod_proxy, or Squid, or whatever) and using a mod_perl-enabled Apache as an "application server" behind the scenes. It's essentially identical to running FastCGI or the Java servlet daemons that came along later in terms of architecture.

And how about that evolution? Lighttpd quickly became popular with the Ruby crowd, but then some of them became annoyed with its FastCGI implementation. So what did they do? They wrote their own HTTPD that runs the Ruby code and used Lighttpd as a front-end, proxing the dynamic requests to it. Hmmm, where have I heard that before?

The larger point is that the Apache 2 HTTPD is a lot more than a web server. It's really a modular framework for writing networked servers. Everything, right down to the HTTP protocol, can be replaced with a module, and it can be done in multiple languages, including Perl.

Lighttpd might be the best server for Perens' application, and it's a high-quality open source application. I just wish he would get his facts straight about Apache and mod_perl.

Wednesday January 17, 2007
04:31 PM

myspace.com tech lessons

Reading this article about myspace.com and their technology lead me to some interesting tidbits.

"Chau developed the initial version of the MySpace Web site in Perl, running on the Apache Web server, with a MySQL database back end. That didn't make it past the test phase, however, because other Intermix developers had more experience with ColdFusion, the Web application environment originally developed by Allaire and now owned by Adobe. So, the production Web site went live on ColdFusion, running on Windows, and Microsoft SQL Server as the database."

I think that explains why they had so many performance problems early on. This is one case where "go with what your teams knows" may have been bad advice. They eventually ditched it for C# and saw a big imrpovement.

Whenever a particular database was hit with a disproportionate load, for whatever reason, the cluster of disk storage devices in the SAN dedicated to that database would be overloaded. "We would have disks that could handle significantly more I/O, only they were attached to the wrong database," Benedetto says.

They solved this by going to a storage technology that pooled their resources instead of partitioning them. I think this supports my theory that partitioning is usually a bad idea and you should share resources as much as possible. Partioning used to be a big sell for expensive EJB tools and IBM hardware, but the end result is that some of your hardware is under-utilized while parts of your application are starving for resources.

The cache is also a better place to store transitory data that doesn't need to be recorded in a database, such as temporary files created to track a particular user's session on the Web site—a lesson that Benedetto admits he had to learn the hard way. "I'm a database and storage guy, so my answer tended to be, let's put everything in the database," he says, but putting inappropriate items such as session tracking data in the database only bogged down the Web site.

Storing sessions in your lossy cache storage is a mistake, in my opinion. If your session suddenly dissapears for no reason when you're browsing myspace.com, this is why -- they put it in the same unreliable storage that they use for caching. But then he goes on to say that he really doesn't care if your data gets lost:

In other words, on MySpace the occasional glitch might mean the Web site loses track of someone's latest profile update, but it doesn't mean the site has lost track of that person's money. "That's one of the keys to the Web site's performance, knowing that we can accept some loss of data," Benedetto says. So, MySpace has configured SQL Server to extend the time between the "checkpoints" operations it uses to permanently record updates to disk storage—even at the risk of losing anywhere between 2 minutes and 2 hours of data—because this tweak makes the database run faster.

Classic.

Tuesday January 09, 2007
06:46 PM

POD + perltidy?

Does anyone have a handy way to run perltidy on code embedded in POD? It feels very backwards to be indenting that by hand when I run perltidy for everything else. Maybe an extension to Pod::Tidy?

Friday September 01, 2006
03:22 PM

Are we done with Joel Spolsky now?

I've never been a fan of Joel's writing, but he kind of clinched the deal with this one. To quote:

...the bottom line is that there are three and a half platforms (C#, Java, PHP, and a half Python) that are all equally likely to make you successful, an infinity of platforms where you're pretty much guaranteed to fail spectacularly when it's too late to change anything (Lisp, ISAPI DLLs written in C, Perl)...

I just hope I can fail as spectacularly as Amazon, Yahoo, and TicketMaster have with their use of Perl.

And incidentally, eBay was originally an ISAPI DLL.

Thursday August 31, 2006
11:06 PM

HTTP Server Fever!

Is it just me, or is everyone writing an HTTP server these days? After Apache 2 became solid, it looked like there wasn't much of interest left to do in the world of HTTP servers, and the field had been fully commodified. CPAN had a half dozen or so Perl HTTP servers, all of which were fine for entertainment but not useful for real sites. You'd hear some crank on Slashdot shouting about thttpd (I swear that wasn't me), but it didn't set the world on fire.

Then the single-threaded servers started showing up in earnest. A non-blocking I/O approach to networking is well-known to scale better than threads or processes, and it appealed to developers in a very primal way -- it's fast! Well, not so much fast, since you'd run out of bandwidth long before that mattered, but you could handle lots of open connections to slow clients without any trouble.

Lighttpd quickly became a star, especially in the PHP and Ruby worlds. (Why were Rails developers looking for a faster web server rather than trying to fix Ruby's performance problems? Probably because it's a much easier problem.)

Somewhere in there, Perlbal made the scene. It's a bit of a hodgepodge of features, having been developed to suit some particular in-house project needs, but an interesting sort of glue project to fill gaps in the Perl web app deployment story.

Some of the Rails guys then decided they didn't like FastCGI and would write their own HTTP server to replace it, called Mongrel. So far, the benchmarks I've seen make it look like performance has gotten worse compared to what they had with FastCGI, but it's still early so maybe they will improve that. They say they were doing it because the FastCGI implementations all had bugs, so maybe they don't care if it's slower anyway.

Meanwhile, people started popping up on the mod_perl list saying that they had built their own single-threaded servers. I usually ask people two things when they say this:

  • What will you do about DBI, and all of the other blocking network and file I/O calls that are the bread and butter of the average web app? Stalling your entire site while someone waits for a query is not going to work.
  • How is this better than running Perl on Lighttpd + FastCGI?

The only good answer I've heard to the first question so far is to ship the blocking stuff off to some separate persistent processes (e.g. mod_perl, PPerl, etc.) that you talk to over non-blocking I/O, and pick up the results when it's done. This is what Stas Bekman did with the single-threaded server he works on at MailChannels (for blocking spam). It's also what Matt Sergeant seems to be planning for his new single-threaded AxKit2 HTTP server.

Meanwhile, back at the Apache 2 camp, mod_proxy has picked up useful new features like basic load balancing and people are experimenting with hybrid threaded/non-blocking I/O process models.

It's good to see innovation happening. Sometimes I do wonder if people are chasing the right things. I find it pretty easy to make a screamingly fast web app with basic Apache and mod_perl these days, so maybe pushing things in a direction that makes development harder (as I think single-threaded programming will be for most people) is not the best move for all of us. High-performance has an undeniable allure though, especially for people like us who still have to convince managers that Perl is fast enough for a web site. (Duh. Maybe you've heard of Amazon?) I'll certainly be paying attention though, to see what Matt and everyone else cooks up.

Sunday July 16, 2006
06:07 PM

job stats show Perl still leads the P languages

With the help of this job trends app, you can see that job postings for Perl continue to be much higher than those for PHP, Python, or Ruby (honorary P language). It's nowhere close to Java, but it's about twice PHP and leaves the others in the dust. Of course this says nothing about the actual quality of the jobs -- only that Perl skills are in demand.
Friday July 14, 2006
02:30 PM

Do they know they're learning Perl?

In this article on IBM's DeveloperWorks site, Bruce Tate (the guy who has been pimping Ruby in his book "Beyond Java") teaches Java developers some basics of Ruby text generation. He talks about strings (when you use double quotes, they interpolate variables!) and eval (where have I seen that before?) and then shows a templating system that appears to be a Ruby port of Text::Template (or any of the similar embedded Perl code modules, but nowhere close to a more powerful system like TT). Funny thing -- it all looks exactly like perl code, and demonstrates basic features of Perl.

I wonder if these Java programmers who are getting excited about learning Ruby realize that they're actually being taught Perl. Our evil plot to change the name to Ruby has succeeded beyond our wildest dreams. Welcome to the fold, DeveloperWorks readers.