Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Mark Leighton Fisher (4252)

Mark Leighton Fisher
  (email not shown publicly)

I am a Systems Engineer at Regenstrief Institute []. I also own Fisher's Creek Consulting [].
Friday May 18, 2007
12:05 PM

Concurrency and Text File Processing

[ #33307 ]

Concurrency will continue to become more important, since it is getting cheaper to add more CPUs to a chip than to speed up the chip. (After all, we don't have one big, fast neuron in our brains.) Much of programming is driven by some form of text file, as even when the data is binary, the source files for the program are still text. So when can we take advantage of concurrency during our text file processing?

The question is of one of structure how much and what kind of structure exists in the text file. In the text of programming languages, structure goes from the free-for-all of C, Perl, Ruby, and that ilk, to the line-structured Python and Fortran (any time you force a certain indentation, you have implicitly forced a line-oriented structure) (and some of us remember RPG II).

Ordering is the other question. Parsing the characters into lines is inherently a sequential operation. (Parsing characters into any structured form is inherently a sequential operation.) Once you have the primary structured form, only then can you process the text in a parallel fashion. Logfiles are one example of a text file format amenable to parallel processing once they have been reduced into lines ("see how many U.S. government users your Apache server saw in the past month" is a parallelizable operation on your Apache logfile). As a partial counter-example, "how many of your routines implicitly returned an Int greater than 30" will likely require knowledge of your program's structure at more than just the line level (except maybe if you are programming in APL).

Parallel processing by definition requires 2+ things to process (a thing can't be parallel to itself). If what you are processing is one big interconnected thing, though it may be divisible into smaller sub-things, then you can't parallel process it. Google Language Tools (IIRC) uses statistical text processing to derive the translated phrases (statistical analysis of that sort can be parallelized). A hypothetical True And Correct Natural Language Translator(tm) would require some understanding of the whole text to create translations in all cases, as material later in the text can require understanding of material earlier in the text to translate it correctly. (Fortunately, that usually isn't the case with the web pages I've had Google Language Tools translate for me.)

I'm wondering if the Unix/Linux model of separate coordinating processes (an MIMD model) would be more scalable over the long term than the vector-processing/SIMD model I keep hearing about from today's concurrency proponents. It may be no accident that some of the biggest concurrency successes in current software have been printing and webpage loading, as those are sequential processes that can be executed separately from the main locus of control.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • If you avoid glomming all of those lines together in a single text file, you can avoid having to scan that file sequentially before you can parallelize. This can be handy if you need to process logfiles in parallel.

  • Concurrency will continue to become more important, since it is getting cheaper to add more CPUs to a chip than to speed up the chip. (After all, we don't have one big, fast neuron in our brains.)
    Are you trying to say that our brains work the way they do because God is a cheapskate? Given the erratic behavior I observe in people, I'd think of Him more as an overclocker.