Stories
Slash Boxes
Comments

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Mark Leighton Fisher (4252)

Mark Leighton Fisher
  (email not shown publicly)
http://mark-fisher.home.mindspring.com/

I am a Systems Engineer at Regenstrief Institute [regenstrief.org]. I also own Fisher's Creek Consulting [fisherscreek.com].
Wednesday June 18, 2008
06:58 AM

The Important Numbers of Testing: 0, 1, and Many

Although there are an infinity of numbers to use in software testing, the 3 important numbers are 0, 1, and Many.

0, the number of nothingness, comes into play when you don't have anything. C enshrined 0 as the null pointer, though other languages and systems had represented nothing by a memory address of 0 before C. (There were other representations of null they make for interesting reading.) Customers without orders, Webpages without links, forests without oak trees all of these are most easily represented by a 0 inside a computer. Even inside your computer, your programs are not a closed system. You can run out of memory (although Perl eliminates the silly cases of this), you might forget and make a directory unreadable (0 files), a compiler error could skip an allocation statement (0 elves) the list can go on and on. If you don't consider the case of 0, eventually your software will fail. (Conversely, I once wrote a server in Perl 4 that ran for months at a time because I did extensively consider and test the case of 0.)

1, the first number, is seen when you only have one of something, an idea so common that it becomes the Singleton pattern in languages that need a special representation of one and only one instantiation of a class. With 1 of something, everything has to be instantiated, but you don't have the problems of multiple copies of the item in question. If the item is part of a collection of items, I have occasionally seen defects where the collection is not allocated if there is only 1 item. It is probably an artifact of my coding style, but I don't see many defects in my code specific to 1 and only 1 item. When the common cases are 0 and many, I have seen code that fails to work on all of the edge cases of 1 and only 1 item.

"Many" often just means "more than 1". Usually, your code does nothing different for the 472nd item than it does for the 2nd item. There are times, though, when code(2nd) != code(472nd) (3-column display code comes to mind here). Handling many items involves sizing their containers appropriately. Almost any allocation algorithm can give you space for 1 item only correctly constructed allocation algorithms will always yield the right number of places to contain your items. The familiar fence-post error of array management is but one example of a failed allocation algorithm (and failure to properly test for many items).

defined() is the special case of accessing an item before it is initialized. A real-world case is a restaurant without any customers. Attempts to access any customer data will only find undefined values. Undefined values can occur when you grab large blocks of data for performance reasons the example restaurant and all its customers from before where you grab so much data in one fell swoop that not all of it is initialized. Incomplete data is not defined. A customer without a cellphone (or without a landline phone) would have an undefined value for that phone field. A broken tire pressure sensor could yield an undefined value when read. Sometimes you can just ignore undefined values, but other times you have to explicitly handle them (think running sums or some statistical operations).

Although you may have other special numbers to test, you will likely have to test at least 0, 1, and many. The multiples of many, the somethingness of 1, and the nothingness of 0 will need to be tested to ensure adequate test coverage of your code.

06:57 AM

The Golden Rule of Data Manipulation

The Golden Rule of Data Manipulation can be summed up as "Concatenation is Easy, But Parsing Is Hard". But we are talking really, really hard here not just lifting a dining room hutch hard, but lifting the Empire State Building hard (in the end game). That hardness has been a large barrier in natural language communication for computers, as parsing an arbitrary sentence is ludicrously hard. AIML et.al. have worked around the problem by restricting both the domain of discourse and the variety of sentences recognized, but they have only worked around the problem, not solved it. If you start from a point of concatenating simple, nearly atomic data, your programming task will be much easier (and much more like to lend itself to later parsing, rather than starting at arbitrary parsing of your data). Anyway, read the article!

Friday May 30, 2008
01:04 PM

PageRank is Precomputed Relevancy Ranking

Google's PageRank is precomputed relevancy ranking, where the heavy lifting of actual relevancy ranking is done by us humans. Why is this important? I was re-reading A new comparison between conventional indexing (MEDLARS) and automatic text processing (SMART), which lays out how computerized indexing can beat the best manual indexing by:

  • Using a stop-word list;
  • Using a thesaurus (synonyms); and
  • Relevancy ranking.

(It's more complicated than that, but you get the idea.) Relevancy ranking is the hardest part of the indexing job, as there are no clear-cut algorithms for relevancy ranking with both excellent precision and excellent recall (getting all of the documents you want and none of the documents you don't want). Google's PageRank works around the difficulty of relevancy ranking by handing the hardest part the ranking of individual documents to us humans. You can get good results from proper metadata, but metadata is useful only in environments where no one has interest in gaming the metadata (I wonder if it should be called "The Semantic Intranet"? That's where Semantic Web technologies really make sense to me.)

The original paper is worth a read, especially if you work on software that incorporates search and these days, I suspect that almost any non-embedded program could grow to a point where it incorporates a search mechanism (and an email client, and a web browser you get the point).

Friday May 23, 2008
01:04 PM

Good Inheritance and Bad Inheritance

Inheritance is evil, and must be destroyed is the slightly overwrought title of an article by BernieCode that, nonetheless, expresses an idea that I've long held that most use of inheritance is better represented by either composition (HAS-A rather than IS-A) or by interface implementation/Perl 6 roles (ACT-AS rather than IS-A).

Inheritance works well for classes that are actually closely related (the canonical example of classes that represent the relationship of various species springs to mind here). What you often want (in my experience) are classes that can act in a certain way for example, a horse and a dog that can act like a pet. The EventManager example in the article above is a particularly good example of where a Perl 6 role/Java interface/etc. solves a problem much more neatly and clearly than inheritance does.

By the way, Solving compositional problems with Perl 6 roles (which I just discovered) also looks like a pretty good resource on this topic, especially for us Perl users.

Friday March 07, 2008
02:09 PM

pmtools-1.10 Release

Now at a CPAN mirror site near you pmtools-1.10. Tom "spot" Callaway of Fedora Core let me know that the Fedora folks were concerned about the fact that pmtools was only licensed under the Perl 5 Artistic License (they were concerned about how well the Artistic License 1.0 would stand up in court). So, pmtools (starting with v1.10) is now dual-licensed like Perl (Artistic and GPL). (My other public Perl stuff is also dual-licensed.) I also added my copyright to pmtools, as I had not added my name to the copyright when I took it over.

Off-hand, I don't recall why Tom Christiansen used only the Artistic License for pmtools. Anyone with a clue, please drop me a line. (That of course includes you, Tom.)

Friday January 25, 2008
01:17 PM

Navigational Spaghetti -- What are your thoughts?

Navigational Spaghetti -- What are your thoughts? presents the dilemma of making program navigation both easy and flexible. Somehow MVC/MVP come to mind here...

Friday January 18, 2008
01:44 PM

Top Ten Software Engineering Ideas

Friday January 11, 2008
01:06 PM

Regular Expression Matching Can Be Simple And Fast

Regular Expression Matching Can Be Simple And Fast is worth a look. re::engine::Plan9 should let you test this approach.

Friday January 04, 2008
01:16 PM

Publishing vs. Storing Documents

One Abstraction, Two Uses is a nice essay that covers software requirements (and user requirements) for storing documents and for publishing documents (which are different actions, as explained in the essay).

01:09 PM

The World is Collaborative and Loosely Coupled

Note to Software Vendors, the World is Collaborative and Loosely Coupled. 'Nuff said.

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." Leslie Lamport