Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Ovid (2709)

Ovid
  (email not shown publicly)
http://publius-ovidius.livejournal.com/
AOL IM: ovidperl (Add Buddy, Send Message)

Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.

Journal of Ovid (2709)

Sunday October 02, 2005
08:24 PM

Yahoo! Does the Dumb

[ #26969 ]

Guess how you validate Yahoo!'s IDIF format? There is no grammar available. In fact, there's no specification at all, just a loose explanatory document. Now, after a couple of phone calls with Yahoo!'s rep, I finally figured out how to validate IDIF documents. I email them to the rep. The rep is out of town for a week. He gets back, let's me know of some errors in my output and then I email my corrections to him and never hear back.

If you read about IDIF, and if you're familiar with XML, you will understand why this format is not only fundamentally flawed, but it's a great example of why you should not produce XML variants without careful forethought. The reason it's one of the best examples of how to do something wrong is because of Yahoo!'s target market: everybody.

Here's how Yahoo! describes their "Search Submit Pro" technology which uses IDIF to format documents.

The Yahoo! Search Submit Pro program enables commerce sites and content providers to drive more traffic to their web sites. Search Submit Pro helps customers directly deliver their content to the Yahoo! Search index, which provides search results to many of the largest and most popular web portals. Designed for content providers with more than 1,000 URLs, the Search Submit Pro program offers many benefits, including ...

So this is for large Web sites all over the world. These Web sites create IDIF documents and build IDIF "pointer" files which Yahoo! reads to find the IDIF documents so the Web site can have more relevant search results in Yahoo!'s index. Can you think of a better example of something which requires a standard, well-tested data interchange format? (It's worth noting creating a format which screams "junior programmer" does nothing to help a company's reputation.)

Let's take a look at a typical IDIF document, pulled straight from their examples:

<?xml version="1.0"?>
<IDIF>
<DOC url="http://www.babystuffandmore.com">
<CONTENT type="text/html">
  <html>
  <head>
  <title>BabyStuffAndMore.com: everything a baby could need</title>
  <meta name="keywords" content= "Baby clothing, baby furniture, infant
  accessories, baby store, baby toys, kids clothes, cribs, baby
  strollers">
  <meta name="description" content = "Shop for a variety of products for
  newborns and toddlers. Outfit your whole nursery. BabyStuffAndMore.com
  has car seats, cribs, bedding, toys and more.">
  </head>
      <body>
    Find great buys on new-baby must-haves, from strollers and car seats
    to items for the nursery.  Personalized gifts for baby: We have a
    great selection of personalized gifts, including: engraved silver
    spoons, nursery d&#233;cor and clothing.
    </body>
  </html>
</CONTENT>
<PROP name="trackurl">http://track.babystuffandmore.com/6093/ ->
www.babystuffandmore.com&sourceid=55944005/507a845124"</PROP>
</DOC>
<DOC url="http://www.babystuffandmore.com/catalog/ ->
store.asp?product_id=55944005">
<CONTENT type="text/html">
  ...
</CONTENT>
</DOC>
</IDIF>

How do I hate thee? Let me count the ways.

First, get rid of that XML declaration. It doesn't belong there. This is not XML.

Second, look at the tag. It says type="type/html". Then it has an HTML document. A nice, unstructured HTML document. An unescaped HTML document. That document can come from anywhere. It allows bad HTML, so IDIF will frequently not validate as XML. I talked to a Yahoo! rep on the phone. I explained my predicament and the man explained that no, I was not allowed to escape the HTML. Can I use a CDATA section? "No." What happens if my clients decide to use include a <DOC> tag somewhere in a document? "Don't do that."

In fact, he explained that if clients, for some reason, used DOC, CONTENT, PROP tags or others in IDIF (not that they would, but I can't stop them if they do), that there are places where it can go in the HTML and others where it can't. He seemed a bit unclear on the specifics and I certainly can't remember them.

In other words, thousands, possibly tens of thousands of Web sites all over the world must all remember not to include any tags like that in their content. Ever.

The only saving grace I see is that the tags used in IDIF are not valid HTML tags, but there are sites which chunk out some pretty weird stuff and a variety of software tools that generate "custom" HTML tags. And in the unlikely event that any of these tags get introduced into "official" XHTML, Yahoo! has a problem.

I can understand a small shop goofing up and doing something like this, but why would a huge software company whose business is dependant getting things done correctly do something this sloppy? Let's say you get calls from 3 irate customers out of 10,000 because those customers are churning out bad HTML. Seems like nothing, right? Well, had Yahoo! stuck with XML in the first place, those 3 irate customers wouldn't be so irate. Further, because you can't use standard XML generating tools, you drive up the costs for the end developers who have to write custom tools.

I do feel kind of bad for Yahoo!, though. This is technology they acquired from Inktomi. It's obviously very beneficial and was likely in widespread use when Yahoo! acquired it so just telling all of their customers "you must change your systems" is rather problematic. Instead, I would recommend that they slap a version number on their IDIF documents and encourage their customers to update their IDIF generators. The new version of IDIF can be valid XML (or other format) and Yahoo! could look for the version number to figure out how to parse it.

To summarize: it's not XML, despite the declaration. There are possible tag collisions, there is no clear specification and there is no way to automatically validate it (yes, I can write a validator for it by pulling it apart at the top level, but I shouldn't have to). Saying "these problems are unlikely to be an issue" is a poor excuse. Good programmers look both ways before crossing a one-way street.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Guuhhh.

    That looks like someone thought “I know, I’ll use angle brackets and I’ll be like all the cool pros.”

    Contrast Google Sitemaps. Sure, a non-standard format cooked up by Google and for Google, but it’s actual XML, it has a spec, and it makes sense. (Okay, they refuse to accept UTF-16, which a compliant parser must; and if you’re Asian, then to you that’s probably more than the minor slip the rest of the world considers it.)

  • ...they say "It is a structured format, similar to XML", not that it is XML. Of course, they do contradict that when they require an XML declaration. ;-)
    • I just hacked my way through programming the IDIF files for a site and I was wandering, forelorn and sad, around the web wondering "how the heck do I validate this mess?"

      And it really is a mess. I have gone, line by line, through their example and compared their output to mine and the two do seem to correspond, but their example has a half a dozen obvious errors, so who knows what I missed?

      And, by the way, what with Google sitemaps and A9's siteinfo and RSS and ATOM, why does the world need another way for
  • I just hacked my way through programming the IDIF files for a site and I was wandering, forelorn and sad, around the web wondering "how the heck do I validate this mess?"

    And it really is a mess. I have gone, line by line, through their example and compared their output to mine and the two do seem to correspond, but their example has a half a dozen obvious errors, so who knows what I missed?

    And, by the way, what with Google sitemaps and A9's siteinfo and RSS and ATOM, why does the world need another way for