Guess how you validate Yahoo!'s IDIF format? There is no grammar available. In fact, there's no specification at all, just a loose explanatory document. Now, after a couple of phone calls with Yahoo!'s rep, I finally figured out how to validate IDIF documents. I email them to the rep. The rep is out of town for a week. He gets back, let's me know of some errors in my output and then I email my corrections to him and never hear back.
If you read about IDIF, and if you're familiar with XML, you will understand why this format is not only fundamentally flawed, but it's a great example of why you should not produce XML variants without careful forethought. The reason it's one of the best examples of how to do something wrong is because of Yahoo!'s target market: everybody.
Here's how Yahoo! describes their "Search Submit Pro" technology which uses IDIF to format documents.
The Yahoo! Search Submit Pro program enables commerce sites and content providers to drive more traffic to their web sites. Search Submit Pro helps customers directly deliver their content to the Yahoo! Search index, which provides search results to many of the largest and most popular web portals. Designed for content providers with more than 1,000 URLs, the Search Submit Pro program offers many benefits, including
So this is for large Web sites all over the world. These Web sites create IDIF documents and build IDIF "pointer" files which Yahoo! reads to find the IDIF documents so the Web site can have more relevant search results in Yahoo!'s index. Can you think of a better example of something which requires a standard, well-tested data interchange format? (It's worth noting creating a format which screams "junior programmer" does nothing to help a company's reputation.)
Let's take a look at a typical IDIF document, pulled straight from their examples:
<title>BabyStuffAndMore.com: everything a baby could need</title>
<meta name="keywords" content= "Baby clothing, baby furniture, infant
accessories, baby store, baby toys, kids clothes, cribs, baby
<meta name="description" content = "Shop for a variety of products for
newborns and toddlers. Outfit your whole nursery. BabyStuffAndMore.com
has car seats, cribs, bedding, toys and more.">
Find great buys on new-baby must-haves, from strollers and car seats
to items for the nursery. Personalized gifts for baby: We have a
great selection of personalized gifts, including: engraved silver
spoons, nursery décor and clothing.
<PROP name="trackurl">http://track.babystuffandmore.com/6093/ ->
<DOC url="http://www.babystuffandmore.com/catalog/ ->
How do I hate thee? Let me count the ways.
First, get rid of that XML declaration. It doesn't belong there. This is not XML.
Second, look at the tag. It says type="type/html". Then it has an HTML document. A nice, unstructured HTML document. An unescaped HTML document. That document can come from anywhere. It allows bad HTML, so IDIF will frequently not validate as XML. I talked to a Yahoo! rep on the phone. I explained my predicament and the man explained that no, I was not allowed to escape the HTML. Can I use a CDATA section? "No." What happens if my clients decide to use include a <DOC> tag somewhere in a document? "Don't do that."
In fact, he explained that if clients, for some reason, used DOC, CONTENT, PROP tags or others in IDIF (not that they would, but I can't stop them if they do), that there are places where it can go in the HTML and others where it can't. He seemed a bit unclear on the specifics and I certainly can't remember them.
In other words, thousands, possibly tens of thousands of Web sites all over the world must all remember not to include any tags like that in their content. Ever.
The only saving grace I see is that the tags used in IDIF are not valid HTML tags, but there are sites which chunk out some pretty weird stuff and a variety of software tools that generate "custom" HTML tags. And in the unlikely event that any of these tags get introduced into "official" XHTML, Yahoo! has a problem.
I can understand a small shop goofing up and doing something like this, but why would a huge software company whose business is dependant getting things done correctly do something this sloppy? Let's say you get calls from 3 irate customers out of 10,000 because those customers are churning out bad HTML. Seems like nothing, right? Well, had Yahoo! stuck with XML in the first place, those 3 irate customers wouldn't be so irate. Further, because you can't use standard XML generating tools, you drive up the costs for the end developers who have to write custom tools.
I do feel kind of bad for Yahoo!, though. This is technology they acquired from Inktomi. It's obviously very beneficial and was likely in widespread use when Yahoo! acquired it so just telling all of their customers "you must change your systems" is rather problematic. Instead, I would recommend that they slap a version number on their IDIF documents and encourage their customers to update their IDIF generators. The new version of IDIF can be valid XML (or other format) and Yahoo! could look for the version number to figure out how to parse it.
To summarize: it's not XML, despite the declaration. There are possible tag collisions, there is no clear specification and there is no way to automatically validate it (yes, I can write a validator for it by pulling it apart at the top level, but I shouldn't have to). Saying "these problems are unlikely to be an issue" is a poor excuse. Good programmers look both ways before crossing a one-way street.