Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Beatnik (493)

Beatnik
  (email not shown publicly)
http://www.ldl48.org/

A 29 year old belgian who likes Mountain Dew, Girl Scout Cookies, Tim Hortons French Vanilla Flavoured Cappucinno, Belgian beer, Belgian chocolate, Belgian women, Magners Cider, chocolate chipped cookies and Perl. Likes snowboarding, snorkling, sailing and silence. Bach can really cheer him up! He still misses his dog.

Project Daddy of Spine [sf.net], a mod_perl based CMS.

In his superhero time (8.30 AM to 5.30 PM), he works on world peace.

Journal of Beatnik (493)

Monday May 30, 2005
10:39 AM

Google woes

[ #24946 ]
Our all-time favorite search engine has a nice feature that let's you read PDF files (among others) as HTML. However, that feature could use a little tweaking here and there.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Not Google’s fault. The design of Postscript and PDF is such that the only guarantee you have is that rendering them will yield the exact same result, wherever it happens. Any other operation, such as trying to extract text, is not reliably possible.

    It works in a usefully large number of cases because documents are generally machine-generated by low-complexity processes. Think of how it is easy to scrape information out of HTML pages using regexes if they were all generated by a script which populat