Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Can we assume that it is capable of doing a join like this?

    SELECT with1.page_id
        , with1.number_of_occurrences
    FROM reverse_index with1
        LEFT JOIN reverse_index without1
            ON with1.word_id = ?
                AND without1.word_id = ?
                AND with1.page_id = without1.page_id
    WHERE without1.page_id IS NULL
    ORDER BY with1.number_of_occurrences DESC

    If we are forced to assume that this can't work, the

    • Yes, there is a Google's paper about MapReduce algorithm which might be used for the situation.

      People whom I asked usually forgot that even to display 10 top links you cannot perform two queries with LIMIT 10 (or 20), one to select pages with most occurences for the word1, and then grep them to exclude those with word2: in this case you can get into multiple queries.

      • Heh, that reminds me of an interesting challenge I faced once.

        I was creating a basic desktop query tool which would be used to summarize how much money had been spent in, say, a particular area on a particular kind of product. Products were categorized at various levels. For example a specific glass beaker would be classified as the exact item, as a glass beaker, as a laboratory supply made out of glass, and under laboratory supplies.

        I had two interesting requirements. The first is that the database shipped to people in a particular region should not have specific order data for other regions. The second is that I needed to support various reports of the form, "Show how much of X was purchased locally and across the whole company over time period X." Where X could be anything from a specific product to, say, any laboratory supply.

        How did I do it? (Note, I did not just precan a set of reports. There were simply far too many reports that could have been run to successfully pre-can it.)