Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Alias (5735)

Alias
  (email not shown publicly)
http://ali.as/

Journal of Alias (5735)

Friday March 24, 2006
04:57 PM

Lingua::EN::VarCon - Starting a war against America(n)

[ #29093 ]

... and hi to all the hardworking people at the NSA.

Sorry for distracting you, please continue on to the next intercept.

As far as I am concerned, English is ultimately defined as the language spoken in England. The rest of the British Isles and the other parts of the British Empire currently over here competing at the Commonwealth Games have decided to use it as well (except for the special case of Canadian, but more on that later).

But (to drop briefly into software analogy) 13 colonies in the new Americas, in their wisdom, got sick of the English project lead for unrelated reasons, mainly because they saw him as not quite a benevolent-enough benevolent dictator for life.

So they formed a break-away group, and then forked the language to create American. In order to differentiate their project from the original, they hired an expensive interface usability consultant to help make the user interface simpler and easier to learn.

A number of years later, in an unfortunate turn of events for the 30% of the world still using English 1.2.48 (build 1970 or so) the project team for American managed to implement the new fields of productised shrink wrap software with American 1.0 as a compulsory dependency.

This has resulting in a rediculous sitation in which "English" in almost any software product actually means the American fork of English.

For a long time we in the majority of English users have put up with this, because we were just happy that we weren't in the situation of the traditional chinese or something really esoteric.

And to be fair, in some situations it's better to use American. When it comes to APIs, such as module, method and global variable naming, I think on the balance it's better to give in to using American just to gain the consistency of a single API, rather than the long term edge cases and scaling problems you would get in an OO system having both WWW::Mechanize and WWW::Mechanise, or ->color and ->colour.

But we've had Unicode for long enough to be standard practice now, and the situation is now almost worse for English than before.

Highly popular software packages now come with your choice of 15 different languages, except English.

English English has fallen into a hole, where we all get annoyed by applications being in American, even if it's just different for 3 or 4 words in the interface, and despite it being a TINY change, nobody really could be bothered to put in the effort to create and maintain an entire language pack just to change 4 words. And in any case, the application will bundle American as well, just in case there are any words missing in the English translation of American.

To add insult to injury the website of my favourite editor, Ultraedit even lists the English language version of the application with specifically BOTH American and British flags on it, but provide only "English (American)" as well as "Spanish (International Sort)" and 6 other languages.

On emailing them about the situation I was informed "We have no plans to implement a British translation of the program". and yet it's probably only 7 or 8 words making the difference.

If the news presentors in Spain were all forced to speak in Mexican or Portugese, or French all cinema switched to Quebecious(sp?) the uproar would be huge.

And while here in Australia we are so close to British we don't really need our own translation, I pity more the poor users of Canadian, a language with only 20 million odd writers, that is a weird half-breed caught between English and American. They have an even smaller payoff to create translations than for the billions of people that use English instead of American.

But one thing is clear in all this.

There are too many applications needing to be translated from American to English, not enough people to do them, and a something of a lack of care-factor due to the small changes that would need to be maintained over time.

So perhaps it's time to look at something different.

Perhaps it's time to look at implementing automated American to English translation (and vice versa) and then integrating that into our internationalisation systems, so that every American program with internationalisation support gets an English and Canadian.

It's been one of those little things I keep meaning to look at properly, but never really had a good plan on how to do it.

The biggest problem I always faced in a general conceptual design was the data. Doing simple parsers or regex replacing things wouldn't be hard, but what do you use for data. It's often under restrictive copyright, or otherwise difficult to get hold of, install, locate, and manage. There always seems to be some problem with it.

And so I took another look at the problem today, and finally found a sane path to fix it.

To celebrate the 2006 Melbourne Commonwealth Games, currently coming to a close, and as a token of respect and a gift to our head of state Elizabeth the Second, by the Grace of God, Queen of Australia and Her other Realms and Territories, Head of the Commonwealth (visiting here for probably the last time) I thought it appropriate to take the first step in the battle to reclaim English.

Lingua::EN::VarCon is a Data Package (a distribution that logically ties a specific data product to a Perl namespace) that provides access to the VarCon database, originating from a wide variety of sources (see the largish copyright documentation section) and compiled and released by the Word List SourceForge project.

It provides a set of 5 tables in tab-seperated-columns format, which contain a number of lists of conversion data. Most notable is the ABBC dataset which contains the list of words that differ between the various written dialects of English, and how that word is spelled in English, American, and what I think is two different Canadian dialects.

The other tables contain various other bits and pieces such as "footpath" vs "sidewalk" type translations that might not be entirely safe to apply all the time.

Although currently it contains just a simple set of methods to locate the files, as part of commitment to the Battle for English I pledge to implement in the Lingua::EN::VarCon module new methods to help others get access to the dataset in whatever is the most optimal format for their use, be it BDB or in-memory hash or whatever else is needed.

To HRH Elizabeth II,
on the occasion of her visit

Madam

It seems the rebel colonials in America have gotten a little out of control, and are creating spelling confusion and inconvenience for all of the 1.8 billion peoples in the 53 countries that make up the Commonwealth of Nations.

For my part, I hope that this gift goes some way towards rectifying this situation.

And thank you for your continuing support, and your continuing willingness to stay entirely out of our way.

I have the honour to be, Madam, Your Majesty's humble and obedient servant.

Adam Kennedy
Lismore, Australia.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Ha ha ! great post, I like your sense of humour.

    As a side note, I'll add that this approach wouldn't be that useful for French : every French-speaking country (in Europe, America or Africa) uses the One True Spelling defined by the Académie. However, there are differences in vocabulary between fr_FR, fr_BE and fr_CA. But vocabulary is notoriously difficult to translate without context...

    • You'll have to forgive me some poetic licence :)

      You are correct of course. French, and also German, have active systems in place to protect and standardise the language, something that we unfortunately don't have.

      But much of this problem DOES stem from this problem of having English being American. 2/3rds of the programs on my desktop right now, including Thunderbird, have American dictionaries, and don't come out with British versions with anything like the rapidity of, say, the French version.

      And it reall
  • As I understand things, American wasn't originally a fork of English. The 'simpler' spellings (eg: harbor instead of harbour) you referred to were actually the common usage in England at the time.

    What happened was the English language continued to develop and more elegant spelling conventions were imported from other languages (esp. French). But the Americans no longer had real-time network access to the development branch. They were stuck on the last stable release which quickly became outdated.

    Ultimate
  • While you're about it, perhaps you could undertake overall spelling reform. English could do with some rationalization of the byzantine relationship between sound and script, as compared to German and French. http://pages.infinit.net/mtalbot/humour/4.html [infinit.net] http://www.rocksforkids.com/FabFours/spelling.htm [rocksforkids.com] I hope the Queen will see her way to appointing some computer chappies to undertake this important project. But perhaps the US is also willing to participate, seeing how the ISO 8601 Date Format is gaini
    • Well, that is a MUCH bigger problem, and one completely beyond my powers or responsibility.

      Someone else can deal with language unification, I'd just like WHATEVER the correct spelling is deemed to be to at least be available.

      Also, I believe someone has already done what you are talking about. That's partly how we got American in the first place.
  • "[R]ebel colonials in America have gotten a
    little out of control, and are creating spelling
    confusion and inconvenience "
    That 'gotten' is surely an Americanism, isn't it.

    In British English it would be 'got.' No?
    • While I may aim for British English, I write in Australian :)

      I suspect gotten has seeded a bit too hard here to be removed.

      Certainly in my town anyway.
      • Got, got, gotten are lazy words. They are used when the speaker/writer can't think of a better word. Or so said my grade two primary school teacher.

        She said that you can always find a better formulation that doesn't used them.

        "[R]ebel colonials in America are a
        little out of control"

        "[R]ebel colonials in America have gone a/become
        little out of control"

        Something like that.
  • As far as I am concerned, English is ultimately defined as the language spoken in England.

    You basically lost me here.. There are many languages spoken in England.... If you find that answer too smartallecky, then which dialect are you referring to specifically? Why do you or anybody get to choose which of those represent "real" English? Saying that BBC English is the real one is arbitrary, and most people don't use it. (I'd like to see a table of how many people speak various dialects of English.)

    I've

    • It's not smart-alecky, but it is pedantic, and reading more into what I'm saying than I am. You've also missed the main point in that I'm talking about written English, and not spoken English.

      As to your specific points, I get to choose because it's MY opinion, and nobody elses. As far as _I_ am concerned, and as I related it to my personal universe, that is the case.

      As to which is better, I really don't care and made no implications that either spelling was inherantly better. Dictating what the official spe