I'm looking for reasonable quantities of text in as many languages as I can get my hands on (note: I mean "text in English", "text in French", etc. I do not mean "text with as many languages as possible inside it").
Basically, I'm looking for better training text for my Lingua::Identify project.
If anyone has a couple of pointers (or even the corpus by itself, even if just of one language), I'd really appreciate that
Oh, one other thing: by "reasonable", I think I'm aiming for something like 10M... but I'd just like to get my hands on corpus, right now (hey, 1M today, 1M tomorrow...)
Google? (Score:1)
Google's advanced search allows you to limit the results to just one particular language; from Arabic to Turkish, you have 35 choices in all.
For example...
Google for Perl in French [google.com]
Google for Perl in German [google.com]
Re: Got corpus? (Score:1)
Here's his website: http://www.petamem.com/ [petamem.com]
Here's his talk: http://www2.perl-workshop.de/2003/contriblist.epl#106 [perl-workshop.de]
Maybe he can help you (at least if you don't plan to take over his business
11 for you to work with (Score:1)
More! They are parallel! (you know I like them this way)
http://people.csail.mit.edu/people/koehn/publications/europarl/ [mit.edu]
Association des Bibliophiles Universels (Score:2)
ABU [abu.cnam.fr] make French literary texts [abu.cnam.fr] (no copyright strings attached) available for free. These are classic works, so maybe the French is a little too classic for your needs. There's a lot of poetry in there as well.