As part of my masters thesis, I've been working on a Somali morphological analyzer and a syntactic disambiguator. A short introduction for anyone reading who doesn't know what these things are is: software that can tell you what the function of a word is in the sentence, and, when multiple posible functions exist, it chooses the one that is correct from context. In English for instance, the word 'can' can be both a auxilliary verb as well as a noun; but we English speakers know which is which when we hear the word in context.
In the case of Somali (and many languages), some forms are ambiguous in text that would not be in speech due to intonational and stress information. For Somali however, this means that information on number of nouns (éy 'dog' vs. eý 'dogs') and sometimes gender of nouns (masculine vs. feminine) is marked via tone. It is easy to imagine then, that when generating speech from text, producing better sounding (and grammatically sound) Somali speech would require being able to know where the tones are in a text. This is where these analytical tools come in handy... And conveniently, tonal patterns in Somali are mostly rule-based.
Naagta laybreeriga wax ku qoraysa ayaa soo socota.
'The woman who is writing in the library will come.'
After the morphological analyzer runs, we end up with input like the following:
naagta naag+N+Fem+Sg+Def+Abs+Prox
laybreeriga laybreeri+N+Masc+Sg+Def+Abs+Prox
wax wax+N+Masc+Sg+Indef+Nom
wax wax+N+Masc+Sg+Indef+Gen
wax wax+N+Masc+Sg+Indef+Abs
wax wax+Pron+Indef+Abs
ku +Nom+Prox
ku ku+Adp
ku ku+Pron+Pers+2Sg+Obj
dhex dhex+N+Fem+Sg+Indef+Gen
dhex dhex+N+Fem+Sg+Indef+Abs
qoraysa qor+V+Prog+3SgF+Ind+Pres+Red+Abs
ayaa ayaa+CS+Foc/L+Subj+Null
soo soo+PP+Deic
socota soco+V+3SgF+Ind+Pres+Red+Abs
There are a couple items that need to be removed here, and disambiguation is carried out by constraint grammar. Casting out the ambiguous possibilities in context rewards us with the following analysis:
"<naagta>"
"naag" N Fem Sg Def Abs Prox
"<laybreeriga>"
"laybreeri" N Masc Sg Def Abs Prox
"<wax>"
"wax" Pron Indef Abs
"<ku>"
"ku" Adp
"<dhex>"
"dhex" N Fem Sg Indef Abs
"<qoraysa>"
"qor" V Prog 3SgF Ind Pres Red Abs
"<ayaa>"
"ayaa" CS Foc/L Subj Null
"<soo>"
"soo" PP Deic
"<socota>"
"soco" V 3SgF Ind Pres Red Abs
... And these disambiguated forms can then be fed back into the morphological analyzer/generator to get the proper tone marking.
naágta laybreériga wax ku dhéx qóraysá ayaa soo socotá
I am a little unsure of the tone marking on dhéx (and in fact, ayaa should probably have a stress-tone on it too, as well as soo), but in any case, this was all carried out automatically, and these things may be fixed. Being able to provide input like this to a text-to-speech program would result in something a little less monotonous, and pleasing to the ear.
As the analysis progresses, it would even be possible to assign places where pauses are necessary, or where the ends of certain clauses are accompanied by boundary tones. ... There are also some other relevant phonological phenomena that could be processed in this manner and included in text-to-speech input.
Now that that's out of the way, does anyone know of some nice, open-source text-to-speech software that is open for use with any language and not just the largest ones?
Over the past several months, I've been working on a Somali morphological analyzer. It's rule based, and built with HFST, so it takes a little bit of work to extend, but it runs quite quickly and smoothly. Following are some examples for varying forms of the word baabuur 'truck'.
baabuur
baabuur baabuur+N+Masc+Indef+Sg+Nom
baabuur baabuur+N+Masc+Indef+Sg+Abs
baabuur baabuur+N+Masc+Indef+Sg+Gen
baabuurro
baabuurro baabuur+N+Fem+Pl+Indef*
baabuurradii
baabuurradii baabuur+N+Fem+Def+Pl+Nom+Dist
baabuurradii baabuur+N+Fem+Def+Pl+Abs+Dist
*Note: Somali has gender polarity for some words, which alternate between Fem. and Masc. in Sg. and Pl.
It can of course be turned around to generate word forms too, if you just input the analysis. This is one of the first stages of rule-based machine translation or text tagging, or what have you, for Somali. Once I'm far enough along with the analyzer, and have gone through and worked out the kinks, I'll probably work on disambiguating multiple analyses.
Along the way, I've been compiling a corpus of news articles with which to test coverage of my analyzer and help extend it... One of the ways I'm working to extend the analyzer with words now is by providing a means to automatically guess which inflectional type a word is. I'm happy to say this is on the way too, but not quite there yet. Either way, the plan is to dump a list of words into the (python) program, and extend the analyzer with those that pass with flying colors. Of course, these aren't many word categories in the program yet, but I'm fairly confident that I can get decent results.
Following is an example. Each word goes through a list of simple tests, and tests are assessed by count of forms fitting into some phonetic category contained in the word categories.
aalad, aalado, aaladda, aaladdu, aaladdii, aaladaha, aaladuhu, aaladihii
D1F: 8/8 <--
D1M: 5/8
D2M: 4/8
D2F: 4/8
--
geed, geedka, geedku, geedkii, geedo, geedaha, geedihii, geeduhu
D1F: 5/8
D1M: 8/8 <--
D2M: 7/8
D2F: 1/8
--
baabuur, baabuurka, baabuurku, baabuurro, baabuurrada, baabuurradii
D1F: 2/6
D1M: 4/6
D2M: 6/6 <--
D2F: 4/6
--
magac, magaca, magucu, magicii, magacyo, magacyada, magacyadii
D1F: 2/7
D1M: 5/7
D2M: 7/7 <--
D2F: 4/7
--
subax, subaxda, subaxdu, subaxdii, subaxyo, subaxyada, subaxyadii
D1F: 5/7
D1M: 2/7
D2M: 4/7
D2F: 7/7 <--
--
And of course, I'm planning on making the source available for these programs as I clean up the source, remove my notes, and provide more useful documentation...
Every once and a while I get a Qglic bug in my system, and I start tweeting or Facebook-statusing in it. Maybe you know the feeling, maybe you don't.
Qglic (as previously blogged about) is an alternative orthography for English, which has a goal of providing a more phonemic writing system (for non-linguists, a more one-to-one correspondence of sounds to letters); but the trick to it is to do this without using any "special" characters, and sticking to A-Z.
For it's part, Qglic is quite good at this, despite the many vowel and consonant sounds in English. General American English distinguishes between 24 separate consonants (maybe 25, if you include the 'wh' in which) and 14 vowels. Totaling those numbers, you can see that English has more contrastive sounds than it has letters to write them. Some languages on the other hand, are easily able to get away with having one letter for each contrastive sound, without really running out of letters in the alphabet.
One of the problems, raised to me on Twitter, was that Qglic works great, but only if you have the caught-cot merger. As it turns out, this is probably something that the creator of Qglic wasn't quite concerned about, perhaps because he doesn't have this distinction-- I think he might even be from western Canada, maybe that explains it? In any case, some words listed by Wikipedia as having separate vowels are the following:
bobble bauble bqbl bock balk bqk body bawdy bqdi bot bought bqt collar caller kqlr chock chalk tcqk hottie haughty hqti odd awed qd stock stalk stqk tock talk tqk wok walk wqk
For a good amount of Americans, these sound the same, and in Qglic they would be written the same as well. Qglic instead uses <o> for the sound in 'smoke', /oʊ/. Following are the Qglic monophthongs, and some sample words.
Eng. Qgl. tack tak tech tek tick tik took tjk toke tok talk tqk tuck tuk teak tyk Turk trk tuque twk
For the more visual linguist, here's a crappy ASCII IPA vowel chart containing Qglic monophthongs and an attempt at IPA approximation of the sounds that they represent:
y i j w i ɪ ʊ u
e u o e/ɛ ʌ/ə oʊ/ɔ
r ɜ˞/ɚ
a q æ ɑ/ɔ
Qglic also provides some diphthongs:
take teyk tyke tuyk boy boy sound saond
But, how might this work if we had to throw in one more distinction, that is, talk/tock? As it turns out, the solution is already hiding in Qglic itself with <e> and <ey>. Merely making sure to represent the vowel sound in 'toke' as a diphthong always would free up one symbol for use with the word tock, and in addition it makes the vowel system a bit more consistent.
take teyk tech tek toke touk tock tok
And again, a table:
y w i u
i u j ɪ ə/ʌ ʊ
ey r ou eɪ ɜ˞/ɚ oʊ
e o ɛ ɔ
a q æ ɑ
This of course, covers 12 of the vowels of the 14-vowel system of General American, however it merges the proposed distinction between /ǝ/ and /ʌ/ , and the rhotic equivalents of these.
One of the additional things that has kind of bothered me about Qglic is the use of the apostrophe for the sound /ŋ/ in 'sing'. One option I've come across, aside from writing it with is to use a number-- it might give Qglic a cool flavor, but of course, I see how it sort of goes beyond the goal of using A-Z: sin, si9.
Maybe you know of some other potential distinctions in English that this should cover, feel free to leave a comment. In any case, I've tested this kind of vowel system using RP and my own U.S.-ish English and found it to be quite suitable. For these texts I just found an RP transcription and retranscribed, and then found the text which was apparently more acceptable for Americans (you'll notice some lexical differences) and transcribed my speech.
RP
Xu nox wind un xu sun wu dispywti9 witc wuz xu strqngu, wen u travlu keim ulq9 rapt in u wom klujk. Xei ugryd xut xu wun hw fws suksydid in meiki9 xu travlu teik iz klujk qf cjb by kunsidud strqngu xun xy uxu. Xen xu nox wind blw uz hqd uz y kjd but xu mor y blw xu mo klujsly did xu travlu fujld hiz klujk urajnd him un ut lqs xu nox wind geiv up xy utemp xen xu sun cqn ajt womly un imydcutly xu travlu tjk qf iz klujk, un suj xu nox wind wuz ublqidc tu kunfes xut xu sun wuz xu strqngr uv xu tw.
'mercan
Xu norx wind an xu sun wr argywi9 wun dey ubeut witc uv xem wuz strqngr, wen u travlr keym ulong rapt up in an euvrkeut. Xey ugryd xat xu wun hw kjd meyk xu travlr teyk hiz keut qf wjd by kunsidrd strqngr xan xu uxr wun. Xen xu norx wind bliw az hqrd az hy kjd, but xu hqrdr hy bliw, xu tuydr xu travlr rapt hiz keut uraond him; an at last xu norx wind geyv up truying. Xen xu sun bigan tu cayn hqt, an ruyt uwey xu travlr tjk hiz keut qf. And seu xu norx wind had tw admit xat xu sun wuz strqngr xan hy wuz.
I realize people like to spell things with a mind to correctness, because having standards (or things near enough to standards) helps communication. However, one of the things I hear often is that things like they're/their/there need to be spelled correctly because otherwise it changes the meaning of sentences. There may be a couple rare cases where you can insert one of these items into the place of another and change the meaning, but they seem to be fairly seldom.
One of the points I like to press often, when people say that spelling things wrong makes the writer seem stupid, is that readers are more flexible than we seem to expect. One of the hallmarks of a functional literate person is being able to interpret written language through spelling errors, typos, or bad typesetting, or even when sentences have been cut and respelled severely to fit within the constraints of an SMS. Might an inability to interpret variation be just as big a problem as an inability (or lack of desire) to spell "properly"?
To illustrate the point with these words, following are 15 sentences. I've replaced all instances of they're, their and there with ðer, since in my dialect these are all the same. If I (and others like me) manage to understand eachother in speech, we obviously should be able to understand eachother in writing with these forms spelled the same.
The sentences are randomly selected from COCA. As there are fifteen, you should find at least 5 of each of the three forms, though some sentences have several. The idea was to collect at least 5 search results containing each of the separate forms. Some are fragments, and some are full sentences, and some have been trimmed to remove things that were generally tough to read. The hope with sentence fragments is that it will illustrate even more clearly that we can still succeed to understand even in compromised situations.
Enjoy!
... many people will still order a cheeseburger, soda and fries, ðer going to focus on that as well, Heidi, really important ...
... the professional services and social interaction that come with cubicle life, ðer crying out for support, not to mention a little chitchat.
... ... which could make them feel ashamed of ðer bodies. But I'm not sure what would be helpful to say or do ...
Real kids don't dress like bankers and fly around in ðer daddy's private jet.
ðer not the only people borrowing. ðer's a stimulus plan in Europe. ðer's a stimulus plan in China.
O'REILLY: Right. ðer going to take from the Social Security fund- ...
(Voiceover) Pruitt Rainey never saw the home gym his dad built for him just below ðer poker room. But his dad kept the promise he made to his son in ...
... a fashionable spa, and maybe even the occasional trip to Niagara, but ðer would certainly have been no Atlantic crossing and most certainly none of the sophisticated social ...
... and he would ask them about school and do his best to stay interested in ðer long-winded, half-baked descriptions. They were lucky when it came to ðer kids,
... although the last two days have been a couple good days. ðer's a sense out ðer that...
... Brooklyn, Detroit and London don't have to share the microphone here with ðer male counterparts. Nor does Peled upstage her subjects. She only adds a few ...
... my God! ðer going to show my hips, ðer going to...' And we brought three examples along to show. The one ...
... moderate jolt. Then ðer's a bit of a sway once you're hanging ðer. When you're actually on the water, ðer's a firmness that did ...
... ain't true. Remember I told you that. ðer only stories. ðer was more: fruits that we don't have no more because they came from
... who is coming for her very first reunion, just as Dorothy is. Though ðer the similarity ends, thank you very much.
As a test to see how easy it would be to implement Oahpa for a new language, I decided to do so with already existing morphological and syntactic analysis tools and data for Finnish. I hope to spend a little time improving it in the next few weeks, 'cause it'll certainly benefit some Finnish learners out there. :)
Try it here: http://finoahpa.donchaknow.com/oahpa/
Oahpa is a collection of language learning games, which range from inflecting words to vocabulary building, to using word inflections in the context of sentences. For languages this is particularly important, because words are inflected in specific ways for certain types of sentences which take learners a while to grasp. For instance, the case of the object of verbs may vary depending on what the verb is. pitää plus partitive means 'hold', while pitää plus elative means 'like'. Similarly, if you say "I feel happy", the form that 'happy' takes differs from if you use it in a sentence like "I am happy."
The original Oahpa games are available in Northern Sámi and (coming soon) Southern Sámi, and they, like the Finnish test version, are based on morphological analysis tools which can generate all word forms given a specific word; as well as syntactic analysis tools which (generally) will mark words as subjects or objects, but also provide much more detailed information such as what verbs agree with. These analysis tools are then used to analyze sentences that learners type to give feedback on common issues, such as verb agreement, case usage and so on.
Although the Finnish Oahpa has only two exercises, I'll post some updates if I carry over more games from Northern Sámi Oahpa. In the meantime, happy inflecting!
Qglic (pronounced Anglish) is a near-phonemic alternative writing system for English. Being near-phonemic, the goal is to have as close to a one-to-one correspondence between sounds in English and the letters used to represent these. One of the benefits to Qglic is that it attempts to do this using only the letters A through Z. You can see a small sample of it following, which is this paragraph but just written in Qglic.
Qglic iz ey funymik qltrnutiv ruyti'g sistum for I'glic. Byi'g funymik (or nirly so), xu gol iz tu hav ez klos tw ey wun-tu-wun koruspqnduns bitwyn saondz in I'glic and xu letrz ywzd tu reprizent xu saondz. Wun uv xu benufits ti Qglic iz xat it utemps ti dw xis ywzi'g only xu letrz A xrw Z. Yw kan sy u smol sampul uv it fqloi'g, witc iz xis perugraf but dcist ritun in Qglic.
I discovered Qglic a year or so ago, but recently remembered it and became all excited about it again. Using my newly acquired skills in various language technological applications, I spent some time putting together a simple finite-state machine based on the phonemic rules of Qglic, and the CMU Pronouncing Dictionary, which is vast and contains a huge amount of words (approximately 133,000). The CMU Pronouncing Dictionary contains pronunciation guides written with Arpabet, which means it's fairly easy to translate it into IPA or in this case, Qglic.
ABSCOND AE0 B S K AA1 N D
ABSCONDED AE0 B S K AA1 N D AH0 D
ABSCONDING AE0 B S K AA1 N D IH0 NG
ABSCONDS AE0 B S K AA1 N D Z
ABSECON AE1 B S AH0 K AO0 N
ABSENCE AE1 B S AH0 N S
ABSENCES AE1 B S AH0 N S IH0 Z
ABSENT AE1 B S AH0 N T
ABSENTEE AE2 B S AH0 N T IY1
ABSENTEEISM AE2 B S AH0 N T IY1 IH0 Z AH0 M
ABSENTEES AE2 B S AH0 N T IY1
Taking this data, I wrote a short Python script (I'll upload it somewhere at some point soon) to translate the pronunciation guides into Qglic, and then convert them to a format used to produce a file format compatible with the Helsinki Finite State Transducer Technology (HFST):
abscond:abskqnd ennd ;
absconded:abskqndud ennd ;
absconding:abskqndi'g ennd ;
absconds:abskqndz ennd ;
absecon:absukon ennd ;
absence:absuns ennd ;
absences:absunsiz ennd ;
It's a very simple finite-state machine, as far as the amount of effort put into producing it. It consists of just a huge list of words in the format of english:qglic, which represents a beginning path and the end path in the machine. The result is very fast: a 385 word article on Naomi Campbell testifying before a war-crimes tribunal from CNN is converted to Qglic in just 0.143 seconds, and the whole of The Importance of Being Earnest translates in about 1.3 seconds.
There are still some issues to work out, such as how I tokenize text, so, punctuation isn't perfect, and thus results in more words not being translated... However, since I'm using the CMU database, there are very few words that don't make it through, and if they don't, it's most likely a result of a tokenization error.
One of the other problems is that words which are homonymous are not handled ideally now (the first homonym is used always), which results in funny spellings when a word is both a noun and a verb ('The farmers prodúce próduce') but used as the other ('*The farmers próduce prodúce.'). Problems like these could be solved with a few more hours of work implementing already existing technologies to disambiguate between the two words based on sentence-sized contexts. If I get a little more time to work on this, maybe I'll iron those problems out and put some of the larger texts up online that are "translated".
Instead, enjoy a couple paragraphs of Naomi Campbell's court case, which has been cleaned up for punctuation issues that I need to fix. Looking through it otherwise, I see there is at least one other issue. See if you can spot it, or find more! ;)
(cnn) -- Ey dcudc in xu wor kruymz truyl uv formr Luybiryun prezidunt Tcqrlz Teylr haz disuydid xat swprmqdul Neyomy Kambulz testumony in xu keys wil go uhed xrzdey.
Xu specul kort uv Syeru Lyon kunfrmd ti syenen wenzdey xat kambul wil teyk xu stand at xu trubywnul, dispuyt an imrdcunsy mocun xu difens fuyld mundey ti diley hr testumony.
Prqsikywtrz sey Teylr geyv Kambul ey duymund dri'g xu wor in Syeru Lyon, kqntrudikti'g Teylrz testumony xat hy nevr handuld xu precus stonz xat fywuld xu kunflikt.
A lot of noise has been made recently due to a seemingly innocuous tweet from Sarah Palin, in which she defends her neologisms by comparing herself to Shakespeare, one of the English language's most known creators of words. Responses have varied but may be summed up as the following:
Roger Ebert later tweeted and in a tongue-in-cheek manner used Sarah's own neologism to tease her. The humor for me in this is actually that, in criticizing Palin, Ebert has unwittingly admitted that her neologism is acceptable— after all, he managed to use it in a sentence, where the word could be understood.
Either way, these criticisms all seem to stem from peoples' low reguard for Palin's intelligence; or stem from the idea that language change is bad or wrong. English is a living language, and although it makes me feel funny that I am depending Palin on one issue, this is something she is completely right in, and I would gladly stake my linguistics degree (and degree in progress) on. Language changes in numerous ways. Words can:
As a brief aside before getting on to the point: I find it useful citing historical roots of things like ask and ain't, because there are people out there who will strongly refudiate the usage of any word if it doesn't seem to be old enough. As if history means everything... But, back to the point.
One of the major problems with the backlash is that if people will happily adopt new words (such as tweet, to use an apt example), why won't they accept refudiate? If the claim is that refudiate is derived from the wrong usage of refute; then why is it okay to take the word tweet meaning roughly 'a noise birds make' and use it to describe the sending of a short 140-character message on twitter.com; and further derive new words from it: tweeps 'a person who uses the aforementioned site'. This said, I know there are people who reject even these neologisms, but the main thing I find fault in is that those who accept that Palin made this message on Twitter do not accept the words she is using despite that they are essentially the same.
What this all boils down to is a language prejudice, or for short: a prejudice. In this case, it is Palin's perceived intelligence which guides decisions on whether the new words are right or wrong. If people accept neologisms in one place, and they do not accept them in other places, there must be a reason for this— and right now, this is the only I can come up with that seems valid to me as a linguist. Since I know that words change in meaning over time, I see no problem with 'refute' taking on a new meaning. Since I know that new words are derived from existing words with the use of suffixes and prefixes (or even parts of other words, as with facestalker), I do not see 'misrefute' to be a problem either— after all, it has a clear meaning which is separate from either usage of 'refute' you adopt.
The problem with prejudices like these is that they are often used to fuel even worse situations, such as institutionalized racism. As I have observed in the U.S. (and many other places) some people often treat people with different dialects (regional, social) or speech defects differently merely on the basis of language. This is because we not only use language to communicate ideas, but to communicate who we are— sometimes we have a choice in the latter, and sometimes we simply can't help it. What we can help though is how we react to people based on their use of language. On paper, every language and dialect should be worth the same as every other, yet in reality, the difference between the worth of dialects and languages is all in our heads. We see people use language and dialect not only to communicate and self-identify, but to discriminate.
The situation with Sarah Palin illustrates the latter more than can any indeterminate truths about the usage of words and whether they are 'correct' or not. Perhaps one of the great ironies in this whole situation is that the people who are most loud on criticising Sarah for this recent language issue are both liberal (and supposedly forward-thinking on issues of equality) and language elitists. Ah, did I just implicate a 'liberal elite'? I guess Sarah can be credited to yet one more neologism...
Some of this is sort of adopted from a comment left elsewhere on the internets for someone asking about imperatives in languages. While musing over the data in Finnish and Northern Sámi, there appears to be an interesting puzzle: 2nd person imperatives are different from the imperatives formed for all other persons, in that non-2nd person imperatives appear to all be decended from an optative mood while 2nd-person imperatives are morphologically distinct. Perhaps this is analagous to the English imperative strategy, in which the 2nd person imperative is a bare verb stem: Go!, Sleep!; while other persons are formed periphrastically: May he go, Let him sleep.
In Finnish, and closely related languages the second person imperative is formed with a bare verb stem, while other persons and numbers have additional morphemes, most of which include -k- (said by some to be a historical present tense marker).
(1) mennä 'go'; mene-n 'I come
sg. pl.
1. -- menkäämme
2. mene menkää
3. menköön menkööt
The negative imperative is formed with help of an auxilliary negative verb, älä (2), which has similar morphology.
(2) sg. pl.
1. -- älkäämme
2. älä älkää
3. älköön älkööt
According to Maija Länsimäki, these ko/kö morphemes are originally from the optative. While this doesn't directly say anything about the plural 1st and 2nd persons, it seems like there's a chance that they are either related by way of optative, or connected to the present marker theory (2nd person imperative of tulla was originally *tulek).
What is just as interesting about this pattern is when the negative verb occurs with other verbs, e.g., don't go:
(3) sg. pl.
1. -- älkäämme menkö
2. älä mene älkää menkö
3. älköön menkö älkööt menkö
The same -ko/-kö appears on the verb. Is this a form of optative agreement, or something else? If these forms are connected, is the -ko/-kö marker found in questions (Nauroiko Mikko? 'Did Mikko laugh?') also related, or is this just a coincidence brought on by the small phoneme inventory in Finnish?
A similar pattern is to be found in Northern Sámi, as well, but slightly extended because NS allows for singular, dual and plural number (4). This paradigm is exactly the same for the negative auxilliary (5), however NS does not have anything similar to the -ko/-kö which occurs on the main verb in negative imperatives (these all occur in one form for all persons and numbers).
(4) mannat 'to go'
sg. du. pl.
1. mann-on mann-u mann-ot
2. mana mann-i mann-et
3. mann-os mann-os-ka mann-os-et
(5) ale 'Neg'
sg. du. pl.
1. allon allu allot
2. ale alli allet
3. allos alloska alloset
Here we see that 2nd person singular offers a bare stem, and that all other non-2nd person imperatives have a round vowel (o/u often alternate in NS, and in precisely this situation) which is specific to these situations only. The availability of dual in the paradigm allows us to see that there is something about 2nd person here that separates it from the other persons: and perhaps this is a difference of mood.
Estonian, as best as I can find, also has a similar pattern in the negative imperative auxilliary; but I can't find out how the main verbs go for non-2nd person imperatives. Anyone...?
(6) minema 'go'
sg. pl.
1 -- --
2 mine minge
3 -- --
(7) ära 'Neg'
sg. pl.
1 -- ärgem
2 ära ärge
3 ärgu ärgu
This at least establishes that this pattern is similar in Finnish, Northern Sámi and Estonian (and apparently English), but what does it mean? One could assume from all of this that 'true' imperatives are restricted only to 2nd person, and other persons may be expressed with other moods for semantic reasons... 2nd person imperatives are only applied directly to the listener from the speaker and are commands, while 1st and 3rd person imperatives may refer to someone perhaps outside of the conversation and as such speakers may only wish for things that these persons may do.
Since I haven't Googled around yet, these are only my musings. May someone reading this come forward with more knowledge!
School's out! Woohoo! Now it's time to get working.
Since finishing exams, I've been spending the last week or so working with Constraint Grammar as part of my Google Summer of Code project in machine translation from Finnish to Northern Sámi. It's enlightening and interesting and there's much to learn, but it seems to give me precisely the kind of puzzles that I like to solve. Constraint Grammar is a syntactic formalism developed by Fred Karlsson (the author of the first Finnish grammar book I studied, which quite possibly changed my life) which has the essential goal of disambiguating words which are homophonous: have similar appearances but separate morphological uses or separate meanings.
An example:
minä lu-i-n kaksi kirja-a
1pSg.Nom READ-Prt-Sg1 TWO BOOK-Part
'I read two books.'
This all makes perfect sense to us, because we know what words are meant; however luin could mean "I read", or "with/by bones". Since the latter meaning is obviously not the one that we want for the sentence, Constraint Grammar provides a rule-based formalism for selecting the intended meaning based on the surrounding context. This isn't easy of course, because one actually needs quite a few rules to produce a fully disambiguated sentence, and natural sentences aren't always as simple as the one given above. Following is the full analysis of each word:
"<minä>"
"minä" Pron Pers Sg Nom
"mikä" Pron Interr Sg Ess
"<luin>"
"lukea" V Act Ind Prt Sg1
"luu" N Pl Ins
"<kaksi>"
"kaksi" Num Card Sg Nom
"<kirjaa>"
"kirja" N Sg Par
"kirjata" V Act Ind Prs Sg3
"kirjata" V Ind Prs ConNeg
"kirjata" V Act Imprt Sg2
As we can see, there are quite a few items that need to be removed (and listed in CG formalism below): the word minä can have its personal pronoun reading chosen because it precedes a verb with 1st person singular marking (line 1237); luin gets its verbal reading selected (as opposed to the 'bone' reading) because it follows a pronoun (line 1645); and finally kirjaa 'book+Part' is selected because it precedes a number.
1187: SELECT (Par) (-1C Num) (-1 Nom)
1237: SELECT (Pron "minä") (*1 Sg1 LINK NOT *-1 CLB?) (NOT 1 CLB?)
1645: SELECT (Sg1) (-1C MINA) (-1 Nom)
2094: MAP (@SUBJ>) TARGET Nom (0 WORD LINK *1 (Act))
2109: MAP (@<OBJ) TARGET Par IF (0 WORD LINK *-1 V BARRIER S-BOUNDARY2) ;
2115: MAP (@+FMAINV) TARGET VFIN IF (NEGATE *0 VERB BARRIER S-BOUNDARY2 OR CC) ;
Then following this disambiguation, several tags are added for later convenience... One tag, @SUBJ> tells us that the word is the subject of the sentence, preceding the verb; @+FMAINV tells us that the word is the main verb, @X tells us there is more work to be done yet; and @<OBJ says that the word is an object following its verb. The tags are shortcuts for passing along information for the generation part of the translation, in which words are produced based on the analysis. The full disambiguation is next, but note that the tags and analysis may not be correct yet; I'm just pulling this from the project as-is. Lines beginning with a semicolon (;) are those which are dropped from the analysis
"<minä>"
"minä" Pron Pers Sg Nom @SUBJ> SELECT:1237 MAP:2094
; "mikä" Pron Interr Sg Ess SELECT:1237
"<luin>"
"lukea" V Act Ind Prt Sg1 @+FMAINV SELECT:1645 MAP:2115
; "luu" N Pl Ins SELECT:1645
"<kaksi>"
"kaksi" Num Card Sg Nom @X MAP:2348
"<kirjaa>"
"kirja" N Sg Par @<OBJ SELECT:1187 MAP:2109
; "kirjata" V Act Ind Prs Sg3 SELECT:1187
; "kirjata" V Ind Prs ConNeg SELECT:1187
; "kirjata" V Act Imprt Sg2 SELECT:1187
So, there's more work to be done. As I dig further in, I may post a few recipes if there are tricky problems that arise. I'll be running some newspaper sentences through the grammar to see what additional things need to be worked out; the rules work fine for short sentences, but it may be that they'll not hold up when applied to much more complex sentences. As you can see in the lines of code produced above, there are BARRIERs involved, which delimit the ability of the rule to search its surroundings. More of these will likely pop up as weirder sentences are tested.
As it turns out though, the above analysis for this sentence is actually enough to produce a good translation. Once all the words are disambiguated, they're sent off to a generator, which produces the following (with slashes representing dialectical variation):
$ echo "minä luin kaksi kirjaa" | fin-sme
mun/mon lohken guokte girjji/girjje
The sentence also shows the connection between the two languages that the project concerns, if you squint you can see their relatedness.
I now know some of my summer plans: Google Summer of Code! I just found out I was accepted. The project is to start a machine translation project with Apertium to translate from Finnish to Northern Sámi. The Apertium project is also getting several other GSoC participants in other areas, as well, featuring translation projects from Polish to Czech, and French to Portuguese. In addition, there are other projects to improve and expand Apertium in various ways. If you want to see my proposal, that's available online. If you want to know a little more about machine translation, read on...
There are two major methods of machine translation (Apertium uses a combination of both): statistical and linguistic translation. Google Translate is a well known example of a statistical machine translation program, which is aided by Google's wealth of texts in various languages. Google Translate works by lining up sentences that are known to match up in translation, and then translates chunk by chunk. In a way, it's like speaking a language by phrasebook, you may say mostly the right thing most of the time, but then some other times you may tell your tobaccanist that your hovercraft is full of eels.
The linguistic translation method analyzes words in the source language morpheme by morpheme (the smallest unit of meaning within a word), and then analyzes the word order of the sentences in order to disambiguate and handle what roles the words play in the sentences. After this, a bilingual dictionary is consulted, and the analysis of the sentence in the source language is used to construct the sentence in the target language. This approach is much closer to what it is like to speak a language, in a way; because words are inflected based on linguistic rules, and the grammar is thoroughly consulted to produce the output.
Although linguistic translation seems to be more akin to learning a language and speak it, that does not mean that it is 100% perfect in the sense of what a bilingual human may provide in translating a novel, for instance; but this is not necessarily one of the immediate goals of machine translation. In order for a machine to flawlessly translate any sentence, it would have to have a more thorough understanding of all of the semantic data associated with words, and how to know when to use what word; and all of the various shades of meaning between words.
We're not there yet, but making progress... So maybe some day, once machines gain enough knowledge to translate flawlessly, they will be able to address us politely as they take over the world.
![[Atom/RSS icon]](/m/img/feed.png)