As a test to see how easy it would be to implement Oahpa for a new language, I decided to do so with already existing morphological and syntactic analysis tools and data for Finnish. I hope to spend a little time improving it in the next few weeks, 'cause it'll certainly benefit some Finnish learners out there. :)
Try it here: http://finoahpa.donchaknow.com/oahpa/
Oahpa is a collection of language learning games, which range from inflecting words to vocabulary building, to using word inflections in the context of sentences. For languages this is particularly important, because words are inflected in specific ways for certain types of sentences which take learners a while to grasp. For instance, the case of the object of verbs may vary depending on what the verb is. pitää plus partitive means 'hold', while pitää plus elative means 'like'. Similarly, if you say "I feel happy", the form that 'happy' takes differs from if you use it in a sentence like "I am happy."
The original Oahpa games are available in Northern Sámi and (coming soon) Southern Sámi, and they, like the Finnish test version, are based on morphological analysis tools which can generate all word forms given a specific word; as well as syntactic analysis tools which (generally) will mark words as subjects or objects, but also provide much more detailed information such as what verbs agree with. These analysis tools are then used to analyze sentences that learners type to give feedback on common issues, such as verb agreement, case usage and so on.
Although the Finnish Oahpa has only two exercises, I'll post some updates if I carry over more games from Northern Sámi Oahpa. In the meantime, happy inflecting!
Some of this is sort of adopted from a comment left elsewhere on the internets for someone asking about imperatives in languages. While musing over the data in Finnish and Northern Sámi, there appears to be an interesting puzzle: 2nd person imperatives are different from the imperatives formed for all other persons, in that non-2nd person imperatives appear to all be decended from an optative mood while 2nd-person imperatives are morphologically distinct. Perhaps this is analagous to the English imperative strategy, in which the 2nd person imperative is a bare verb stem: Go!, Sleep!; while other persons are formed periphrastically: May he go, Let him sleep.
In Finnish, and closely related languages the second person imperative is formed with a bare verb stem, while other persons and numbers have additional morphemes, most of which include -k- (said by some to be a historical present tense marker).
(1) mennä 'go'; mene-n 'I come
sg. pl.
1. -- menkäämme
2. mene menkää
3. menköön menkööt
The negative imperative is formed with help of an auxilliary negative verb, älä (2), which has similar morphology.
(2) sg. pl.
1. -- älkäämme
2. älä älkää
3. älköön älkööt
According to Maija Länsimäki, these ko/kö morphemes are originally from the optative. While this doesn't directly say anything about the plural 1st and 2nd persons, it seems like there's a chance that they are either related by way of optative, or connected to the present marker theory (2nd person imperative of tulla was originally *tulek).
What is just as interesting about this pattern is when the negative verb occurs with other verbs, e.g., don't go:
(3) sg. pl.
1. -- älkäämme menkö
2. älä mene älkää menkö
3. älköön menkö älkööt menkö
The same -ko/-kö appears on the verb. Is this a form of optative agreement, or something else? If these forms are connected, is the -ko/-kö marker found in questions (Nauroiko Mikko? 'Did Mikko laugh?') also related, or is this just a coincidence brought on by the small phoneme inventory in Finnish?
A similar pattern is to be found in Northern Sámi, as well, but slightly extended because NS allows for singular, dual and plural number (4). This paradigm is exactly the same for the negative auxilliary (5), however NS does not have anything similar to the -ko/-kö which occurs on the main verb in negative imperatives (these all occur in one form for all persons and numbers).
(4) mannat 'to go'
sg. du. pl.
1. mann-on mann-u mann-ot
2. mana mann-i mann-et
3. mann-os mann-os-ka mann-os-et
(5) ale 'Neg'
sg. du. pl.
1. allon allu allot
2. ale alli allet
3. allos alloska alloset
Here we see that 2nd person singular offers a bare stem, and that all other non-2nd person imperatives have a round vowel (o/u often alternate in NS, and in precisely this situation) which is specific to these situations only. The availability of dual in the paradigm allows us to see that there is something about 2nd person here that separates it from the other persons: and perhaps this is a difference of mood.
Estonian, as best as I can find, also has a similar pattern in the negative imperative auxilliary; but I can't find out how the main verbs go for non-2nd person imperatives. Anyone...?
(6) minema 'go'
sg. pl.
1 -- --
2 mine minge
3 -- --
(7) ära 'Neg'
sg. pl.
1 -- ärgem
2 ära ärge
3 ärgu ärgu
This at least establishes that this pattern is similar in Finnish, Northern Sámi and Estonian (and apparently English), but what does it mean? One could assume from all of this that 'true' imperatives are restricted only to 2nd person, and other persons may be expressed with other moods for semantic reasons... 2nd person imperatives are only applied directly to the listener from the speaker and are commands, while 1st and 3rd person imperatives may refer to someone perhaps outside of the conversation and as such speakers may only wish for things that these persons may do.
Since I haven't Googled around yet, these are only my musings. May someone reading this come forward with more knowledge!
School's out! Woohoo! Now it's time to get working.
Since finishing exams, I've been spending the last week or so working with Constraint Grammar as part of my Google Summer of Code project in machine translation from Finnish to Northern Sámi. It's enlightening and interesting and there's much to learn, but it seems to give me precisely the kind of puzzles that I like to solve. Constraint Grammar is a syntactic formalism developed by Fred Karlsson (the author of the first Finnish grammar book I studied, which quite possibly changed my life) which has the essential goal of disambiguating words which are homophonous: have similar appearances but separate morphological uses or separate meanings.
An example:
minä lu-i-n kaksi kirja-a
1pSg.Nom READ-Prt-Sg1 TWO BOOK-Part
'I read two books.'
This all makes perfect sense to us, because we know what words are meant; however luin could mean "I read", or "with/by bones". Since the latter meaning is obviously not the one that we want for the sentence, Constraint Grammar provides a rule-based formalism for selecting the intended meaning based on the surrounding context. This isn't easy of course, because one actually needs quite a few rules to produce a fully disambiguated sentence, and natural sentences aren't always as simple as the one given above. Following is the full analysis of each word:
"<minä>"
"minä" Pron Pers Sg Nom
"mikä" Pron Interr Sg Ess
"<luin>"
"lukea" V Act Ind Prt Sg1
"luu" N Pl Ins
"<kaksi>"
"kaksi" Num Card Sg Nom
"<kirjaa>"
"kirja" N Sg Par
"kirjata" V Act Ind Prs Sg3
"kirjata" V Ind Prs ConNeg
"kirjata" V Act Imprt Sg2
As we can see, there are quite a few items that need to be removed (and listed in CG formalism below): the word minä can have its personal pronoun reading chosen because it precedes a verb with 1st person singular marking (line 1237); luin gets its verbal reading selected (as opposed to the 'bone' reading) because it follows a pronoun (line 1645); and finally kirjaa 'book+Part' is selected because it precedes a number.
1187: SELECT (Par) (-1C Num) (-1 Nom)
1237: SELECT (Pron "minä") (*1 Sg1 LINK NOT *-1 CLB?) (NOT 1 CLB?)
1645: SELECT (Sg1) (-1C MINA) (-1 Nom)
2094: MAP (@SUBJ>) TARGET Nom (0 WORD LINK *1 (Act))
2109: MAP (@<OBJ) TARGET Par IF (0 WORD LINK *-1 V BARRIER S-BOUNDARY2) ;
2115: MAP (@+FMAINV) TARGET VFIN IF (NEGATE *0 VERB BARRIER S-BOUNDARY2 OR CC) ;
Then following this disambiguation, several tags are added for later convenience... One tag, @SUBJ> tells us that the word is the subject of the sentence, preceding the verb; @+FMAINV tells us that the word is the main verb, @X tells us there is more work to be done yet; and @<OBJ says that the word is an object following its verb. The tags are shortcuts for passing along information for the generation part of the translation, in which words are produced based on the analysis. The full disambiguation is next, but note that the tags and analysis may not be correct yet; I'm just pulling this from the project as-is. Lines beginning with a semicolon (;) are those which are dropped from the analysis
"<minä>"
"minä" Pron Pers Sg Nom @SUBJ> SELECT:1237 MAP:2094
; "mikä" Pron Interr Sg Ess SELECT:1237
"<luin>"
"lukea" V Act Ind Prt Sg1 @+FMAINV SELECT:1645 MAP:2115
; "luu" N Pl Ins SELECT:1645
"<kaksi>"
"kaksi" Num Card Sg Nom @X MAP:2348
"<kirjaa>"
"kirja" N Sg Par @<OBJ SELECT:1187 MAP:2109
; "kirjata" V Act Ind Prs Sg3 SELECT:1187
; "kirjata" V Ind Prs ConNeg SELECT:1187
; "kirjata" V Act Imprt Sg2 SELECT:1187
So, there's more work to be done. As I dig further in, I may post a few recipes if there are tricky problems that arise. I'll be running some newspaper sentences through the grammar to see what additional things need to be worked out; the rules work fine for short sentences, but it may be that they'll not hold up when applied to much more complex sentences. As you can see in the lines of code produced above, there are BARRIERs involved, which delimit the ability of the rule to search its surroundings. More of these will likely pop up as weirder sentences are tested.
As it turns out though, the above analysis for this sentence is actually enough to produce a good translation. Once all the words are disambiguated, they're sent off to a generator, which produces the following (with slashes representing dialectical variation):
$ echo "minä luin kaksi kirjaa" | fin-sme
mun/mon lohken guokte girjji/girjje
The sentence also shows the connection between the two languages that the project concerns, if you squint you can see their relatedness.
In almost every discussion of language variation in the world, Finnish comes up due to its wealth of case endings. There are languages that have even more, such as Hungarian, and most languages have less. Of the languages I've learned that mark cases with suffixes, I could say that on average they have somewhere around 6 cases (excluding of course, languages very closely related to Finnish which have a similar number). I've learned some with slightly more, and some with slightly less. This amount of cases is what usually gets Finnish labeled as "difficult", but, this difficulty doesn't make Finnish impossible to learn.
Here I've attempted to collect some of my own internal reasonings based on the various things I've read over the course of time I've learned Finnish. As such, a lot of the material I don't remember my source for... Except for this source, a brief grammatical reference, which I used to check the latinate case names which I often forget. I'll list some additional resources below, as I remember or come across them, for those interested. One of the great resources is Hakulinen's Suomen kielen rakenne ja kehitys 'Structure and Development of the Finnish Language', but sadly I don't have access to it right now to check some of my memories and theories.
Read moreThe symposium was quite a success, I thought. It was the first I attended, and nice to see what's going on and what others are researching. Subject matter of the talks ranged, touching on graphic design, the successes and failures behind the development of writing systems, the varying web-presence between Inari Sámi and Skolt Sámi; and the syntax and semantics of reflexive pronouns, tone and intonation, and issues involving machine translation and lexical databases.
One of the things that I felt was an important theme was that when doing academic research, it is important to return something back to the communities you are working with. Otherwise, if you're working with endangered languages, how can you help them improve their situation? Some research produces results that is usable by these communities, and some research does not. Of course, it always means extra work to do so, but isn't it good to have an effect on things?
Anyway, brief update, but, here are some projects and resources to check out for those interested. I'll add some more things from time to time, as they come up.
I've been on a Hungarian music kick lately, listening more to hits from the 30s to the 50s, and so naturally I had to learn some. I dusted off a book I'd picked up on Hungarian for Finnish language learners, thinking that the Uralic connection might mean that things were organized in a way I can learn more efficiently. It seems to be true so far, and I'm finding some interesting things about Hungarian syntax, particularly the verbal prefixes which are discussed here. All following examples are taken from Unkarin kielioppi (Csepregi, 1991). Note that in the following glosses, only definiteness is marked in verb conjugations, otherwise assume that verbs not marked with +Def are indefinite.
Read moreBack in one of my phonology classes while I was working on my undergraduate degree in Linguistics, I wrote a paper that was an Optimality Theoretical account of stress-based coda strengthening in Northern Sámi. Although the OT model worked perfectly for the data set I had collected, I was somewhat unhappy with a smaller data set and wished to prove my point on a larger scale. In the couple years since then, I've gained some better programming skills and internet has changed so more data is available. As such, what follows is an extention of this previous paper. At some point I might track that down and revise and post it, but since that will likely be a little while, what follows will be a more complete discussion of the phenomenon that assumes less prior knowledge.
In order to process large amounts of data, I created a syllable parser based on some rules I had made for a programmatical account Standard Finnish. The role of a syllable parser in in Northern Sámi could be a couple things. As an analytical tool, the syllable parser can handle larger amounts of data in a shorter amount of time than can be processed by a human working at this task. The parser could also be used predictively, and could aid language/spell checking and for translation or localization. In localization, the syllable parser could make sure that the right suffix is applied to the right kind of word.
Read moreI was reading a book on Estonian Noun and Verb conjugations (Mürk, 1997), which is filled with some interesting stuff, but one mention of Estonian overlength and its interaction in allomorphy caught my eye, so here are some notes. The following post is going to use an altered orthography (to be explained below).
Estonian has a system of three contrasts in quantity, short, long and overlong. Finnish and Italian and numerous other languages by comparison have two, so there are short and long consonants (marked in the orthographies usually by double consonants <kk>). Estonian is surprising in this regard, however the third quantity as it turns out may be more of a question of stress or intonation, and it's usually marked by a some sort of pitch accent (unsure on specifics). As far as I know, some studies have proven that Estonian speakers have more difficulties distinguishing words when the only difference is the duration of the consonants or vowels, and require pitch cues. Also, over-long segments or syllables are usually accompanied by stress. There are also other questions of lenis and fortis, but I don't want to get into that since it's not really important to the problem at hand.
Regardless of this, I'll be marking duration in the following way: short consonants and vowels will be marked by one character: <t>, <p>, <k>, <a>; long with two: <tt>, <pp>, <oo>; over-long with three: <ttt>, <rkk>, <mmp>, <uii>. For stops, this means I won't be using the traditional orthographic method of <b>, <d>, <g> to mark the short stops, rather <p>, <t>, <k>.
The interesting morphophonological variation alluded to above is with partitive plural morphemes and how they behave when overlong syllables come into question. The book claimed that they count for an extra syllable (it may thus be an issue of moras), and I was skeptical at first, but then I collected some more data from the book, and it was rather surprising. So, I want to see what anyone else here with phonological tendencies thinks.
First, some words in which length isn't the question, but mere syllable count. In the following examples (1-6), you can see that words with two syllables in the genitive singular end up with the suffix -sit, while words with three syllables end up with -it.
| |
English | Gen Sg. | Part. Pl. | |||
| (1) | 'airplane' | .len.nu.ki. | .len.nu.keit. | |||
| (2) | 'horse' | .ho.pu.se. | .ho.pu.seit. | |||
| (3) | 'beard' | .ha.pe.me. | .ha.pe.meit. | |||
| (4) | 'disease' | .tõ.ve. | .tõ.pe.sit. | |||
| (5) | 'storehouse' | .ai.ta. | .ait.ta.sit. | |||
| (6) | 'lip' | .mo.ka. | .mo.ka.sit. |
There are obviously a couple other things going on in the data such as consonant gradation (a system of lenition formerly triggered by morphological situations in which open syllables become closed). Otherwise, it looks pretty clear that the genitive singular forms that have three syllables correlate to the application of the partitive plural suffix -it, while genitive singular forms with two syllables correspond with partitive plural suffix -sit.
The following data shows situations in which the only way to explain the application of the suffix -it implies something is up with the syllable count. According to the book, an overlong segment implies the presence of two syllables, so I'll mark that in the examples below.
| English | Gen Sg. | Part. Pl. | ||||
| (7) | 'bush' | .põõ.õ.sa. | .põõ.õ.sait. | |||
| (8) | 'speck' | .tap.p.pe. | .tap.p.peit. | |||
| (9) | 'cabbage' | .kap.p.sa. | .kap.p.sait. | |||
| (10) | 'alert' | .erk.k.sa. | .erk.k.sait. | |||
| (11) | 'window' | .ak.k.na. | .ak.k.nait. | |||
| (12) | 'edifice' | .ho.o.ne. | .ho.o.neit. | |||
| (13) | 'tooth' | .ham.m.pa. | .ham.m.pait. |
The book also implies that this only works when the overlong segments in question are in the last or second to last syllable, which would correspond with main stress. On the other hand, the assignment of stress in the partitive plural forms in the examples here is not different, so the issue can not necessarily be 100% stress. Also, if the above examples were to be treated as words with two syllables, they would receive the suffix -sit.
Anyway, what else could it be? I guess I'd suspect a moraic analysis of this would be more clear-- and thus prevent the need for this weird syllable analysis. On the other hand, the syllable analysis shows that all illative forms 'like' to be three syllables at most. Interesting problem. Now if only Estonian clearly marked it's length/whatever contrast in a more clear manner in the standard orthography, and I think I'd be set. Until then, I just need to learn a crap-load of words.
I've been doing a little studying/reading some of the Kven language, so I thought I'd share some samples.
The Kven are a recognized minority in northern Norway and their language has some official status in some places. They arrived in Northern Norway in the beginning of the 1700s, coming from Northern Finland. The language that they speak is most closely related to Finnish varieties found in northern Finland and Sweden (such as meänkieli). A sample of written Kven is available here in a document on switching to Digital TV.
The most noticeable feature of the language when you take a gander at the above-linked document is the fact that they use the letter đ (as in Northern Sámi), which represents an interdental fricative. This is one of the features that survived in Kven that didn't make it into Standard Finnish (the Rauma dialect of Finnish still has it-- unless that too has now completely gone away). Thus, Kađula ođotethaan 'People wait in the street/there is waited in the street' (c.f. Finnish: kadulla odotetaan).
Kven is also a variety of Balto-Finnic which has retained intervocalic -h-, although unlike in Karelian dialects of Finnish where the -h- is still intervocalic, -h- in Kven either metathesizes with previous voiced consonants, or follows them and works as the onset of the syllable:
kirkhoon 'into the church' (Kar. kirikköh, Fin. kirkkoon > *kirkkohon)
miehleen 'into mind' (Kar. mieleh, Fin. mieleen > *mielehen)
Norhjaan/Norjhaan 'into Norway'(Kar. Norjah, Fin. Norjaan > *Norjahan)
I have been unable to find samples of this pronounced yet, so I can't say if this is purely a resyllabification or if there is some assimilation with -h- and surrounding consonants involving voicing, which would be quite cool. How else are words like tukholhmaan treated, [.tuk.hol.@h.maan.], [.tuk.holh.maan.], [.tuk.hol.hmaan.] or [.tuk.hol̥.maan]? I'd really be curious how the syllable template in Kven handles this.
Another feature is that sometimes infinitives are overtly marked with a final consonant-- this is somewhat preserved in speech in Finnish (but not the orthography), where one sometimes finds an assimilating glottal stop (haluan mennäk kotiin/ostaat tuolin 'I want to go home/buy a table.'), but it is preserved in Kven as a final -t (at least according to orthography): mie haluun kattoot TV:tä 'I want to watch TV.'
Kven also seems to like 'strengthening' short stressed CV syllables, much like Sámi and apparently some dialects of Swedish and Norwegian that have been influenced by the same thing as Sámi, so apparently this is an areal phenomenon up there (although it does happen elsewhere in Finnish): pittäät 'to hold/like', (c.f. Finn.: pitää), but this seems to be governed by more than just syllable weight, so I'll not comment more than to point it out. If I can find some recordings some time, or just some speakers, it might be fun to figure it out.
A month or so ago, I talked to a friend of mine who works at Facebook, and as a result, a new localization option was opened in Facebook's Translations application: Northern Sámi. Some of you might ask what Northern Sámi is, so before I talk about the project, here's a quick introduction to the vital details in a format that is less intensive with regards to linguistic terminology.
Northern Sámi is spoken in Northern Scandinavia by an estimated 15,000 - 35,000 people (depending on who you ask). It is a Finno-Ugrian language, which makes its more well known relatives Finnish and Hungarian, which aren't quite closely related. If you were to compare the relation of Finnish and Northern Sámi to Indo-European and Romance languages, you might say that Finnish is to Portuguese as Northern Sámi is to Russian. Northern Sámi is most closely related to about 8 to 10 other Sámi languages which also are spoken around Northern Scandinavia and the Kola Peninsula of Russia. Of these languages, Northern Sámi is the most numerous in terms of speakers.
If forced to pick a few interesting points about the language, I would have to go with the following:
Dual numbers — Northern Sámi contains verb conjugations and pronouns that describe 'we two', 'you two' and 'they two', in addition to the singular and plural. The following examples show this, but also show that English only differentiates between singular and plural.
Márit lea gávpis.
'Márit is at the store.'
Márit ja Máhtte leaba gávpis.
'Márit and Máhtte are at the store.'
Márit, Máhtte ja Elle leat gávpis.
'Márit, Máhtte and Elle are at the store'.Detailed terminology for reindeer and snow. A good summary is available in this PDF.
A three-way contrast between consonant and vowel length.
Interdentals! Ththththththththtthththththththth. There aren't a lot of Finno-ugrian languages that have these sounds. In fact, the only other language variety I can think of right now outside of Northern Scandinavia with interdentals is in a version of the Rauma dialect of Finnish (Southwest Finland) as spoken by now elderly speakers. Interdental consonants (like in 'think') used to be more prominent in Finnic languages about a thousand years ago, but have since become less common.
The Facebook internationalization project in Northern Sámi, since it began, has grown to having 25 translators. Some of them are highly active in providing translations, and some of them are highly active in voting on translations to make sure that the best translation "wins". Recently, the project reached a new phase (translating phrases), which has been going much faster than even the first phase (establishing a glossary of terminology) despite that this second phase contains much more work. While I cannot predict how long this second phase will take, I can say that (copy/pasting) there are 23,796 phrases left to translate as of this date.
The reason I feel that a Northern Sámi-localized version of Facebook is important is because Facebook is about keeping people in touch with each other. What better a way to accomplish this, than to do it in Facebookers' own languages? Not only that, but this goal becomes immediately more awesome when it is also improving the usefulness of a minority language to its speakers. This is important for the survival of a language, because in order for a language to survive a language must continue to be useful to its speakers, and they must want to speak it. Part of this is maintaining prestige, and part of this is making sure that the language can continue to be used in a changing and globalizing environment.
In this case, Facebook is just a piece of the puzzle, and part of a more general point: since it is a prominent social networking site (which is constantly gaining users, and has an active population larger than Russia), it is naturally an important part of some peoples' methods of keeping in touch. If this one resource is available to users in their own language, this service has increased the usefulness of that language and reduces a need to interact with that service with another non-native language. With more services and media (books, news, TV, etc.) becoming available, a language has an even better chance at surviving.
Now, Northern Sámi isn't as endangered (or just plain isn't endangered) like some of its closest relatives, but the availability of Facebook in Northern Sámi can serve as a sign that something like this is just as possible for other minority languages too.
If you're interested in participating, check out Facebook's Translations application.
Update (3/9/9): Since the original time of posting, the amount of translators working on this project has just about doubled. w00t!
![[Atom/RSS icon]](/m/img/feed.png)