Over the past several months, I've been working on a Somali morphological analyzer. It's rule based, and built with HFST, so it takes a little bit of work to extend, but it runs quite quickly and smoothly. Following are some examples for varying forms of the word baabuur 'truck'.
baabuur
baabuur baabuur+N+Masc+Indef+Sg+Nom
baabuur baabuur+N+Masc+Indef+Sg+Abs
baabuur baabuur+N+Masc+Indef+Sg+Gen
baabuurro
baabuurro baabuur+N+Fem+Pl+Indef*
baabuurradii
baabuurradii baabuur+N+Fem+Def+Pl+Nom+Dist
baabuurradii baabuur+N+Fem+Def+Pl+Abs+Dist
*Note: Somali has gender polarity for some words, which alternate between Fem. and Masc. in Sg. and Pl.
It can of course be turned around to generate word forms too, if you just input the analysis. This is one of the first stages of rule-based machine translation or text tagging, or what have you, for Somali. Once I'm far enough along with the analyzer, and have gone through and worked out the kinks, I'll probably work on disambiguating multiple analyses.
Along the way, I've been compiling a corpus of news articles with which to test coverage of my analyzer and help extend it... One of the ways I'm working to extend the analyzer with words now is by providing a means to automatically guess which inflectional type a word is. I'm happy to say this is on the way too, but not quite there yet. Either way, the plan is to dump a list of words into the (python) program, and extend the analyzer with those that pass with flying colors. Of course, these aren't many word categories in the program yet, but I'm fairly confident that I can get decent results.
Following is an example. Each word goes through a list of simple tests, and tests are assessed by count of forms fitting into some phonetic category contained in the word categories.
aalad, aalado, aaladda, aaladdu, aaladdii, aaladaha, aaladuhu, aaladihii
D1F: 8/8 <--
D1M: 5/8
D2M: 4/8
D2F: 4/8
--
geed, geedka, geedku, geedkii, geedo, geedaha, geedihii, geeduhu
D1F: 5/8
D1M: 8/8 <--
D2M: 7/8
D2F: 1/8
--
baabuur, baabuurka, baabuurku, baabuurro, baabuurrada, baabuurradii
D1F: 2/6
D1M: 4/6
D2M: 6/6 <--
D2F: 4/6
--
magac, magaca, magucu, magicii, magacyo, magacyada, magacyadii
D1F: 2/7
D1M: 5/7
D2M: 7/7 <--
D2F: 4/7
--
subax, subaxda, subaxdu, subaxdii, subaxyo, subaxyada, subaxyadii
D1F: 5/7
D1M: 2/7
D2M: 4/7
D2F: 7/7 <--
--
And of course, I'm planning on making the source available for these programs as I clean up the source, remove my notes, and provide more useful documentation...
![[Atom/RSS icon]](/m/img/feed.png)
#1: Kevin B. Unhammer (14:07) - 9 Jul 2011
Source would be great :-) we've used similar methods, but on a much smaller scale, for Maltese (the mt->he pair in Apertium).
#2: Ryan (19:07) - 10 Jul 2011
Source will be had once I'm good and done. ;) I'm not sure how portable it will be to other languages, but perhaps the general idea will be reusable... Well, let's hope!