Recent blog entries from Nov, 2011

Searching through morphologically analyzed texts — 8 Nov 2011

I just finished the first draft of a tool that lets me search through a text which is morphologically analyzed on the fly (via XFST). It's rough for now, but quite awesome. Eventually I'll extend it to include morphologically disambiguated analyses such as those that vislcg would provide. For now, this will help me much to figure some things out. In addition, it provided another experience to learn; thus, quite worth it even if something like this exists already.

Here's a quick example, but you'll probably have to take my word for it (although the genitive is marked with -ood. I searched for the pattern Num N+Fem+Sg+Indef+Gen, or a numeral followed by a (feminine) noun in the genitive:

(waayo, Yoo'aab iyo reer binu Israa'iil oo dhammu halkaasay iska joogeen intii lix bilood ah, ilaa uu wada jaray wixii lab ahaa ee Edom joogay oo dhan;)

(For six months did Joab remain there with all Israel, until he had cut off every male in Edom:)

--

Markaasaa Axiiyaah qabsaday dharkii cusbaa oo uu qabay, oo wuxuu u kala jeexjeexay laba iyo toban meelood.

And Ahijah caught the new garment that {was} on him, and rent it {in} twelve pieces:

--

Kolkaasuu Yaaraabcaam ku yidhi, Toban meelood qaado, waayo, Rabbiga Ilaaha reer binu Israa'iil ah wuxuu leeyahay, Boqortooyada waan ka xoogayaa Sulaymaan, oo toban qabiil ayaan ku siinayaa,

And he said to Jeroboam, Take thee ten pieces: for thus saith the LORD, the God of Israel, Behold, I will rend the kingdom from the hand of Solomon, and will give ten tribes to thee:

--

Oo wakhtigii Sulaymaan reer binu Israa'iil oo dhan Yeruusaalem boqorka ugu ahaa wuxuu ahaa afartan sannadood.

And the time that Solomon reigned in Jerusalem over all Israel {was} forty years.

The results all come from the Old Testament, which has been pretty useful as far as being a huge text (and providing some out of context and humorous sentences). The translations come from another tool I wrote for searching through aligned texts.

In any case, there are some things left to do before this is "complete", but it's a good start!

0 comments

Inserting tones into a toneless text — 4 Nov 2011

As part of my masters thesis, I've been working on a Somali morphological analyzer and a syntactic disambiguator. A short introduction for anyone reading who doesn't know what these things are is: software that can tell you what the function of a word is in the sentence, and, when multiple posible functions exist, it chooses the one that is correct from context. In English for instance, the word 'can' can be both a auxilliary verb as well as a noun; but we English speakers know which is which when we hear the word in context.

In the case of Somali (and many languages), some forms are ambiguous in text that would not be in speech due to intonational and stress information. For Somali however, this means that information on number of nouns (éy 'dog' vs. eý 'dogs') and sometimes gender of nouns (masculine vs. feminine) is marked via tone. It is easy to imagine then, that when generating speech from text, producing better sounding (and grammatically sound) Somali speech would require being able to know where the tones are in a text. This is where these analytical tools come in handy... And conveniently, tonal patterns in Somali are mostly rule-based.

Naagta laybreeriga wax ku qoraysa ayaa soo socota.
'The woman who is writing in the library will come.'

After the morphological analyzer runs, we end up with input like the following:

naagta  naag+N+Fem+Sg+Def+Abs+Prox

laybreeriga laybreeri+N+Masc+Sg+Def+Abs+Prox

wax wax+N+Masc+Sg+Indef+Nom
wax wax+N+Masc+Sg+Indef+Gen
wax wax+N+Masc+Sg+Indef+Abs
wax wax+Pron+Indef+Abs

ku  +Nom+Prox
ku  ku+Adp
ku  ku+Pron+Pers+2Sg+Obj

dhex    dhex+N+Fem+Sg+Indef+Gen
dhex    dhex+N+Fem+Sg+Indef+Abs

qoraysa qor+V+Prog+3SgF+Ind+Pres+Red+Abs

ayaa    ayaa+CS+Foc/L+Subj+Null

soo soo+PP+Deic

socota  soco+V+3SgF+Ind+Pres+Red+Abs

There are a couple items that need to be removed here, and disambiguation is carried out by constraint grammar. Casting out the ambiguous possibilities in context rewards us with the following analysis:

"<naagta>"
    "naag" N Fem Sg Def Abs Prox 
"<laybreeriga>"
    "laybreeri" N Masc Sg Def Abs Prox 
"<wax>"
    "wax" Pron Indef Abs 
"<ku>"
    "ku" Adp 
"<dhex>"
    "dhex" N Fem Sg Indef Abs 
"<qoraysa>"
    "qor" V Prog 3SgF Ind Pres Red Abs 
"<ayaa>"
    "ayaa" CS Foc/L Subj Null
"<soo>"
    "soo" PP Deic 
"<socota>"
    "soco" V 3SgF Ind Pres Red Abs

... And these disambiguated forms can then be fed back into the morphological analyzer/generator to get the proper tone marking.

naágta laybreériga wax ku dhéx qóraysá ayaa soo socotá

I am a little unsure of the tone marking on dhéx (and in fact, ayaa should probably have a stress-tone on it too, as well as soo), but in any case, this was all carried out automatically, and these things may be fixed. Being able to provide input like this to a text-to-speech program would result in something a little less monotonous, and pleasing to the ear.

As the analysis progresses, it would even be possible to assign places where pauses are necessary, or where the ends of certain clauses are accompanied by boundary tones. ... There are also some other relevant phonological phenomena that could be processed in this manner and included in text-to-speech input.

Now that that's out of the way, does anyone know of some nice, open-source text-to-speech software that is open for use with any language and not just the largest ones?

2 comments