Entries tagged with “Programming”

Inserting tones into a toneless text — 4 Nov 2011

As part of my masters thesis, I've been working on a Somali morphological analyzer and a syntactic disambiguator. A short introduction for anyone reading who doesn't know what these things are is: software that can tell you what the function of a word is in the sentence, and, when multiple posible functions exist, it chooses the one that is correct from context. In English for instance, the word 'can' can be both a auxilliary verb as well as a noun; but we English speakers know which is which when we hear the word in context.

In the case of Somali (and many languages), some forms are ambiguous in text that would not be in speech due to intonational and stress information. For Somali however, this means that information on number of nouns (éy 'dog' vs. eý 'dogs') and sometimes gender of nouns (masculine vs. feminine) is marked via tone. It is easy to imagine then, that when generating speech from text, producing better sounding (and grammatically sound) Somali speech would require being able to know where the tones are in a text. This is where these analytical tools come in handy... And conveniently, tonal patterns in Somali are mostly rule-based.

Naagta laybreeriga wax ku qoraysa ayaa soo socota.
'The woman who is writing in the library will come.'

After the morphological analyzer runs, we end up with input like the following:

naagta  naag+N+Fem+Sg+Def+Abs+Prox

laybreeriga laybreeri+N+Masc+Sg+Def+Abs+Prox

wax wax+N+Masc+Sg+Indef+Nom
wax wax+N+Masc+Sg+Indef+Gen
wax wax+N+Masc+Sg+Indef+Abs
wax wax+Pron+Indef+Abs

ku  +Nom+Prox
ku  ku+Adp
ku  ku+Pron+Pers+2Sg+Obj

dhex    dhex+N+Fem+Sg+Indef+Gen
dhex    dhex+N+Fem+Sg+Indef+Abs

qoraysa qor+V+Prog+3SgF+Ind+Pres+Red+Abs

ayaa    ayaa+CS+Foc/L+Subj+Null

soo soo+PP+Deic

socota  soco+V+3SgF+Ind+Pres+Red+Abs

There are a couple items that need to be removed here, and disambiguation is carried out by constraint grammar. Casting out the ambiguous possibilities in context rewards us with the following analysis:

"<naagta>"
    "naag" N Fem Sg Def Abs Prox 
"<laybreeriga>"
    "laybreeri" N Masc Sg Def Abs Prox 
"<wax>"
    "wax" Pron Indef Abs 
"<ku>"
    "ku" Adp 
"<dhex>"
    "dhex" N Fem Sg Indef Abs 
"<qoraysa>"
    "qor" V Prog 3SgF Ind Pres Red Abs 
"<ayaa>"
    "ayaa" CS Foc/L Subj Null
"<soo>"
    "soo" PP Deic 
"<socota>"
    "soco" V 3SgF Ind Pres Red Abs

... And these disambiguated forms can then be fed back into the morphological analyzer/generator to get the proper tone marking.

naágta laybreériga wax ku dhéx qóraysá ayaa soo socotá

I am a little unsure of the tone marking on dhéx (and in fact, ayaa should probably have a stress-tone on it too, as well as soo), but in any case, this was all carried out automatically, and these things may be fixed. Being able to provide input like this to a text-to-speech program would result in something a little less monotonous, and pleasing to the ear.

As the analysis progresses, it would even be possible to assign places where pauses are necessary, or where the ends of certain clauses are accompanied by boundary tones. ... There are also some other relevant phonological phenomena that could be processed in this manner and included in text-to-speech input.

Now that that's out of the way, does anyone know of some nice, open-source text-to-speech software that is open for use with any language and not just the largest ones?

2 comments

Somali morphological analysis progress report — 9 Jul 2011

Over the past several months, I've been working on a Somali morphological analyzer. It's rule based, and built with HFST, so it takes a little bit of work to extend, but it runs quite quickly and smoothly. Following are some examples for varying forms of the word baabuur 'truck'.

baabuur
baabuur baabuur+N+Masc+Indef+Sg+Nom
baabuur baabuur+N+Masc+Indef+Sg+Abs
baabuur baabuur+N+Masc+Indef+Sg+Gen

baabuurro
baabuurro       baabuur+N+Fem+Pl+Indef*

baabuurradii
baabuurradii    baabuur+N+Fem+Def+Pl+Nom+Dist
baabuurradii    baabuur+N+Fem+Def+Pl+Abs+Dist

*Note: Somali has gender polarity for some words, which alternate between Fem. and Masc. in Sg. and Pl.

It can of course be turned around to generate word forms too, if you just input the analysis. This is one of the first stages of rule-based machine translation or text tagging, or what have you, for Somali. Once I'm far enough along with the analyzer, and have gone through and worked out the kinks, I'll probably work on disambiguating multiple analyses.

Along the way, I've been compiling a corpus of news articles with which to test coverage of my analyzer and help extend it... One of the ways I'm working to extend the analyzer with words now is by providing a means to automatically guess which inflectional type a word is. I'm happy to say this is on the way too, but not quite there yet. Either way, the plan is to dump a list of words into the (python) program, and extend the analyzer with those that pass with flying colors. Of course, these aren't many word categories in the program yet, but I'm fairly confident that I can get decent results.

Following is an example. Each word goes through a list of simple tests, and tests are assessed by count of forms fitting into some phonetic category contained in the word categories.

aalad, aalado, aaladda, aaladdu, aaladdii, aaladaha, aaladuhu, aaladihii
  D1F: 8/8  <--
  D1M: 5/8 
  D2M: 4/8 
  D2F: 4/8 
--
geed, geedka, geedku, geedkii, geedo, geedaha, geedihii, geeduhu
  D1F: 5/8 
  D1M: 8/8  <--
  D2M: 7/8 
  D2F: 1/8 
--
baabuur, baabuurka, baabuurku, baabuurro, baabuurrada, baabuurradii
  D1F: 2/6 
  D1M: 4/6 
  D2M: 6/6  <--
  D2F: 4/6 
--
magac, magaca, magucu, magicii, magacyo, magacyada, magacyadii
  D1F: 2/7 
  D1M: 5/7 
  D2M: 7/7  <--
  D2F: 4/7 
--
subax, subaxda, subaxdu, subaxdii, subaxyo, subaxyada, subaxyadii
  D1F: 5/7 
  D1M: 2/7 
  D2M: 4/7 
  D2F: 7/7  <--
--

And of course, I'm planning on making the source available for these programs as I clean up the source, remove my notes, and provide more useful documentation...

Related ...

2 comments

Finnish Grammar Exercizes — 22 Feb 2011

As a test to see how easy it would be to implement Oahpa for a new language, I decided to do so with already existing morphological and syntactic analysis tools and data for Finnish. I hope to spend a little time improving it in the next few weeks, 'cause it'll certainly benefit some Finnish learners out there. :)

Try it here: http://finoahpa.donchaknow.com/oahpa/

Oahpa is a collection of language learning games, which range from inflecting words to vocabulary building, to using word inflections in the context of sentences. For languages this is particularly important, because words are inflected in specific ways for certain types of sentences which take learners a while to grasp. For instance, the case of the object of verbs may vary depending on what the verb is. pitää plus partitive means 'hold', while pitää plus elative means 'like'. Similarly, if you say "I feel happy", the form that 'happy' takes differs from if you use it in a sentence like "I am happy."

The original Oahpa games are available in Northern Sámi and (coming soon) Southern Sámi, and they, like the Finnish test version, are based on morphological analysis tools which can generate all word forms given a specific word; as well as syntactic analysis tools which (generally) will mark words as subjects or objects, but also provide much more detailed information such as what verbs agree with. These analysis tools are then used to analyze sentences that learners type to give feedback on common issues, such as verb agreement, case usage and so on.

Although the Finnish Oahpa has only two exercises, I'll post some updates if I carry over more games from Northern Sámi Oahpa. In the meantime, happy inflecting!

0 comments

Translation to Qglic with Finite-State Technology — 4 Aug 2010

Qglic (pronounced Anglish) is a near-phonemic alternative writing system for English. Being near-phonemic, the goal is to have as close to a one-to-one correspondence between sounds in English and the letters used to represent these. One of the benefits to Qglic is that it attempts to do this using only the letters A through Z. You can see a small sample of it following, which is this paragraph but just written in Qglic.

Qglic iz ey funymik qltrnutiv ruyti'g sistum for I'glic. Byi'g funymik (or nirly so), xu gol iz tu hav ez klos tw ey wun-tu-wun koruspqnduns bitwyn saondz in I'glic and xu letrz ywzd tu reprizent xu saondz. Wun uv xu benufits ti Qglic iz xat it utemps ti dw xis ywzi'g only xu letrz A xrw Z. Yw kan sy u smol sampul uv it fqloi'g, witc iz xis perugraf but dcist ritun in Qglic.

I discovered Qglic a year or so ago, but recently remembered it and became all excited about it again. Using my newly acquired skills in various language technological applications, I spent some time putting together a simple finite-state machine based on the phonemic rules of Qglic, and the CMU Pronouncing Dictionary, which is vast and contains a huge amount of words (approximately 133,000). The CMU Pronouncing Dictionary contains pronunciation guides written with Arpabet, which means it's fairly easy to translate it into IPA or in this case, Qglic.

ABSCOND  AE0 B S K AA1 N D
ABSCONDED  AE0 B S K AA1 N D AH0 D
ABSCONDING  AE0 B S K AA1 N D IH0 NG 
ABSCONDS  AE0 B S K AA1 N D Z
ABSECON  AE1 B S AH0 K AO0 N
ABSENCE  AE1 B S AH0 N S
ABSENCES  AE1 B S AH0 N S IH0 Z
ABSENT  AE1 B S AH0 N T
ABSENTEE  AE2 B S AH0 N T IY1
ABSENTEEISM  AE2 B S AH0 N T IY1 IH0 Z AH0 M
ABSENTEES  AE2 B S AH0 N T IY1

Taking this data, I wrote a short Python script (I'll upload it somewhere at some point soon) to translate the pronunciation guides into Qglic, and then convert them to a format used to produce a file format compatible with the Helsinki Finite State Transducer Technology (HFST):

    abscond:abskqnd         ennd ;
    absconded:abskqndud             ennd ;
    absconding:abskqndi'g               ennd ;
    absconds:abskqndz               ennd ;
    absecon:absukon         ennd ;
    absence:absuns          ennd ;
    absences:absunsiz               ennd ;

It's a very simple finite-state machine, as far as the amount of effort put into producing it. It consists of just a huge list of words in the format of english:qglic, which represents a beginning path and the end path in the machine. The result is very fast: a 385 word article on Naomi Campbell testifying before a war-crimes tribunal from CNN is converted to Qglic in just 0.143 seconds, and the whole of The Importance of Being Earnest translates in about 1.3 seconds.

There are still some issues to work out, such as how I tokenize text, so, punctuation isn't perfect, and thus results in more words not being translated... However, since I'm using the CMU database, there are very few words that don't make it through, and if they don't, it's most likely a result of a tokenization error.

One of the other problems is that words which are homonymous are not handled ideally now (the first homonym is used always), which results in funny spellings when a word is both a noun and a verb ('The farmers prodúce próduce') but used as the other ('*The farmers próduce prodúce.'). Problems like these could be solved with a few more hours of work implementing already existing technologies to disambiguate between the two words based on sentence-sized contexts. If I get a little more time to work on this, maybe I'll iron those problems out and put some of the larger texts up online that are "translated".

Instead, enjoy a couple paragraphs of Naomi Campbell's court case, which has been cleaned up for punctuation issues that I need to fix. Looking through it otherwise, I see there is at least one other issue. See if you can spot it, or find more! ;)

Neyomy Kambul wil testufuy in wor kruymz truyl xrzdey

(cnn) -- Ey dcudc in xu wor kruymz truyl uv formr Luybiryun prezidunt Tcqrlz Teylr haz disuydid xat swprmqdul Neyomy Kambulz testumony in xu keys wil go uhed xrzdey.

Xu specul kort uv Syeru Lyon kunfrmd ti syenen wenzdey xat kambul wil teyk xu stand at xu trubywnul, dispuyt an imrdcunsy mocun xu difens fuyld mundey ti diley hr testumony.

Prqsikywtrz sey Teylr geyv Kambul ey duymund dri'g xu wor in Syeru Lyon, kqntrudikti'g Teylrz testumony xat hy nevr handuld xu precus stonz xat fywuld xu kunflikt.

0 comments

Constraint Grammar — 7 Jun 2010

School's out! Woohoo! Now it's time to get working.

Since finishing exams, I've been spending the last week or so working with Constraint Grammar as part of my Google Summer of Code project in machine translation from Finnish to Northern Sámi. It's enlightening and interesting and there's much to learn, but it seems to give me precisely the kind of puzzles that I like to solve. Constraint Grammar is a syntactic formalism developed by Fred Karlsson (the author of the first Finnish grammar book I studied, which quite possibly changed my life) which has the essential goal of disambiguating words which are homophonous: have similar appearances but separate morphological uses or separate meanings.

An example:

minä lu-i-n kaksi kirja-a

1pSg.Nom READ-Prt-Sg1 TWO BOOK-Part

'I read two books.'

This all makes perfect sense to us, because we know what words are meant; however luin could mean "I read", or "with/by bones". Since the latter meaning is obviously not the one that we want for the sentence, Constraint Grammar provides a rule-based formalism for selecting the intended meaning based on the surrounding context. This isn't easy of course, because one actually needs quite a few rules to produce a fully disambiguated sentence, and natural sentences aren't always as simple as the one given above. Following is the full analysis of each word:

"<minä>"
    "minä" Pron Pers Sg Nom
    "mikä" Pron Interr Sg Ess
"<luin>"
    "lukea" V Act Ind Prt Sg1
    "luu" N Pl Ins
"<kaksi>"
    "kaksi" Num Card Sg Nom
"<kirjaa>"
    "kirja" N Sg Par
    "kirjata" V Act Ind Prs Sg3
    "kirjata" V Ind Prs ConNeg 
    "kirjata" V Act Imprt Sg2

As we can see, there are quite a few items that need to be removed (and listed in CG formalism below): the word minä can have its personal pronoun reading chosen because it precedes a verb with 1st person singular marking (line 1237); luin gets its verbal reading selected (as opposed to the 'bone' reading) because it follows a pronoun (line 1645); and finally kirjaa 'book+Part' is selected because it precedes a number.

1187: SELECT (Par) (-1C Num) (-1 Nom)
1237: SELECT (Pron "minä") (*1 Sg1 LINK NOT *-1 CLB?) (NOT 1 CLB?)
1645: SELECT (Sg1) (-1C MINA) (-1 Nom)

2094: MAP (@SUBJ>) TARGET Nom (0 WORD LINK *1 (Act))
2109: MAP (@<OBJ) TARGET Par IF (0 WORD LINK *-1 V BARRIER S-BOUNDARY2) ;
2115: MAP (@+FMAINV) TARGET VFIN IF (NEGATE *0 VERB BARRIER S-BOUNDARY2 OR CC) ;

Then following this disambiguation, several tags are added for later convenience... One tag, @SUBJ> tells us that the word is the subject of the sentence, preceding the verb; @+FMAINV tells us that the word is the main verb, @X tells us there is more work to be done yet; and @<OBJ says that the word is an object following its verb. The tags are shortcuts for passing along information for the generation part of the translation, in which words are produced based on the analysis. The full disambiguation is next, but note that the tags and analysis may not be correct yet; I'm just pulling this from the project as-is. Lines beginning with a semicolon (;) are those which are dropped from the analysis

"<minä>"
    "minä" Pron Pers Sg Nom @SUBJ> SELECT:1237 MAP:2094 
;   "mikä" Pron Interr Sg Ess SELECT:1237 
"<luin>"
    "lukea" V Act Ind Prt Sg1 @+FMAINV SELECT:1645 MAP:2115 
;   "luu" N Pl Ins SELECT:1645 
"<kaksi>"
    "kaksi" Num Card Sg Nom @X MAP:2348 
"<kirjaa>"
    "kirja" N Sg Par @<OBJ SELECT:1187 MAP:2109 
;   "kirjata" V Act Ind Prs Sg3 SELECT:1187 
;   "kirjata" V Ind Prs ConNeg SELECT:1187 
;   "kirjata" V Act Imprt Sg2 SELECT:1187

So, there's more work to be done. As I dig further in, I may post a few recipes if there are tricky problems that arise. I'll be running some newspaper sentences through the grammar to see what additional things need to be worked out; the rules work fine for short sentences, but it may be that they'll not hold up when applied to much more complex sentences. As you can see in the lines of code produced above, there are BARRIERs involved, which delimit the ability of the rule to search its surroundings. More of these will likely pop up as weirder sentences are tested.

As it turns out though, the above analysis for this sentence is actually enough to produce a good translation. Once all the words are disambiguated, they're sent off to a generator, which produces the following (with slashes representing dialectical variation):

$ echo "minä luin kaksi kirjaa" | fin-sme
        mun/mon lohken guokte girjji/girjje

The sentence also shows the connection between the two languages that the project concerns, if you squint you can see their relatedness.

3 comments

Google Summer of Code / Apertium - Finnish and Northern Sámi — 27 Apr 2010

I now know some of my summer plans: Google Summer of Code! I just found out I was accepted. The project is to start a machine translation project with Apertium to translate from Finnish to Northern Sámi. The Apertium project is also getting several other GSoC participants in other areas, as well, featuring translation projects from Polish to Czech, and French to Portuguese. In addition, there are other projects to improve and expand Apertium in various ways. If you want to see my proposal, that's available online. If you want to know a little more about machine translation, read on...

There are two major methods of machine translation (Apertium uses a combination of both): statistical and linguistic translation. Google Translate is a well known example of a statistical machine translation program, which is aided by Google's wealth of texts in various languages. Google Translate works by lining up sentences that are known to match up in translation, and then translates chunk by chunk. In a way, it's like speaking a language by phrasebook, you may say mostly the right thing most of the time, but then some other times you may tell your tobaccanist that your hovercraft is full of eels.

The linguistic translation method analyzes words in the source language morpheme by morpheme (the smallest unit of meaning within a word), and then analyzes the word order of the sentences in order to disambiguate and handle what roles the words play in the sentences. After this, a bilingual dictionary is consulted, and the analysis of the sentence in the source language is used to construct the sentence in the target language. This approach is much closer to what it is like to speak a language, in a way; because words are inflected based on linguistic rules, and the grammar is thoroughly consulted to produce the output.

Although linguistic translation seems to be more akin to learning a language and speak it, that does not mean that it is 100% perfect in the sense of what a bilingual human may provide in translating a novel, for instance; but this is not necessarily one of the immediate goals of machine translation. In order for a machine to flawlessly translate any sentence, it would have to have a more thorough understanding of all of the semantic data associated with words, and how to know when to use what word; and all of the various shades of meaning between words.

We're not there yet, but making progress... So maybe some day, once machines gain enough knowledge to translate flawlessly, they will be able to address us politely as they take over the world.

1 comments

Minor updates — 29 Dec 2009

Since I'm a poor graduate student, I've added a basic CV, which is accessible either by that link, or if you're not reading this in an RSS feeder, to the upper right in the navigation bar. More detailed information is of course available at request.

Also, there were some outstanding things that needed doing with this blog, such as adding pagination, trackbacks, switching over to Mercurial/Hg instead of SVN (Hg seems to just have a better workflow for me). I disabled some entertaining, yet out of date things that I hadn't had time to do the upkeep on, or indeed finish up the starting content (moving country takes work!). Maybe I'll get back to that soon, and reenable it. Blogging about random Finnish words in detail is fun, anyway. :)

Of course, I always wonder why I don't just run some WordPress instance, or just direct my domain at a Blogger account, but it's always more fun to program things on your own when you've got the time, and you always learn a few new and useful things. On the to do list for the break from school is to try out nginx. I've heard it's great for high load things, so I'm curious to do some load-testing with some projects I've been working on that are more computationally intensive. More on one of those later... :)

0 comments

OpenMinneapolis announced — 16 Dec 2009

If you don't know much about Minneapolis and what government data is available to you, that's no surprise; it's somewhat difficult to find out, and requires reading through state statutes and city ordinances in order to know what you can request. That doesn't mean that the process of getting this data is any easier, unless you have experience with governmental processes.

Having become annoyed with a lack of transparency and open government data in Minneapolis, some friends and I are launching a project, OpenMinneapolis.org, to make data on government meetings, officials, elections and various processes more available to the public and private individuals. You'll even be able to check out attendance records for city council members, and know whether they're doing the job you elected them to do, as well as have access to meeting minutes in an open and usable format (in this case XML). Take a look at the announcement on OpenMinneapolis.org for more details...

We're already getting some press too, and have submitted a grant application to the Knight foundation, which you can go and vote on. :)

1 comments

Referrer-based conditional redirects in lighttpd — 16 Nov 2009

I noticed someone was hot-linking to an image of Bonnie Tyler stored on my Bonnie Tyler tribute website: Total Eclipse of the World, so I thought I'd circumvent this somehow. I don't want my precious bandwidth going to unknown hotlinkers.

The solution was to use lighttpd's virtual host definitions to insert a condition saying that if the HTTP referrer is not my domain, then the user should be redirected somewhere else. Now, referrers can of course be spoofed easily, but this is not the point. I may still want people to be able to share a link to an image, but I don't want someone embedding an <img> tag somewhere that gets a ton of hits... Here's what the configuration looks like:

$HTTP["host"] == "mydomain.com" {
    server.document-root = "/foo/bar"
    server.errorlog = "/foo/bar"
    accesslog.filename = "/foo/bar"
    server.error-handler-404 = "/foo/bar"

    $HTTP["referer"] !~ "^($|https?://(.*\.)?mydomain\.com)" {
        url.redirect = ("^(.*)\.(jpg|gif|png|css|js)" => "http://www.goldenplec.com/wp-content/uploads/2008/10/rick_astley.jpg")
    }
}

0 comments

X11 Font Cache, FontExplorer X and Inkscape — 19 Jun 2009

I use FontExplorer X to manage fonts, and by default it maintains an additional directory of fonts outside of other standard font directories. This is no problem for most OS X applications, however X11 applications such as Inkscape use X11's font cache which does not check the FontExplorer X directory for fonts.

When running the font cache updating tool (font_cache), I noticed that it was searching in a directory that didn't exist, ~/.fonts; so, not wanting to figure out how to define which directories font_cache should be looking in, I created a symbolic link from FontExplorer X's library to ~/.fonts:

ln -s FontExplorer\ X/Font\ Library/ .fonts

And then forced font_cache to run and recompile all fonts from the X11 Terminal application:

font_cache -v -f

This did the trick, and now all my fonts show up in Inkscape. I had also attempted previously to create a symlink in a directory that font_cache was checking already, but with no success-- it looks like font_cache will not traverse through symlinked directories, if they occur in a directory it checks (e.g., ~/Library/Fonts/fontexplorer_x_symlink); but it will look in a directory that is a symlink (e.g., ~/.fonts/).

Hopefully anyone googling for a similar issue will find this, but what other ways could one get font_cache to look in other directories? I couldn't find much, but then again I didn't really search all that much either. ;)

0 comments