A data format for Somali language tools...? — 18 Jul 2011

Printer/academia friendly?

As previously mentioned, I've been working on a morphological analyzer for Somali in my own time as means to learn about how such things are done. As part of my job, however, I've been working with several applications that are a result of this kind of work: one of which is a language learning website for Southern Sámi, a minority language spoken in Norway.

The website takes lexical data, stored in XML format, and combines it with a morphological analyzer/generator to produce learning exercises where students can practice how to inflect words in the various forms necessary to speak South Sámi properly. The application allows for exercises where there is a more rote form of learning, such as being presented with dictionary forms of verbs, and being told to inflect them into a certain case, as well as in context, where the user is presented with a sentence and told to fill in the blank. Since South Sámi has several cases, this is a real necessary exercise. One could compare it to English, where exercises might involve filling in the necessary preposition or pronoun form (I/me/my).

The use of a morphological analyzer in an external application provides a useful opportunity to improve and extend the tools and further improve them. Quite often, when working on this South Sámi web-app, we find places where the morphological analyzer might have a bug preventing words from being generated, and the bug may never have been found if it weren't for the need to generate tons of word forms for a learning application. In essence, every use of the tools leads to improvements for all applications.

Now for Somali

While working on the morphological analyzer for Somali, I had been recording word meanings in the hopes that something more would come of it later, but it was a fairly messy way of collecting lexical data, just storing them in the HFST LEXC source files. Taking a page from Giellatekno, I've decided to collect information in a more easily parseable file format, but the twist here is that I'm going to also use some of these files as means to compile parts of the morphological analyzer. Thus, I hope to store data in one place, and use it for multiple things.

Giellatekno uses XML as their file format of choice for storing lexical data. XML is great and well supported, but I thought I might go with YAML instead, as it is much more human readable, all you have to do is mind your indents and colons; much similar to Python (my language of choice). One of the other nice things about YAML is it allows you to refer to other parts of data via use of references-- which saves a lot of time if you have data that needs to be reproduced for each entry. Also, if it turns out in the long run that YAML is really a bad idea for what I want to do, it will be fairly easy to just convert it to XML; I'll just have to find a better way of storing notes on word entries and other repeated information. If it turns out not to work so well, maybe I'll at least be able to provide some good arguments for why XML or YAML are ideal for the applications I've got, and what other people working on other languages may want to consider.

I'm still working on my YAML format, but it's mainly based off of the data structure of Giellatekno's XML files, so much so that it would be quite easy for me to use existing Giellatekno applications with my data after a little scripting to convert.

Here's an example entry to consider:

- lemma: "iibso"
  deriv: "iibs{o}"
  <<: *HFST_V3B
  syntax:
    pos: V
    val: TV
  translations:
    - eng: "buy"
      fin: "ostaa"
      syntax: 
        deic: "soo"
    - eng: "sell"
      fin: "myydä"
      syntax:
        deic: "sii"

- lemma: "joogso"
  ... etc

The dash marks this off as part of a list. Note that quotation marks aren't necessarily required in these instances in YAML, but I've taken to using them just so that I can mark strings as separate from other types of data. The line containing <<: *HFST_V3B refers back to some preset values that are required to mark this word as belonging to a specific inflectional class, specifically for use with the morphological analyzer.

Translations are broken up into meaning groups (a list, with dashes again), and this example provides a sort of puzzle. We have essentially two separate words: sii iibso 'sell' and soo iibso 'buy', the differentiating factor is a directional particle that says whether the action involved a transfer from the speaker or to the speaker. The lemma is the same, so are these one entry? Or is it the meaning that differs, and that is what should decide?

For now, I'm taking the viewpoint of Somali, which seems to say that soo and sii are separate items, and they may vary to affect meaning in predictable ways, I am expecting that they may not always be predictable*, and that their presence is crucial to a specific meaning, thus they must be included. English and Finnish on the other hand, just use completely separate words for these concepts, so it's easier for us to decide that these items are really two separate words. Maybe I should do it a different way, but until I figure something out or decide, I'm open to suggestions-- in any case, it should be simple to find all of the words that I need to update when I need to rely on a different solution for the problem.

* The predictable meanings of the words are: soo 'motion towards', sii 'motion away', which makes complete sense with iibso

To dictionaries

The goal though is that all of this information is stored in one place. I want to produce separate applications, and potentially more down the line, but if I change one word translation or spelling, I want to make sure that I have the opportunity to reflect these changes in all places.

One of the main human-interest goals here is to produce a dictionary, available in several languages and Somali, which contains as much useful information about words which are relevant to learners, as well as the bare minimum necessary for native speakers. However, another way to look at it is that really, multilingual dictionaries are always for learners: whether you're using the somali->english side, or the english->somali side. I may well decide that I want to find ways to include example sentences from corpuses for English words as well... I tend to find these things useful in languages I learn too, and am constantly googling words to see how they work in context.

Further down the line

- lemma: "boosto"
  deriv: "boost{o}"
  <<: *HFST_D6_F
  translations:
    - eng: "post office"
  semantics:
    - "PLACE"
    - "BUILDING"

One of the goals with documenting words like this is to include semantic information about the words in question, because down the line this will help produce other tools, such as machine translators; or perhaps even more learning applications such as Oahpa. Oahpa uses categories like these to construct fill-in-the-blank sentences, and machine translation applications may need these semantic categories for real grammatical uses: for example, verbs expressing states and emotions may have syntactic patterns from other words (they certainly do in Finnish and Guaraní, and many other languages). I may attempt to absorb some semantic classes from existing word banks to get myself started, but if anyone has some suggestions for this, I'd be happy to hear.

Having a fairly well-working morphological analyzer also means that there will be several possible analyses for words. If I want to move down the line to syntactic analysis, which is the next stage in preparing for machine translation or even grammar correction (think about a word processor that corrects you if your verbs don't agree with the subject) will be syntactic disambiguation. As I'm familiar with Constraint Grammar, this is a likely path that I'll go.

Big undertaking

Things like this aren't easy, but the hope is that with a little additional planning from the beginning, I can at least make the difficult part of the work collecting the data, and not entering it or including it in all of the likely applications. It would also be easy to extend the existing data if I find new pieces of information I want to collect for each word, and do so hopefully without disrupting existing applications.

I'm always welcome to hear ideas, so drop a comment or an email. Or if you have texts or information you want to contribute, I'm quite happy with that too! Eventually I'll try to find a good way to get more direct contributions from others. I'd like to open the source, but I only want to open it once I've got it fairly cleaned up, and there is some small amount of data that I'm using for research on my masters project that I want to make sure is in the place it needs to be, undisturbed. It could be opening things up will be a little while down the road, but until then, I'm working on making some of the fruits of this labor available to the public in form of an online dictionary. It's already up, but I want to do a little more cleanup before I broadcast its location and claim it to be a fully functioning product. So, watch this space...

Posted by:
Ryan

Comments

No comments yet.

Post comment...