Brief note to anyone else who's looking (and me incase I forget). While learning to use nginx, either I misread somewhere or borrowed the wrong information, but I was attempting to set up static content directories within my server definition like so:
server {
...
location /media {
root /site/media/dir;
access_log off;
}
...
}
Unfortunately, it wasn't working out for me that way. I tried all manner of things to test that it wasn't something simple, such as making sure there were slashes on the end of the directory (nope), or a permissions issue for nginx's user not being able to access the directory (nope!). Finally, I stumbled upon an email list posting that lead me in the right direction: I needed to use an alias instead of directory root. Using root (according to the email thread) implies that the directory you specify (e.g., location /media) actually exists within your server's root directory. Oops!
location /media {
alias /site/media/dir;
access_log off;
}
Now with an alias instead, everything works fine, and nginx is running like a charm.
Since I'm a poor graduate student, I've added a basic CV, which is accessible either by that link, or if you're not reading this in an RSS feeder, to the upper right in the navigation bar. More detailed information is of course available at request.
Also, there were some outstanding things that needed doing with this blog, such as adding pagination, trackbacks, switching over to Mercurial/Hg instead of SVN (Hg seems to just have a better workflow for me). I disabled some entertaining, yet out of date things that I hadn't had time to do the upkeep on, or indeed finish up the starting content (moving country takes work!). Maybe I'll get back to that soon, and reenable it. Blogging about random Finnish words in detail is fun, anyway. :)
Of course, I always wonder why I don't just run some WordPress instance, or just direct my domain at a Blogger account, but it's always more fun to program things on your own when you've got the time, and you always learn a few new and useful things. On the to do list for the break from school is to try out nginx. I've heard it's great for high load things, so I'm curious to do some load-testing with some projects I've been working on that are more computationally intensive. More on one of those later... :)
I recently learned a bit more about regular expressions which allowed me to vastly improve the speed and simplicity with the Northern Sámi syllable parser I wrote about previously. The thing I learned was 'lookaround', which is a form of matching that doesn't actually consume any material. One of the examples in the book (Friedl, 1997) was a way to split numbers by inserting commas using lookaround using the standard pattern of splitting every three digits. Following is a really simple description of regular expression matching with lookaround, followed by an explanation of it with syllables and Northern Sámi.
What lookaround does, effectively, is finds a position in the text instead of finding text. A simple regular expression substitution without lookaround may be to look through a text, and find every instance of something and replace it.
text = "I'm Lisa. My pet is a cat. I like cats." re.sub(r'cat', 'dog', text) # Returns: "I'm Lisa. My pet is a dog. I like dogs."
But, what if we wanted to insert "fat" in front of every instance of "cat"? We could certainly replace every instance of 'cat' with 'fat cat', or use group matching and replace every instance with 'cat' with 'fat \g<animal>'.
re.sub(r'(?P<animal>cat)', 'fat \g<animal>', text) # Returns: "I'm Lisa. My pet is a fat cat. I like fat cats."
In the above example, it's still necessary to specify what you have matched, because otherwise it will be gobbled up in the replacement. For the next example, I'll use a slightly bigger text just to show how lookaround can save breath. Since (?=lookaround) matches a position in the text, there is no need to specify the full match in order to make sure nothing is deleted. In the following example, I'm using lookahead, which matches a space before whatever we tell it to look for. There is also lookbehind, which matches the space after.
text = """ I'm Lisa. My pet is a cat. I like cats. I'm Suzie. My pet is a dog. I like dogs. """ re.sub(r'(?=cat|dog)','fat ',text) # Returns: "I'm Lisa. My pet is a fat cat. I like fat cats. I'm Suzie. My pet is a fat dog. I like fat dogs."
As you can see, the ability to match an environment and insert things in that environment, instead of matching text and inserting the text with the matched environment is a lot more useful in some situations. For instance, the example I happened upon involved inserting commas between three digits from the right of the word, if the three digits were preceded by yet another digit (so commas wouldn't be inserted on the beginning of the number: ,123,456,789.). The moment I saw this example, my eyes lit up, because this is effectively how syllables are parsed.
Syllable parsing is approached by assigning a syllable boundary for a certain criterion. In the case of Finnish and Northern Sámi, the most simple approach for a large chunk of words is to split them up at every Consonant Vowel (CV) pair from the right to the left. So, a word like CVCCVCV would be split like the following:
CVCCVCV CVCCV.CV CVC.CV.CV
It would be pretty simple, thus, to do this with lookaround, because one could define what counts as a consonant and what counts as a vowel pretty simply in a regular expression-friendly format. For the following examples, I'll keep using C and V, but note that C could be a stand-in for a regular expression set [tpklmn], and V could be a stand-in for [aieou]. It's convenient to use C and V here, but the needs of syllable parsing are sometimes more dependent on what the actual consonants and vowels are, and not the fact that they are simply a consonant or a vowel.
The first syllable splitting rule should then be to match CV and insert a period before it to represent a syllable break:
>>> word = 'CVCCVCV' >>> print re.sub('(?=CV)', '.', word) .CVC.CV.CV
For the sake of neatness, that initial dot can be removed by matching CV only if it is preceded by something else (using lookbehind):
>>> print re.sub('(?<=[CV])(?=CV)', '.', word) CVC.CV.CV
This rule can also handle numerous types of words, vowel-initial words (VCVCV -> V.CV.CV), and words with codas (CVCCVCVC -> CVC.CV.CVC). Of course, things get a little more complex when you wish to match specific contexts like lists of vowels and lists of consonants, but it can be done. It also shortened the code I had written drastically. Where I was using a ridiculous amount of if:then statements, the updated parser just runs every word through the same regular expression and lets a significantly faster and more efficient process handle the splitting.
As a result, the code I had to write for a syllable parser was reduced from nearly 130 lines to 10. Also, much larger sets of data can be handled in a shorter amount of time so it would be easily possible to use it in a language checking/spell-check tool where it is important to provide quick feedback. For a detailed explanation of one such example, read my previous post on syllable parsing.
Note: Syntax highlighting may be non-existent or ugly while I work on adding some syntax highlighting rules to my CSS.
Moving in... So, prepare for little bugs. If anything explodes and gives an error, drop a comment with the URL that was problematic. If other inconsistencies occur, also mention. I'm working on ironing those out but I only have one set of eyes! Blog posts may be a bit sparse to start with, but check out the Selection of Truly Exciting Finnish Words... I'm populating that with more words than there will be blog entries for while.
The content of this blog is not necessarily meant to be Northern Sámi-centric, but it just happens to be what I'm working on more lately, as will be explained in future posts. The reason for this is not that I am culturally Northern Sámi myself, but rather, I am a student of linguistics who has taken an interest in this language and its respective culture and language family. I'm basically a big nerd for Finno-ugric languages, and not ashamed to admit it.
Things may slowly end up getting tweaked through use. For instance, the Selection of Truly Exciting Finnish Words is currently in it's infancy, but I expect it to grow. Some words are not as thoroughly populated with interesting tidbits, or are there as a placeholder for more information. Word tags also contain a decent amount of information, for example: consonant gradation and the ghost consonant tags. Drop comments where comments are welcome; they'll only help improve things.
The sanasto itself does not store all word forms individually, and instead they are generated by a series of rules. I will be tweaking the underlying code that handles this over the course of time, so for any of you Finnish speakers out there, please tell me if you notice odd inflections, or are aware of additional variation that is available in certain words (e.g., tunturia/tuntureita). Be advised that Standard Finnish may accept one thing, but this may not be true of the wealth of Finnish dialects.
Happy reading and word-sleuthing!
![[Atom/RSS icon]](/m/img/feed.png)