Tuesday, August 28, 2012

N-gram features for text classification

Traditionally, text classification has relied on bag-of-words count features. For some experiments, I was  wondering if using n-gram counts could make for a good feature set. Once I generated the features, I knew I was in trouble. For the WSJ corpus, I got about 20 million features for a trigram model. Just checked out the literature and found this paper that n-gram features don't help much:


A Study Using n-gram Features for Text Categorization, Johannes Furnkranz


Bigram and trigram features may give modest gains, but feature selection is obviously required. Feature selection based on document frequency, term frequency would be a simple approach.


Thursday, August 23, 2012

Origins of the Brahmi Script

This post is motivated by chapter 2 of James Gleick's book '', which discusses the evolution of writing.  Brahmi is the mother script from which the scripts of all modern Indian and South-East Asian languages have evolved. It was first seen in Emporor Ashoka's rock edicts dating tno the 3rd century B.C.  It is then one of the ancient world's "alphabets" - along with Greek, Phoenician and Aramaic.  The alphabet is based on the idea that symbols represent phonemes in contrast to other writing systems like logographic (e.g. Chinese which employs symbols for words) or syllabic (e.g. Japanese where symbols represent syllables). 

All the alphabetic scripts are said to be derived from a single script, the Phoenician. In fact, the very word 'alphabet' comes from the first two symbols in the Greek script 'Alpha' and 'Beta'. There is a lack of clarity on the origin of the Brahmi script, with two primary categories of theories. One propounds that the Brahmi evolved from the Aramaic script (itself an evolution over the Phoenician). This is based on the proposed orthographic similarities between symbols in the scripts. (See Figure).

The other theory proposes an indigenous development of the Brahmi script, based on the wide differences in how the writing systems work. I tend to favour this theory, though I must admit that my knowledge of this area is limited to reading a few articles and knowing some of the modern day descendants of these scripts. The modern day alphabet of Indian scripts are organized phonetically, and there is little ambiguity phonetically - as opposed to the Roman scripts. The earliest Semitic scripts (Phoenician, Aramaic) and even modern Arabic do not have vowels, whereas the so called "true" alphabets Greek and its modern Latin derivative scripts still have room for ambiguity. Even if there was some use of symbols from the Aramaic scripts, the design seems pretty novel to call it a new style of scripting. Is there an alternative line of evolution of the script? The Indus Valley script is still undecipered - could the Brahmi have evolved from there?