Tuesday, August 28, 2012

N-gram features for text classification

Traditionally, text classification has relied on bag-of-words count features. For some experiments, I was  wondering if using n-gram counts could make for a good feature set. Once I generated the features, I knew I was in trouble. For the WSJ corpus, I got about 20 million features for a trigram model. Just checked out the literature and found this paper that n-gram features don't help much:


A Study Using n-gram Features for Text Categorization, Johannes Furnkranz


Bigram and trigram features may give modest gains, but feature selection is obviously required. Feature selection based on document frequency, term frequency would be a simple approach.


Thursday, August 23, 2012

Origins of the Brahmi Script

This post is motivated by chapter 2 of James Gleick's book '', which discusses the evolution of writing.  Brahmi is the mother script from which the scripts of all modern Indian and South-East Asian languages have evolved. It was first seen in Emporor Ashoka's rock edicts dating tno the 3rd century B.C.  It is then one of the ancient world's "alphabets" - along with Greek, Phoenician and Aramaic.  The alphabet is based on the idea that symbols represent phonemes in contrast to other writing systems like logographic (e.g. Chinese which employs symbols for words) or syllabic (e.g. Japanese where symbols represent syllables). 

All the alphabetic scripts are said to be derived from a single script, the Phoenician. In fact, the very word 'alphabet' comes from the first two symbols in the Greek script 'Alpha' and 'Beta'. There is a lack of clarity on the origin of the Brahmi script, with two primary categories of theories. One propounds that the Brahmi evolved from the Aramaic script (itself an evolution over the Phoenician). This is based on the proposed orthographic similarities between symbols in the scripts. (See Figure).

The other theory proposes an indigenous development of the Brahmi script, based on the wide differences in how the writing systems work. I tend to favour this theory, though I must admit that my knowledge of this area is limited to reading a few articles and knowing some of the modern day descendants of these scripts. The modern day alphabet of Indian scripts are organized phonetically, and there is little ambiguity phonetically - as opposed to the Roman scripts. The earliest Semitic scripts (Phoenician, Aramaic) and even modern Arabic do not have vowels, whereas the so called "true" alphabets Greek and its modern Latin derivative scripts still have room for ambiguity. Even if there was some use of symbols from the Aramaic scripts, the design seems pretty novel to call it a new style of scripting. Is there an alternative line of evolution of the script? The Indus Valley script is still undecipered - could the Brahmi have evolved from there?    


Sunday, February 12, 2012

Indian English

From Chandan Mitra's weekly column in the Pioneer, some hilarious examples of English usage:


In a newspaper, describing a case of chain-snatching in which criminals shot dead the man who tried to resist and pursue the chain-snatchers, the reporter stated: “The deceased gave chase to the criminals who, however, managed to escape”!

Police notice: “Take care of belongings. You may be theft”


The article is interesting reading too.
http://dailypioneer.com/columnists/item/51044-dont-fast-you-may-be-theft-indlish-is-on-a-roll.html



Saturday, January 14, 2012

Yet Another Moses Installation Guide

Though Moses is a versatile MT system, its installation is still from stone age. Let me document here some of the key points to navigate through the installation of Moses. The intent is not to present a complete installation guide, but to highlight key issues that may crop up (as they cropped up for me). For a complete installation, this is probably the best guide. Another useful installation guide can be found here.

To install the Moses system, the following tools need to be installed. 
  • Language modelling toolkit (SRILM, IRSTLM, etc.)
  • GIZA++ package which contains GIZA++ and mkcls
  • Moses decoder (version 1.0 and above)

SRILM installation
  • The primary installation reference is the INSTALL document that ships with the tool.
  • Install all pre-requisites mentioned in the SRILM installation guide. On Ubuntu I had to install the following packages: csh, g++-multilib, tcl-dev
  • Set the environment variable SRILM to point to the base directory of the install package before building SRILM.
  • Following the instruction manual with the SRILM download should be enough once the pre-requisites are installed.    
  • The problems you may yet face are
    • Problem in identifying the architecture, especially if it a 64-bit machine. To make sure that the install script correctly identifies the architecture, set the variable MACHINE_TYPE in sbin/machine-type.
    • Problems with TCL compilation. You may not need the TCL user interfaces at all, so it may just be able ok to disable their compilation. Set the variable NO_TCL = X in the file common/your_architecture_specific_makefile.         
  • Make sure you have added the $SRILM/bin and $SRILM/bin/$MACHINE_TYPE to the PATH variable
  • Note: SRILM 1.7.1 and above are not compatible with Moses
IRSTLM installation
  • Ubuntu packages required: libtool make autoconf autotools-dev automake
  • The installation is pretty simple, just have to follow the installation guide
  • One caveat: Sometimes, it may be required to create a directory named 'm4' manually, if the first step mails

GIZA++ and mkcls installation
  • You get both if you download the giza-pp tool. 
  • Most straightforward installation. Download and 'make'.
  • Copy the binaries - GIZA++, mkcls, snt2cooc.out to a new directory. 
XMLRPC Server
  • XML RPC Server is required if you want to run a webservice providing translations. If you just want to get Moses running, you can skip this step.
  • Install the following packages: libxmlrpc-core-c3 libxmlrpc-core-c3-dev libxmlrpc-c3-dev libxmlrpc-c++4 libxmlrpc-c++4-dev 
Boost Library
The C++ Boost library  is required for installation of  Moses. Boost 1.48 has a serious bug which breaks Moses compilation. Unfornately, some Linux distributions (eg. Ubuntu 12.04) have broken versions of the Boost library.To fix this situation you can:
  • For Ubuntu 12.04: Remove boost 1.48 from your distribution and install Boost 1.46 which is available in the distribution. This works most of the time. If not, build Boost from source as described below.
  • To install Boost manually and making it work with Moses, follow the instructions in the section titled "Manually Installing Boost" on this page: http://www.statmt.org/moses/?n=Development.GetStarted
Moses installation
  • The primary installation reference is the INSTALL document that ships with the tool.
  • SRILM or IRSTLM need to be installed before Moses is installed
  • Make sure you have installed the packages  automake and libtool
  • Boost has to be installed
  • It is then a matter of just following the instructions. The command to be run is
  • /usr/bin/bjam --with-srilm=  --with-xmlrpc-c= --with-boost=
    • If the xml RPC is installed in /usr/bin, then the parameter would simply be '/usr'
    • --with-boost is required only when Boost is installed in a non-standard directory. The path should contain both lib/lib64 and include directories
Now Moses is ready to cross the Red Sea.


Alternative ways of installation Moses

If you fail to install from the source as mentioned above, then there are a couple of simpler alternatives you can try:

One, use the pre-compiled binaries provided by the Moses team: 
The pre-compiled version comes with IRSTLM and does not support XML-RPC to the best of my knowledge. However, it is handy to get started. 

If that too runs into trouble, then you can try using the virtual machine provided by the Moses team. 


If you are using Virtual Box, you can import the OVA images into VirtualBox. 
This guide many be useful for importing OVA images into VirtualBox:
http://www.maketecheasier.com/import-export-ova-files-in-virtualbox/

I have not tried the Virtual Images, so let me know if it works.