Published Work | Under Progress | Downloads |
I was recently exposed to research during my undergraduate thesis project, which I did at Language Technologies Research Center during early 07 . Prior to this, I was curious about many things (still am), did a lot of technical work, but did not get opportunities to foray into serious science.
My current published work focuses on (or derives from) various aspects of Information Retrieval, Linguistics, Machine Learning, Information Extraction, Machine Translation and Information Theory. A common connecting thread in much of this work is to develop techniques and methods which could function despite limited language resources. Most Indian languages being a subset of this category, these techniques have been tested and optimized for some of these also. To see the full list of published work click here.
One of the more interesting works was to develop purely corpus based techniques to identify language relatedness. Most previous research in historical and phylo-genetic computational linguistics uses manually developed and semantically aligned word lists of small sizes to estimate language relatedness. However, such a method can have some drawbacks. We defined several corpus based measures which only require a substantially big corpus of a given language. We tested our measures on 10 Indian languages. Results concorded well with linguistic knowledge. We were also able to generate graphs of language closeness, which were close to the real language-wise map of India (see below).
As the popular maxim goes, a picture in this case is worth (atleast) 81 numbers.
|
I am also involved with several other research activities. One of them is to develop AI for the stand-alone version of a proprietary board game. I am also working on applying data mining techniques for a more intuititive and informed user interface. Transliteration algorithm is also being extended to develop a tolerant and broad (user segment wise) input mechanism for Indian languages.
Though it may not count as conventional research, I am also working on modeling web games to generate resources for various applications in Natural Language Processing. More specifically, I want to exploit state-of-art Information Extraction systems to model the games as a Factoid / Trivia based quiz. A game design is under progress. This work is inspired by GWAP initiative at CMU.
Grammar files, Test and Training sets for the Transliteration algorithm (DATM)
A More Discerning and Adaptable Multilingual Transliteration Mechanism for Indian Languages (IJCNLP 08)
[ZIP, TAR]