Báo cáo khoa học: "Use of Mutual Information Based Character Clusters in Dictionary-less Morphological Analysis of Japanese"

For languages whose character set is very large and whose orthography does not require spacing between words, such as Japanese, tokenizing and part-of-speech tagging are often the difficult parts of any morphological analysis. For practical systems to tackle this problem, uncontrolled heuristics are primarily used. The use of information on character sorts, however, mitigates this difficulty. This paper presents our method of incorporating character clustering based on mutual information into DecisionTree Dictionary-less morphological analysis. By using natural classes, we have confirmed that our morphological analyzer has been significantly improved in both tokenizing and tagging Japanese text. .

