For Chinese, the package includes two simple word segmenters. One is a
lexicon-based maximum match segmenter, and the other uses the parser to
do Hidden Markov Model-based word segmentation. These segmentation
methods are okay, but if you would like a high quality segmentation of
Chinese text, you will have to segment the Chinese by yourself as a
preprocessing step. The supplied grammars assume that
Chinese input has already been word-segmented according to Penn
Chinese Treebank conventions. Choosing
Chinese with -tLPP
edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
makes space-separated words the default tokenization.
To do word segmentation within the parser, give one of the options
-segmentMarkov or -segmentMaxMatch.
Would you like to comment?
Join Diigo for a free account, or sign in if you are already a member.