
By the rapid development of the information technology, the scale of digital documents is vastly growing as well. The need of searching for useful information from mass unstructured data becomes increasingly clear. The present search technique is based on keyword search, so the accuracy of the keywords directly affects the accuracy of search results. And the acquisition of keywords is mostly realized by language MA. Among numerous languages, Chinese, Japanese and Korean are getting especially focused. Not only the three languages are more and more important, but also they’re all stalemated languages, no clear divisions between phrases characteristic (such as English is separated by blanks). So there’s an urgent need to have such a MA that could accurately dealing with the three languages.
Advanced and efficient
Provide state-of-the-art language segmentation technologies
Rule and configuration customization helps the accuracy improvement
The segmentation precision is over 93%.
Segment more than 20,000,000 words per minute.
Flexibility and Stability
Provide well-organized system dictionaries
Support multi-platforms, including 32 bits/64 bits Windows and Linux
Support multi-thread
Support mass data
Strong Scalability
Support users adding their own phrases, such as specific industry dictionariesSupport various encodings such as GB18030, GB2312, UTF8, EUC-JP, EUC-KR, etc.
Support development APIs for external programs
Information Retrieval
Full text retrieval
Natural Language Processing
Machine translation
Provide the APIs for other language processing systems
Content identification and analysis
Information extraction
Automatic text classification
Sophisticated spam filtering
Data mining
N-Best Results
All the MAs support N-Best segmentation results of a sentence
Each result has a probability for further processing
Part of Speech (POS) Tagging
Provide the advanced context-based POS tagging algorithms
Proper POS sets for different languages
Ambiguity Identification
Provide the advanced recognition algorithms to reduce word ambiguities effectively
Provide configuration customization to reduce some ambiguities.
User Dictionary
Users can define their own new words
User dictionary is easy to add/remove
Support Multiple Encodings
Support GB18030, GB2312, UTF8, EUC-JP and EUC-KR encodings
Support more encodings if necessary




