Background
By the rapid development of the information technology, the scale of digital documents is vastly growing as well. The need of searching for useful information from mass unstructured data becomes increasingly clear. The present search technique is based on keyword search, so the accuracy of the keywords directly affects the accuracy of search results. And the acquisition of keywords is mostly realized by language MA. Among numerous languages, Chinese, Japanese and Korean are getting especially focused. Not only the three languages are more and more important, but also they’re all stalemated languages, no clear divisions between phrases characteristic (such as English is separated by blanks). So there’s an urgent need to have such a MA that could accurately dealing with the three languages.

Advanced and efficient

Provide state-of-the-art language segmentation technologies
Rule and configuration customization helps the accuracy improvement
The segmentation precision is over 93%.
Segment more than 20,000,000 words per minute.

Flexibility and Stability

Provide well-organized system dictionaries
Support multi-platforms, including 32 bits/64 bits Windows and Linux
Support multi-thread
Support mass data

Strong Scalability

Support users adding their own phrases, such as specific industry dictionaries
Support various encodings such as GB18030, GB2312, UTF8, EUC-JP, EUC-KR, etc.
Support development APIs for external programs

Information Retrieval

Full text retrieval

Natural Language Processing

Machine translation
Provide the APIs for other language processing systems

Content identification and analysis

Information extraction
Automatic text classification
Sophisticated spam filtering
Data mining


N-Best Results

All the MAs support N-Best segmentation results of a sentence
Each result has a probability for further processing

Part of Speech (POS) Tagging

Provide the advanced context-based POS tagging algorithms
Proper POS sets for different languages

Ambiguity Identification

Provide the advanced recognition algorithms to reduce word ambiguities effectively
Provide configuration customization to reduce some ambiguities.

User Dictionary

Users can define their own new words
User dictionary is easy to add/remove

Support Multiple Encodings

Support GB18030, GB2312, UTF8, EUC-JP and EUC-KR encodings
Support more encodings if necessary

Copyright © iZENEsoft.com. All Rights Reserved
Home| Site map | Careers | Contact us | RSS