N-Gram Extraction Tools

Introduction

This is a set of tools for manipulating N-grams from raw corpus. The program implements [Nagao94]'s arbitrary N-gram extracting algorithm and can extract both word N-grams (Latin language such as English) and character N-grams (Eastern language such as Chinese) up to 255-grams.

Text is represented as Unicode (UCS-2) internally so in theory the program can handle any text that can be converted into (or from) Unicode (UCS-2). Default input/output encoding is set to UTF-8. Current implementation focuses on processing Chinese (GBK) and English (ISO-8859-1) text.

The tools provided here distinguish themselves from other N-Gram extraction programs in that:

Unicode code base

Both Word N-gram and Character N-gram can be extracted

Statistical Substring Reduction (SSR) algorithms

paper

License

The programs are all open source and are distributed under MIT license.

Download

Source code: ngramtool-20040530.tar.gz
Linux binary: ngramtool-20040527-linux-static.tar.gz
FreeBSD binary: ngramtool-20040527-freebsd-static.tar.gz
WIN32 binary (built with Mingw32): ngramtool-20040527-mingw32-static.zip

Usage

Three programs are provided: ``text2ngram'', ``extractngram'' and ``strreduction''. Run these programs with ``-h'' option to see a brief usage.

Extract N-Gram statistics from raw corpus

text2ngram can be used to extract word/character level N-Gram statistics from raw corpus. Here are some examples:

```
text2ngram -n3 file
```
```
text2ngram -n3 -m10 -f5 file
```

text2ngram -c -n3 -m10 -f5 -F gbk -T gbk file

Performing Statistical Substring Reduction on acquired N-Gram statistics

strreduction implements four Statistical Substring Reduction algorithms. Here are some examples:

```
strreduction -a2  < input > output
```

strreduction -a2 -c -F gbk -T gbk < input > output

strreduction -a2 -c -F gbk -T gbk -s -t -f 3 < input >
    output

Reference

Statistical Substring Reduction in Linear Time
Xueqiang Lv, Le Zhang and Junfeng Hu. IJCNLP-04, Hai Nan island, P.R.China. [ps.gz] [pdf]
LV Xue-qiang. Research of E-Chunk Acquisition and Application in Machine Translation. Ph.D. dissertation (in Chinese), Northeastern University, Shen Yang, China, Jan, 2003.
A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora
Zhang Le, LV Xue-qiang, SHEN Yan-na, YAO Tian-shun. In proceeding of 20th International Conference on Computer Processing of Oriental Languages (ICCPOL03), ShenYang, P.R.China. [ps.gz] [pdf] [slide]
A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese
Makoto Nagao and Shinsuke Mori. The 15th International Conference on Computational Linguistics (COLING 1994). [pdf]

Contact

The author can be reached at: ejoy@xinhuanet.com

Last Change :01-Feb-2005. Please send any question to Zhang Le