Adding white space to Chinese text : )

One of the key challenges in learning Chinese is that there is no white space in Chinese text. You have to visually identify words and when you don’t know many words this is particularly difficult. This is what Chinese text looks like:


As you can see for those of us coming from languages that separate words with white spaces it is difficult to identify words in a Chinese paragraph.

While doing research into how I could add white space to Chinese text in order to better identify words, I ran across the following article from Stanford University:

In this article I learned the technical term for adding white space to Chinese text is “tokenization of raw text”, and since Chinese requires “extensive token pre-processing” then the more particular technical term is “segmentation”. So translated into technical lingo what I was looking for was a Chinese Word Segmenter. The article provides ample information regarding all the science that goes behind coming up with algorithms that allow us to write programs to segment Chinese text.

In this blog post I am only concerned with showing how I used the Chinese Word Segmenter to transform the above text into a more Chinese language student friendly version.

  1. I downloaded and unzipped the segmenter.
  2. I put the text into a plain text file (.txt)
  3. I executed the following command:

./ ctb file.txt UTF-8 0 > file.segmented.txt

The segmenter provides detailed output regarding the segmentation process. I will only mention the last line of that output here:

CRFClassifier tagged 189 words in 1 documents at 1524.19 words per second.

The contents of file.segmented.txt look like this:

今天 , 我们 来 继续 介绍 北京 的 房价 。 在 上次 的 文章 中 , 我们 谈到 炒 房 现象 , 也就是 人们 买卖 房产 , 赚取 差价 的 行为 。 为了 限制 炒 房 , 2010年 , 政府 出台 了 “ 限购 ” 政策 , 这个 政策 被 人们 叫做 “ 限购令 ” 。 “ 限 购 令 ” 规定 , 只有 北京 户口 的 人才 可以 买房 ; 已婚 的 人 , 每个 家庭 只 能 购买 两 套 商品 住房 ; 单身 的 人 只 能 购买 一 套 商品 住房 。 这样 , 在 北京 可以 买房 的 人 更 少 了 。 但 北京 的 房价 并 没有 因此 停止 上涨 。 原因 也 是 多方面 的 。

As you can see the above text has been segmented into a more Chinese language student friendly version.

Hope this helps all my fellow Chinese language students out there : )



