Adding white space to Chinese text : )

One of the key challenges in learning Chinese is that there is no white space in Chinese text. You have to visually identify words and when you don’t know many words this is particularly difficult. This is what Chinese text looks like:

“今天,我们来继续介绍北京的房价。在上次的文章中,我们谈到炒房现象,也就是人们买卖房产,赚取差价的行为。为了限制炒房,2010年,政府出台了“限购”政策,这个政策被人们叫做“限购令”。“限购令”规定,只有北京户口的人才可以买房;已婚的人,每个家庭只能购买两套商品住房;单身的人只能购买一套商品住房。这样,在北京可以买房的人更少了。但北京的房价并没有因此停止上涨。原因也是多方面的。”

http://www.slow-chinese.com/podcast/177-bei-jing-de-fang-jia-er/

As you can see for those of us coming from languages that separate words with white spaces it is difficult to identify words in a Chinese paragraph.

While doing research into how I could add white space to Chinese text in order to better identify words, I ran across the following article from Stanford University:

https://nlp.stanford.edu/software/segmenter.shtml

In this article I learned the technical term for adding white space to Chinese text is “tokenization of raw text”, and since Chinese requires “extensive token pre-processing” then the more particular technical term is “segmentation”. So translated into technical lingo what I was looking for was a Chinese Word Segmenter. The article provides ample information regarding all the science that goes behind coming up with algorithms that allow us to write programs to segment Chinese text.

In this blog post I am only concerned with showing how I used the Chinese Word Segmenter to transform the above text into a more Chinese language student friendly version.

  1. I downloaded and unzipped the segmenter.
  2. I put the text into a plain text file (.txt)
  3. I executed the following command:

./segment.sh ctb file.txt UTF-8 0 > file.segmented.txt

The segmenter provides detailed output regarding the segmentation process. I will only mention the last line of that output here:

CRFClassifier tagged 189 words in 1 documents at 1524.19 words per second.

The contents of file.segmented.txt look like this:

今天 , 我们 来 继续 介绍 北京 的 房价 。 在 上次 的 文章 中 , 我们 谈到 炒 房 现象 , 也就是 人们 买卖 房产 , 赚取 差价 的 行为 。 为了 限制 炒 房 , 2010年 , 政府 出台 了 “ 限购 ” 政策 , 这个 政策 被 人们 叫做 “ 限购令 ” 。 “ 限 购 令 ” 规定 , 只有 北京 户口 的 人才 可以 买房 ; 已婚 的 人 , 每个 家庭 只 能 购买 两 套 商品 住房 ; 单身 的 人 只 能 购买 一 套 商品 住房 。 这样 , 在 北京 可以 买房 的 人 更 少 了 。 但 北京 的 房价 并 没有 因此 停止 上涨 。 原因 也 是 多方面 的 。

As you can see the above text has been segmented into a more Chinese language student friendly version.

Hope this helps all my fellow Chinese language students out there : )

约瑟。

 

Advertisements
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

2 Responses to Adding white space to Chinese text : )

  1. Haven’t got round to learning Chinese yet but it is definitely on the wish list. Seems like this might help make the callenge a little easier though 👍🏻

  2. Pingback: Leo Chino! | 约瑟

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s