Leo Chino!

As mentioned in one of my earlier blog posts, one of the challenges you face when you are learning Chinese is that there is no white space in Chinese text. Fortunately technology comes in handy and you can use segmentation to help you add white space in Chinese text.

Once the text has whitespace, the next two time consuming tasks (for those of us who want to learn the language and won’t translate the entire chunk of text altogether) are:

  1. Find out the meaning or meanings of the word.
  2. Find out the sound of the word.

Usually the above two tasks are accomplished by copy pasting the text into a translator. But I thought there could be a better/faster way of doing this without having to go back and forth between the text you are reading and the translator.

After searching for an answer online, the closest to what I was looking for was Convert Chinese to Pinyin (Mand). This is a Chrome Extension that will take Chinese text and convert it all into Pinyin. So this solved task #2 but it still did not fully accomplished what I wanted.

So I decided to go ahead and write my own Chrome Extension and I called it Leo Chino, which means “I read Chinese” in Spanish : )

This is how Leo Chino will help you solve the two above laborious tasks:

First, let’s assume the below is a picture of the text you want to read.

Screen Shot 2017-06-18 at 9.58.35 PM.png

After you have installed Leo Chino you can go ahead and click it.

Screen Shot 2017-06-18 at 9.58.17 PM.png

Leo Chino will do its work and will show you the below transformed page.

Screen Shot 2017-06-18 at 9.59.04 PM.png

So task #2 has been solved as you can read all the words on this text (notice the nice whitespace between words 😀 ). Now Leo Chino will also help you learn the meaning or meanings of this word as well as the way you should pronounce the word (just because Pinyin is written in English letters it does not mean you are saying it right). All you need to do is click on the Pinyin for the word you want to learn and a popup will show you the information you need.

Screen Shot 2017-06-19 at 9.58.15 PM.png

To hear the pronunciation just click on the speaker icon : )

The only limitation at the moment is that Leo Chino only works in web pages served through HTTP.

Have fun reading and learning Chinese : )

Advertisements
Posted in Uncategorized | Tagged , , , , , , , , , , , | Leave a comment

Adding white space to Chinese text : )

One of the key challenges in learning Chinese is that there is no white space in Chinese text. You have to visually identify words and when you don’t know many words this is particularly difficult. This is what Chinese text looks like:

“今天,我们来继续介绍北京的房价。在上次的文章中,我们谈到炒房现象,也就是人们买卖房产,赚取差价的行为。为了限制炒房,2010年,政府出台了“限购”政策,这个政策被人们叫做“限购令”。“限购令”规定,只有北京户口的人才可以买房;已婚的人,每个家庭只能购买两套商品住房;单身的人只能购买一套商品住房。这样,在北京可以买房的人更少了。但北京的房价并没有因此停止上涨。原因也是多方面的。”

http://www.slow-chinese.com/podcast/177-bei-jing-de-fang-jia-er/

As you can see for those of us coming from languages that separate words with white spaces it is difficult to identify words in a Chinese paragraph.

While doing research into how I could add white space to Chinese text in order to better identify words, I ran across the following article from Stanford University:

https://nlp.stanford.edu/software/segmenter.shtml

In this article I learned the technical term for adding white space to Chinese text is “tokenization of raw text”, and since Chinese requires “extensive token pre-processing” then the more particular technical term is “segmentation”. So translated into technical lingo what I was looking for was a Chinese Word Segmenter. The article provides ample information regarding all the science that goes behind coming up with algorithms that allow us to write programs to segment Chinese text.

In this blog post I am only concerned with showing how I used the Chinese Word Segmenter to transform the above text into a more Chinese language student friendly version.

  1. I downloaded and unzipped the segmenter.
  2. I put the text into a plain text file (.txt)
  3. I executed the following command:

./segment.sh ctb file.txt UTF-8 0 > file.segmented.txt

The segmenter provides detailed output regarding the segmentation process. I will only mention the last line of that output here:

CRFClassifier tagged 189 words in 1 documents at 1524.19 words per second.

The contents of file.segmented.txt look like this:

今天 , 我们 来 继续 介绍 北京 的 房价 。 在 上次 的 文章 中 , 我们 谈到 炒 房 现象 , 也就是 人们 买卖 房产 , 赚取 差价 的 行为 。 为了 限制 炒 房 , 2010年 , 政府 出台 了 “ 限购 ” 政策 , 这个 政策 被 人们 叫做 “ 限购令 ” 。 “ 限 购 令 ” 规定 , 只有 北京 户口 的 人才 可以 买房 ; 已婚 的 人 , 每个 家庭 只 能 购买 两 套 商品 住房 ; 单身 的 人 只 能 购买 一 套 商品 住房 。 这样 , 在 北京 可以 买房 的 人 更 少 了 。 但 北京 的 房价 并 没有 因此 停止 上涨 。 原因 也 是 多方面 的 。

As you can see the above text has been segmented into a more Chinese language student friendly version.

Hope this helps all my fellow Chinese language students out there : )

约瑟。

 

Posted in Uncategorized | Tagged , , , , | 2 Comments

CHM to plain text macOS : )

I recently discovered several of the books I want to reformat are in CHM format (Compiled HTML). While doing some research online I found the following blog post which gave me the general idea and guidance on how to go about converting my CHM books to plain text:

http://www.jaredlog.com/?p=1146

Since some of the steps didn’t quite work on my MacBook Pro with macOS Sierra, I had to use different tools to accomplish the same result. These are the steps I followed to get my CHM in plain text:

1) brew install chmlib
2) extract_chmLib file.chm folder
3) cd folder
4) ls | sort -n > list
5) clean up/sanitize the list file so it only includes the files you want to convert.
6) for i in `cat list`; do textutil -stdout -convert txt $i >> out.txt ; done

Needless to say I did all the above steps from the command line.

Hope this helps someone else looking for an answer : )

约瑟。

Posted in Uncategorized | Tagged , , , | Leave a comment

Hola Mundo! Hello World! 世界你好!

I have finally decided to start my own blog to write about all the things I am interested in. The range of topics is very wide and it includes technology, learning languages, politics, philosophy, history, and theology.

The range of topics is a direct reflection of my personal interests and the kind of person I am. I think the world is a fascinating place that offers new learning opportunities every day. Those learning opportunities come from new technologies, new experiences, new world events, and the people with whom we interact every day.

The knowledge we acquire on a quotidian basis together with our imagination is what can truly make the world a better place.

I would like to end my first blog post with a quote from Einstein:

“Imagination is more important than knowledge. For knowledge is limited to all we now know and understand, while imagination embraces the entire world, and all there ever will be to know and understand.”

Hope you guys enjoy reading my blog posts!

约瑟 : )

Posted in Uncategorized | Tagged | 4 Comments