How to Annotate Language Data for Asian Languages: Challenges for Chinese, Korean, and Japanese

As artificial intelligence races to power everything from smart speakers to refrigerators, the world is taking note of above-average adoption rates for AI in Asia. The so-called “Asian tigers” have leapfrogged past older technology. Asian consumers are demonstrating an impressive willingness to try (and fund) new ideas.

If you are building technology for Asian consumers, you probably already experienced the challenge of annotating and analyzing Asian language data. In fact, Asian languages present unique challenges to artificial intelligence and machine learning.

For example, how do you annotate Chinese language data?

Well, “Chinese” actually comprises several different dialects. Most linguists would call these completely different languages!

These languages are pronounced in many different ways, despite the fact that they can be written using the same writing system.

Before proceeding with a Chinese data annotation project, it is important to have a strong, nuanced understanding of the purposes of your Asian language data. You should also have a clear understanding of your user avatar: where they live, which “dialect” they prefer, and which writing system they use.

Orthographic Concerns for Chinese Language Data Annotation

One of the biggest challenges when annotating Chinese date is that Chinese does not use spaces to denote word barriers.

This makes it harder for machines to detect word boundaries. Instead, they must rely on different cues.

Also, Chinese can be written with a few different writing systems. All Chinese writing systems encode meaning more robustly than pronunciation, unlike the alphabets used in most of the Western world.

Orthographic Concerns for Korean Language Data Annotation

Korean may look like it is written with characters. However it is actually written with one of the easiest alphabets in the world to learn.

In fact, we have even trained many of our non-Korean linguists to learn it!

However, similar to Chinese Korean does not always use spacing to denote word barriers, as with Chinese.

Even more confusingly, Korean does make use of spacing to denote word boundaries in some cases, but not in others. This tricks most Western-developed programs into thinking word boundaries have been detected.

Therefore, traditional methods of recognizing word boundaries do not always work when using AI or machine learning strategies for Korean data annotation.

Orthographic Concerns for Japanese Language Data Annotation

Did the Korean and Chinese writing systems overwhelm you? Well, Japanese uses three different writing systems all at once!

In Japanese, most content words use a variant of Chinese characters, while grammar words usually use a syllabary (which is sound-based, and similar to an alphabet).

Then, foreign loanwords use a different alphabet altogether!

These three writing systems are usually used in pretty standardized ways – but not always!

Context is very important. Therefore, you will want Japanese-proficient linguists and/or data scientists to help you design an effective Japanese data annotation project.

Orthographic Concerns for Other Asian Languages

How about data annotation for Thai, Vietnamese, Indonesian, or other Asian languages?

Each of these languages use unique writing systems and poise unique challenges to any NLP or AI project.

Feel free to get in touch with us if you are looking for more information on this topic.

Grammatical and Syntactic Concerns for Asian Language Data Annotation

As you might have guessed, Asian languages also pose grammatical and syntactic challenges for data scientists who are accustomed to working with English or Indo-European languages.

Korean and Japanese have particularly complex verb inflections. If you think you understand the importance of context for English, you haven’t seen anything yet!

You may find that a single English command has more than a dozen Korean equivalents. Even more confusingly, each equivalent can only be used in a specific situation.

It is no surprise that Siri’s team decided to use English as their wake-up signal for the Japanese version of their app – deciding on a single universal Japanese greeting or command would have been a tall order.

Finally, since important information is often stored in different parts of the sentence when compared to English, it can be complicated to segment utterances in a useful way unless you have a solid understanding of the syntax of the Asian language.

If you are looking for Chinese, Korean, or data annotation services, or data annotation services for any other Asian language, make sure that you don’t rely on just any multi-language data resource provider.

Any multi-language vendor can outsource or crowdsource Asian language data, but that doesn’t mean that they will know how to tailor the results to your purposes.

To speak to one of our Asian linguists about your multilingual data annotation needs, feel free to contact us!