How Corpus Linguistics Builds Tomorrow's Technology

These days, it can feel like technology advances at the speed of light. New models and new capabilities spring up every day!

One of the finest examples of this is Corpus Linguistics.

This branch of the study of language has not only made linguistic analysis more efficient.

It has also made possible technologies such as machine translation, voice recognition, and forensic linguistic analysis.

What is corpus linguistics?

The main objective of Corpus Linguistics is to study natural language use. It does this by means of a large set of electronically-stored “real life” language samples.

Essentially, linguists collect hundreds or even thousands of hours of speech; or millions or even billions of words of text, to create a corpus (plural: corpora).

These corpora can vary according to size, purpose, accessibility, mode (text, speech, or both) or subject matter.

For speech (voice) corpora, the audio files might be collected from existing media such as movies and news programs. They might also be elicited from scratch in order to meet the specific needs of the client.

For example, if voice samples in a specific subject domain are needed, corpus linguists might collect these samples from specialists in this domain.

These days, many corpora include over one million words or hours of data. They might be developed by research institutions for academic purposes, or commissioned by businesses for the purposes of building new technologies.

When businesses need text or voice corpora to be vetted or developed from scratch, they often turn to teams of linguists like us, here at Meridian Linguistics.

What is in a corpus?

Many corpora contain more than just words, including:

Metadata: information about the text itself, such as the author name, year of publication, language, region, and more;
Textual markup: information within the text, such as utterance boundaries in a spoken corpus, or the sentence breaks for written corpora;
Linguistic annotation: encoded information such as part of speech (POS) tagging, semantic categories, parsing and many others).

Corpora might also be monolingual or multilingual–for example, bilingual corpora are often used for building machine translation technology.

How do you analyze a corpus?

Corpora are analyzed by trained linguists who employ many different kinds of statistical analyses and specialized software.

The results produced by this software are usually represented by list of words, collocations (words that frequently occur together), or even useful information about context.

For example, to create a machine translation engine, corpus software might not only search for a single word in a corpus, but might also search for context such as a “clause type”.

It can also be used to calculate and compare the frequencies of given words and collocations. Then, statistical probability can be used to calculate the likelihood of a correct translation.

We can now also classify a text corpus into various categories depending on specific criteria.

For example, we can classify the number of languages represented (mono- and multilingual), the purpose of analyzed question (general and domain-specific), the medium (speech versus text), and more.

There are even corpora that consist of non-native speech. This way, we can better train our speech recognition devices to understand accented speakers.

How else can multilingual corpora be used?

You might be surprised at the variety of uses for corpora. They are used for e-commerce, marketing, advertising, politics, and many other fields.

Techniques such as text mining, topic modeling, sentiment analysis, search engine optimization, translation memory tools and even authorship analyses have helped clients as varied as business owners, law firms, and even law enforcement.

CASE STUDY 1:

A business hires a linguistic consulting firm to perform sentiment analysis on a corpus consisting of millions of words of Chinese-language customer service chats.

This analysis helps the client to categorize the attitude of their Chinese customers as positive, negative or neutral and make decisions on how to improve their customer service in China.

CASE STUDY 2:

A law firm hires a linguist to perform a forensic analysis on thousands of words of e-mails, in order to identify the author of said e-mails as part of building their case.

CASE STUDY 3:

A technology company building a multilingual voice assistant hires a linguistic consulting firm to prepare speech corpora in several languages, which is then used to help build their technology.

How Do Your Prepare or Purchase a Corpus or Corpus Analysis Services?

If you need a corpus or corpus analysis, the first step is finding a team of linguists that specialize in the language at hand. If you are building a Chinese-language voice assistant, make sure your linguists specialize in Chinese.
Then, make sure you have a clear understanding of your goals. Who will be the end user of your technology? How much data will be needed to achieve your goals?
Finally, prepare your budget and your expectations. There are many ways to prepare a corpus, ranging from affordable web-scraping tools to more high-end natural speech elicitation. Your linguist team should help you understand the realistic capabilities of the corpus for which your budget allows.

If you are looking for helping designing or analyzing your corpus in any language, to hesitate to contact us!