Language. It’s the common parlance among a community through which to communicate meaning. We take it for granted that our family, friends and colleagues will understand us when we communicate. Should we expect the same from machines?
Real-time translation has been in our lives over the past several years thanks to Google Translate. Forward-looking companies such as Unbabel deploy translation-as-a-service to solve translation challenges for business applications. However, how do we approach true machine-level understanding of natural language?
We’re often asked about this at DigitalGenius, since we support our customers across over 15 languages. Customers and prospects are rightly curious as to how our AI can make predictions and resolutions in languages as varied as German to Chinese; indeed, so was I, and going down the Machine Learning rabbit hole recently, I was fascinated by what I learned. This is the topic of today’s blog.
It Pays to be Language Agnostic
As a starting point, as an AI-powered organization, we want to be language agnostic in our overall approach to Machine Learning. There are a litany of Natural Language Processing (NLP) techniques available, and unfortunately many of them are language specific. For example, a process known as Stemming removes segments of words, varying the core meaning. To apply to the real world, Stemming recognizes patterns in nouns and adverbs, yet if you apply a technique that was built for English to a French data set, the results will be intrinsically poor. Check out what I mean below, with the underlines serving to highlight what Stemming would target and remove:
English: Slow - Slowly - Slowest
French: Lent - Lentement - le plus lent
We tend to avoid NLP techniques that do not scale across languages, opting for generalized methods that can help us understand the data. Instead, we look at the data holistically. We separate monolithic linguistic data sets (i.e breaking apart a data set containing Dutch, Swedish and Spanish), and then drill down into relevant data points using techniques like Stop Words Removal. This technique relates to “filler” words that do not impact the meaning of the overall intent:
[I] want [a] refund [immediately]
Removing the bracketed words does not change the intent, but it does improve the accuracy of the predictions. Cutting out the noise helps DigitalGenius to understand that the intent of the message is a refund, and does so on a consistent and scalable plane.
Now we have data sets segmented by language and focused on the key data impacting the meaning. Drilling down, we extract only the most recurring vocabulary in the data - meaning we’re not wasting time on a “long tail” of infrequently occurring queries.
Less is More
We’re here to geek out, so hey - let’s go the distance as to why. Natural language generally follows Zipf’s Law, stating that in a given corpus of natural language utterances, the usage of words varies in indirect proportion to their ranking in the language. In practical terms, the use of a word in any language falls very fast as we move down the scale from more commonly used ones to less frequently used ones:
Isolating only the recurring, and therefore the relevant, data is critical to building a performance AI model.
Language then needs to be mapped to numerical representations - numbers that machine learning algorithms can understand. This is achieved by creating a vocabulary of indices - a mapping from words to numbers, and in turn from numbers to word vectors.
At this point we get closer to mapping natural language sentences to sequences of numbers that a machine learning model can interpret:
…and without giving away too much of the secret sauce, a neural network is ready to learn directly from natural language mapped to intents.
At the other side of the Neural Network, we train it to predict probabilities over a group of intents. Those probabilities, in effect, are how confident the neural network is in its predictions. Therefore, the more distinct the patterns in the input natural language data, the greater the output confidence.
After training, when a customer comes back with some variant of “Where is my refund?”, our model now has a full pipeline through which to convert the natural language into numerical probabilities, or predictions. And we have reached this point with minimal assumptions about the language the customer used - the same model could be built to predict on just about any language in the world.
To bring that all together, there is first a separation and curation of data that allows us to analyze each unique linguistic data set individually. This allows us to focus on repetition in the data.
When the ML is doing the “thinking”, it is doing so utilizing a numerical format, making the basis of the computation identifying probabilities to certain outcomes, rather than looking for the inherent meaning in words, which can vary by language, dialect and region.
With this approach, we don’t rely on a specific understanding of a given language, but instead remain language-agnostic. This is how we’re able to support over 15 languages for our customers - with many more on the way.