Natural Language Processing (NLP)
Data Science Digest, Vol. 6
This article is part of our “Data Science Digest” series. With this series, we will keep up to date with developments in Data Science, show you the potential of data science techniques and give you a sneak peak into some of the exciting things we’ve been working on at Anchormen. In this article, we will talk about Natural Language Processing (NLP) and some of the most interesting applications of the technology.
We use language on a daily basis to communicate, argue, celebrate, negotiate, and learn. Everywhere around us we are bombarded by language, both conscious and subliminal. Communication is essential for every human being to be able to understand and navigate society. For machines however, natural language is very problematic.
When you are listening to or talking with someone, you don’t really appreciate how many things are happening on your metaphorical “back-end” that help you understand speech and convey your thoughts in a logical manner. We are the only species that communicate with sentences composed by distinctive words that take the role of nouns, verbs, adjectives, etc. For example, your dog can tell you that it’s angry but it can’t tell the story of its life. Natural Language Processing (NLP) is an area of computer science concerned with leveraging computational power to process and understand language.
What makes NLP such a tough field is the fact that language is persistently ambiguous. A single sentence can have different meanings or it can use slang, jargon, sarcasm, or any number of diverse human communicational constructs. It is little wonder that forcing computers to try and interpret meaning without cultural context is so difficult.
The history of NLP can be traced back to the 1950’s but what really gave a kick-start to the field is the arrival of the Web and the technological advancements in computing power. The Web provided computers with an evergrowing stream of written language data to learn from, while the faster and more powerful computers helped to facilitate that growth, both in terms of producing and analyzing.
Although the field is huge, there are four main sub-categories that are of particular interest because of the number of business and societal applications they (can) have.
Also known as opinion mining, sentiment analysis is used to determine public opinion on a subject by analyzing texts that reference that subject. In most cases, opinions are classified in three segments – positive, negative and neutral. Thanks to the advent of social media platforms, there are heaps of data that can be analyzed to that end. And the potential insight that businesses can gain is mindboggling.
A common use case for organizations is to study consumer attitude towards their brand or product, but that’s just scratching the surface. Deeper analysis can provide valuable information such as what do people like or dislike and even track their change in opinion over time. All, which is useful when trying to improve their attitude towards you.
Sentiment analysis is also very popular during elections and campaigns. It has been used by political candidates to monitor voters’ opinion on different topics, proposals for policy changes, announcements, advertising and reaching out to the right target group. This allows them to fine-tune their overall campaign message and to have a good understanding of how they appear in the eyes of the general public.
Like sentiment analysis, topic detection also mines text in order to discover insights, but focusses on the abstract topics that occur within text data. A body of text usually concerns several different topics. This works by predetermining specific topics (clusters) based on common words or phrases after which these are clustered. This can be used for differentiating between different articles concerning sport and politics. This is simple for a human, but rather hard for a computer. After clustering, the labels can be linked with other machine learning models to classify new incoming texts into the right category. Moreover, topic detection is an unsupervised method, which means that the algorithm figures out the topic “by itself”. A data scientist adds value by correctly fine-tuning the topics and way of clustering, after which these can be used for classification and/or recommendation purposes.
Going back to the example with elections from the sentiment analysis, topic detection has a useful application here as well. You can classify different political parties and use topic detection to determine which article or post panders to which party. In other words, telling you if what you just read is for example right-wing or left-wing focused or even on what side of the political spectrum is the writer.
Text classification (detect emotions, speeches- who wrote it)
Text classification focusses on putting a text into a “righteous” class. For example, when determining whether or not an email is spam or not, is a task for text classification. With this, the classification algorithm has seen examples of spam and not-spam emails, after which it is able to determine (classify) whether or not a new incoming email is spam or not-spam. A seemingly simple technique that can get massively difficult, quickly.
The applications of text classification are vast. Any type of text can be classified based on the pre-set criteria that are of interest. So, what can you do with this model? Let’s continue with the political examples for consistency’s sake. By analyzing previous patterns (texts, posts, tweets) you can determine who wrote a particular piece of content. Was this speech written by Obama or Trump? You can discern which political party is most active and what kind of topics do they talk about. On the voters’ side, you can predict how people are going to vote based on their online behavior.
Another interesting application is detecting human emotions. This can be done by classifying the six main emotions (anger, disgust, fear, happiness, sadness, surprise) and then linking certain expression or words with them. With a big enough data set the accuracy levels are surprisingly high. And the applications of this are remarkable. Similar to sentiment analysis, you can determine how people feel about specific topics. Theoretically, you can even use it to determine if a person has mental issues or suffers from depression and help them proactively. Here is a more in-depth research on the topic.
Finally, we have speech recognition. Nowadays this technology is very common, you have it in all of your smart IoT devices, in your mobile phones, laptops & computers, cars, etc. It’s easy to dismiss how incredibly complicated this technology is.
Speech recognition works by analyzing the sound you input, translating it to a digital language (signal), and then associating every sound wave with a specific letter in order to build words and sentences. For example, the two most sophisticated speech recognition programs, Alexa (Google) and Siri (Apple), although similar, work in two different ways. Apple’s Siri waits for a person to input a complete sentence and then returns an accurate answer from its databases. On the other hand, Google’s Alexa has the benefit of accessing all the data that Google has collected through Google Search and doesn’t need to wait for a full sentence before it answers you. It takes every word step-by-step and tries to predict your sentence, each time learning from what you wanted to say and improving further queries.
It seems like the future of the NLP field is bright. Exponential growth in technology and usable data has spurred fast developments in all of the above-mentioned areas. And we are still discovering new applications, especially with so much work done in the fields of Deep Learning and Artificial Intelligence. The only thing slowing this progress is the ambiguous nature of natural language and how difficult it is to teach a machine to understand it. Natural Language Processing is definitely a field to follow closely.
|Data Science Digest Vol.5|