Natural Language Processing- NLP 101

May 21, 2014
Comments off

NLP stands for natural language processing. It is a part of computer science and artificial intelligence which deals with human languages. NLP also stands for neuro-linguistic programming which is an approach to communication, personal development and psychotherapy. However this blog is about the former i.e. natural language processing. Although both are spelt the same way, they represent entirely different things. The core of NLP is to understand human language in an automated way i.e. infer the correct NLP from a given text which mentions NLP. From here onwards, all references to NLP are with respect to natural language processing.

Our success i.e. success of human race is because of our ability to communicate i.e. share information. Using this ability, we have marched ahead of other animals and have become the most sophisticated creatures.

We began to look for ways to preserve our thoughts, feelings, messages and other information. We started with oral communication like other animals, but because of its ephemeral nature, we began painting walls of caves where we lived. These paintings are a great source of information about our ancestors and the game of Pictionary is a tribute to this art. However, not everyone can be good in drawing ;)

There was a need to standardize the drawings so that everyone could understand, and that’s where the concept of developing language comes in. However many such standards came up resulting in many languages with each language having its own set of basic shapes called alphabets, combination of alphabets called words and combination of words arranged meaningfully called sentences. Each language has a set of rules based on words are combined to form sentences. These set of rules is termed as grammar.

img1 (1)

Let’s fast forward to 21st century. Language has evolved from the Papyrus to the Kindle and from mammoth Mahabharata to tiny Tweet. The birth of Web 2.0 and social media has led to explosion of data as evident from numbers shown below:-
1. Facebook:- Adds 0.5 petabyte of data every 24 hours
2. Twitter:- Adds 340 million tweets per day
3. Youtube:- Adds 100 hours of new videos every minute

These numbers are still growing at a rapid pace. To put these numbers into perspective, consider the following: 1 petabyte of MP3 songs will require 2000 years to play. Imagine reading this humongous amount of data to figure out what they are trying to say. By the time we finish reading one of the article/post, million more are ready. It is humanly impossible to do so. This gave rise to the need of performing such tasks in an automated way. Historically, whenever we were burdened by huge repetitive tasks, we invented some machines which helped us to perform these tasks. We already have computers as intelligent machines which can perform any task which can be taught in form of a ‘program’. Thus a new form of program is born which came to be known as ‘NLP’ or ‘Natural language processing’. These programs enable computers to infer meanings from human language.
With NLP, we have a potent tool to read large volumes of textual data and come up with insights. However this is not an easy task. Let’s start from the beginning.

Any language construct can be divided into three forms:-
1. Definition: – This represents a piece of text that explains the meaning of a word, phrase or a set of symbols.
2. Fact: – This represents something that is true or is the actual case as per our current knowledge and understanding.
3. Opinion: – This refers to a judgement or viewpoint.

1. Definition: – A girl is a female human being.
2. Fact: – Jenny is a girl with blue eyes and blonde hair.
3. Opinion: – Jenny is pretty/ Jenny is ugly.

The beauty of languages lies in the fact that the above three forms can be mixed together in such a way that it becomes difficult to separately identify them and here lies the difficulty. Also, languages are still in a state of evolution. This makes the task of automation difficult. We will consider text in English language to explain this. Consider the two sentences below:-

1. The mouse on the desk is broken.
2. The mouse on the desk is eating cheese.

Grammatically, both the sentences have same form. They both represent fact. However, ‘the mouse’ in both the sentences refers to two different things. In first sentence, it refers to a pointing device which is used with a computer. In second sentence, it refers to a living organism which is a mammal. Our brain is capable of reading backwards, i.e. it infers the correct meaning of ‘the mouse’ after it encounters the word ‘broken’ and ‘eating cheese’. Language is dynamic also. If the same two sentences were used hundred years ago, ‘the mouse’ would refer to the same thing. Computers and mouse as a pointing device did not exist that time. In terms of NLP, this problem is called ‘Word sense disambiguation’.

Now consider the following sentence:-
John drove Mary from Austin to Texas.

In the above sentence ‘John’ and ‘Mary’ represent people and ‘Austin’ and ‘Texas’ represent location. Inferring such details from a given text is termed as ‘Named Entity Recognition’ or simply NER.

Consider following two sentences:-

1. The movie has an unpredictable plot.
2. The car has an unpredictable steering.

Grammatically, both the sentences have same form as in above example. Consider that we are asked to take decisions based on above two sentences. A movie with unpredictable plot would be exciting to watch. We can spend our money on such a movie. However same cannot be said about a car with unpredictable steering. Finding such insights is termed as ‘Sentiment analysis’.

Now consider following two sentences:-

1. Some people love the sea and some see the love.
2. Some people love the see and some sea the love.

From the above examples it is clear that language has many ambiguous forms which makes task for an automated system difficult. I have shared examples from the English language. Considering the number of languages we have, and the fact that languages are still evolving, herein lies the challenge for developing 100% accurate NLP algorithms for automated systems

Second law of thermodynamics states that the entropy of an isolated system tends to stay or increase. In simple words, chaos tends to stay same or increase. This is evident from the volume of data as specified above. We as team Germin8 envision ourselves as creator of tools based NLP algorithms which demystify this chaos and thus simplifying decision making process. These tools allow our clients to:-

1. Accumul8, aggreg8 and assimil8 (accumulate, aggregate and assimil8) – trends, views, reviews, opinion of customers and prospective customers.
2. Autom8 (automate) – Accumulation, aggregation and assimilation process.
3. Anticip8 (anticipate) – Customer needs
4. Ab8 (abate) – Crisis
5. Elimin8 (eliminate) – Inefficiency, inconsistency
6. Acceler8 (Accelerate) – Productivity, resolution of complaints
7. Innov8 (innovate) – New products and services based on customer needs, views, reviews, opinion and feedback

I will leave with following sentences for you to interpret :-)
1. She sells sea shells at sea shore.
2. See shells sea shells at sea shore.
3. I saw the man with the binoculars.
4. Police help dog bite victim.
5. He saw that gas can explode.
6. I once saw a deer riding my bicycle.
7. We saw her duck.
8. Wanted: a nurse for a baby about twenty years old.
9. Flying planes can be dangerous.
10. ‘Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo’.