In this blog, we will understand the concept of NLP and its uses. So, keep up with the blog.
Chaliye shuru karte hai! 🙂
The Natural Language Processing (NLP) refers to all systems that work together to handle end-to-end interactions between the machines and humans in the preferred language of the human. In other words, NLP lets people and machines talk to each other “naturally“.
Natural language processing (NLP) is the broad umbrella term for any machine’s ability to recognize what is said to it, understand its meaning, determine the proper action, and respond in language the user will understand. Common real-world examples of such tasks are online chatbots, text summarizers, auto-generated keyword tabs and sentiment analysis.
Natural language understanding (NLU), on the other hand, is a subset of NLP. NLU goes beyond just the basic sentence structure, but attempt to understand the intended meaning of language. Human speech is full of nuances, mispronunciations, and colloquialisms so the purpose of NLU is to tackle these complexities.
Natural Language Generation (NLG) is a technology that simply turns data into plan-English language.In other words, this means our software can look at your data and write a story from it, just like a human analyst would today.
If I need to say it in a mathematical way it could be said as the combination of NLU and NLG will result in an NLP engine that works.
NLP = NLU + NLG
NLU is the meaning of what the user or the input which is given means. That is nothing but the understanding of the text given and classifying it into proper intents.
Let us take an example here: “Can I play cricket today“
What should your NLP engine do?
Here the user intention is playing cricket but however, there are many possibilities that should be taken in account. One of the dependency would be of “Checking the weather outside”.
If it is raining outside, since cricket is an outdoor game we cannot recommend playing right?? as you can see we need to get it into structured data. So here what do we do to make use of intent and entities.
Intents are nothing but verbs(activities that the user needs to do). If we want to capture a request, or perform an action, use an intent. In the example play is the intent.
Entities are the nouns or the content for the action that needs to be performed. In this case, cricket is a noun.
It is possible to have multiple intents(like checking the weather, checking ground available, checking friends availability) for a single entity playing and also multiple entities for intent and multiple intents for multiple entities.
And also the intents and entity change based on the previous chats check out below.
- User : Can I play cricket?
- Bot: The weather is bad I would not suggest playing now.
- User: What about football?
- Bot: The same weather is bad though.
- User: Let me watch at least.
- Bot: Sure you can watch football.
Step 1: “Entity” — “Cricket” and “Intent” — “Play”
Step 2: “Entity” — “Football” and “Intent” — “Play”
Step 3: “Entity” — “Football” and “Intent” — “Watch”
See how the Entities and Intents varies based on previous chats.
How the three of them work in hand in hand:
- NLU takes up the understanding of the data based on grammar, the context in which it was said and decide on intent and entities.
- NLP will convert the text into structured data.
- NLG generates text generated based on structured data.
Now, Let’s talk also about the challenges in NLP
The development of NLP has its meaning because of some specific problems and phenomena that arrive when we study natural language. Most of the times these problems are unique in comparison to the problems that emerge in other fields of computer science or engineering, and that is in part what makes NLP such an interesting and different area.
The main challenge of NLP is the understanding and modelling of elements within a variable context. In language, words are unique but can have different meaning depending on the context in which they are being evaluated. We can have words (or even sentences) with different meanings in the same sentence depending on the way we interpret these words. This happens because of the difference between signifier (the way we represent the information, word) and signified (the meaning of that information, concept).
The other key phenomena of natural language is that we can express the same idea with different terms. This occurs because of synonymy, which is also dependent of the specific context: fine is synonym of correct.
Syntax refers to the arrangement of words in a sentence such that they make grammatical sense. In NLP, syntactic analysis is used to assess how the natural language aligns with the grammatical rules.
Computer algorithms are used to apply grammatical rules to a group of words and derive meaning from them.
Have you ever had a conversation with a chatbot? If so, I guess that you may encounter the situation as below:
Why cannot your chatbot understand what you said? It is obvious that he does not know what is “his wife“. It is easy for us humans to refer the context and understand “his wife” is the wife of Barack Obama. However, our chatbot cannot, because he lacks the capability to understand the connection between those two words in the context. The Coreference Resolution is here to resolve this problem.
Do we actually need the Coreference Resolution for chatbots?
Well, it depends. The chatbot which I worked on is a typical Q&A chatbot. Its job is to answer the day-to-day questions that will be encountered in our company, such as the information of employees, clients and basic information of our offices. The conversation is led by the users rather than the chatbot. Other application scenarios can evidently need this type of chatbots. The best experience for the users should be getting answers almost right away when they ask the questions. Once the chatbot cannot understand a question because of some ambiguities, the conversation cannot continue.
While, if your chatbot is not a Q&A chatbot led by users, you will have fewer problems. For example, a client service chatbot will ask you which item do you want, which size matches you. The conversation will be evidently limited in a specific question set and topic. This kind of conversation goes on well if you follow its questions.
The coreference exists when more than one of phrases refer to the same real-word entity in the document. As the above conversation shows, “Barack Obama” and “his” both refer to Barack Obama in the real world. As for the Coreference Resolution, it aims at finding all expressions that refer to the same real-world entities. Therefore, with the help of the Coreference Resolution, clusters of phrases which refer to the same real-word entity are found. It is extremely useful for our chatbot to understand the real meaning of speakers.
Not only for chatbot, the Coreference Resolution can be applied to multiple Natural Language Processing (NLP) tasks, such as name entity recognition (NER), translation, question answering and text summarization, in a meaningful way.
The process of linking together mentions that relates to real world entities is called coreference.
- Normalization vs Information:
Normalization is a process that converts a list of words to a more uniform sequence. When we process natural language, in order to be able to manage it in a more general way, we need to normalize it. This means that depending on the task we would want all the words to be lowercased or to convert plural terms into singular ones if we don’t want to consider dog and dogs are two different entities. Other times we might encounter different forms of the same verb in a document and we would want to consider just that verb instead of making a distinction between each form. All these processes normalize natural language in some way, and we will learn the techniques that are used to achieve it, but the key idea here is that when we normalize we are losing part of information in exchange of being able to generalize better. This normalization/information trade-off is common in the study of data, but also very important in the study of natural language.
- Word Embedding
The word embedding are basically a form of word representation that bridge the human understanding of language to that of a machine. Word embedding are distributed representation of text in an n-dimensional space. These are essential for solving most NLP problems. It is word vectors that make technologies such as speech recognition and machine translation possible.
Similar to the way a painting might be a representation of a person, a word embedding is a representation of a word, using real-valued numbers. They are an arrangement of numbers representing the semantic (The purpose of semantic is to propose exact meanings of words and phrases, and remove confusion) and syntactic (The term syntax refers to grammatical structure) information of words and their context, in a format that computers can understand.
For the most part, computers can’t understand natural language. Our programs are still line-by-line instructions telling a computer what to do – they often miss nuance (A subtle difference in or shade of meaning, expression, or sound.)and context. How can you explain sarcasm to a machine?
There’s good news though. There’s been some important breakthroughs in natural language processing (NLP), the domain where researchers try to teach computers human language.
Famously, in 2013 Google researchers (Mikolov 2013) found a method that enabled a computer to learn relations between words such as:
king-man+woman≈queen. This method, called word embeddings.
The goal of word-embedding algorithm is, to embed words with meaning based on their similarity or relationship with other words.
In practice, words are embedded into real vector space (A vector space is a set of objects which can be multiplied by regular numbers and added together via some rules called the vector space), which comes with notions of distance and angle. We hope that these notions extend to the embedded words in meaningful ways, quantifying relations or similarity between different words.
For example, the Google algorithm I mentioned above discovered certain nouns are singular/plural or have gender (Mikolov 2013 abc):
They also found a country-capital relationship:
And as further evidence that a word’s meaning can be implied from its relationships with other words, they actually found that the learned structure for one language often correlated to that of another language.
Today, many companies and data scientists have found different ways to incorporate word2vec (It consists of models for generating word embedding) into their businesses and research. Spotify uses it to help provide music recommendation. Stitch Fix uses it to recommend clothing. Google is thought to use word2vec in RankBrain as part of their search algorithm.
- Personality, intention and style:
There are also different styles to express the same idea depending on the personality or the intention in a specific scenario. Some of them (such as irony or sarcasm) may have opposite idea from the one that can be initially thought due to the context. We can state for example “Oh, great” referring to a feeling of joy but also to the completely opposite feeling if we are being sarcastic.
Some basic NLP techniques:
Now that we know what is and is not in NLP and what problems does it face. The tools and techniques of NLP is a python library called SpaCy. This blog focuses on the concepts.
Tokenization is one of the most common tasks when it comes to working with text data. But what does the term ‘tokenization’ actually mean?
“Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.“
Why is Tokenization required in NLP?
I want you to think about the English language here. Pick up any sentence you can think of and hold that in your mind as you read this section. This will help you understand the importance of tokenization in a much easier manner.
Before processing a natural language, we need to identify the words that constitute a string of characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is important because the meaning of the text could easily be interpreted by analyzing the words present in the text.
Let’s take an example. Consider the below string:
“This is a cat.”
What do you think will happen after we perform tokenization on this string?
We get [‘This’, ‘is’, ‘a’, cat’].
There are numerous uses of doing this. We can use this tokenized form to:
- Count the number of words in the text
- Count the frequency of the word, that is, the number of times a particular word is present.
Let’s do an example using spaCy library.
spaCy is an open-source library for advanced Natural Language Processing (NLP). It supports over 49+ languages and provides state-of-the-art computation speed.
So, let’s see how we can utilize the awesomeness of spaCy to perform tokenization. We will use spacy.lang.en which supports the English language.
spaCy is quite fast as compared to other libraries while performing NLP tasks (yes, even NLTK).
b. Stemming and Lemmatizing:
Stemming and Lemmatization are commonly used techniques in the development of search engines, keyword extractions, grouping similar words together and NLP.
Different forms of a words often communicate essentially the same meaning. For example, there’s probably no difference in intent between a search for shoe and a search for shoes.
These syntactic (The purpose of syntactic is to draw exact meaning, or you can say dictionary meaning from the text.) differences between words forms are called inflections ( In which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender and mood.), and they create challenges for query understanding. In the example of shoe and shoes, we probably want to treat the two forms identically. But we wouldn’t want to do the same for the words logistic (In Mathematics) and logistics (In Businesses), which mean different things despite their apparent similarity. Nor would we want to equate the word universe and university, even though both words derive from the same Latin root.
Stemming and Lemmatization are the two approaches to handle inflections in search queries.
The aim of both processes is the same: reducing the word into a common base or root. However, these two approaches follow a very different procedure.
To stem words is to remove word endings like -s and -ing.
When we stem a mushroom, we chop off its stem and keep the cap that most people think of as the edible portion. Similarly, when we stem a word, we chop off its inflections and keep what hopefully represents the main essence of the word. Technically, it depends on the type of mushroom, and we’re throwing away the mushroom stems while keeping the word stems. Nonetheless, I hope the metaphor is useful.
The best-known and most popular stemming approach for English is the Porter stemming algorithm, also known as the Porter stemmer. It is a collection of rules designed to reflect how English handles inflections. For example, the Porter stemmer chops both apple and apples down to appl, and it stems berry and berries to berri.
If we apply a stemmer to queries and indexed documents, we can increase recall by matching words against their other inflected forms. It is critical that we apply the same stemmer to both queries and documents.
You can find an implementation of the Porter stemmer in any major natural language processing library, such as NLTK.
Just as using a knife to chop a mushroom stem may leave a bit of the stem or cut into the cap, stemming algorithms sometimes remove too little or too much. For example, Porter stems both meanness and meaning to mean, creating a false equivalence. On the other hand, Porter stems goose to goos and geese to gees, when those two words should be equivalent.
This slicing can be successful on most occasions, but not always.
PorterStemmer is a very popular stemmer
- There is a stem module in NLTk which is imported. If you import the complete module, then the program becomes heavy as it contains thousands of lines of codes. So from the entire stem module, we only imported “PorterStemmer.”
- We prepared a dummy list of variation data of the same word.
- An object is created which belongs to class nltk.stem.porter.PorterStemmer.
- Further, we passed it to PorterStemmer one by one using “for” loop. Finally, we got output root word of each word mentioned in the list.
It is the process of converting the words of a sentence into its dictionary form.
E.g. Word: Feet, Lemmatization: Foot.
In this Feet is dictionary form of foot.
In the process of Stemming converting the words of sentence to its non-changing portion.
Here everyone should keep a keen note that by applying Stemming we may or may not get a meaningful word but using Lemmatization the words of a sentence will give a meaningful meaning of word.
You can further defined it as grouping together the inflected forms of words so that they can be analysed as a single term, A Lemma or Lemmata means canonical form of words(dictionary form, or citation form of a set of words).
It uses the vocabulary analysis of Words aiming to remove inflectional endings to return the dictionary form of a word.
Lemmatization is accurate but is computationally expensive, while Stemming is not that accurate but it is faster.
c. Coreference resolution:
It consists of solving the coreferences that are present in our corpus (A collection of written texts). This can also be thought as a normalizing or preprocessing task.
d. Part-of-speech (POS) Tagging:
Part Of Speech (PoS) is a useful technique that is used in the NLP projects.
Each language is made up of a number of parts of speech such as verbs, nouns, adverbs, adjectives and so on.
PoS is all about tagging (assigning) language-specific parts of a speech on a text.
NLTK is a fantastic library to support your NLP project. It provides a number of tagging models. The default tagging model is the maxent_treebank_pos_tagger.
The part of speech explains how a word is used in a sentence. There are eight main parts of speech – nouns, pronouns, adjectives, verbs, adverbs, preposition, conjunctions and interjections.
- Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hope
- Verb (V)- go, speak, run, eat, play, live, walk, have, like, are, is
- Adjective(ADJ)- big, happy, green, young, fun, crazy, three
- Adverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrow
- Preposition (P)- at, on, in, from, with, near, between, about, under
- Conjunction (CON)- and, or, but, because, so, yet, unless, since, if
- Pronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, this
- Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!
Most POS are divided into sub-classes. POS Tagging simply means labeling words with their appropriate Part-Of-Speech.
NLTK has a function to get pos tags and it works after tokenization process.
- Tokenize text (word_tokenize)
- apply pos_tag to above step that is nltk.pos_tag(tokenize_text)
Let’s discuss about the tagging abbreviation:
- CC – coordinating conjunction
- DT – determiner (This is used before Noun to clarify Noun) e.x: My pen, where pen is the noun and My is determiner.
- PRP – personal pronoun (him, himself, herself)
- VBZ – verb, present tense with 3rd person singular (bases)
- NNS – noun plural (desks)
- NN – noun, singular (cat, tree)
- IN – preposition/subordinating conjunction
- JJ – adjective (large)
. To view the complete abbreviation list, follow this link.
e. Dependency Parsing:
Sometimes instead of the category (PoS tag) of a word we want to know the role of that word in a specific sentence of our corpus, this is the task of dependency parsers. The objective is to obtain the dependencies or relations of words in the format of a dependency tree.
The considered dependencies are in general terms subject, object, complement and modifier relations.
As an example, given the sentence “I want to play the piano” a dependency parser would produce the following tree:
Here we can see that the dependency parser that I use (SpaCy’s dependency parser) also outputs the POS tags. If you think about it, it makes sense because we first need to know the category of each word to extract dependencies.
We will see in detail the types of dependencies, but in this case we have:
want — I: nominal subject. want — play: open clausal complement. play — to: auxiliary verb. play — the piano: direct object
where a — b: R means “b is R of a”. For example, “the piano is direct object of play” (which is play — the piano: direct object from above).
f. Named Entity Recognition (NER):
In NLP, Named Entity Recognition is an important method in order to extract relevant information. In the real world, in our daily conversation we don’t work directly with the categories of word. Instead, for example, if we want to build a Netflix chatbot we want it to recognize both ‘Batman‘ and ‘Avatar‘ as instances of the same group which we call films”, but ‘Steven Spielberg’ as a ‘director’.’ This concept of semantic (It is the study of the relationship between words and how we draw meaning from those words) field dependent of a context is what we define as entity (A thing with distinct and independent existence). The role of a named entity recognizer is to detect relevant entities in our corpus (A collection of written texts).
For example, if our NER knows the entities ‘film’, ‘location’ and ‘director’, given the sentence “James Cameron filmed part of Avatar in New Zealand”, it will output:
- James Cameron: DIRECTOR
- Avatar: FLIM
- New Zealand: LOCATION
Note that in the example instances of entities can be just a single word (‘Avatar’) or several ones (‘New Zealand’ or ‘James Cameron’).
Computational linguistics has become a critical area of interest in recent years, as companies work to build systems capable of effortless, unsupervised, and socially acceptable direct interaction with customers. Everyone from small tech startups to the major technology companies like Amazon (Alexa), Apple (Siri) and Google (Duplex) are investing in effort to make their systems feel more human. New and exciting things are happening in this everyday.
Hurray!! 🎉 you know the very basics of NLP.
Thank you & Keep Learning.. ✌🏻