Natural Language Processing (NLP) is probably the hottest topic in Artificial Intelligence (AI) right now. After the breakthrough of GPT-3 with its ability to write essays, code and also create images from text, Google announced its new trillion-parameter AI language model that’s almost 6 times bigger than GPT-3. These are massive advances in the discipline that keep pushing the boundaries to new limits.

How is this possible? How can machines interact with human language? There are dozens of subfields in NLP, but we must start with the basics. In another post I went through some tips on how to begin the NLP journey. Now it’s time to talk about normalizing text.

Why do we need text normalization?

When we normalize text, we attempt to reduce its randomness, bringing it closer to a predefined “standard”. This helps us to reduce the amount of different information that the computer has to deal with, and therefore improves efficiency. The goal of normalization techniques like stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

An example

Jaron Lanier said:

“It would be unfair to demand that people cease pirating files when those same people aren’t paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I’d like to see everyone eventually pay for music and the like, I’d not ask for it until there’s reciprocity.”

Let’s start by saving the phrase as a variable called “sentence”:

sentence = “It would be unfair to demand that people cease pirating files when those same people aren't paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I'd like to see everyone eventually pay for music and the like, I'd not ask for it until there's reciprocity.”

In another post I went through some techniques to perform Exporatory Data Analysis over text, so now I’ll focus on different types or methods.

Expanding contractions

In our sentence, we have the words “we will” contracted as “we’ll”, which should be managed before further normalization.

Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe, and removing them contributes to text standardization

There are different ways to expand contractions, but one of the most straight forward one is to create a dictionary of contractions with their corresponding expansions:

contractions_dict = { “ain’t”: “are not”, ”’s”:” is”, ”aren’t”: “are not”, “can’t”: “cannot”, ”can’t’ve”: “cannot have”, “‘cause”: “because”, ”could’ve”: “could have”, ”couldn’t”: “could not”, “couldn’t’ve”: “could not have”, “didn’t”: “did not”, ”doesn’t”: “does not”, “don’t”: “do not”, ”hadn’t”: “had not”, ”hadn’t’ve”: “had not have”, “hasn’t”: “has not”, ”haven’t”: “have not”, ”he’d”: “he would”, “he’d’ve”: “he would have”, ”he’ll”: “he will”, “he’ll’ve”: “he will have”, “how’d”: “how did”, ”how’d’y”: “how do you”, ”how’ll”: “how will”, “I’d”: “I would”, “I’d’ve”: “I would have”, ”I’ll”: “I will”, “I’ll’ve”: “I will have”, ”I’m”: “I am”, ”I’ve”: “I have”, “isn’t”: “is not”, “it’d”: “it would”, ”it’d’ve”: “it would have”, ”it’ll”: “it will”, “it’ll’ve”: “it will have”, “let’s”: “let us”, ”ma’am”: “madam”, “mayn’t”: “may not”, ”might’ve”: “might have”, ”mightn’t”: “might not”, “mightn’t’ve”: “might not have”, ”must’ve”: “must have”, ”mustn’t”: “must not”, “mustn’t’ve”: “must not have”, “needn’t”: “need not”, “needn’t’ve”: “need not have”, ”o’clock”: “of the clock”, ”oughtn’t”: “ought not”, “oughtn’t’ve”: “ought not have”, ”shan’t”: “shall not”, ”sha’n’t”: “shall not”, “shan’t’ve”: “shall not have”, ”she’d”: “she would”, ”she’d’ve”: “she would have”, “she’ll”: “she will”, “she’ll’ve”: “she will have”, ”should’ve”: “should have”, “shouldn’t”: “should not”, “shouldn’t’ve”: “should not have”, ”so’ve”: “so have”, “that’d”: “that would”, ”that’d’ve”: “that would have”, “there’d”: “there would”, “there’d’ve”: “there would have”, “they’d”: “they would”, “they’d’ve”: “they would have”,”they’ll”: “they will”,
“they’ll’ve”: “they will have”, “they’re”: “they are”, ”they’ve”: “they have”, “to’ve”: “to have”, ”wasn’t”: “was not”, ”we’d”: “we would”, “we’d’ve”: “we would have”, ”we’ll”: “we will”, ”we’ll’ve”: “we will have”, “we’re”: “we are”, ”we’ve”: “we have”, “weren’t”: “were not”,”what’ll”: “what will”, “what’ll’ve”: “what will have”, ”what’re”: “what are”, “what’ve”: “what have”, “when’ve”: “when have”, ”where’d”: “where did”, “where’ve”: “where have”,
“who’ll”: “who will”, ”who’ll’ve”: “who will have”, ”who’ve”: “who have”, “why’ve”: “why have”, ”will’ve”: “will have”, ”won’t”: “will not”, “won’t’ve”: “will not have”, “would’ve”: “would have”, ”wouldn’t”: “would not”, “wouldn’t’ve”: “would not have”, ”y’all”: “you all”, “y’all’d”: “you all would”, “y’all’d’ve”: “you all would have”, ”y’all’re”: “you all are”, “y’all’ve”: “you all have”, “you’d”: “you would”, ”you’d’ve”: “you would have”, “you’ll”: “you will”, ”you’ll’ve”: “you will have”, “you’re”: “you are”, “you’ve”: “you have”}

Then, we can use regular expressions to update the text:

import re
contractions_re = re.compile('(%s)'%'|'.join(contractions_dict.keys()))def expand_contractions(s, contractions_dict=contractions_dict):
def replace(match):
return contractions_dict[match.group(0)]
return contractions_re.sub(replace, s)sentence = expand_contractions(sentence)
print(sentence)

Tokenize

Tokenization is the process of segmenting running text into sentences and words. In essence, it’s the task of cutting a text into pieces called tokens.

import nltk
from nltk.tokenize import word_tokenize
sent = word_tokenize(sentence)
print(sent)

Next, we should remove punctuations.

Remove punctuations

nltk.download(“punkt”)def remove_punct(token):
return [word for word in token if word.isalpha()]sent = remove_punct(sent)
print(sent)

Now we can perform stemming and lemmatization over the sentence.

Stemming

Stemming is the process of reducing the words to their word stem or root form. The objective of stemming is to reduce related words to the same stem even if the stem is not a dictionary word. For example, connection, connected, connecting word reduce to a common word “connect”.

The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter’s algorithm.

from nltk.stem import PorterStemmer
ps = PorterStemmer()ps_stem_sent = [ps.stem(words_sent) for words_sent in sent]
print(ps_stem_sent)

What happened here? Words have been reduced, but some of them not to a real English word. Stemming refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time. The result? Stemming a word or sentence may result in words that are not actual words.

This happens because there are mainly two errors in stemming:

Over-stemming: where a much larger part of a word is chopped off than what is required, which in turn leads to words being reduced to the same root word or stem incorrectly when they should have been reduced to more stem words. For example, the words “university” and “universe” that get reduced to “univers”.
Under-stemming: occurs when two or more words could be wrongly reduced to more than one root word when they actually should be reduced to the same root word. For example, the words “data” and “datum” that get reduced to “dat” and “datu” respectively (instead of the same stem “dat”).

An improvement to the Porter Stemmer is the Snowball Stemmer, which stems words to a more accurate stem.

from nltk.stem import SnowballStemmer
sb = SnowballStemmer(“english”)sb_stem_sent = [sb.stem(words_sent) for words_sent in sent]
print(sb_stem_sent)

Still not good, right? Let’s follow a different approach.

Lemmatization

Unlike stemming, lemmatization reduces words to their base word, reducing the inflected words properly and ensuring that the root word belongs to the language.

It’s usually more sophisticated than stemming, since stemmers works on an individual word without knowledge of the context. In lemmatization, a root word is called lemma. A lemma is the canonical form, dictionary form, or citation form of a set of words.

Just like for stemming, there are different lemmatizers. For this example, we’ll use WordNet lemmatizer.

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()lem_sent = [lemmatizer.lemmatize(words_sent) for words_sent in sent]
print(lem_sent)

It’s possible to improve performance over lemmatization even further if you provide the context in which you want to lemmatize, which you can do through parts-of-speech (POS) tagging.

POS tagging is the task of assigning each word in a sentence the part of speech that it assumes in that sentence. The primary target of POS tagging is to identify the grammatical group of a given word: whether it is a noun, pronoun, adjective, verb, adverbs, etc. based on the context.

POS tagging improves accuracy

You can use POS as a speech parameter, which in Python is noun by default. For example, the word ‘leaves’ without a POS tag would get lemmatized to the word ‘leaf’, but with a verb tag, its lemma would become ‘leave’.

To get the best results, you’ll have to feed the POS tags to the lemmatizer, or otherwise it might not reduce all the words to the lemmas you desire.

Conclusion

While lemmatization helps a lot for some queries, it equally hurts performance. On the other hand, stemming increases recall while harming precision. Getting better value from text normalization depends more on pragmatic issues of word use than on formal issues of linguistic morphology.

Even though text normalization is considered pretty solved for modern languages such as English, there exist many (e.g. historic) languages for which the problem is harder to solve, due to a lack of resources and unstable orthography.

So, should we always normalize?

It depends on the nature of the problem. Topic modeling, for example, relies on the distribution of content words, the identification of which is dependent on a string match between words, which is achieved by lemmatizing their forms so that all variants are consistent across documents.

Lemmatization is also important for training word vectors, since accurate counts within the window of a word would be disrupted by an irrelevant inflection like a simple plural or present tense infleciton. On the other hand, some sentiment analysis methods (e.g. VADER), have different ratings depending on the form of the word and therefore the input should not be stemmed or lemmatized.

All preprocessing does not require normalization for the eventual model or application to be effective, and it may actually impede the success or accuracy of the model or application. We need to ask ourselves: is important information being lost by normalizing? Or is irrelevant information being removed?