cft

About Text Vectorization

The magic of converting text to numbers


user

Gharibi M

3 years ago | 5 min read

This post will walk you through the basics of text vectorization which is converting text to vectors (list of numbers). In this post, we present Bag of Words (BOW) and its flavors: Frequency Vectors, One Hot Encoding (OHE), and Term Frequency/Inverse Document Frequency (TF/IDF).

Why text vectorization?

Representing text with numbers has many advantages, mainly:

  1. Computers do not understand text and the relations between words and sentences, so you need a way to represent these words with numbers which is what computers understand.
  2. Such vectors can be used in many applications such as question answering systems, recommendations systems, sentiment analysis, text classification and it also makes it easier for search, return synonyms, etc.

Bag of Words (BOW)

BOW is a technique to parse the features of a document. The meanings of features are the characteristics and properties that you can use to make a decision (to buy a house you look for few features such as how many rooms and its location). The features of the text are how many unique words in the corpus and the occurrence for each word, etc.

BOW is a feature extraction technique in which the output is a vector space that represents each document in the corpus. The length of this vector (dimensions) corresponds to the number of unique words in the corpus (no repetition, each word occurs only once).

BOW model has different flavors where each extends or modifies the base BOW. Next will discuss three different vectors: Frequency vectors (count vectors), One Hot Encoding, and Term Frequency/Inverse Document Frequency.

Frequency Vectors

This is the simplest encoding technique, yet it is still effective in some use cases. Simply we fill the document vector with the count of how many times each word appeared in the document. As an example, let us say our corpus has two documents.

While the first one contains “Alice loves pasta”, the second document contains “Alice loves fish. Alice and Bob are friends”. To represent the count we can either use a table or JavaScript Object Notation (JSON) as bellow:

Table representation:+-------+-------+-------+-------+------+-----+-----+-----+---------+
| | Alice | loves | pasta | fish | and | Bob | are | friends |
+-------+-------+-------+-------+------+-----+-----+-----+---------+
| doc1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
+-------+-------+-------+-------+------+-----+-----+-----+---------+
| doc2 | 2 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
+-------+-------+-------+-------+------+-----+-----+-----+---------+JSON representation:doc1: {"Alice":1, "loves":1, "pasta":1}
doc2: {"Alice":2, "loves":1, "fish":1, "and":1, "Bob":1, "are":1, "friends":1}You can combine them: {"Alice":3, "loves":2, "pasta":1, "fish":1, "and":1, "Bob":1, "are":1, "friends":1}

As you can see we have 8 unique words in our corpus. Therefore, our vector will have a size of 8. To represent document 1, we simply take the first row in our table [1, 1, 1, 0, 0, 0, 0, 0]. This vector helps in comparing documents.

While this technique is helpful in some use cases it has some limitations such as: does not keep the document structure (does not keep the order of the words, rather it just counts) and it also has the sparsity problem (most of the values in the vector are zeros, which increase the time complexity and add bias for the model, and the stopping words (such as ‘and’, ‘or’, ‘is’, ‘the’, etc.) appear many times more than the other words.

Therefore, we use some techniques such as Stemming and Lemmatization. We also remove the stopping words and the rare words that appeared only a few times in the entire corpus.

One Hot Encoding

As discussed in frequency vectors, tokens that appear frequently have more magnitude than others that appeared less. Therefore, the OHE vector provides a boolean vector as a solution for this problem where we fill the vector with only 1’s and 0’s. We place 1 if the word appears in the document (1 instead of the count) and 0 otherwise. Document 2 can be presented as [1, 1, 0, 1, 1, 1, 1, 1].

One Hot Encoding can also be used to represent the words. 1 for the word that we want to represent and 0 for the rest. The word “Alice” can be represented as [1, 0, 0, 0, 0, 0, 0, 0] or we can add the count as well, so “Alice” can be represented as [3, 0, 0, 0, 0, 0, 0, 0] (will discuss this in details in part 2 of this blog).

Term Frequency/Inverse Document Frequency

So far we have been treating each document as a standalone entity without looking at the context of the corpus. TF/IDF is one of the common techniques to normalize the frequency of tokens in a document with respect to the corpus context. TF/ID represents two things:

1. Term frequency tf(t, d): how frequently a term (t) occurs in a document (d). If we denote the raw count by f(t, d), then the simplest tf scheme is tf(t, d) = f(t, d) (Other techniques discussed below) and let us denote the total number of words appear in document d by len(d).

For example, to rank documents that are most related to the query “the blue sky”, we count the number of times each word occurs in each document. However, since each document is different in size, it is not fair to compare how many times a word occurs in a document with 10 words and a document with 1M words. Therefore, we scale tf to prevent the bias of long documents as follows:

tf(t, d) = f(t, d) / len(d)

Other methods of tf that adjust and reduce the count of most repeated words in a document:

  • Boolean frequency: tf(t, d) = 1 if t occurs in d and 0 otherwise
  • Term Frequency adjusted for document length: tf(t, d) = f(t, d)/len(d)
  • Logarithmically Scaled Frequencies: tf(t, d) = log( 1 + f(t, d))
  • Augmented Frequency: tf(t, d) = 1 * f(t, d) / m, where m is the most occurring word occur in d

2) Inverse Document Frequency: It measures how important a term is. IDF reduces the value of common words that appear in different documents. Given our previous example “the blue sky” the word “the” is a common word and therefore the term frequency tends to incorrectly emphasize documents with repeated words with fewer values such as “the”.

As a solution, we calculate the log() for the total number of documents (D) divided by n which is the number of documents with t appeared in:

idf(t, D) = log(D / n)

and finally, TF/IDF can be calculated as:

tf-idf(t, d, D) = t(t, d) . idf(t, D)

Finally, we just add TF-IDF scores in vectors instead of frequency count or OHE.

Resources:

Upvote


user
Created by

Gharibi M


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles