cft

NLP Tutorial📚: Gensim Word2Vec[With Codes]🧑‍💻

In this post, we are going to talk about the Gensim Word2Vec model and will see and end to end implementation of this.


user

Ravi kumar

a year ago | 4 min read

In this post, we are going to talk about the Gensim Word2Vec model and will see and end to end implementation of this.

Let’s start with our common drill by listing down all the topics that we are going to cover in the post:

  • What is Gensim?
  • What is the Word2Vec model and how it works?
  • End to End Implementation using an example
  • Conclusion

What is Gensim?

Gensim is a software library for Python that is used to analyze and understand text data. It is designed to work well with large collections of text and has efficient algorithms for a variety of natural language processing (NLP) tasks, including topic modeling, document similarity analysis, and pre-processing text data. The library is open-source and commonly used in NLP and text-mining projects.

  • It is designed to handle large, sparse text collections and provides efficient implementations of popular algorithms such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)
  • It can also be used for pre-processing text data, such as tokenization and lemmatization
  • It provides tools for post-processing topic models too, such as calculating topic coherence scores

What is the Word2Vec model and how it works?

  • Gensim’s word2vec is an implementation of the word2vec algorithm for learning vector representations of words (The words are assigned a certain number)
  • Word2vec algorithm is a neural network-based approach for natural language processing (NLP) that learns to represent words in a high-dimensional vector space, where semantically similar words are located near one other
  • These vector representations can be used in various NLP tasks such as text classification, named entity recognition, and machine translation

End to End Implementation using an example

  • Find the full code here: Link
  • Find the Dataset here: Link
  1. Let’s first start with importing all the important libraries

# imports needed and set up logging

import gzip

import gensim

import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Logging will help us to see what all operations are going on, for more details read this.

2. Unzip and read the dataset (This is a car and hotel reviews dataset)

data_file="reviews_data.txt.gz"

with gzip.open ('reviews_data.txt.gz', 'rb') as f:

for i,line in enumerate (f):

print(line)

break

This is how the dataset looks like

3. Convert the Dataset into a list

def read_input_file(input_file):

"""This method reads the input file which is in gzip format"""

logging.info("reading file {0}...this may take a while".format(input_file))

with gzip.open (input_file, 'rb') as f:

for i, line in enumerate (f):

if (i%10000==0):

logging.info ("read {0} reviews".format (i))

# do some pre-processing and return a list of words for each review text

yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list

# each review item becomes a serries of words

# so this becomes a list of lists

documents = list (read_input_file (data_file))

logging.info ("Done reading data file")

Output for step 3

4. Training a Word2Vec model

model = gensim.models.Word2Vec (documents, vector_size=150, window=10, min_count=2, workers=10)

model.train(documents,total_examples=len(documents),epochs=10)

vector_size: The size of the dense vector represents each token or word. If you have very limited data, then the size should be a much smaller value. If you have lots of data, it's good to experiment with various sizes. A value of 100–150 has worked well for me.

window: The maximum distance between the target word and its neighboring word. If your neighbor’s position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as it's a decent-sized window.

min_count: Minimum frequency count of words. The model would ignore words that do not satisfy the min_count. Extremely infrequent words are usually unimportant, so it's best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

workers: How many threads to use behind the scenes?

Output for step 4

5. Let’s see some outputs 

Example number 1:

  • Looking up words similar to the word dirty. All we need to do here is to call the most_similar function and provide the word dirty as the positive example. This returns the top 10 similar words.

w1 = "dirty"model.wv.most_similar (positive=w1)

Output of word ‘dirty’

For the assignment find the similarity for the words polite, france, and shocked.

Example number 2:

  • You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered related

# get everything related to stuff on the bed

w1 = ["bed",'sheet','pillow']

w2 = ['couch']

model.wv.most_similar (positive=w1,negative=w2,topn=10)

Output of example 2

Example number 3:

  • You can find the Similarity between two words in the vocabulary using “similarity

# similarity between two different words

model.wv.similarity(w1="dirty",w2="smelly")

Output of example 3

It uses Cosine similarity to find the similarity between the vocabulary, to read more about this click here.

Example number 4:

  • Find the odd one out in a given list of words by using “doesnt_match

# Which one is the odd one out in this list?

model.wv.doesnt_match(["cat","dog","france"])

Output of example 4

Conclusion

In this post, we learned about what is gensim and how to use gensim word2vec model to create similar words, find the odd one out and to find the similarity score.

More about me:

I am a Data Science enthusiast🌺, Learning and exploring how Math, Business, and Technology can help us to make better decisions in the field of data science.

If this article helped you, don’t forget to Follow, like, and share it with your friends👍Happy Learning!!

Recent Top NLP articles:

NLP 🗣: Guide to Sentiment Analysis🔬😃😣😭

This post is the continuation of the NLP series and here are going to learn about Sentiment Analysis in very simple…blog.devgenius.io

NLP📜Topic Modeling📳- LDA (Latent Dirichlet Allocation) 💬💻🧠[with codes]In this post, we are going to discuss what is NLP, what is Topic Modeling in NLP, and how to use one of the techniques…blog.devgenius.io

Is Parquet Faster than CSV for sentiment analysis?

If you see the below Graph it directly shows that Parquet consumes very less memory than others, but why?medium.com

Upvote


user
Created by

Ravi kumar

Data science enthusiast🌺, Learning and exploring how Math, Business, and Technology can help us to make better decisions in the field of data science.


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles