How To Visualize Word Embeddings with Google Embedding Projector in Tensorboard

September 12, 2023

When working with data, visualization is critical. Because data tends to grow huge and complex over time, it is difficult to grasp its true meaning or detect patterns by simply looking at its structure or examining its content. While this can produce positive outcomes and useful information, it does not provide a whole picture of what the data is about and requires more effort to obtain solid insight on time.

As a result, when working with data, visualization is essential. It reduces the complexity of the data and allows you to identify patterns and insights that were previously hidden in a crowd of data plainly in open space with graphs of your choice.

In this article, you will also learn how to use Google Embedding Projector to visualize word embeddings. This will allow you to visualize words in three dimensions and detect similarities and contrasts between them. To follow along with this post, you must have a decent working knowledge of Python and the Pandas library.

Overview of the Dataset

The dataset to be used in this article is the TED Talks Transcripts. If you don’t know what TED, acronym for Technology, Entertainment, and Design, is, it is a global platform that brings together ideas worth spreading which started as a conference in 1984 and has since grown into a vast online library of thought-provoking talks, known as TED Talks.

TED Talks cover a wide range of topics, including science, technology, art, education, psychology, and more. Renowned speakers, experts, and innovators from diverse fields share their insights and experiences in engaging and concise presentations that are typically limited to 18 minutes or less.

And with this data, the aim is to convert the transcripts of various talks into word embeddings and find out which talks are similar and which are starkly different by looking at their position in the 3D space. You can get the dataset for this tutorial here: https://www.kaggle.com/datasets/miguelcorraljr/ted-ultimate-dataset. When you click the link, you will find out that there are different talks in different languages, but this tutorial will use English transcripts because it’s written in English.

Text Preprocessing

Preprocessing of text is crucial for effective natural language processing (NLP) tasks, such as visualizing word embeddings. It involves cleaning the data by removing noise, standardizing the text, and tokenizing it into individual words. Stop words are often removed, and words may be stemmed or lemmatized to reduce dimensionality and enhance semantic analysis. Additional techniques such as part-of-speech tagging, named entity recognition, and sentiment analysis can also be applied, but you won’t get to do that here.

To begin, import Pandas and read the dataset to see what it’s all about. import pandas as pd

df = pd.read_csv('ted_talks_en.csv') 

df.head(5)

When you run this, you see the dataset has about 19 columns, but this tutorial will only concern itself with two columns which are the title and transcript columns.

Therefore, go ahead and index these two columns

df = df[[‘title’, ‘transcript’]]

Moving on to the preprocessing of the transcript text. To process this text, you will first convert the text to lowercase, then proceed to tokenize the text. This is done to divide the text into individual words or tokens, which allows for better analysis and processing at the word level. After tokenization of the text, you will go ahead and remove stopwords, which are common words like ‘the’, ‘is, and ‘and’, which appear frequently but do not contribute to the overall meaning of the text. Lastly, you will reduce these words to their base form, which helps to normalize variation in words and reduces the dimensionality of the data in a process called lemmatization.

To do all these you will need to import the nltk and spacy packages, and also download some nltk resources.

import nltk

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer

import spacy

# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

# Download spaCy English model
spacy.cli.download('en_core_web_md')

If you don’t have these packages installed you can run pip install nltk and pip install spacy to have them installed.

Next, create a function to preprocess the text at once instead of having to run each process individually.

def preprocess_text(text):

    # Tokenization
    tokens = nltk.word_tokenize(text.lower())

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Join tokens back into a string
    processed_text = ' '.join(tokens)
    
    return processed_text

To apply this function on the transcript text, simply run:

df[‘processed_transcript’] = df[‘transcript’].apply(lambda x: preprocess_text(x))

If you go ahead and check your dataframe, you will see the difference between the original transcript and the processed transcript.

Creating Word Embeddings

With the text preprocessed, you can now go ahead and create the word embedding. Creating a word embedding refers to the process of representing words as numerical vectors in a high-dimensional space. Word embeddings capture the semantic and contextual relationships between words, allowing machines to understand and work with words in a more meaningful way.

Traditional approaches for representing words used methods like one-hot encoding, where each word is represented as a sparse binary vector. However, one-hot encoding fails to capture the semantic relationships between words or provide any notion of similarity or context.

Word embeddings, on the other hand, encode words as dense vectors, whereas similar words have vectors that are closer together in the embedding space. These vectors are learned through techniques like Word2Vec, GloVe, or FastText using large amounts of text data. These models consider the context in which words appear and aim to capture the meaning and semantic relationships between them.

To convert the processed text to numerical vectors, you will use the Google Universal Sentence Encoder. This encoder is a pre-trained model developed by Google that provides an encoding for sentences or short texts. It is designed to capture the semantic meaning and similarity between sentences, allowing for a better understanding and analysis of textual data.

To use this encoder, you will need to install tensorflow and tensor-hub using pip install tensorflow and pip install tensor-hub before importing the packages.

import tensorflow as tf

import tensorflow_hub as hub

To load the encoder simply run:

use_model_url = “https://tfhub.dev/google/universal-sentence-encoder/4" embed = hub.load(use_model_url)

And finally to create the word embeddings run:

processed_transcript = df[‘processed_transcript’]
embeddings = embed(processed_transcript).numpy()

Note: The process of creating word embeddings is very memory-intensive, so you might want to embed a subset of your dataset if you don’t have enough memory. For this tutorial, the first 1000 records were used, but if you have enough memory, you can create embeddings for all 4005 records.

When you run the embeddings, you will see that all the text has been converted into numerical vectors.

array([[ 0.04606897, -0.04607391, 0.0057315 , …, -0.04607334,
            -0.04607391, 0.04599435],
[ 0.03809019, -0.04527386, -0.04282996, …, -0.00621341,
            -0.04527386, -0.04527386],
[ 0.03843987, -0.04537549, -0.04537549, …, -0.04537498,
            -0.04537549, 0.04537549],
…,
[ 0.04153137, -0.04572546, -0.03428574, …, -0.04530822,
            -0.04572546, 0.04110127],
[ 0.04519934, -0.04520092, -0.04520062, …, -0.04520092,
            -0.04520092, -0.03715887],
[ 0.04518392, -0.0451839 , -0.04514507, …, -0.04518392,
-0.04518392, 0.04511109]], dtype=float32)

Setting Up The Tensorboard

To set up the tensorboard, first you need to import the projector module; from tensorboard.plugins import projector

Next, you need to create a directory to log the results or files from your experiments;

import os

# Set up a logs directory, so Tensorboard knows where to look for files. 
log_dir=’/content/logs/experiment/’
if not os.path.exists(log_dir):
  os.makedirs(log_dir)

Once that is done, initialize a Tensorflow session, and within that session, create a variable that presents the word embeddings. Set the variable to non-trainable to preserve the integrity of the pre-trained embeddings.

sess = tf.compat.v1.InteractiveSession()

with tf.device("/cpu:0"):
  weights = tf.Variable(embeddings, trainable=False, name='embedding')

Thereafter, associate the embeddings with a writer. This writer generates logs in the ‘/content/logs/experiment’ directory, forming the foundation for visualization. Also, create a saver to save the Tensorflow model for future use.

tf.compat.v1.disable_eager_execution()
sess.run(tf.compat.v1.global_variables_initializer())

writer = tf.compat.v1.summary.FileWriter('/content/logs/experiment', sess.graph) saver = tf.compat.v1.train.Saver()

Next, configure the projector and add the tensor name and the metadata path to it. The metadata path refers to the labels of the word embeddings, which in this case are the titles of each individual talk.

To create this labels file, you will use the title column of the dataframe. To do this, run:

with open(“labels.tsv”, “w”) as f:
  for label in df['title']:
    f.write(label + "\n")

The tensor name is set to “embedding”, linking it to the previously created weights variable. The path to the metadata file, ‘/content/labels.tsv’, which contains additional information about the embeddings, is also specified.

config = projector.ProjectorConfig()
embedding = config.embeddings.add()

embedding.tensor_name = "embedding"
embedding.metadata_path = '/content/labels.tsv'

The last step here is to visualize the word embeddings and save the model checkpoint. The embeddings and their configuration are passed to

projector.visualize_embeddings() , which generates the necessary metadata for the visualization. The writer is then instructed to write the metadata and associated graph to enable the visualization of the word embeddings.

The trained model and its variables are also saved using saver.save(). The TensorFlow session is persisted as a checkpoint file named model.ckpt in the ‘/content/logs/experiment/’ directory, ensuring that the trained embeddings can be utilized or further trained in the future.

projector.visualize_embeddings(writer, config)

max_size = 0
saver.save(sess, '/content/logs/experiment/model.ckpt', global_step=max_size)

Visualizing Embeddings

To load up the tensorboard to view the word embeddings, run:

%reload_ext tensorboard

%tensorboard - logdir /content/logs/experiment/

You will be presented with the following interface

To view the projector, click the dropdown on the top-right hand corner, scroll until you see the projector option and select it.

Once selected, you’ll see the visualization projector with the data points of all the embeddings.

Since the visualization is in 3D space, you can always move it around to see which TED talk titles are closely related based on the transcripts and which ones are well separated. If you look closely at the graph above, you’ll find out that the data points are sort of aggregated towards the left and right, with the center being seemingly sparse at the top while it appears to converge at the bottom.

You can also search for a particular title and see the other titles that are closely related to it using the search bar on the right side.

Conclusion

By using Google Tensorboard, you successfully visualized the relationship between different TED talks, using the titles as a label to identify which talks are similar and share little to nothing in common. You saw how to preprocess text, create word embeddings, and set up Google Tensorboard to visualize these embeddings.

With the techniques and understanding gained from this article, you can now go on to explore other datasets with diverse subject areas. By adapting the preprocessing steps to suit the characteristics of the new dataset, you can generate word embeddings that encapsulate the semantic relationships specific to that domain.

If you liked this article, kindly give it a thumbs up and subscribe so you won’t miss out on our upcoming articles.