Embeddings are fascinating but notoriously difficult to comprehend. When you upload data to a system like Vantage, it transforms into these giant vectors of floating-point numbers that don't make much sense to humans. So how do we make sense of these embeddings and understand their similarities?
In this article, we'll dive into the concept of embeddings, use a sample dataset called "data barn delights," and explore how to visualize high-dimensional data using a technique called t-SNE. And yes, we have the entire notebook and example available on GitHub for those who want to dive deeper.
Embeddings are like coordinates in a massively high-dimensional space. Imagine a space with thousands of dimensions where each data point is a vector of floating-point numbers. These vectors capture the semantic meaning of the items they represent. For instance, the words "king" and "queen" might be close together in this space, while "king" and "carrot" would be far apart.
While we can calculate cosine similarity scores between these vectors to measure how similar they are, understanding and visualizing these similarities is a whole different ballgame. Cosine similarity scores give us a numerical value, but how do we visualize these massive dimensional coordinates?
Let's introduce our dataset: "data barn delights," a sample data containing 12 items. Each item has an ID, a color, a department, and a description. Here are a few examples:
When these items are turned into embeddings, like when uploaded to Vantage, they become these incomprehensible vectors in high-dimensional space. So, how do we make sense of their similarities?
Enter t-SNE (t-distributed Stochastic Neighbor Embedding), a technique that allows us to approximate high-dimensional coordinates into lower dimensions—such as 2D or 3D—that we mere mortals can understand.
t-SNE works by converting the similarities between data points into joint probabilities and trying to minimize the Kullback-Leibler divergence between these joint probabilities in the high-dimensional and lower-dimensional spaces. In simpler terms, it maps high-dimensional data to two or three dimensions while preserving the relative distances between points as much as possible.
While not perfect or precise, t-SNE helps us visualize and understand how our data points relate to each other. Think of it as trying to visualize the inside of a balloon animal without popping it.
Let's see t-SNE in action with our data barn delights sample data. We start with embeddings that have 2048 dimensions for each item and use t-SNE to reduce these to just 2 dimensions.
Here's what our t-SNE visualization looks like:
In this graph, each point represents an item from our dataset, color-coded by department: grocery (yellow), garden (purple), and cleaning (red). As we can see, items in similar categories tend to cluster together.
This clustering is both intuitively and mathematically expected. When we use semantic similarity, we rely on the distance between points in our space to reflect their similarity. t-SNE helps us visualize this concept.
Now, let's delve deeper into interpreting our t-SNE visualization. The key is to look at the distances between points. Similar items should be closer together, while dissimilar items should be farther apart.
By visualizing these clusters, we can better understand the relationships between our data points. This is essential when we need to explain the results of our embeddings to stakeholders or use them to drive business decisions.
In this article, we've explored the concept of embeddings and their complexities. We introduced the data barn delights sample data and used t-SNE to visualize the high-dimensional embeddings in a way that's comprehensible to humans. By interpreting the t-SNE visualization, we can better understand the relationships between our data points and make more informed decisions.
Remember, while t-SNE is not perfect, it's a powerful tool for visualizing and interpreting high-dimensional data. And don't forget, the entire notebook and example are available on GitHub for those who want to dive deeper into the technical details.