Visualization of vector spaces

Hi there everyone!

I want to visualice the vector representation of different objects in 2 dimensions in such a way that two points are close if they have similar meaning. I know that PCA offers a possibility to reduce dimensions but is not meant to capture similarity. There is also t-sne that keeps the distribution of the data, but it is quite complex to choose the parameters and slow.

Does it makes sense the following approach?

  • Calculate a pairwise matrix of cosine similarities among the vectors
  • Apply PCA of 2 dimensions to the cosine similarity matrix
  • Plot the 2 PCA dimensions using scatter

Thank you

Yes I think it makes sense and even though with dimensionality reduction the results probably will show similarity if the plots are close enough!

Mauricio, I tried your idea on some 3D data and it kind of works. But there is a high chance, the 2D projection for higher-order data is not appropriate to really find the similarities you are looking for. Simply too many points will likely overlap.

My suggestion: use interactive 3D or even interactive 3D and time (as a 4th dimensions) to keep as much information in dimensions. Also use DBSCAN (is quick, identifies outliers as well, no need for number of clusters) to identify clusters based on two simple, domain-specific parameters (eps, and min_points).

Reason: essentially, what you are looking for are points close to each other. And this is: clusters. So clustering your data would be an essential step in whatver visalization you use, whether 2D or 3D or 3D and time (movie-like). Then colors will express the mathematical closeness of points according to whatever distance measure you define.

Here is some sample code in a toy example:

import numpy as np
import plotly.graph_objects as go
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

def generate_points_on_torus(num_points, major_radius, minor_radius):
    # Generate random toroidal coordinates
    theta = 2 * np.pi * np.random.rand(num_points)
    phi = 2 * np.pi * np.random.rand(num_points)

    # Calculate torus coordinates
    x = (major_radius + minor_radius * np.cos(phi)) * np.cos(theta)
    y = (major_radius + minor_radius * np.cos(phi)) * np.sin(theta)
    z = minor_radius * np.sin(phi)

    # Create a numpy array with the Cartesian coordinates
    points_on_torus = np.column_stack((x, y, z))

    return points_on_torus

def visualize_points_on_surface(points, title):
    fig = go.Figure()
    unique_labels = np.unique(points[:, 3])
    colors = ['rgb({}, {}, {})'.format(np.random.randint(0, 255), np.random.randint(0, 255), np.random.randint(0, 255)) for _ in range(len(unique_labels))]

    for label, color in zip(unique_labels, colors):
        cluster_points = points[points[:, 3] == label]
        trace = go.Scatter3d(
            x=cluster_points[:, 0],
            y=cluster_points[:, 1],
            z=cluster_points[:, 2],
            mode='markers',
            marker=dict(size=5, color=color),
            name=f'Cluster {int(label)}'
        )
        fig.add_trace(trace)

    fig.update_layout(scene=dict(aspectmode='data'))
    fig.update_layout(title=title)
    fig.show()

# Set the number of random points
num_points_torus = 700

# Generate random points on the surface of a torus (donut-style)
torus_major_radius = 5  # Major radius
torus_minor_radius = 2  # Minor radius (tube radius)
torus_points = generate_points_on_torus(num_points_torus, torus_major_radius, torus_minor_radius)

# Standardize the data before applying DBSCAN
scaler = StandardScaler()
torus_points_scaled = scaler.fit_transform(torus_points[:, :3])

# Perform DBSCAN clustering (Watch out!  This is on scaled down data!)
epsilon = 0.25
min_pts = 4

dbscan = DBSCAN(eps=epsilon, min_samples=min_pts)
labels = dbscan.fit_predict(torus_points_scaled)

# Visualize the points on the torus after clustering
torus_points_with_labels = np.column_stack((torus_points, labels))
visualize_points_on_surface(torus_points_with_labels, 'Points on the Surface of a Torus (After Clustering)')

While your idea would be something like this:

# initialize data set
A=torus_points
# Calculate a pairwise matrix of cosine similarities among the vectors
from sklearn.metrics.pairwise import cosine_similarity
A_cos_sim = cosine_similarity (A, A)
# Apply PCA of 2 dimensions to the cosine similarity matrix
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(A_cos_sim)
print(pca.explained_variance_ratio_)
A_res = pca.fit_transform(A_cos_sim)
# Plot the 2 PCA dimensions using scatter
import matplotlib.pyplot as plt
plt.scatter( A_res[:, 0], A_res[:, 1])
plt.show()

Naturally, one of the most important aspects is to look at Proportion of Variance explained when reducing dimensions to 2 in the PCA. If 2 principal components only catch a small share of the total variance, then you need more dimensions.

Two more remarks: in case t-SNE (which would be a great visualization method) is too slow. try on a subset of your samples, such as a certain part of hyperspace or a random subset. Just to get an idea.

Also, look into UMAP ( Uniform Manifold Approximation and Projection for Dimension Reduction). It will likely serve the purpose better than PCA.

Here is a nice comparison of different methods for dimension reduction:
https://umap-learn.readthedocs.io/en/latest/benchmarking.html

And there is a range of methods quicker than t-SNE, but way better than PCA.

Hi Ralf and Gent,

Thanks a lot for your replies. Well, I am going to try UMAP and dbscan too.

To tell you more details, what I have is 10 million sentences and a universal sentence encoding (720 dimensions) calculated for each of them. For visualisation, I am choosing a random sample of 500 and plotting them using a scatter 2D in plottly so I can see the point, text and color for the category. I have a category for each of them, with a total of 55 unique categories. So, what I really want to see is how good is the embedding to group together texts that are in the same category before running any ML model at all.

Is still valid to use dbscan over data with such high dimension (720 dimensions)? I have another dataset similar for which I don’t have the categories and I would like to see some groups that can be defined based on the similarity of the embedding.