Embeddings

This guide walks through using thema for Low Dimensional Embeddings and Dimensionality Reduction of clean datasets.

Note

Embeddings are dependant on the creation of a Moon object. Please see the Data Preprocessing Guide for more info on preprocessing steps and the Outer System API documentation for info on thema.multiverse.inner.Moon objects containing cleaned, encoded, and scaled data.

Projection Handling with Oort Class

The Oort class in Thema provides functionality for handling projections. When using the Oort class for projections:

  1. Output Data: Upon execution, the Oort class will create a directory as specified in the params.yaml configuration file. This directory serves as the designated location for storing all projection-related data. The Oort class does not store the projection data within its own instance. Instead, it manages and organizes the data within the designated directory. This approach ensures that all generated data, such as output files or processed results, are stored systematically for further analysis or usage.

  2. Configuration: The params.yaml file serves as the central configuration point for specifying parameters for generating multiple projections. This file provides comprehensive control over hyperparameter selection, facilitating grid search exploration to optimize UMAP embeddings. Users can customize within the params.yaml to precisely tailor the generated data storage and algorithm behavior to their project requirements.

Note

Projections are embeddings of your data.

UMAP Embeddings

The params.yaml file allows you to configure various parameters for generating UMAP embeddings using the Oort class. UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that can be customized through several parameters.

YAML Configuration for UMAP

  • Oort: The top-level key indicating the configuration for the Oort class.

  • umap: Contains the parameters for configuring the UMAP embeddings.

  • nn: Specifies the number of nearest neighbors to consider for each point. Two values are provided (2 and 4), indicating a range to explore during grid search.

  • minDist: Specifies the minimum distance between points in the embedding space. Multiple values are provided below, indicating a grid-search over multiple hyperparameters.

  • dimensions: Specifies the number of dimensions for the UMAP embedding. A value of 2 indicates a 2D embedding.

  • seed: Specifies the random seed for reproducibility. A value of 32 is provided.

  • projectiles: Lists the dimensionality reduction methods to be used. Here, umap is specified.

Here is an example of the relevant section in the params.yaml file:

Oort:
    umap:
        nn:
        - 2
        - 4
        minDist:
        - 0.1
        - 0.25
        dimensions:
        - 2
        seed:
        - 32
    projectiles:
        - umap

The parameters listed under the umap key are used in a grid search to generate multiple UMAP embeddings. Each grid search iteration will create a UMAP embedding based on different combinations of the provided parameter values. This allows you to explore a variety of configurations to identify the best one for your data.

For example, the grid search will consider the following combinations of parameters:

  • nn: 2, minDist: 0.05, dimensions: 2, seed: 32

  • nn: 4, minDist: 0.05, dimensions: 2, seed: 32

  • nn: 2, minDist: 0.1, dimensions: 2, seed: 32

  • nn: 4, minDist: 0.1, dimensions: 2, seed: 32

By adding more values to each parameter, you can expand the grid search to include more combinations.

By performing a grid search with these parameters, you can systematically evaluate different UMAP configurations to find the one that best suits your data and objectives. Feel free to add more parameters to the grid search to explore additional configurations and improve the quality of your UMAP embeddings.

Instantiate Oort with UMAP Embedding

First, import the correct module:

from thema.multiverse import Oort

Here, we instantiate a simple Oort using the yaml configs shown above:

yaml = "<PATH TO YOUR params.yaml>"
oort = Oort(YAML_PATH=yaml)
oort.fit()

The outDir member variable of an Oort object stores the location of newly created projection files:

Other Embeddings

Thema also supports PCA and T-SNE embeddings. The params.yaml file allows you to configure parameters for generating these embeddings alongside, or instead of, UMAP.

TSNE Parameters

Hint

  • An interactive overview of the key differences between UMAP and t-SNE projections: UMAP vs. t-SNE

  • perplexity: Controls the number of effective neighbors used in TSNE. Example value: 2.

  • dimensions: Specifies the number of dimensions for TSNE embeddings. Example value: 2.

  • seed: Specifies the random seed for reproducibility in TSNE. Example value: 32.

Oort:
    tsne:
        perplexity:
        - 2
        - 4
        dimensions:
        - 2
        seed:
        - 32
        - 42
    projectiles:
        - tsne

PCA Parameters

  • dimensions: Specifies the number of dimensions for PCA embeddings. Example value: 2.

  • seed: Specifies the random seed for reproducibility in PCA. Example value: 32.

Oort:
    pca:
        dimensions:
        - 2
        - 3
        - 4
        seed:
        - 32
    projectiles:
        - pca

Example YAML with Multiple Embedding Methods

To include multiple projections in your Oort object crearion, modify the projectiles section in your params.yaml file to include multiple embeddings (and dont forget to define the parameter).

Note

This is just the Oort section of the yaml, and does not affect any other keys (such as the Planet yaml key controlling data preprocessing, for example).

Oort:
    umap:
        nn:
        - 2
        minDist:
        - 0.1
        dimensions:
        - 2
        seed:
        - 32
    tsne:
        perplexity:
        - 2
        dimensions:
        - 2
        seed:
        - 32
    pca:
        dimensions:
        - 2
        seed:
        - 32
    projectiles:
        - umap
        - tsne
        - pca