Embeddings¶
This guide walks through using thema
for Low Dimensional Embeddings and Dimensionality Reduction of clean datasets.
Note
Embeddings are dependant on the creation of a Moon
object. Please see the Data Preprocessing Guide for more info on preprocessing steps and the Outer System API documentation for info on thema.multiverse.inner.Moon
objects containing cleaned, encoded, and scaled data.
Projection Handling with Oort Class¶
The Oort class in Thema provides functionality for handling projections. When using the Oort class for projections:
Output Data: Upon execution, the Oort class will create a directory as specified in the
params.yaml
configuration file. This directory serves as the designated location for storing all projection-related data. The Oort class does not store the projection data within its own instance. Instead, it manages and organizes the data within the designated directory. This approach ensures that all generated data, such as output files or processed results, are stored systematically for further analysis or usage.Configuration: The
params.yaml
file serves as the central configuration point for specifying parameters for generating multiple projections. This file provides comprehensive control over hyperparameter selection, facilitating grid search exploration to optimize UMAP embeddings. Users can customize within theparams.yaml
to precisely tailor the generated data storage and algorithm behavior to their project requirements.
Note
Projections are embeddings of your data.
UMAP Embeddings¶
The params.yaml
file allows you to configure various parameters for generating UMAP embeddings using the Oort
class. UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that can be customized through several parameters.
YAML Configuration for UMAP¶
Oort: The top-level key indicating the configuration for the
Oort
class.umap: Contains the parameters for configuring the UMAP embeddings.
nn: Specifies the number of nearest neighbors to consider for each point. Two values are provided (2 and 4), indicating a range to explore during grid search.
minDist: Specifies the minimum distance between points in the embedding space. Multiple values are provided below, indicating a grid-search over multiple hyperparameters.
dimensions: Specifies the number of dimensions for the UMAP embedding. A value of 2 indicates a 2D embedding.
seed: Specifies the random seed for reproducibility. A value of 32 is provided.
projectiles: Lists the dimensionality reduction methods to be used. Here,
umap
is specified.
Here is an example of the relevant section in the params.yaml
file:
Oort:
umap:
nn:
- 2
- 4
minDist:
- 0.1
- 0.25
dimensions:
- 2
seed:
- 32
projectiles:
- umap
The parameters listed under the umap
key are used in a grid search to generate multiple UMAP embeddings. Each grid search iteration will create a UMAP embedding based on different combinations of the provided parameter values. This allows you to explore a variety of configurations to identify the best one for your data.
For example, the grid search will consider the following combinations of parameters:
nn: 2, minDist: 0.05, dimensions: 2, seed: 32
nn: 4, minDist: 0.05, dimensions: 2, seed: 32
nn: 2, minDist: 0.1, dimensions: 2, seed: 32
nn: 4, minDist: 0.1, dimensions: 2, seed: 32
By adding more values to each parameter, you can expand the grid search to include more combinations.
By performing a grid search with these parameters, you can systematically evaluate different UMAP configurations to find the one that best suits your data and objectives. Feel free to add more parameters to the grid search to explore additional configurations and improve the quality of your UMAP embeddings.
Instantiate Oort with UMAP Embedding¶
First, import the correct module:
from thema.multiverse import Oort
Here, we instantiate a simple Oort using the yaml configs shown above:
yaml = "<PATH TO YOUR params.yaml>"
oort = Oort(YAML_PATH=yaml)
oort.fit()
The outDir
member variable of an Oort
object stores the location of newly created projection files:
Other Embeddings¶
Thema also supports PCA and T-SNE embeddings. The params.yaml file allows you to configure parameters for generating these embeddings alongside, or instead of, UMAP.
TSNE Parameters¶
Hint
An interactive overview of the key differences between UMAP and t-SNE projections: UMAP vs. t-SNE
perplexity: Controls the number of effective neighbors used in TSNE. Example value: 2.
dimensions: Specifies the number of dimensions for TSNE embeddings. Example value: 2.
seed: Specifies the random seed for reproducibility in TSNE. Example value: 32.
Oort:
tsne:
perplexity:
- 2
- 4
dimensions:
- 2
seed:
- 32
- 42
projectiles:
- tsne
PCA Parameters¶
dimensions: Specifies the number of dimensions for PCA embeddings. Example value: 2.
seed: Specifies the random seed for reproducibility in PCA. Example value: 32.
Oort:
pca:
dimensions:
- 2
- 3
- 4
seed:
- 32
projectiles:
- pca
Example YAML with Multiple Embedding Methods¶
To include multiple projections in your Oort object crearion, modify the projectiles section in your params.yaml
file to include multiple embeddings (and dont forget to define the parameter).
Note
This is just the Oort
section of the yaml, and does not affect any other keys (such as the Planet
yaml key controlling data preprocessing, for example).
Oort:
umap:
nn:
- 2
minDist:
- 0.1
dimensions:
- 2
seed:
- 32
tsne:
perplexity:
- 2
dimensions:
- 2
seed:
- 32
pca:
dimensions:
- 2
seed:
- 32
projectiles:
- umap
- tsne
- pca