Embeddings

Oort projects high-dimensional data to lower dimensions for graph construction. Supports t-SNE and PCA.

Note

Requires Moon files from Planet. See Data Preprocessing first.

Basic Usage

from thema.multiverse import Oort

oort = Oort(
    data="/path/to/data.pkl",
    cleanDir="./outputs/my_run/clean",
    outDir="./outputs/my_run/projections",
    params={
        "tsne": {
            "perplexity": [30],
            "dimensions": [2],
            "seed": [42]
        }
    }
)
oort.fit()

Parameters

datastr or Path

Absolute path to original raw data file (same as Planet input)

cleanDirstr or Path

Absolute path to Planet output directory containing Moon files

outDirstr or Path

Absolute path for projection outputs

paramsdict

Nested dictionary of projection methods and hyperparameters

Projection Methods

t-SNE

Nonlinear dimensionality reduction preserving local structure.

params = {
    "tsne": {
        "perplexity": [15, 30, 50],
        "dimensions": [2],
        "seed": [42]
    }
}
perplexitylist of int or float

Balances local vs global structure. Typical range: 5-50.

  • 5-15: Emphasizes local neighborhoods

  • 30-50: Preserves global patterns

  • Rule of thumb: perplexity ≈ sqrt(n_samples)

dimensionslist of int

Output dimensionality. Typically 2 for Mapper graphs.

seedlist of int

Random seed for reproducibility

PCA

Linear dimensionality reduction via principal components.

params = {
    "pca": {
        "dimensions": [2, 3, 5],
        "seed": [42]  # Not used, but required
    }
}
dimensionslist of int

Number of principal components to retain

seedlist of int

Placeholder (PCA is deterministic)

Parameter Grids

Oort generates embeddings for all parameter combinations:

params = {
    "tsne": {
        "perplexity": [15, 30, 50],
        "dimensions": [2],
        "seed": [42, 13]
    }
}
# Produces: 3 perplexities × 1 dimension × 2 seeds = 6 embeddings per Moon file

Output

Comet files saved as <method>_<params>_moon_<seed>_<sample>.pkl in outDir:

  • tsne_perplexity30_dims2_seed42_moon_42_0.pkl

  • pca_dims2_seed42_moon_42_0.pkl

Each contains the reduced-dimension array and metadata.

Examples

Single Method

oort = Oort(
    data="/data/survey.pkl",
    cleanDir="./outputs/analysis/clean",
    outDir="./outputs/analysis/projections",
    params={
        "tsne": {
            "perplexity": [30],
            "dimensions": [2],
            "seed": [42]
        }
    }
)
oort.fit()

Multiple Methods

oort = Oort(
    data="/data/survey.pkl",
    cleanDir="./outputs/analysis/clean",
    outDir="./outputs/analysis/projections",
    params={
        "tsne": {
            "perplexity": [15, 30, 50],
            "dimensions": [2],
            "seed": [42]
        },
        "pca": {
            "dimensions": [2, 5],
            "seed": [42]
        }
    }
)
oort.fit()

YAML Configuration

In params.yaml:

Oort:
  tsne:
    perplexity: [30, 50]
    dimensions: [2]
    seed: [42]
  pca:
    dimensions: [2]
    seed: [42]
  projectiles: [tsne, pca]

Then:

oort = Oort(YAML_PATH="params.yaml")
oort.fit()

Best Practices

  • Start with t-SNE perplexities [15, 30, 50] to capture local and global structure

  • Use 2D embeddings for Mapper (higher dimensions increase computational cost)

  • PCA is fast and deterministic; use for baseline comparisons

  • t-SNE is stochastic; fix seeds for reproducibility

  • Grid explosion: 5 parameters × 3 Moon files = 15 embeddings