Overview

What is Thema?

Thema systematically explores hyperparameter spaces for unsupervised learning through three stages:

  1. Planet (Preprocessing): Generates multiple clean data versions with different imputation, scaling, and encoding strategies

  2. Oort (Embeddings): Creates low-dimensional projections across parameter grids (t-SNE, PCA)

  3. Galaxy (Graph Construction & Selection): Builds Mapper graphs, computes topological distances, and selects representatives

Instead of manually tuning preprocessing and embedding parameters, Thema generates candidate models systematically and uses curvature-based graph distances to identify diverse, high-quality representatives.

Pipeline Flow

        flowchart LR
  A[Raw Data] --> B[Planet]
  B --> C[Moon Files]
  C --> D[Oort]
  D --> E[Comet Files]
  E --> F[Galaxy]
  F --> G[Star Graphs]
  G --> H[Curvature Distance]
  H --> I[Representatives]

  style A fill:#e1f5ff
  style C fill:#fff4e1
  style E fill:#ffe1f5
  style G fill:#e1ffe1
  style I fill:#ffe1e1
    

Core Concepts

Moon

Preprocessed dataset (cleaned, encoded, scaled, imputed). Saved as .pkl files in clean/ directory. Each Moon represents a specific combination of preprocessing choices.

Comet

Low-dimensional embedding from a Moon. Contains projection array and metadata. Saved as .pkl files in projections/ directory. Multiple Comets per Moon (one per projection method/parameter combo).

Star

Mapper graph built from a Comet using Kepler Mapper algorithm. Contains nodes (clusters), edges (overlaps), and topology. Saved as .pkl files in models/ directory.

Galaxy

Orchestrator that generates Stars across parameter grids, computes pairwise graph distances using curvature filtrations, clusters similar graphs, and selects representatives.

Key Parameters

Planet
  • scaler: standard, minmax, robust

  • encoding: one_hot, label, ordinal

  • imputeMethods: mean, median, mode, sampleNormal

  • seeds: Random seeds for reproducible sampling

Oort
  • perplexity: t-SNE neighborhood size (5-50)

  • dimensions: Output dimensionality (typically 2)

  • projectiles: List of methods to use (tsne, pca)

Galaxy
  • nCubes: Cover resolution (5-50)

  • percOverlap: Cube overlap fraction (0-1)

  • clusterer: Algorithm for within-cube clustering (HDBSCAN, DBSCAN, KMeans)

  • metric: Graph distance (stellar_curvature_distance)

  • selector: Representative selection (max_nodes, max_edges, random)

Output Structure

Thema organizes outputs hierarchically:

{outDir}/{runName}/
├── clean/
│   ├── moon_42_0.pkl
│   ├── moon_42_1.pkl
│   └── ...
├── projections/
│   ├── tsne_perplexity30_dims2_seed42_moon_42_0.pkl
│   ├── pca_dims2_seed42_moon_42_0.pkl
│   └── ...
└── models/
    ├── star_tsne_perplexity30_nCubes10_overlap0.6.pkl
    ├── star_pca_dims2_nCubes10_overlap0.6.pkl
    └── ...

When to Use Thema

Good Use Cases

  • Exploring preprocessing choices for unsupervised learning

  • Comparing embedding methods (t-SNE vs PCA) systematically

  • Finding robust data representations across hyperparameter grids

  • Identifying diverse graph topologies in your data

  • Validating clustering stability across multiple configurations

Not Ideal For

  • Supervised learning (Thema focuses on unsupervised tasks)

  • Single fixed preprocessing pipeline (use sklearn directly)

  • Real-time inference (Thema generates models offline)

  • Small datasets (<100 samples; topological methods need sufficient data)

Next Steps

New Users

Start with Quickstart for a 5-minute walkthrough.

YAML Workflows

See Getting Started for complete tutorial.

Programmatic Control

Read Programmatic Pipeline for Python-only workflows.

Parameter Tuning

Check Tuning and Selection for grid strategies and filtering.

Advanced Customization

Explore Customizing Thema to write custom filters and graph builders.