Overview¶
What is Thema?¶
Thema systematically explores hyperparameter spaces for unsupervised learning through three stages:
Planet (Preprocessing): Generates multiple clean data versions with different imputation, scaling, and encoding strategies
Oort (Embeddings): Creates low-dimensional projections across parameter grids (t-SNE, PCA)
Galaxy (Graph Construction & Selection): Builds Mapper graphs, computes topological distances, and selects representatives
Instead of manually tuning preprocessing and embedding parameters, Thema generates candidate models systematically and uses curvature-based graph distances to identify diverse, high-quality representatives.
Pipeline Flow¶
flowchart LR A[Raw Data] --> B[Planet] B --> C[Moon Files] C --> D[Oort] D --> E[Comet Files] E --> F[Galaxy] F --> G[Star Graphs] G --> H[Curvature Distance] H --> I[Representatives] style A fill:#e1f5ff style C fill:#fff4e1 style E fill:#ffe1f5 style G fill:#e1ffe1 style I fill:#ffe1e1
Core Concepts¶
- Moon
Preprocessed dataset (cleaned, encoded, scaled, imputed). Saved as
.pkl
files inclean/
directory. Each Moon represents a specific combination of preprocessing choices.- Comet
Low-dimensional embedding from a Moon. Contains projection array and metadata. Saved as
.pkl
files inprojections/
directory. Multiple Comets per Moon (one per projection method/parameter combo).- Star
Mapper graph built from a Comet using Kepler Mapper algorithm. Contains nodes (clusters), edges (overlaps), and topology. Saved as
.pkl
files inmodels/
directory.- Galaxy
Orchestrator that generates Stars across parameter grids, computes pairwise graph distances using curvature filtrations, clusters similar graphs, and selects representatives.
Key Parameters¶
- Planet
scaler
:standard
,minmax
,robust
encoding
:one_hot
,label
,ordinal
imputeMethods
:mean
,median
,mode
,sampleNormal
seeds
: Random seeds for reproducible sampling
- Oort
perplexity
: t-SNE neighborhood size (5-50)dimensions
: Output dimensionality (typically 2)projectiles
: List of methods to use (tsne
,pca
)
- Galaxy
nCubes
: Cover resolution (5-50)percOverlap
: Cube overlap fraction (0-1)clusterer
: Algorithm for within-cube clustering (HDBSCAN, DBSCAN, KMeans)metric
: Graph distance (stellar_curvature_distance
)selector
: Representative selection (max_nodes
,max_edges
,random
)
Output Structure¶
Thema organizes outputs hierarchically:
{outDir}/{runName}/
├── clean/
│ ├── moon_42_0.pkl
│ ├── moon_42_1.pkl
│ └── ...
├── projections/
│ ├── tsne_perplexity30_dims2_seed42_moon_42_0.pkl
│ ├── pca_dims2_seed42_moon_42_0.pkl
│ └── ...
└── models/
├── star_tsne_perplexity30_nCubes10_overlap0.6.pkl
├── star_pca_dims2_nCubes10_overlap0.6.pkl
└── ...
When to Use Thema¶
Good Use Cases
Exploring preprocessing choices for unsupervised learning
Comparing embedding methods (t-SNE vs PCA) systematically
Finding robust data representations across hyperparameter grids
Identifying diverse graph topologies in your data
Validating clustering stability across multiple configurations
Not Ideal For
Supervised learning (Thema focuses on unsupervised tasks)
Single fixed preprocessing pipeline (use sklearn directly)
Real-time inference (Thema generates models offline)
Small datasets (<100 samples; topological methods need sufficient data)
Next Steps¶
- New Users
Start with Quickstart for a 5-minute walkthrough.
- YAML Workflows
See Getting Started for complete tutorial.
- Programmatic Control
Read Programmatic Pipeline for Python-only workflows.
- Parameter Tuning
Check Tuning and Selection for grid strategies and filtering.
- Advanced Customization
Explore Customizing Thema to write custom filters and graph builders.