Quickstart¶
Run the full pipeline from raw data to representative graphs.
flowchart LR A[params.yaml] --> B[Planet: clean] B --> C[Oort: embed] C --> D[Galaxy: graphs] D --> E[Representatives]
One-Shot Pipeline¶
from thema.thema import Thema
T = Thema("params.yaml")
T.genesis()
print(T.selected_model_files) # Paths to representative graphs
Minimal Configuration¶
params.yaml
runName: my_run
data: /path/to/data.pkl
outDir: ./outputs
Planet:
scaler: standard # Zero-mean, unit-variance scaling
encoding: one_hot # Categorical encoding
imputeColumns: auto # Auto-detect columns with missing values
imputeMethods: auto # Auto-select imputation method per column
numSamples: 1 # Number of imputed datasets per seed
seeds: auto # Auto-generate random seeds
Oort:
tsne:
perplexity: [30] # t-SNE neighborhood size
dimensions: [2] # Output dimensions
seed: [42]
pca:
dimensions: [2]
seed: [42]
projectiles: [tsne, pca] # Methods to run
Galaxy:
metric: stellar_curvature_distance # Graph distance metric
selector: max_nodes # Selection strategy
nReps: 2 # Number of representatives
stars: [jmap] # Graph construction method
jmap:
nCubes: [8] # Cover resolution
percOverlap: [0.3] # Cube overlap fraction
minIntersection: [-1] # Edge formation threshold (-1 = weighted)
clusterer:
- [HDBSCAN, {min_cluster_size: 5}]
Key Parameters¶
- runNamestr
Output subdirectory name
- datastr
Absolute path to input data (CSV, pickle, parquet)
- outDirstr
Base directory for all outputs. Creates:
{outDir}/{runName}/{clean,projections,models}/
- Planet.scalerstr
Options:
standard
(recommended),minmax
,robust
,None
- Oort.projectileslist of str
Projection methods to use:
tsne
,pca
- Galaxy.nRepsint
Number of representative graphs to select
- Galaxy.selectorstr
Options:
max_nodes
(largest graph),max_edges
,random
Step-by-Step Control¶
Run stages independently:
from thema.multiverse import Planet, Oort, Galaxy
planet = Planet(YAML_PATH="params.yaml")
planet.fit() # Outputs: {outDir}/{runName}/clean/*.pkl
oort = Oort(YAML_PATH="params.yaml")
oort.fit() # Outputs: {outDir}/{runName}/projections/*.pkl
galaxy = Galaxy(YAML_PATH="params.yaml")
galaxy.fit() # Outputs: {outDir}/{runName}/models/*.pkl
reps = galaxy.collapse() # Dict: {cluster_id: {"star": StarGraph, "file": Path}}
Cleaning Outputs¶
Remove previous run outputs:
T = Thema("params.yaml")
T.spaghettify() # Deletes {outDir}/{runName}/ directory tree
Logging¶
Enable detailed logging:
import thema
thema.enable_logging('DEBUG') # Verbose output
# or
thema.enable_logging('INFO') # Progress messages