Quickstart

Run the full pipeline from raw data to representative graphs.

        flowchart LR
  A[params.yaml] --> B[Planet: clean]
  B --> C[Oort: embed]
  C --> D[Galaxy: graphs]
  D --> E[Representatives]
    

One-Shot Pipeline

from thema.thema import Thema

T = Thema("params.yaml")
T.genesis()
print(T.selected_model_files)  # Paths to representative graphs

Minimal Configuration

params.yaml

runName: my_run
data: /path/to/data.pkl
outDir: ./outputs

Planet:
  scaler: standard           # Zero-mean, unit-variance scaling
  encoding: one_hot          # Categorical encoding
  imputeColumns: auto        # Auto-detect columns with missing values
  imputeMethods: auto        # Auto-select imputation method per column
  numSamples: 1              # Number of imputed datasets per seed
  seeds: auto                # Auto-generate random seeds

Oort:
  tsne:
    perplexity: [30]         # t-SNE neighborhood size
    dimensions: [2]          # Output dimensions
    seed: [42]
  pca:
    dimensions: [2]
    seed: [42]
  projectiles: [tsne, pca]   # Methods to run

Galaxy:
  metric: stellar_curvature_distance  # Graph distance metric
  selector: max_nodes                 # Selection strategy
  nReps: 2                            # Number of representatives
  stars: [jmap]                       # Graph construction method
  jmap:
    nCubes: [8]              # Cover resolution
    percOverlap: [0.3]       # Cube overlap fraction
    minIntersection: [-1]    # Edge formation threshold (-1 = weighted)
    clusterer:
      - [HDBSCAN, {min_cluster_size: 5}]

Key Parameters

runNamestr

Output subdirectory name

datastr

Absolute path to input data (CSV, pickle, parquet)

outDirstr

Base directory for all outputs. Creates: {outDir}/{runName}/{clean,projections,models}/

Planet.scalerstr

Options: standard (recommended), minmax, robust, None

Oort.projectileslist of str

Projection methods to use: tsne, pca

Galaxy.nRepsint

Number of representative graphs to select

Galaxy.selectorstr

Options: max_nodes (largest graph), max_edges, random

Step-by-Step Control

Run stages independently:

from thema.multiverse import Planet, Oort, Galaxy

planet = Planet(YAML_PATH="params.yaml")
planet.fit()  # Outputs: {outDir}/{runName}/clean/*.pkl

oort = Oort(YAML_PATH="params.yaml")
oort.fit()    # Outputs: {outDir}/{runName}/projections/*.pkl

galaxy = Galaxy(YAML_PATH="params.yaml")
galaxy.fit()  # Outputs: {outDir}/{runName}/models/*.pkl

reps = galaxy.collapse()  # Dict: {cluster_id: {"star": StarGraph, "file": Path}}

Cleaning Outputs

Remove previous run outputs:

T = Thema("params.yaml")
T.spaghettify()  # Deletes {outDir}/{runName}/ directory tree

Logging

Enable detailed logging:

import thema
thema.enable_logging('DEBUG')  # Verbose output
# or
thema.enable_logging('INFO')   # Progress messages