Manual Configuration Guide¶

This guide explains how to configure and run the Thema pipeline programmatically without YAML files. It covers preprocessing, dimensionality reduction, graph construction, filtering, and model selection.

Overview¶

The Thema pipeline consists of four main stages:

Planet - Data preprocessing and cleaning
Oort - Dimensionality reduction
Galaxy - TDA graph construction using Mapper
Collapse - Model selection and filtering

Each stage produces outputs consumed by the next, creating a structured workflow from raw data to representative graphs.

Directory Structure¶

Thema organizes outputs into three directories:

clean/ - Preprocessed datasets
projections/ - Dimensionality-reduced data
graphs/ - TDA Mapper graphs

Important: Thema requires absolute paths for all directory arguments.

from pathlib import Path

base_dir = Path("/absolute/path/to/thema_outputs")
clean_dir = base_dir / "clean"
projections_dir = base_dir / "projections"
graphs_dir = base_dir / "graphs"

for d in [clean_dir, projections_dir, graphs_dir]:
    d.mkdir(parents=True, exist_ok=True)

Stage 1: Preprocessing with Planet¶

Planet handles data cleaning and generates multiple preprocessed versions for robustness analysis.

Class Initialization¶

from thema.multiverse import Planet

planet = Planet(
    data=input_path,
    dropColumns=columns_to_drop,
    imputeColumns=columns_to_impute,
    imputeMethods=imputation_methods,
    scaler=scaling_method,
    seeds=random_seeds,
    numSamples=samples_per_seed
)
planet.outDir = clean_dir
planet.fit()

Parameters¶

datastr or Path

Absolute path to input data file (CSV, pickle, or parquet)

dropColumnslist of str

Column names to remove before analysis. Typically includes identifiers, dates, or non-numeric features.

imputeColumnslist of str

Column names requiring imputation for missing values. Must align with imputeMethods.

imputeMethodslist of str

Imputation strategy for each column in imputeColumns. Options:

"mode" - Most frequent value
"mean" - Column mean
"median" - Column median
"sampleNormal" - Sample from normal distribution fitted to column
"zeros" - Fill with zeros

scalerstr

Feature scaling method. Options:

"standard" - Zero mean, unit variance (recommended)
"minmax" - Scale to [0, 1] range
"robust" - Robust to outliers using IQR
None - No scaling

seedslist of int

Random seeds for reproducible sampling. Each seed generates numSamples datasets.

numSamplesint

Number of imputed datasets per seed. Creates multiple “universes” for robustness.

Output¶

Planet generates preprocessed pickle files in outDir:

moon_<seed>_<sample>.pkl - Each contains cleaned, imputed, and scaled data

These files are automatically discovered by Oort and Galaxy.

Example¶

planet = Planet(
    data="/data/raw_dataset.pkl",
    dropColumns=["id", "name", "timestamp"],
    imputeColumns=["age", "category", "value"],
    imputeMethods=["sampleNormal", "mode", "median"],
    scaler="standard",
    seeds=[42, 13, 99],
    numSamples=2
)
planet.outDir = clean_dir
planet.fit()
# Produces: 6 files (3 seeds × 2 samples)

Stage 2: Dimensionality Reduction with Oort¶

Oort projects high-dimensional data to lower dimensions for graph construction.

Class Initialization¶

from thema.multiverse import Oort

oort = Oort(
    data=input_path,
    cleanDir=clean_dir,
    outDir=projections_dir,
    params=projection_config
)
oort.fit()

Parameters¶

datastr or Path: Path to original raw data file (same as Planet input)
cleanDirstr or Path: Absolute path to Planet output directory (clean/)
outDirstr or Path: Absolute path for projection outputs
paramsdict: Nested dictionary specifying projection methods and hyperparameters

Projection Configuration¶

The params dictionary structure:

params = {
    "method_name": {
        "param1": [value1, value2, ...],
        "param2": [value3, value4, ...],
        "dimensions": [2],  # Output dimensionality
        "seed": [42]        # Random seed
    }
}

Supported Methods¶

t-SNE ("tsne")

"tsne": {
    "perplexity": [15, 30, 50],  # Balance local vs global structure
    "dimensions": [2],            # Typically 2 for Mapper
    "seed": [42]
}

perplexity: Lower values (5-15) emphasize local structure, higher values (30-50) preserve global patterns

PCA ("pca")

"pca": {
    "dimensions": [2, 3, 5],
    "seed": [42]  # Not used but required
}

UMAP ("umap")

"umap": {
    "n_neighbors": [15, 30, 50],
    "min_dist": [0.1, 0.3, 0.5],
    "dimensions": [2],
    "seed": [42]
}

Output¶

Oort generates projection files in outDir:

<method>_<params>_moon_<seed>_<sample>.pkl - Reduced data for each parameter combination and Moon

Example¶

projection_config = {
    "tsne": {
        "perplexity": [15, 30, 66],
        "dimensions": [2],
        "seed": [42]
    },
    "pca": {
        "dimensions": [2, 5],
        "seed": [42]
    }
}

oort = Oort(
    data="/data/raw_dataset.pkl",
    cleanDir=clean_dir,
    outDir=projections_dir,
    params=projection_config
)
oort.fit()

Stage 3: Graph Construction with Galaxy¶

Galaxy constructs TDA Mapper graphs from projections using clustering and cover schemes.

Class Initialization¶

from thema.multiverse import Galaxy

galaxy = Galaxy(
    data=input_path,
    cleanDir=clean_dir,
    projDir=projections_dir,
    outDir=graphs_dir,
    params=mapper_config
)
galaxy.fit()

Parameters¶

datastr or Path: Path to original raw data file
cleanDirstr or Path: Absolute path to Planet outputs (clean/)
projDirstr or Path: Absolute path to Oort outputs (projections/)
outDirstr or Path: Absolute path for graph outputs
paramsdict: Mapper algorithm configuration

Mapper Configuration¶

The params dictionary uses the "jmapStar" key:

params = {
    "jmapStar": {
        "nCubes": [5, 10, 20],
        "percOverlap": [0.5, 0.6, 0.7],
        "minIntersection": [-1],
        "clusterer": [
            ["HDBSCAN", {"min_cluster_size": 3}],
            ["HDBSCAN", {"min_cluster_size": 10}]
        ]
    }
}

Mapper Parameters¶

nCubeslist of int

Number of hypercubes (intervals) covering the projection space. More cubes = finer resolution.

3-5: Coarse, few large clusters
10-20: Moderate resolution (recommended starting point)
50+: Fine-grained, many small clusters

percOverlaplist of float

Percentage overlap between adjacent hypercubes (0-1 range).

0.3-0.5: Less overlap, more disconnected components
0.6-0.7: Moderate overlap (recommended)
0.8+: High overlap, highly connected graphs

minIntersectionlist of int

Minimum items required in cube overlap to form an edge.

-1: No minimum (default, recommended)
Positive values: Stricter edge formation

clustererlist of [str, dict] pairs

Clustering algorithms and their parameters. Each entry is [algorithm_name, param_dict].

Clustering Options¶

HDBSCAN (recommended)

["HDBSCAN", {"min_cluster_size": 5, "min_samples": 3}]

min_cluster_size: Minimum items to form a cluster (2-10 typical)
min_samples: Core point requirement (optional)

DBSCAN

["DBSCAN", {"eps": 0.5, "min_samples": 5}]

KMeans

["KMeans", {"n_clusters": 8}]

Graph Interpretation¶

Mapper graphs contain:

Nodes: Clusters of data points
Edges: Overlap between clusters (shared items)
Connected components: Groups of connected nodes representing distinct patterns or “archetypes”

Output¶

Galaxy generates graph files in outDir:

star_<projection>_<mapper_params>.pkl - Each contains a Mapper graph model

Example¶

mapper_config = {
    "jmapStar": {
        "nCubes": [5, 10, 20],
        "percOverlap": [0.55, 0.65, 0.75],
        "minIntersection": [-1],
        "clusterer": [
            ["HDBSCAN", {"min_cluster_size": 2}],
            ["HDBSCAN", {"min_cluster_size": 5}],
            ["HDBSCAN", {"min_cluster_size": 10}]
        ]
    }
}

galaxy = Galaxy(
    data="/data/raw_dataset.pkl",
    cleanDir=clean_dir,
    projDir=projections_dir,
    outDir=graphs_dir,
    params=mapper_config
)
galaxy.fit()

Stage 4: Filtering and Model Selection¶

After generating graphs, filter and select representative models using built-in or custom filters.

Graph Filtering¶

Built-in Filter Functions¶

from thema.multiverse.universe.utils.starFilters import (
    minimum_unique_items_filter,
    component_count_filter,
    component_count_range_filter,
    minimum_nodes_filter,
    minimum_edges_filter,
    nofilterfunction
)

minimum_unique_items_filter(n)

Keep graphs covering at least n unique data items

coverage_filter = minimum_unique_items_filter(1000)

component_count_filter(k)

Keep graphs with exactly k connected components

three_component_filter = component_count_filter(3)

component_count_range_filter(min_k, max_k)

Keep graphs with component count in range [min_k, max_k]

mid_range_filter = component_count_range_filter(3, 8)

minimum_nodes_filter(n)

Keep graphs with at least n nodes

minimum_edges_filter(n)

Keep graphs with at least n edges

nofilterfunction

No filtering, keep all graphs

Loading Filtered Graphs¶

from thema.multiverse.universe.geodesics import _load_starGraphs

filtered_graphs = _load_starGraphs(
    dir=graphs_dir,
    graph_filter=filter_function
)

Example: Coverage-Based Filtering¶

import pandas as pd
from pathlib import Path

# Get total item count from cleaned data
sample_file = next(Path(clean_dir).glob("*.pkl"))
total_items = len(pd.read_pickle(sample_file).imputeData)

# Filter for 85% coverage
coverage_filter = minimum_unique_items_filter(int(total_items * 0.85))
high_coverage_graphs = _load_starGraphs(
    dir=graphs_dir,
    graph_filter=coverage_filter
)

Model Collapse (Representative Selection)¶

The collapse() method clusters similar graphs and selects representatives.

Method Signature¶

representatives = galaxy.collapse(
    metric="stellar_curvature_distance",
    curvature="forman_curvature",
    distance_threshold=250,
    nReps=None,
    selector="max_nodes",
    filter_fn=filter_function,
    files=list_of_graph_files
)

Parameters¶

metricstr

Distance metric for graph comparison

"stellar_curvature_distance" - Curvature-based (recommended)
Other metrics may be available depending on implementation

curvaturestr

Curvature calculation method

"forman_curvature" - Forman-Ricci curvature (recommended)
"ollivier_curvature" - Ollivier-Ricci curvature (slower)

distance_thresholdfloat

Maximum distance for graphs to be considered similar. Lower = stricter clustering.

nRepsint or None

Number of representatives to select. If None, uses distance_threshold instead.

selectorstr

How to choose representatives from each cluster

"max_nodes" - Graph with most nodes
"max_edges" - Graph with most edges
"min_nodes" - Graph with fewest nodes
"random" - Random selection

filter_fncallable or None

Filter function to apply before clustering

fileslist of Path or None

Specific graph files to consider. If None, uses all files in outDir.

Return Value¶

Dictionary mapping cluster IDs to representative graph information:

{
    0: {"star": StarGraph_object, "file": Path, ...},
    1: {"star": StarGraph_object, "file": Path, ...},
    ...
}

Example: Component-Based Selection¶

from thema.multiverse.universe.utils.starFilters import component_count_filter

# Select representatives for graphs with exactly 5 components
filter_5_components = component_count_filter(5)

representatives = galaxy.collapse(
    metric="stellar_curvature_distance",
    curvature="forman_curvature",
    distance_threshold=200,
    selector="max_nodes",
    filter_fn=filter_5_components,
    files=list(high_coverage_graphs)
)

# Extract StarGraph objects
selected_graphs = [v["star"] for v in representatives.values()]

Example: Selecting Across Component Counts¶

from collections import defaultdict

# Group by component count
component_groups = defaultdict(list)
for graph_file in high_coverage_graphs:
    star = pd.read_pickle(graph_file)
    n_components = star.starGraph.nComponents
    component_groups[n_components].append(graph_file)

# Select representatives for each component count
all_representatives = {}
for n_components, files in component_groups.items():
    filter_fn = component_count_filter(n_components)
    reps = galaxy.collapse(
        metric="stellar_curvature_distance",
        curvature="forman_curvature",
        distance_threshold=250,
        selector="max_nodes",
        filter_fn=filter_fn,
        files=files
    )
    all_representatives[n_components] = [v["star"] for v in reps.values()]

Complete Workflow Example¶

from pathlib import Path
from thema.multiverse import Planet, Oort, Galaxy
from thema.multiverse.universe.geodesics import _load_starGraphs
from thema.multiverse.universe.utils.starFilters import (
    minimum_unique_items_filter,
    component_count_filter
)
import pandas as pd

# Setup
base_dir = Path("/absolute/path/to/outputs")
clean_dir = base_dir / "clean"
projections_dir = base_dir / "projections"
graphs_dir = base_dir / "graphs"

for d in [clean_dir, projections_dir, graphs_dir]:
    d.mkdir(parents=True, exist_ok=True)

# 1. Preprocessing
planet = Planet(
    data="/data/dataset.pkl",
    dropColumns=["id", "name"],
    imputeColumns=["age", "category"],
    imputeMethods=["sampleNormal", "mode"],
    scaler="standard",
    seeds=[42, 13],
    numSamples=2
)
planet.outDir = clean_dir
planet.fit()

# 2. Dimensionality Reduction
oort = Oort(
    data="/data/dataset.pkl",
    cleanDir=clean_dir,
    outDir=projections_dir,
    params={
        "tsne": {
            "perplexity": [15, 30, 50],
            "dimensions": [2],
            "seed": [42]
        }
    }
)
oort.fit()

# 3. Graph Construction
galaxy = Galaxy(
    data="/data/dataset.pkl",
    cleanDir=clean_dir,
    projDir=projections_dir,
    outDir=graphs_dir,
    params={
        "jmapStar": {
            "nCubes": [5, 10, 20],
            "percOverlap": [0.6, 0.7],
            "minIntersection": [-1],
            "clusterer": [
                ["HDBSCAN", {"min_cluster_size": 3}],
                ["HDBSCAN", {"min_cluster_size": 8}]
            ]
        }
    }
)
galaxy.fit()

# 4. Filter for High Coverage
sample_file = next(Path(clean_dir).glob("*.pkl"))
total_items = len(pd.read_pickle(sample_file).imputeData)
coverage_filter = minimum_unique_items_filter(int(total_items * 0.85))

high_coverage = _load_starGraphs(
    dir=graphs_dir,
    graph_filter=coverage_filter
)

# 5. Select Representatives for 3-Component Graphs
filter_3_comp = component_count_filter(3)
reps = galaxy.collapse(
    metric="stellar_curvature_distance",
    curvature="forman_curvature",
    distance_threshold=200,
    selector="max_nodes",
    filter_fn=filter_3_comp,
    files=list(high_coverage)
)

selected = [v["star"] for v in reps.values()]
print(f"Selected {len(selected)} representative graphs")

Tips and Best Practices¶

Parameter Selection¶

Start Simple: Begin with small parameter grids and expand based on results
Preprocessing Seeds: 2-3 seeds with 2-3 samples each provides good robustness
Projection Methods: t-SNE with perplexities [15, 30, 50] covers local to global structure
Mapper Resolution: Start with nCubes=[5, 10, 20] and percOverlap=[0.6, 0.7]
Clustering: HDBSCAN with min_cluster_size=[3, 5, 10] is robust

Performance Optimization¶

Parallelization: Planet, Oort, and Galaxy automatically parallelize across parameter combinations
Incremental Analysis: Process subsets of parameters first to validate pipeline
File Management: Large parameter grids generate many files; monitor disk usage
Memory: Galaxy.collapse() loads graphs into memory; filter aggressively for large datasets

Common Pitfalls¶

Relative Paths: Always use absolute paths for directory arguments
Mismatched Parameters: Ensure imputeColumns and imputeMethods lists align
Over-Parameterization: Combinatorial explosion occurs quickly; be selective
Coverage vs Resolution: Balance coverage filtering with parameter exploration
Component Count: Some parameter combinations may produce zero components

Troubleshooting¶

No graphs pass coverage filter

Reduce coverage threshold
Increase percOverlap in Mapper config
Check data quality and imputation

Too many similar graphs

Decrease distance_threshold in collapse()
Use stricter filter_fn
Reduce parameter grid size

Empty components

Increase percOverlap
Decrease min_cluster_size
Use fewer nCubes

Out of memory during collapse

Filter more aggressively before collapse
Process component counts separately
Reduce number of graphs