Best Practices

Workflow Strategy

Start Small, Scale Up

Begin with minimal parameter grids (2-3 values per parameter) to validate the pipeline. Expand grids based on initial results.

Incremental Validation

Run Planet -> Oort -> Galaxy separately to inspect intermediate outputs before full automation.

Parameter Exploration Order
  1. Fix preprocessing (Planet)

  2. Explore embeddings (Oort)

  3. Tune graph construction (Galaxy)

  4. Apply filters and selection

Data Management

File Formats

Save raw data as pickle (.pkl) to preserve dtypes and avoid parsing issues.

Absolute Paths

Always use absolute paths for data, cleanDir, projDir, outDir parameters.

Output Organization

Use consistent naming: {outDir}/{runName}/{clean,projections,models}/

Clean Between Runs

Remove previous outputs to avoid confusion:

T = Thema("params.yaml")
T.spaghettify()  # Deletes entire {outDir}/{runName} tree

Preprocessing (Planet)

Auto-Detection

Use imputeColumns="auto" and imputeMethods="auto" for initial runs, then inspect with get_missingData_summary().

Scaling

Use scaler="standard" (zero mean, unit variance) for most cases. Use "robust" only if outliers are problematic.

Encoding

Use encoding="one_hot" for categorical variables. Avoid with high-cardinality features (>50 categories).

Imputation Sampling

Set numSamples=1 unless using randomized methods (sampleNormal, sampleCategorical). Multiple samples only help capture imputation uncertainty.

Seeds

Use 2-3 explicit seeds (e.g., [42, 13, 99]) for reproducibility. Avoid "auto" in production runs.

Embeddings (Oort)

t-SNE Perplexities

Start with [15, 30, 50] to cover local and global structure. Adjust based on dataset size (perplexity ≈ sqrt(n_samples)).

PCA Dimensions

Use 2D for speed and visualization. Higher dimensions capture more variance but slow down graph construction.

Method Combination

Run both t-SNE and PCA to compare linear vs nonlinear projections.

Reproducibility

Fix seeds for t-SNE: seed: [42]. PCA is deterministic.

Graph Construction (Galaxy)

Mapper Parameters

Start with nCubes: [5, 10, 20] and percOverlap: [0.5, 0.7]. Adjust based on graph connectivity:

  • Too many disconnected components? Increase percOverlap or decrease nCubes

  • Graphs too dense? Decrease percOverlap or increase nCubes

Clustering

Use HDBSCAN with min_cluster_size: [3, 5, 10]. Start with smaller values for finer clusters.

Edge Formation

Use minIntersection: [-1] for weighted edges (recommended). Positive values enforce stricter connectivity.

Filtering and Selection

Filter Before Distance

Apply filters before collapse() to reduce computational cost. Use minimum_unique_items_filter to ensure coverage.

Representative Selection
  • selector="max_nodes": Most interpretable (default)

  • selector="max_edges": Most connected

  • selector="min_nodes": Minimal examples

Component Count Strategy

Filter by component count to focus on specific graph topologies. Process different component counts separately for targeted selection.

Performance

Curvature Metrics
  • forman_curvature: Fast, good for large grids

  • balanced_forman_curvature: Better sensitivity, moderate speed

  • ollivier_ricci_curvature: Use only when geometry is critical (slow)

Grid Size Management

Parameter grids grow combinatorially. A grid with 4 parameters × 3 values each = 81 combinations per Moon file. Monitor disk space.

Memory Optimization
  • Filter aggressively before collapse()

  • Use distance_threshold instead of nReps for adaptive selection

  • Process large datasets in batches by component count

Parallel Execution

Planet, Oort, and Galaxy parallelize automatically. No manual configuration needed.

Reproducibility

Version Control

Track params.yaml in git. Include Thema version in commit messages.

Seed Management

Use explicit seeds for Planet and Oort: seeds: [42, 13, 99]

Environment Management

Use uv for dependency management:

uv sync --extra dev
uv run python script.py
Documentation

Save parameter configurations and filter criteria alongside outputs.

Troubleshooting

No Graphs Pass Filters
  • Reduce coverage threshold in minimum_unique_items_filter

  • Increase percOverlap in Mapper config

  • Check for imputation issues in Planet

Too Many Similar Graphs
  • Decrease distance_threshold in collapse()

  • Use stricter filters

  • Reduce parameter grid size

Disconnected Graphs
  • Increase percOverlap (try 0.7-0.8)

  • Decrease min_cluster_size in HDBSCAN

  • Use fewer nCubes

Out of Memory
  • Filter more aggressively before collapse()

  • Process component counts separately

  • Reduce parameter grid size

  • Use forman_curvature instead of slower metrics

Slow Collapse
  • Switch to forman_curvature

  • Filter to fewer graphs before distance computation

  • Use distance_threshold instead of nReps

Common Pitfalls

Relative Paths

Always use absolute paths. Relative paths may break depending on execution context.

Mismatched Parameters

Ensure imputeColumns and imputeMethods lists have the same length.

Over-Parameterization

Resist the urge to test every possible parameter value. Start small, expand strategically.

Ignoring Coverage

Graphs with low coverage miss large portions of the dataset. Always filter by minimum_unique_items.

Component Count Blindness

Different component counts represent fundamentally different topologies. Process them separately for better selection.