Best Practices¶

Workflow Strategy¶

Start Small, Scale Up

Begin with minimal parameter grids (2-3 values per parameter) to validate the pipeline. Expand grids based on initial results.

Incremental Validation

Run Planet -> Oort -> Galaxy separately to inspect intermediate outputs before full automation.

Parameter Exploration Order

Fix preprocessing (Planet)
Explore embeddings (Oort)
Tune graph construction (Galaxy)
Apply filters and selection

Data Management¶

File Formats

Save raw data as pickle (.pkl) to preserve dtypes and avoid parsing issues.

Absolute Paths

Always use absolute paths for data, cleanDir, projDir, outDir parameters.

Output Organization

Use consistent naming: {outDir}/{runName}/{clean,projections,models}/

Clean Between Runs

Remove previous outputs to avoid confusion:

T = Thema("params.yaml")
T.spaghettify()  # Deletes entire {outDir}/{runName} tree

Preprocessing (Planet)¶

Auto-Detection: Use imputeColumns="auto" and imputeMethods="auto" for initial runs, then inspect with get_missingData_summary().
Scaling: Use scaler="standard" (zero mean, unit variance) for most cases. Use "robust" only if outliers are problematic.
Encoding: Use encoding="one_hot" for categorical variables. Avoid with high-cardinality features (>50 categories).
Imputation Sampling: Set numSamples=1 unless using randomized methods (sampleNormal, sampleCategorical). Multiple samples only help capture imputation uncertainty.
Seeds: Use 2-3 explicit seeds (e.g., [42, 13, 99]) for reproducibility. Avoid "auto" in production runs.

Embeddings (Oort)¶

t-SNE Perplexities: Start with [15, 30, 50] to cover local and global structure. Adjust based on dataset size (perplexity ≈ sqrt(n_samples)).
PCA Dimensions: Use 2D for speed and visualization. Higher dimensions capture more variance but slow down graph construction.
Method Combination: Run both t-SNE and PCA to compare linear vs nonlinear projections.
Reproducibility: Fix seeds for t-SNE: seed: [42]. PCA is deterministic.

Graph Construction (Galaxy)¶

Mapper Parameters

Start with nCubes: [5, 10, 20] and percOverlap: [0.5, 0.7]. Adjust based on graph connectivity:

Too many disconnected components? Increase percOverlap or decrease nCubes
Graphs too dense? Decrease percOverlap or increase nCubes

Clustering

Use HDBSCAN with min_cluster_size: [3, 5, 10]. Start with smaller values for finer clusters.

Edge Formation

Use minIntersection: [-1] for weighted edges (recommended). Positive values enforce stricter connectivity.

Filtering and Selection¶

Filter Before Distance

Apply filters before collapse() to reduce computational cost. Use minimum_unique_items_filter to ensure coverage.

Representative Selection

selector="max_nodes": Most interpretable (default)
selector="max_edges": Most connected
selector="min_nodes": Minimal examples

Component Count Strategy

Filter by component count to focus on specific graph topologies. Process different component counts separately for targeted selection.

Performance¶

Curvature Metrics

forman_curvature: Fast, good for large grids
balanced_forman_curvature: Better sensitivity, moderate speed
ollivier_ricci_curvature: Use only when geometry is critical (slow)

Grid Size Management

Parameter grids grow combinatorially. A grid with 4 parameters × 3 values each = 81 combinations per Moon file. Monitor disk space.

Memory Optimization

Filter aggressively before collapse()
Use distance_threshold instead of nReps for adaptive selection
Process large datasets in batches by component count

Parallel Execution

Planet, Oort, and Galaxy parallelize automatically. No manual configuration needed.

Reproducibility¶

Version Control

Track params.yaml in git. Include Thema version in commit messages.

Seed Management

Use explicit seeds for Planet and Oort: seeds: [42, 13, 99]

Environment Management

Use uv for dependency management:

uv sync --extra dev
uv run python script.py

Documentation

Save parameter configurations and filter criteria alongside outputs.

Troubleshooting¶

No Graphs Pass Filters

Reduce coverage threshold in minimum_unique_items_filter
Increase percOverlap in Mapper config
Check for imputation issues in Planet

Too Many Similar Graphs

Decrease distance_threshold in collapse()
Use stricter filters
Reduce parameter grid size

Disconnected Graphs

Increase percOverlap (try 0.7-0.8)
Decrease min_cluster_size in HDBSCAN
Use fewer nCubes

Out of Memory

Filter more aggressively before collapse()
Process component counts separately
Reduce parameter grid size
Use forman_curvature instead of slower metrics

Slow Collapse

Switch to forman_curvature
Filter to fewer graphs before distance computation
Use distance_threshold instead of nReps

Common Pitfalls¶

Relative Paths: Always use absolute paths. Relative paths may break depending on execution context.
Mismatched Parameters: Ensure imputeColumns and imputeMethods lists have the same length.
Over-Parameterization: Resist the urge to test every possible parameter value. Start small, expand strategically.
Ignoring Coverage: Graphs with low coverage miss large portions of the dataset. Always filter by minimum_unique_items.
Component Count Blindness: Different component counts represent fundamentally different topologies. Process them separately for better selection.