Best Practices¶
Workflow Strategy¶
- Start Small, Scale Up
Begin with minimal parameter grids (2-3 values per parameter) to validate the pipeline. Expand grids based on initial results.
- Incremental Validation
Run Planet -> Oort -> Galaxy separately to inspect intermediate outputs before full automation.
- Parameter Exploration Order
Fix preprocessing (Planet)
Explore embeddings (Oort)
Tune graph construction (Galaxy)
Apply filters and selection
Data Management¶
- File Formats
Save raw data as pickle (
.pkl) to preserve dtypes and avoid parsing issues.- Absolute Paths
Always use absolute paths for
data,cleanDir,projDir,outDirparameters.- Output Organization
Use consistent naming:
{outDir}/{runName}/{clean,projections,models}/- Clean Between Runs
Remove previous outputs to avoid confusion:
T = Thema("params.yaml") T.spaghettify() # Deletes entire {outDir}/{runName} tree
Preprocessing (Planet)¶
- Auto-Detection
Use
imputeColumns="auto"andimputeMethods="auto"for initial runs, then inspect withget_missingData_summary().- Scaling
Use
scaler="standard"(zero mean, unit variance) for most cases. Use"robust"only if outliers are problematic.- Encoding
Use
encoding="one_hot"for categorical variables. Avoid with high-cardinality features (>50 categories).- Imputation Sampling
Set
numSamples=1unless using randomized methods (sampleNormal,sampleCategorical). Multiple samples only help capture imputation uncertainty.- Seeds
Use 2-3 explicit seeds (e.g.,
[42, 13, 99]) for reproducibility. Avoid"auto"in production runs.
Embeddings (Oort)¶
- t-SNE Perplexities
Start with
[15, 30, 50]to cover local and global structure. Adjust based on dataset size (perplexity ≈ sqrt(n_samples)).- PCA Dimensions
Use 2D for speed and visualization. Higher dimensions capture more variance but slow down graph construction.
- Method Combination
Run both t-SNE and PCA to compare linear vs nonlinear projections.
- Reproducibility
Fix seeds for t-SNE:
seed: [42]. PCA is deterministic.
Graph Construction (Galaxy)¶
- Mapper Parameters
Start with
nCubes: [5, 10, 20]andpercOverlap: [0.5, 0.7]. Adjust based on graph connectivity:Too many disconnected components? Increase
percOverlapor decreasenCubesGraphs too dense? Decrease
percOverlapor increasenCubes
- Clustering
Use HDBSCAN with
min_cluster_size: [3, 5, 10]. Start with smaller values for finer clusters.- Edge Formation
Use
minIntersection: [-1]for weighted edges (recommended). Positive values enforce stricter connectivity.
Filtering and Selection¶
- Filter Before Distance
Apply filters before
collapse()to reduce computational cost. Useminimum_unique_items_filterto ensure coverage.- Representative Selection
selector="max_nodes": Most interpretable (default)selector="max_edges": Most connectedselector="min_nodes": Minimal examples
- Component Count Strategy
Filter by component count to focus on specific graph topologies. Process different component counts separately for targeted selection.
Performance¶
- Curvature Metrics
forman_curvature: Fast, good for large gridsbalanced_forman_curvature: Better sensitivity, moderate speedollivier_ricci_curvature: Use only when geometry is critical (slow)
- Grid Size Management
Parameter grids grow combinatorially. A grid with 4 parameters × 3 values each = 81 combinations per Moon file. Monitor disk space.
- Memory Optimization
Filter aggressively before
collapse()Use
distance_thresholdinstead ofnRepsfor adaptive selectionProcess large datasets in batches by component count
- Parallel Execution
Planet, Oort, and Galaxy parallelize automatically. No manual configuration needed.
Reproducibility¶
- Version Control
Track
params.yamlin git. Include Thema version in commit messages.- Seed Management
Use explicit seeds for Planet and Oort:
seeds: [42, 13, 99]- Environment Management
Use
uvfor dependency management:uv sync --extra dev uv run python script.py
- Documentation
Save parameter configurations and filter criteria alongside outputs.
Troubleshooting¶
- No Graphs Pass Filters
Reduce coverage threshold in
minimum_unique_items_filterIncrease
percOverlapin Mapper configCheck for imputation issues in Planet
- Too Many Similar Graphs
Decrease
distance_thresholdincollapse()Use stricter filters
Reduce parameter grid size
- Disconnected Graphs
Increase
percOverlap(try 0.7-0.8)Decrease
min_cluster_sizein HDBSCANUse fewer
nCubes
- Out of Memory
Filter more aggressively before
collapse()Process component counts separately
Reduce parameter grid size
Use
forman_curvatureinstead of slower metrics
- Slow Collapse
Switch to
forman_curvatureFilter to fewer graphs before distance computation
Use
distance_thresholdinstead ofnReps
Common Pitfalls¶
- Relative Paths
Always use absolute paths. Relative paths may break depending on execution context.
- Mismatched Parameters
Ensure
imputeColumnsandimputeMethodslists have the same length.- Over-Parameterization
Resist the urge to test every possible parameter value. Start small, expand strategically.
- Ignoring Coverage
Graphs with low coverage miss large portions of the dataset. Always filter by
minimum_unique_items.- Component Count Blindness
Different component counts represent fundamentally different topologies. Process them separately for better selection.