Best Practices¶
Workflow Strategy¶
- Start Small, Scale Up
Begin with minimal parameter grids (2-3 values per parameter) to validate the pipeline. Expand grids based on initial results.
- Incremental Validation
Run Planet -> Oort -> Galaxy separately to inspect intermediate outputs before full automation.
- Parameter Exploration Order
Fix preprocessing (Planet)
Explore embeddings (Oort)
Tune graph construction (Galaxy)
Apply filters and selection
Data Management¶
- File Formats
Save raw data as pickle (
.pkl
) to preserve dtypes and avoid parsing issues.- Absolute Paths
Always use absolute paths for
data
,cleanDir
,projDir
,outDir
parameters.- Output Organization
Use consistent naming:
{outDir}/{runName}/{clean,projections,models}/
- Clean Between Runs
Remove previous outputs to avoid confusion:
T = Thema("params.yaml") T.spaghettify() # Deletes entire {outDir}/{runName} tree
Preprocessing (Planet)¶
- Auto-Detection
Use
imputeColumns="auto"
andimputeMethods="auto"
for initial runs, then inspect withget_missingData_summary()
.- Scaling
Use
scaler="standard"
(zero mean, unit variance) for most cases. Use"robust"
only if outliers are problematic.- Encoding
Use
encoding="one_hot"
for categorical variables. Avoid with high-cardinality features (>50 categories).- Imputation Sampling
Set
numSamples=1
unless using randomized methods (sampleNormal
,sampleCategorical
). Multiple samples only help capture imputation uncertainty.- Seeds
Use 2-3 explicit seeds (e.g.,
[42, 13, 99]
) for reproducibility. Avoid"auto"
in production runs.
Embeddings (Oort)¶
- t-SNE Perplexities
Start with
[15, 30, 50]
to cover local and global structure. Adjust based on dataset size (perplexity ≈ sqrt(n_samples)).- PCA Dimensions
Use 2D for speed and visualization. Higher dimensions capture more variance but slow down graph construction.
- Method Combination
Run both t-SNE and PCA to compare linear vs nonlinear projections.
- Reproducibility
Fix seeds for t-SNE:
seed: [42]
. PCA is deterministic.
Graph Construction (Galaxy)¶
- Mapper Parameters
Start with
nCubes: [5, 10, 20]
andpercOverlap: [0.5, 0.7]
. Adjust based on graph connectivity:Too many disconnected components? Increase
percOverlap
or decreasenCubes
Graphs too dense? Decrease
percOverlap
or increasenCubes
- Clustering
Use HDBSCAN with
min_cluster_size: [3, 5, 10]
. Start with smaller values for finer clusters.- Edge Formation
Use
minIntersection: [-1]
for weighted edges (recommended). Positive values enforce stricter connectivity.
Filtering and Selection¶
- Filter Before Distance
Apply filters before
collapse()
to reduce computational cost. Useminimum_unique_items_filter
to ensure coverage.- Representative Selection
selector="max_nodes"
: Most interpretable (default)selector="max_edges"
: Most connectedselector="min_nodes"
: Minimal examples
- Component Count Strategy
Filter by component count to focus on specific graph topologies. Process different component counts separately for targeted selection.
Performance¶
- Curvature Metrics
forman_curvature
: Fast, good for large gridsbalanced_forman_curvature
: Better sensitivity, moderate speedollivier_ricci_curvature
: Use only when geometry is critical (slow)
- Grid Size Management
Parameter grids grow combinatorially. A grid with 4 parameters × 3 values each = 81 combinations per Moon file. Monitor disk space.
- Memory Optimization
Filter aggressively before
collapse()
Use
distance_threshold
instead ofnReps
for adaptive selectionProcess large datasets in batches by component count
- Parallel Execution
Planet, Oort, and Galaxy parallelize automatically. No manual configuration needed.
Reproducibility¶
- Version Control
Track
params.yaml
in git. Include Thema version in commit messages.- Seed Management
Use explicit seeds for Planet and Oort:
seeds: [42, 13, 99]
- Environment Management
Use
uv
for dependency management:uv sync --extra dev uv run python script.py
- Documentation
Save parameter configurations and filter criteria alongside outputs.
Troubleshooting¶
- No Graphs Pass Filters
Reduce coverage threshold in
minimum_unique_items_filter
Increase
percOverlap
in Mapper configCheck for imputation issues in Planet
- Too Many Similar Graphs
Decrease
distance_threshold
incollapse()
Use stricter filters
Reduce parameter grid size
- Disconnected Graphs
Increase
percOverlap
(try 0.7-0.8)Decrease
min_cluster_size
in HDBSCANUse fewer
nCubes
- Out of Memory
Filter more aggressively before
collapse()
Process component counts separately
Reduce parameter grid size
Use
forman_curvature
instead of slower metrics
- Slow Collapse
Switch to
forman_curvature
Filter to fewer graphs before distance computation
Use
distance_threshold
instead ofnReps
Common Pitfalls¶
- Relative Paths
Always use absolute paths. Relative paths may break depending on execution context.
- Mismatched Parameters
Ensure
imputeColumns
andimputeMethods
lists have the same length.- Over-Parameterization
Resist the urge to test every possible parameter value. Start small, expand strategically.
- Ignoring Coverage
Graphs with low coverage miss large portions of the dataset. Always filter by
minimum_unique_items
.- Component Count Blindness
Different component counts represent fundamentally different topologies. Process them separately for better selection.