Tuning and Selection¶
Fine-tune parameter grids, apply filters, and select representatives.
Parameter Grid Strategy¶
Start small, expand based on results:
Planet: 2-3 seeds, 1-2 samples each
Oort: 3 values per parameter
Galaxy: 2-3 values for nCubes and percOverlap
Oort: Embedding Parameters¶
t-SNE Grid¶
Oort:
tsne:
perplexity: [15, 30, 50]
dimensions: [2]
seed: [42]
projectiles: [tsne]
- perplexity
5-15: Local structure (small clusters)
30-50: Global structure (large-scale patterns)
Rule: perplexity ≈ sqrt(n_samples)
PCA Grid¶
Oort:
pca:
dimensions: [2, 3, 5]
seed: [42]
projectiles: [pca]
- dimensions
2: Fast, good for visualization
3-5: Captures more variance, slower graph construction
Galaxy: Mapper Parameters¶
Mapper Configuration¶
Galaxy:
metric: stellar_curvature_distance
selector: max_nodes
nReps: 3
stars: [jmap]
jmap:
nCubes: [5, 10, 20]
percOverlap: [0.5, 0.6, 0.7]
minIntersection: [-1]
clusterer:
- [HDBSCAN, {min_cluster_size: 3}]
- [HDBSCAN, {min_cluster_size: 5}]
- nCubes
Number of intervals covering the projection space.
5: Coarse resolution, few large clusters
10-20: Moderate (recommended starting point)
50+: Fine-grained, many small clusters
- percOverlap
Fraction of overlap between adjacent cubes (0-1).
0.3-0.5: Less connectivity, more components
0.6-0.7: Moderate connectivity (recommended)
0.8+: High connectivity, fewer components
- minIntersection
Minimum shared items to form an edge.
-1: Weighted edges (recommended)
Positive: Stricter edge requirements
- clusterer
Algorithm and parameters for clustering within cubes.
Clustering Algorithms¶
HDBSCAN (recommended)
clusterer:
- [HDBSCAN, {min_cluster_size: 3}]
- [HDBSCAN, {min_cluster_size: 5, min_samples: 3}]
min_cluster_size
: Minimum items per cluster (2-10 typical)min_samples
: Core point requirement (optional)
DBSCAN
clusterer:
- [DBSCAN, {eps: 0.5, min_samples: 5}]
KMeans
clusterer:
- [KMeans, {n_clusters: 8}]
Filtering Graphs¶
Apply filters to remove unwanted graphs before distance computation.
Built-in Filters¶
- component_count(k)
Keep graphs with exactly k components
- component_count_range(min_k, max_k)
Keep graphs with component count in [min_k, max_k]
- minimum_nodes_filter(n)
Keep graphs with at least n nodes
- minimum_edges_filter(n)
Keep graphs with at least n edges
- minimum_unique_items_filter(n)
Keep graphs covering at least n unique data points
Programmatic Filtering¶
from thema.multiverse import Galaxy
from thema.multiverse.universe.utils.starFilters import (
component_count_filter,
minimum_unique_items_filter
)
galaxy = Galaxy(YAML_PATH="params.yaml")
galaxy.fit()
# Filter for 3-component graphs with 80%+ coverage
filter_3comp = component_count_filter(3)
selection = galaxy.collapse(
filter_fn=filter_3comp,
selector="max_nodes",
nReps=2
)
Selection Strategies¶
selector options in collapse()
:
- max_nodes (recommended)
Largest graph per cluster. Good for interpretability.
- max_edges
Most connected graph per cluster.
- min_nodes
Smallest graph per cluster. Minimal representatives.
- random
Random selection per cluster.
Collapse Methods¶
Two ways to control representative count:
By Count
selection = galaxy.collapse(
nReps=5,
selector="max_nodes"
)
By Distance Threshold
selection = galaxy.collapse(
distance_threshold=250,
selector="max_nodes"
)
Curvature Metrics¶
Choose curvature metric in collapse()
:
selection = galaxy.collapse(
curvature="balanced_forman_curvature",
nReps=3
)
- forman_curvature
Fast, default choice
- balanced_forman_curvature
More sensitive to structural differences
- resistance_curvature
Emphasizes connectivity patterns
- ollivier_ricci_curvature
Most detailed, slowest computation
Complete Example¶
from thema.multiverse import Galaxy
from thema.multiverse.universe.utils.starFilters import (
component_count_range_filter,
minimum_unique_items_filter
)
galaxy = Galaxy(YAML_PATH="params.yaml")
galaxy.fit()
# Filter for 3-5 components with 85%+ coverage
comp_filter = component_count_range_filter(3, 5)
cov_filter = minimum_unique_items_filter(int(0.85 * total_items))
def combined_filter(star):
return comp_filter(star) and cov_filter(star)
selection = galaxy.collapse(
filter_fn=combined_filter,
curvature="balanced_forman_curvature",
nReps=4,
selector="max_nodes"
)
print(f"Selected {len(selection)} representatives")