Tuning and Selection¶

Fine-tune parameter grids, apply filters, and select representatives.

Parameter Grid Strategy¶

Start small, expand based on results:

Planet: 2-3 seeds, 1-2 samples each
Oort: 3 values per parameter
Galaxy: 2-3 values for nCubes and percOverlap

Oort: Embedding Parameters¶

t-SNE Grid¶

Oort:
  tsne:
    perplexity: [15, 30, 50]
    dimensions: [2]
    seed: [42]
  projectiles: [tsne]

perplexity

5-15: Local structure (small clusters)
30-50: Global structure (large-scale patterns)
Rule: perplexity ≈ sqrt(n_samples)

PCA Grid¶

Oort:
  pca:
    dimensions: [2, 3, 5]
    seed: [42]
  projectiles: [pca]

dimensions

2: Fast, good for visualization
3-5: Captures more variance, slower graph construction

Galaxy: Mapper Parameters¶

Mapper Configuration¶

Galaxy:
  metric: stellar_curvature_distance
  selector: max_nodes
  nReps: 3
  stars: [jmap]
  jmap:
    nCubes: [5, 10, 20]
    percOverlap: [0.5, 0.6, 0.7]
    minIntersection: [-1]
    clusterer:
      - [HDBSCAN, {min_cluster_size: 3}]
      - [HDBSCAN, {min_cluster_size: 5}]

nCubes

Number of intervals covering the projection space.

5: Coarse resolution, few large clusters
10-20: Moderate (recommended starting point)
50+: Fine-grained, many small clusters

percOverlap

Fraction of overlap between adjacent cubes (0-1).

0.3-0.5: Less connectivity, more components
0.6-0.7: Moderate connectivity (recommended)
0.8+: High connectivity, fewer components

minIntersection

Minimum shared items to form an edge.

-1: Weighted edges (recommended)
Positive: Stricter edge requirements

clusterer

Algorithm and parameters for clustering within cubes.

Clustering Algorithms¶

HDBSCAN (recommended)

clusterer:
  - [HDBSCAN, {min_cluster_size: 3}]
  - [HDBSCAN, {min_cluster_size: 5, min_samples: 3}]

min_cluster_size: Minimum items per cluster (2-10 typical)
min_samples: Core point requirement (optional)

DBSCAN

clusterer:
  - [DBSCAN, {eps: 0.5, min_samples: 5}]

KMeans

clusterer:
  - [KMeans, {n_clusters: 8}]

Filtering Graphs¶

Apply filters to remove unwanted graphs before distance computation.

Built-in Filters¶

component_count(k): Keep graphs with exactly k components
component_count_range(min_k, max_k): Keep graphs with component count in [min_k, max_k]
minimum_nodes_filter(n): Keep graphs with at least n nodes
minimum_edges_filter(n): Keep graphs with at least n edges
minimum_unique_items_filter(n): Keep graphs covering at least n unique data points

Programmatic Filtering¶

from thema.multiverse import Galaxy
from thema.multiverse.universe.utils.starFilters import (
    component_count_filter,
    minimum_unique_items_filter
)

galaxy = Galaxy(YAML_PATH="params.yaml")
galaxy.fit()

# Filter for 3-component graphs with 80%+ coverage
filter_3comp = component_count_filter(3)
selection = galaxy.collapse(
    filter_fn=filter_3comp,
    selector="max_nodes",
    nReps=2
)

Selection Strategies¶

selector options in collapse():

max_nodes (recommended): Largest graph per cluster. Good for interpretability.
max_edges: Most connected graph per cluster.
min_nodes: Smallest graph per cluster. Minimal representatives.
random: Random selection per cluster.

Collapse Methods¶

Two ways to control representative count:

By Count

selection = galaxy.collapse(
    nReps=5,
    selector="max_nodes"
)

By Distance Threshold

selection = galaxy.collapse(
    distance_threshold=250,
    selector="max_nodes"
)

Curvature Metrics¶

Choose curvature metric in collapse():

selection = galaxy.collapse(
    curvature="balanced_forman_curvature",
    nReps=3
)

forman_curvature: Fast, default choice
balanced_forman_curvature: More sensitive to structural differences
resistance_curvature: Emphasizes connectivity patterns
ollivier_ricci_curvature: Most detailed, slowest computation

Complete Example¶

from thema.multiverse import Galaxy
from thema.multiverse.universe.utils.starFilters import (
    component_count_range_filter,
    minimum_unique_items_filter
)

galaxy = Galaxy(YAML_PATH="params.yaml")
galaxy.fit()

# Filter for 3-5 components with 85%+ coverage
comp_filter = component_count_range_filter(3, 5)
cov_filter = minimum_unique_items_filter(int(0.85 * total_items))

def combined_filter(star):
    return comp_filter(star) and cov_filter(star)

selection = galaxy.collapse(
    filter_fn=combined_filter,
    curvature="balanced_forman_curvature",
    nReps=4,
    selector="max_nodes"
)

print(f"Selected {len(selection)} representatives")