Galaxy

Galaxy builds many Stars (graphs), measures pairwise distances between them, clusters, and then selects representative graphs per cluster.

Highlights

  • Fit: generate Stars over your parameter grid

  • Collapse: compute distances (curvature filtrations), cluster (Agglomerative), select reps

  • Coordinates: get a 2D layout (MDS) for quick visual sanity checks

API

class thema.multiverse.universe.galaxy.Galaxy(params=None, data=None, cleanDir=None, projDir=None, outDir=None, metric='stellar_curvature_distance', selector='max_nodes', nReps=3, filter_fn=None, YAML_PATH=None, verbose=False)[source]

Bases: object

A space of stars.

The largest space of data representations, a galaxy can be searched to find particular stars and systems most suitable for a particular explorer.

Galaxy generates a space of star objects from the distribution of inner and outer systems.

Members

data: str

Path to the original raw data file.

cleanDir: str

Path to a populated directory containing Moons.

projDir: str

Path to a populated directory containing Comets

outDir: str

Path to an out directory to store star objects.

selection: dict

Dictionary containing selected representative stars. Set by collapse function.

YAML_PATH: str

Path to yaml configuration file.

Functions

get_data_path() -> str

returns path to the raw data file

fit() -> None

fits a space of Stars and saves to outDir

collapse() -> list

clusters and selects representatives of star models

get_galaxy_coordinates() -> np.ndarray

computes a 2D coordinate system of stars in the galaxy using Multidimensional Scaling (MDS)

save() -> None

Saves instance to pickle file.

Example

>>> cleanDir = <PATH TO MOON OBJECT FILES>
>>> data = <PATH TO RAW DATA FILE>
>>> projDir = <PATH TO COMET OBJECT FILES>
>>> outDir = <PATH TO OUT DIRECTORY OF PROJECTIONS>
>>> params = {
...   "jmap": {   "nCubes":[2,5,8],
...                "percOverlap": [0.2, 0.4],
...            "minIntersection":[-1],
...            "clusterer": [["HDBSCAN", {"minDist":0.1}]]
...            }
... }
>>> galaxy = Galaxy(params=params,
...            data=data,
...            cleanDir = cleanDir,
...            projDir = projDir,
...            outDir = outDir)
>>> galaxy.fit()
>>> # First, compute distances and cluster the stars
>>> selected_stars = galaxy.collapse()
>>> print(f"Selected {len(selected_stars)} representative stars")
>>>
>>> # Generate and visualize the galaxy coordinates with custom plotting
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>>
>>> # Manual plotting of the galaxy coordinates (NOTE: `Thema` does not have built-in visualization dependencies)
>>> coordinates = galaxy.get_galaxy_coordinates()
>>> plt.figure(figsize=(8, 6))
>>> plt.scatter(coordinates[:, 0], coordinates[:, 1], alpha=0.7)
>>> plt.title('2D Coordinate Map of Star Models')
>>> plt.xlabel('X Coordinate')
>>> plt.ylabel('Y Coordinate')
>>> plt.show()
```
collapse(metric=None, nReps=None, selector=None, filter_fn=None, files: list | None = None, distance_threshold: float | None = None, **kwargs)[source]

Collapses the space of Stars into representative Stars. Either nReps (number of clusters) or distance_threshold (AgglomerativeClustering) can be used.

Parameters:
  • metric (str, optional) – Metric function name for comparing graphs. Defaults to self.metric.

  • nReps (int, optional) – Number of clusters for AgglomerativeClustering. Ignored if distance_threshold is set.

  • selector (str, optional) – Selection function name to choose representative stars. Defaults to self.selector.

  • filter_fn (callable, str, or None) – Filter function to select a subset of graphs. Defaults to no filter.

  • files (list[str] or None) – Optional list of file paths to process. Defaults to self.outDir.

  • distance_threshold (float, optional) – AgglomerativeClustering distance threshold. Used if nReps is None.

  • **kwargs – Additional arguments passed to the metric function.

Returns:

Mapping from cluster labels to selected stars and cluster sizes.

Return type:

dict

fit()[source]

Configure and generate space of Stars. Uses the function_scheduler to spawn multiple star instances and fit them in parallel.

Returns:

Saves star objects to outDir and prints a count of failed saves.

Return type:

None

getParams()[source]

Returns the parameters of the Galaxy instance.

Returns:

A dictionary containing the parameters of the Galaxy instance.

Return type:

dict

get_galaxy_coordinates() ndarray[source]

Computes a 2D coordinate system for stars in the galaxy, allowing visualization of their relative positions. This function uses Multidimensional Scaling (MDS) to project the high-dimensional distance matrix into a 2D space, preserving the relative distances between stars as much as possible.

Note: This method requires that distances have been computed first, usually by calling the collapse() method or directly computing distances with a metric function.

Returns:

A 2D array of shape (n_stars, 2) containing the X,Y coordinates of each star in the galaxy. Each row represents the 2D coordinates of one star.

Return type:

np.ndarray

Examples

>>> # After fitting the galaxy and computing distances
>>> import matplotlib.pyplot as plt
>>> coordinates = galaxy.get_galaxy_coordinates()
>>>
>>> # Basic scatter plot
>>> plt.figure(figsize=(10, 8))
>>> plt.scatter(coordinates[:, 0], coordinates[:, 1], alpha=0.7)
>>> plt.title('Star Map of the Galaxy')
>>> plt.xlabel('X Coordinate')
>>> plt.ylabel('Y Coordinate')
>>> plt.show()
>>>
>>> # Advanced plot with cluster coloring
>>> if galaxy.selection:  # If collapse() has been called
>>>     plt.figure(figsize=(12, 10))
>>>     # Plot all stars
>>>     plt.scatter(coordinates[:, 0], coordinates[:, 1], c='lightgray', alpha=0.5)
>>>     # Highlight representative stars
>>>     for cluster_id, info in galaxy.selection.items():
>>>         # Find the index of the representative star in the keys array
>>>         rep_idx = np.where(galaxy.keys == info['star'])[0][0]
>>>         plt.scatter(coordinates[rep_idx, 0], coordinates[rep_idx, 1],
>>>                   s=100, c='red', edgecolor='black', label=f'Cluster {cluster_id}')
>>>     plt.legend()
>>>     plt.title('Star Map with Representative Stars')
>>>     plt.show()
save(file_path)[source]

Save the current object instance to a file using pickle serialization.

Parameters:

file_path (str) – The path to the file where the object will be saved.

summarize_graphClustering()[source]

Summarizes the graph clustering results.

Returns:

A dictionary of the clusters and their corresponding graph members. The keys are the cluster names and the values are lists of graph file names.

Return type:

dict

writeParams_toYaml(YAML_PATH=None)[source]

Write the parameters of the Galaxy instance to a YAML file.

Parameters:

YAML_PATH (str, optional) – The path to the YAML file. If not provided, the YAML_PATH attribute of the instance will be used.

Return type:

None