Galaxy

The Galaxy class is a core component of the THEMA package, designed to manage and explore a vast space of data representations. It serves as a container for star objects, each representing a unique unsupervised projection of data. By organizing these projections into a galaxy, users are able to democratize model selection and quantify agreement between models. THEMA helps ensure that the most reliable representations of your data are chosen.

A galaxy is generated from distributions of inner and outer systems, which define the parameters for the creation of star objects. This structure allows for a flexible and scalable approach to managing complex data representations.

Key Features

  • Star Management: The Galaxy class maintains a collection of star objects, enabling easy access and manipulation of various data projections.

  • Search Functionality: Users can search for specific stars within the galaxy based on defined criteria, facilitating targeted exploration of data.

  • Distribution-Based Generation: Stars are generated from specified inner and outer system distributions, providing a customizable and systematic approach to data representation.

Use Cases

  • Data Exploration: Providing a structured way to explore different unsupervised projections of data.

  • Visualization: Enabling visualization of complex data relationships through star objects.

  • Algorithm Implementation: Serving as a foundation for implementing and managing various graph generation algorithms, such as the Kepler Mapper algorithm used in the jmapStar Class.

By leveraging the Galaxy class, users can create a comprehensive and organized space of data representations, facilitating deeper insights and more effective analysis of complex datasets.

Galaxy Class

class thema.multiverse.universe.galaxy.Galaxy(params=None, data=None, cleanDir=None, projDir=None, outDir=None, metric='stellar_kernel_distance', selector='random', nReps=3, YAML_PATH=None, verbose=False)[source]

Bases: object

A space of stars.

The largest space of data representations, a galaxy can be searched to find particular stars and systems most suitable for a particular explorer.

Galaxy generates a space of star objects from the distribution of inner and outer systems.

Members

data: str

Path to the original raw data file.

cleanDir: str

Path to a populated directory containing Moons.

projDir: str

Path to a populated directory containing Comets

outDir: str

Path to an out directory to store star objects.

selection: dict

Dictionary containing selected representative stars. Set by collapse function.

YAML_PATH: str

Path to yaml configuration file.

Functions

get_data_path() -> str

returns path to the raw data file

fit() -> None

fits a space of Stars and saves to outDir

collapse() -> list

clusters and selects representatives of star models

show_MDS() -> None

plots a 2D representation of model layout

save() -> None

Saves instance to pickle file.

Example

>>> cleanDir = <PATH TO MOON OBJECT FILES>
>>> data = <PATH TO RAW DATA FILE>
>>> projDir = <PATH TO COMET OBJECT FILES>
>>> outDir = <PATH TO OUT DIRECTORY OF PROJECTIONS>
>>> params = {
...   "jmap": {   "nCubes":[2,5,8],
...                "percOverlap": [0.2, 0.4],
...            "minIntersection":[-1],
...            "clusterer": [["HDBSCAN", {"minDist":0.1}]]
...            }
... }
>>> galaxy = Galaxy(params=params,
...            data=data,
...            cleanDir = cleanDir,
...            projDir = projDir,
...            outDir = outDir)
>>> galaxy.fit()
>>> galaxy.show_MDS()
>>> galaxy.collapse()
```
collapse(metric=None, nReps=None, selector=None, **kwargs)[source]

Collapses the space of Stars into a small number of representative Stars

Parameters:
  • metric (str, optional) – The metric used when comparing graphs. Currently, we only support stellar_kernel_distance. (default: None)

  • nReps (int, optional) – The number of representative stars. (default: None)

  • selector (str, optional) – The selection criteria to choose representatives from a cluster. Currently, only “random” is supported. (default: None)

  • **kwargs (dict) – Additional arguments necessary for different metric functions.

Returns:

A dictionary containing the path to the star and the size of the group it represents.

Return type:

dict

Examples

>>> galaxy = Galaxy()
>>> galaxy.collapse(metric='stellar_kernel_distance', nReps=5, selector='random')
{'0': {'star': 'path/to/star1', 'cluster_size': 10},
    '1': {'star': 'path/to/star2', 'cluster_size': 8},
    '2': {'star': 'path/to/star3', 'cluster_size': 12},
    '3': {'star': 'path/to/star4', 'cluster_size': 9},
    '4': {'star': 'path/to/star5', 'cluster_size': 11}}
fit()[source]

Configure and generate space of Stars Uses the ProcessPoolExecutor library to spawn multiple star instances and fit them.

Returns:

Saves star objects to outDir

Return type:

None

getParams()[source]

Returns the parameters of the Galaxy instance.

Returns:

A dictionary containing the parameters of the Galaxy instance.

Return type:

dict

save(file_path)[source]

Save the current object instance to a file using pickle serialization.

Parameters:

file_path (str) – The path to the file where the object will be saved.

show_mds(randomState: int = None)[source]

Generates an embedding based on precomputed metric.

Parameters:

randomState (int, default None) – seed to set MDS and ensure reproducable results

Returns:

Shows a plot of the embedding.

Return type:

None

summarize_graphClustering()[source]

Summarizes the graph clustering results.

Returns:

A dictionary of the clusters and their corresponding graph members. The keys are the cluster names and the values are lists of graph file names.

Return type:

dict

writeParams_toYaml(YAML_PATH=None)[source]

Write the parameters of the Galaxy instance to a YAML file.

Parameters:

YAML_PATH (str, optional) – The path to the YAML file. If not provided, the YAML_PATH attribute of the instance will be used.

Return type:

None