Galaxy¶

The Galaxy class is a core component of the THEMA package, designed to manage and explore a vast space of data representations. It serves as a container for star objects, each representing a unique unsupervised projection of data. By organizing these projections into a galaxy, users are able to democratize model selection and quantify agreement between models. THEMA helps ensure that the most reliable representations of your data are chosen.

A galaxy is generated from distributions of inner and outer systems, which define the parameters for the creation of star objects. This structure allows for a flexible and scalable approach to managing complex data representations.

Key Features

Star Management: The Galaxy class maintains a collection of star objects, enabling easy access and manipulation of various data projections.
Search Functionality: Users can search for specific stars within the galaxy based on defined criteria, facilitating targeted exploration of data.
Distribution-Based Generation: Stars are generated from specified inner and outer system distributions, providing a customizable and systematic approach to data representation.

Use Cases

Data Exploration: Providing a structured way to explore different unsupervised projections of data.
Visualization: Enabling visualization of complex data relationships through star objects.
Algorithm Implementation: Serving as a foundation for implementing and managing various graph generation algorithms, such as the Kepler Mapper algorithm used in the jmapStar Class.

By leveraging the Galaxy class, users can create a comprehensive and organized space of data representations, facilitating deeper insights and more effective analysis of complex datasets.

Galaxy Class¶

class thema.multiverse.universe.galaxy.Galaxy(params=None, data=None, cleanDir=None, projDir=None, outDir=None, metric='stellar_kernel_distance', selector='random', nReps=3, YAML_PATH=None, verbose=False)[source]¶

Bases: object

A space of stars.

The largest space of data representations, a galaxy can be searched to find particular stars and systems most suitable for a particular explorer.

Galaxy generates a space of star objects from the distribution of inner and outer systems.

Members¶

data: str: Path to the original raw data file.
cleanDir: str: Path to a populated directory containing Moons.
projDir: str: Path to a populated directory containing Comets
outDir: str: Path to an out directory to store star objects.
selection: dict: Dictionary containing selected representative stars. Set by collapse function.
YAML_PATH: str: Path to yaml configuration file.

Functions¶

get_data_path() -> str: returns path to the raw data file
fit() -> None: fits a space of Stars and saves to outDir
collapse() -> list: clusters and selects representatives of star models
show_MDS() -> None: plots a 2D representation of model layout
save() -> None: Saves instance to pickle file.

Example

>>> cleanDir = <PATH TO MOON OBJECT FILES>
>>> data = <PATH TO RAW DATA FILE>
>>> projDir = <PATH TO COMET OBJECT FILES>
>>> outDir = <PATH TO OUT DIRECTORY OF PROJECTIONS>

>>> params = {
...   "jmap": {   "nCubes":[2,5,8],
...                "percOverlap": [0.2, 0.4],
...            "minIntersection":[-1],
...            "clusterer": [["HDBSCAN", {"minDist":0.1}]]
...            }
... }
>>> galaxy = Galaxy(params=params,
...            data=data,
...            cleanDir = cleanDir,
...            projDir = projDir,
...            outDir = outDir)

>>> galaxy.fit()
>>> galaxy.show_MDS()
>>> galaxy.collapse()
```

collapse(metric=None, nReps=None, selector=None, **kwargs)[source]¶

Collapses the space of Stars into a small number of representative Stars

Parameters:

metric (str, optional) – The metric used when comparing graphs. Currently, we only support stellar_kernel_distance. (default: None)
nReps (int, optional) – The number of representative stars. (default: None)
selector (str, optional) – The selection criteria to choose representatives from a cluster. Currently, only “random” is supported. (default: None)
**kwargs (dict) – Additional arguments necessary for different metric functions.

Returns:

A dictionary containing the path to the star and the size of the group it represents.

Return type:

dict

Examples

>>> galaxy = Galaxy()
>>> galaxy.collapse(metric='stellar_kernel_distance', nReps=5, selector='random')
{'0': {'star': 'path/to/star1', 'cluster_size': 10},
    '1': {'star': 'path/to/star2', 'cluster_size': 8},
    '2': {'star': 'path/to/star3', 'cluster_size': 12},
    '3': {'star': 'path/to/star4', 'cluster_size': 9},
    '4': {'star': 'path/to/star5', 'cluster_size': 11}}

fit()[source]¶

Configure and generate space of Stars Uses the ProcessPoolExecutor library to spawn multiple star instances and fit them.

Returns:: Saves star objects to outDir
Return type:: None

getParams()[source]¶

Returns the parameters of the Galaxy instance.

Returns:: A dictionary containing the parameters of the Galaxy instance.
Return type:: dict

save(file_path)[source]¶

Save the current object instance to a file using pickle serialization.

Parameters:: file_path (str) – The path to the file where the object will be saved.

show_mds(randomState: int = None)[source]¶

Generates an embedding based on precomputed metric.

Parameters:: randomState (int, default None) – seed to set MDS and ensure reproducable results
Returns:: Shows a plot of the embedding.
Return type:: None

summarize_graphClustering()[source]¶

Summarizes the graph clustering results.

Returns:: A dictionary of the clusters and their corresponding graph members. The keys are the cluster names and the values are lists of graph file names.
Return type:: dict

writeParams_toYaml(YAML_PATH=None)[source]¶

Write the parameters of the Galaxy instance to a YAML file.

Parameters:: YAML_PATH (str, optional) – The path to the YAML file. If not provided, the YAML_PATH attribute of the instance will be used.
Return type:: None