Outer System

The thema.multiverse.system.outer module provides essential functionality for managing and exploring high-dimensional data through projection algorithms. At the core of this module is the COMET class, which serves as a base template for projection algorithms, enforcing a structured approach to data management and projection. This enables a universal procedure for generating projection objects.

The Oort class, a key component of the system.outer module, generates a space of projected representations of an original, high-dimensional dataset. While navigating this space of projections can be challenging, our tools facilitate easy exploration and interpretation of the data.

Comet Class:

A base class template for projection algorithms, enforcing structure on data management and projection.

Projectiles:
Support for creating Comet subclasses. Thema currently supports three projection methods:
  • Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)

  • T-distributed Stochastic Neighbor Embedding (t-SNE)

  • Principle Component Analysis (PCA)

Oort Class:

Generates a space of projected representations of high-dimensional datasets, aiding in data exploration.

Usage

  • Creating and Managing Projections: Use Projectiles class to create and manage a universe of different unsupervised projections of your data.

  • Unlocking the Multiverse of Representations Use Oort to handle the space of multiple data projections

Comet Base Class

class thema.multiverse.system.outer.comet.Comet(data_path: str, clean_path: str)[source]

Bases: Core

Collapse or Modify Existing Tabulars

A COMET is a base class template for projection (dimensionality reduction) algorithms. As a parent class, Comet enforces structure on data management and projection, enabling a ‘universal’ procedure for generating these objects.

Members

datapd.DataFrame

a pandas dataframe of raw data

cleanpd.DataFrame

a pandas dataframe of complete, encoded, and scaled data

Functions

save()

saves Comet to .pkl serialized object file

See also

docs

see for more information on implementing a realization of Comet

Examples

>>> from thema.multiverse.system.outer import Comet
>>> class PCA(Comet):
...     def fit(self):
...         pass
>>> pca = PCA(data_path='data.csv', clean_path='clean.csv')
>>> pca.fit()
abstract fit()[source]

Abstract method to be implemented by Comet’s child.

Notes

Method must initialize the projectionArray member.

Raises:

NotImplementedError – If the method is not implemented by the child class.

save(file_path)[source]

Save the current object instance to a file using pickle serialization.

Parameters:

(str) (file_path) – here the object will be saved.

Raises:

Exception – If the file cannot be saved.:

Examples

>>> from thema.multiverse.system.outer import Comet
>>> class PCA(Comet):
...     def fit(self):
...         pass
>>> pca = PCA(data_path='data.csv', clean_path='clean.csv')
>>> pca.fit()
>>> pca.save('pca.pkl')

Projectiles

Create and manage the universe of different unsupervised projections of your data. We have decided to support three standard dimensionality reduction methods:

  • Uniform Manifold Approximation and Projection for Dimension Reduction: UMAP

  • T-distributed Stochastic Neighbor Embedding: t-SNE

  • Principle Component Analysis: PCA

Hint

  • An interactive overview of the key differences between UMAP and t-SNE projections: UMAP vs. t-SNE

UMAP

thema.multiverse.system.outer.projectiles.umapProj.initialize()[source]

Returns the umapProj class object from module. This is a general method that allows us to initialize arbitrary projectile objects.

Returns:

umapProj – The UMAP projectile object.

Return type:

object

class thema.multiverse.system.outer.projectiles.umapProj.umapProj(data_path, clean_path, nn, minDist, dimensions, seed)[source]

Bases: Comet

UMAP Projectile Class.

Inherits from Comet.

Projects data into lower dimensional space using the Uniform Manifold Approximation and Projection. See: https://umap-learn.readthedocs.io/en/latest/

Parameters:
  • data_path (str) – A path to the raw data file.

  • clean_path (str) – A path to a cofigured Moon object file.

  • nn (int) – The number of nearest neighbors for UMAP alg.

  • minDist (float) – The minimum distance threshold for clustering.

  • dimensions (int) – The number of dimensions for the embedding.

  • seed (int) – The seed for randomization.

data

A pandas dataframe of raw data.

Type:

pd.DataFrame

clean

A pandas dataframe of complete, encoded, and scaled data.

Type:

pd.DataFrame

projectionArray

A projection array.

Type:

np.array

nn

Number of nearest neighbors.

Type:

int

minDist

Minimum distance threshold for clustering.

Type:

float

dimensions

Number of dimensions for the embedding.

Type:

int

seed

Seed for randomization.

Type:

int

fit()[source]

Fits a UMAP projection from given parameters and saves to projectionArray.

save()

Saves umapProj to .pkl serialized object file.

fit()[source]

Performs a UMAP projection based on the configuration parameters.

Returns:

Initializes projectionArray member.

Return type:

None

t-SNE

thema.multiverse.system.outer.projectiles.tsneProj.initialize()[source]

Returns the tsneProj class object from module. This is a general method that allows us to initialize arbitrary projectile objects.

Returns:

tsneProj – The t-SNE projectile object.

Return type:

object

class thema.multiverse.system.outer.projectiles.tsneProj.tsneProj(data_path, clean_path, perplexity, dimensions, seed)[source]

Bases: Comet

t-SNE Projectile Class.

Inherits from Comet.

Projects data into lower dimensional space using the T-distributed Stochastic Neighbor Embedding. See: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

Members

datapd.DataFrame

A pandas dataframe of raw data.

cleanpd.DataFrame

A pandas dataframe of complete, encoded, and scaled data.

projectionArraynp.array

A projection array.

perplexityint

A tsne configuration parameter.

dimensionsint

Number of dimensions for the embedding.

seedint

Seed for randomization.

Functions

fit()

Fits a tsne projection from given parameters and saves to projectionArray.

save()

Saves tsneProj to .pkl serialized object file.

fit()[source]

Performs a TSNE projection based on the configuration parameters.

Returns:

Initializes projectionArray member.

Return type:

None

PCA

thema.multiverse.system.outer.projectiles.pcaProj.initialize()[source]

Returns the pcaeProj class object from module. This is a general method that allows us to initialize arbitrary projectile objects.

Returns:

pcaProj – The PCA projectile object.

Return type:

object

class thema.multiverse.system.outer.projectiles.pcaProj.pcaProj(data_path, clean_path, dimensions, seed)[source]

Bases: Comet

PCA Projectile Class.

Inherits from Comet.

Projects data into lower dimensional space using sklearn’s PCA Projection. See: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

data

A pandas dataframe of raw data.

Type:

pd.DataFrame

clean

A pandas dataframe of complete, encoded, and scaled data.

Type:

pd.DataFrame

projectionArray

A projection array.

Type:

np.array

dimensions

Number of dimensions for the embedding.

Type:

int

seed

Seed for randomization.

Type:

int

__init__(self, data_path, clean_path, dimensions, seed)[source]

Constructs a pcaProj instance.

fit(self)[source]

Performs a PCA projection based on the configuration parameters.

save(self)

Saves pcaProj to .pkl serialized object file.

fit()[source]

Performs a PCA projection based on the configuration parameters.

Returns:

Initializes projectionArray member.

Return type:

None

Oort Class

class thema.multiverse.system.outer.oort.Oort(params=None, data=None, cleanDir=None, outDir=None, YAML_PATH=None, verbose=False)[source]

Bases: Core

The space of COMET objects.

The Oort cloud, sometimes called the Öpik–Oort cloud, is theorized to be a vast cloud of icy planetesimals surrounding the Sun at distances ranging from 2,000 to 200,000 AU.

Our Oort class generates a space of projected representations of an original, high dimensional dataset. Though sometimes it can be difficult to see through the cloud of projections, our tools allow you to easily navigate this terrain and properly explore your data.

param data:

A pandas DataFrame of raw data.

type data:

pd.DataFrame

param params:

A parameter dictionary. Default is None.

type params:

dict, optional

param cleanDir:

Path to the clean data directory. Default is None.

type cleanDir:

str, optional

param outDir:

Path to the out data directory. Default is None.

type outDir:

str, optional

param YAML_PATH:

Path to the YAML parameter file. Default is None.

type YAML_PATH:

str, optional

data

A pandas DataFrame of raw data.

Type:

pd.DataFrame

params

A parameter dictionary.

Type:

dict

cleanDir

Path to the clean data directory.

Type:

str

outDir

Path to the out data directory.

Type:

str

YAML_PATH

Path to the YAML parameter file.

Type:

str

get_data_path() str

Returns the path to the raw data file.

fit() None[source]

Fits projection space.

save(file_path: str) None[source]

Saves object as a pickle file.

getParams() dict[source]

Returns a dictionary of parameters.

writeParams_toYaml(YAML_PATH: str) None[source]

Writes out the specified parameters to a YAML file.

Examples

>>> cleanDir = "<PATH TO MOON OBJECT FILES>"
>>> data = "<PATH TO RAW DATA FILE>"
>>> outDir = "<PATH TO OUT DIRECTORY OF PROJECTIONS>"
>>> params = {
...     "umap" : {
...         "nn" : [2, 5, 10],
...         "minDist" : [0.1, 0.5],
...         "dimensions" : [2],
...         "seed" : [42]
...     }
... }
>>> oort = Oort(
...     params=params,
...     data=data,
...     cleanDir=cleanDir,
...     outDir=outDir,
...     YAML_PATH=None
... )
>>> oort.fit()

Note

oort.fit() will produce 6 * len(os.listdir(cleanDir)) files in outDir in this example.

fit()[source]

Configure and run your projections.

Uses the ProcessPoolExecutor library to spawn multiple projectile instances and fit them.

Returns:

Saves projections to the specified outDir

Return type:

None

Examples

>>> oort = Oort()
>>> oort.fit()
getParams()[source]

Get the parameters used to initialize the space of Comets in this Oort.

Returns:

A dictionary containing the parameters used to initialize an Oort instance.

Return type:

dict

Examples

>>> oort = Oort()
>>> params = oort.getParams()
>>> print(params)
{
    "params": {...},  # dictionary containing the parameters used to initialize the Oort instance
    "data": "/path/to/data",  # path to the data
    "cleanDir": True,  # whether to clean the directory
    "outDir": "/path/to/output"  # path to the output directory
}
save(file_path)[source]

Save the current object instance to a file using pickle serialization.

Parameters:

file_path (str) – The path to the file where the object will be saved.

Raises:

IOError – If there is an error while saving the object to the file.

Examples

>>> obj = MyClass()
>>> obj.save("data.pkl")  # Save the object to a file named "data.pkl"
writeParams_toYaml(YAML_PATH=None)[source]

Write out the specified parameters to a YAML type file.

Parameters:

YAML_PATH (str (filepath), optional) – The path to an existing .yaml type file. If not provided, the value of self.YAML_PATH will be used. If self.YAML_PATH is also None, a ValueError will be raised.

Returns:

Saves a yaml file to the specified YAML_PATH.

Return type:

None

Raises:
  • ValueError – If YAML_PATH is None and self.YAML_PATH is also None.

  • TypeError – If the file path specified by YAML_PATH does not point to a YAML file.

Examples

Example usage of writeParams_toYaml:

>>> oort = Oort()
>>> oort.writeParams_toYaml('/path/to/params.yaml')
YAML file successfully updated