Outer System¶
The thema.multiverse.system.outer
module provides essential functionality for managing and exploring high-dimensional data through projection algorithms. At the core of this module is the COMET
class, which serves as a base template for projection algorithms, enforcing a structured approach to data management and projection. This enables a universal procedure for generating projection objects.
The Oort
class, a key component of the system.outer
module, generates a space of projected representations of an original, high-dimensional dataset. While navigating this space of projections can be challenging, our tools facilitate easy exploration and interpretation of the data.
- Comet Class:
A base class template for projection algorithms, enforcing structure on data management and projection.
- Projectiles:
- Oort Class:
Generates a space of projected representations of high-dimensional datasets, aiding in data exploration.
Usage¶
Creating and Managing Projections: Use
Projectiles
class to create and manage a universe of different unsupervised projections of your data.Unlocking the Multiverse of Representations Use
Oort
to handle the space of multiple data projections
Comet Base Class¶
- class thema.multiverse.system.outer.comet.Comet(data_path: str, clean_path: str)[source]¶
Bases:
Core
Collapse or Modify Existing Tabulars¶
A COMET is a base class template for projection (dimensionality reduction) algorithms. As a parent class, Comet enforces structure on data management and projection, enabling a ‘universal’ procedure for generating these objects.
Members¶
- datapd.DataFrame
a pandas dataframe of raw data
- cleanpd.DataFrame
a pandas dataframe of complete, encoded, and scaled data
Functions¶
- save()
saves Comet to .pkl serialized object file
See also
docs
see for more information on implementing a realization of Comet
Examples
>>> from thema.multiverse.system.outer import Comet >>> class PCA(Comet): ... def fit(self): ... pass >>> pca = PCA(data_path='data.csv', clean_path='clean.csv') >>> pca.fit()
- abstract fit()[source]¶
Abstract method to be implemented by Comet’s child.
Notes
Method must initialize the projectionArray member.
- Raises:
NotImplementedError – If the method is not implemented by the child class.
- save(file_path)[source]¶
Save the current object instance to a file using pickle serialization.
- Parameters:
(str) (file_path) – here the object will be saved.
- Raises:
Exception – If the file cannot be saved.:
Examples
>>> from thema.multiverse.system.outer import Comet >>> class PCA(Comet): ... def fit(self): ... pass >>> pca = PCA(data_path='data.csv', clean_path='clean.csv') >>> pca.fit() >>> pca.save('pca.pkl')
Projectiles¶
Create and manage the universe of different unsupervised projections of your data. We have decided to support three standard dimensionality reduction methods:
Uniform Manifold Approximation and Projection for Dimension Reduction: UMAP
T-distributed Stochastic Neighbor Embedding: t-SNE
Principle Component Analysis: PCA
Hint
An interactive overview of the key differences between UMAP and t-SNE projections: UMAP vs. t-SNE
UMAP¶
- thema.multiverse.system.outer.projectiles.umapProj.initialize()[source]¶
Returns the umapProj class object from module. This is a general method that allows us to initialize arbitrary projectile objects.
- Returns:
umapProj – The UMAP projectile object.
- Return type:
object
- class thema.multiverse.system.outer.projectiles.umapProj.umapProj(data_path, clean_path, nn, minDist, dimensions, seed)[source]¶
Bases:
Comet
UMAP Projectile Class.
Inherits from Comet.
Projects data into lower dimensional space using the Uniform Manifold Approximation and Projection. See: https://umap-learn.readthedocs.io/en/latest/
- Parameters:
data_path (str) – A path to the raw data file.
clean_path (str) – A path to a cofigured Moon object file.
nn (int) – The number of nearest neighbors for UMAP alg.
minDist (float) – The minimum distance threshold for clustering.
dimensions (int) – The number of dimensions for the embedding.
seed (int) – The seed for randomization.
- data¶
A pandas dataframe of raw data.
- Type:
pd.DataFrame
- clean¶
A pandas dataframe of complete, encoded, and scaled data.
- Type:
pd.DataFrame
- projectionArray¶
A projection array.
- Type:
np.array
- nn¶
Number of nearest neighbors.
- Type:
int
- minDist¶
Minimum distance threshold for clustering.
- Type:
float
- dimensions¶
Number of dimensions for the embedding.
- Type:
int
- seed¶
Seed for randomization.
- Type:
int
- save()¶
Saves umapProj to .pkl serialized object file.
t-SNE¶
- thema.multiverse.system.outer.projectiles.tsneProj.initialize()[source]¶
Returns the tsneProj class object from module. This is a general method that allows us to initialize arbitrary projectile objects.
- Returns:
tsneProj – The t-SNE projectile object.
- Return type:
object
- class thema.multiverse.system.outer.projectiles.tsneProj.tsneProj(data_path, clean_path, perplexity, dimensions, seed)[source]¶
Bases:
Comet
t-SNE Projectile Class.
Inherits from Comet.
Projects data into lower dimensional space using the T-distributed Stochastic Neighbor Embedding. See: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Members¶
- datapd.DataFrame
A pandas dataframe of raw data.
- cleanpd.DataFrame
A pandas dataframe of complete, encoded, and scaled data.
- projectionArraynp.array
A projection array.
- perplexityint
A tsne configuration parameter.
- dimensionsint
Number of dimensions for the embedding.
- seedint
Seed for randomization.
Functions¶
- fit()
Fits a tsne projection from given parameters and saves to projectionArray.
- save()
Saves tsneProj to .pkl serialized object file.
PCA¶
- thema.multiverse.system.outer.projectiles.pcaProj.initialize()[source]¶
Returns the pcaeProj class object from module. This is a general method that allows us to initialize arbitrary projectile objects.
- Returns:
pcaProj – The PCA projectile object.
- Return type:
object
- class thema.multiverse.system.outer.projectiles.pcaProj.pcaProj(data_path, clean_path, dimensions, seed)[source]¶
Bases:
Comet
PCA Projectile Class.
Inherits from Comet.
Projects data into lower dimensional space using sklearn’s PCA Projection. See: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- data¶
A pandas dataframe of raw data.
- Type:
pd.DataFrame
- clean¶
A pandas dataframe of complete, encoded, and scaled data.
- Type:
pd.DataFrame
- projectionArray¶
A projection array.
- Type:
np.array
- dimensions¶
Number of dimensions for the embedding.
- Type:
int
- seed¶
Seed for randomization.
- Type:
int
- save(self)¶
Saves pcaProj to .pkl serialized object file.
Oort Class¶
- class thema.multiverse.system.outer.oort.Oort(params=None, data=None, cleanDir=None, outDir=None, YAML_PATH=None, verbose=False)[source]¶
Bases:
Core
The space of COMET objects.¶
The Oort cloud, sometimes called the Öpik–Oort cloud, is theorized to be a vast cloud of icy planetesimals surrounding the Sun at distances ranging from 2,000 to 200,000 AU.
Our Oort class generates a space of projected representations of an original, high dimensional dataset. Though sometimes it can be difficult to see through the cloud of projections, our tools allow you to easily navigate this terrain and properly explore your data.
- param data:
A pandas DataFrame of raw data.
- type data:
pd.DataFrame
- param params:
A parameter dictionary. Default is None.
- type params:
dict, optional
- param cleanDir:
Path to the clean data directory. Default is None.
- type cleanDir:
str, optional
- param outDir:
Path to the out data directory. Default is None.
- type outDir:
str, optional
- param YAML_PATH:
Path to the YAML parameter file. Default is None.
- type YAML_PATH:
str, optional
- data¶
A pandas DataFrame of raw data.
- Type:
pd.DataFrame
- params¶
A parameter dictionary.
- Type:
dict
- cleanDir¶
Path to the clean data directory.
- Type:
str
- outDir¶
Path to the out data directory.
- Type:
str
- YAML_PATH¶
Path to the YAML parameter file.
- Type:
str
- get_data_path() str ¶
Returns the path to the raw data file.
- writeParams_toYaml(YAML_PATH: str) None [source]¶
Writes out the specified parameters to a YAML file.
Examples
>>> cleanDir = "<PATH TO MOON OBJECT FILES>" >>> data = "<PATH TO RAW DATA FILE>" >>> outDir = "<PATH TO OUT DIRECTORY OF PROJECTIONS>" >>> params = { ... "umap" : { ... "nn" : [2, 5, 10], ... "minDist" : [0.1, 0.5], ... "dimensions" : [2], ... "seed" : [42] ... } ... } >>> oort = Oort( ... params=params, ... data=data, ... cleanDir=cleanDir, ... outDir=outDir, ... YAML_PATH=None ... ) >>> oort.fit()
Note
oort.fit() will produce 6 * len(os.listdir(cleanDir)) files in outDir in this example.
- fit()[source]¶
Configure and run your projections.
Uses the ProcessPoolExecutor library to spawn multiple projectile instances and fit them.
- Returns:
Saves projections to the specified outDir
- Return type:
None
Examples
>>> oort = Oort() >>> oort.fit()
- getParams()[source]¶
Get the parameters used to initialize the space of Comets in this Oort.
- Returns:
A dictionary containing the parameters used to initialize an Oort instance.
- Return type:
dict
Examples
>>> oort = Oort() >>> params = oort.getParams() >>> print(params) { "params": {...}, # dictionary containing the parameters used to initialize the Oort instance "data": "/path/to/data", # path to the data "cleanDir": True, # whether to clean the directory "outDir": "/path/to/output" # path to the output directory }
- save(file_path)[source]¶
Save the current object instance to a file using pickle serialization.
- Parameters:
file_path (str) – The path to the file where the object will be saved.
- Raises:
IOError – If there is an error while saving the object to the file.
Examples
>>> obj = MyClass() >>> obj.save("data.pkl") # Save the object to a file named "data.pkl"
- writeParams_toYaml(YAML_PATH=None)[source]¶
Write out the specified parameters to a YAML type file.
- Parameters:
YAML_PATH (str (filepath), optional) – The path to an existing .yaml type file. If not provided, the value of self.YAML_PATH will be used. If self.YAML_PATH is also None, a ValueError will be raised.
- Returns:
Saves a yaml file to the specified YAML_PATH.
- Return type:
None
- Raises:
ValueError – If YAML_PATH is None and self.YAML_PATH is also None.
TypeError – If the file path specified by YAML_PATH does not point to a YAML file.
Examples
Example usage of writeParams_toYaml:
>>> oort = Oort() >>> oort.writeParams_toYaml('/path/to/params.yaml') YAML file successfully updated