Inner System¶

Thema’s multiverse.system.inner submodule offers two key classes: Moon and Planet. These classes facilitate preprocessing of tabular data, easing the transition from raw datasets to cleaned, imputed, and ready-for-analysis dataframes.

Moon Class:: This class sits close to the original dataset, focusing on preprocessing steps crucial for downstream analysis. It streamlines data cleaning and aids in the creation of an imputeData dataframe, which is a formatted version of the data suitable for in-depth exploration. The Moon class supports standard sklearn.preprocessing operations such as scaling and encoding, with an emphasis on imputation methods for handling missing values.
Planet Class:: Operating within the inner system, the Planet class manages the transformation of raw tabular data into processed, scaled, encoded, and complete datasets. Its key function is handling datasets with missing values, utilizing a Moon imputeData dataframe using random sampling to fill in these gaps while exploring the distribution of possible missing values.

Both classes are integral to Thema’s data preprocessing capabilities, providing efficient solutions for common data cleaning and imputation tasks.

Moon Class¶

Note

thema.multiverse.system.inner.Moon handles data preprocessing, moving from raw, tabular datasets to cleaned, scaled, and encoded python-friendly formats.

class thema.multiverse.system.inner.moon.Moon(data, dropColumns=[], encoding='one_hot', scaler='standard', imputeColumns=[], imputeMethods=[], id=None, seed=None)[source]¶

Bases: Core

The Moon: Modify, Omit, Oscillate and Normalize.¶

The Moon data class resides cosmically near to the original raw dataset. This class handles a multitude of individual preprocessing steps helpful for smooth computation and analysis farther downstream the analysis pipeline.

The intended use of this class is simplify the cleaning process and automate the production of an imputeData dataframe - a format of the data fit for more expansive exploration.

The Moon class supports standard sklearn.preprocessing measures for scaling and encoding, with the primary additive feature being supported imputation methods for filling N/A values.

data¶

A pandas dataframe of raw data.

Type:: pd.DataFrame

imputeData¶

A pandas dataframe of complete, encoded, and scaled data.

Type:: pd.DataFrame

encoding¶

A list of encoding methods used for categorical variables.

Type:: list

scaler¶

The scaling method used.

Type:: str

dropColumns¶

A list of columns dropped from the raw data.

Type:: list

imputeColumns¶

A list of columns with missing values.

Type:: list

imputeMethods¶

A list of imputation methods used to fill missing values.

Type:: list

seeds¶

The random seed used.

Type:: int

outDir¶

The path to the output data directory.

Type:: str

fit()[source]¶: Performs the cleaning procedure according to the constructor arguments.

save(file_path)[source]¶: Saves the current object using pickle serialization.

Examples

>>> data = pd.DataFrame({"A": ["Sally", "Freddy", "Johnny"],
...                      "B": ["cat", "dog", None],
...                      "C": [14, 22, 43]})
>>> data.to_pickle("myRawData")
>>> data_path = "myRawData.pkl"
>>> moon = Moon(data=data_path,
...             dropColumns=["A"],
...             encoding=["one_hot"],
...             scaler="standard",
...             imputeColumns=["B"],
...             imputeMethod=["mode"])
>>> moon.fit()
>>> moon.imputeData.to_pickle("myCleanData")

fit()[source]¶

Performs the cleaning procedure according to the constructor arguments. Initializes the imputeData member as a DataFrame, which is a scaled, numeric, and complete representation of the original raw data set.

Examples

>>> moon = Moon()
>>> moon.fit()

save(file_path)[source]¶

Saves the current object using pickle serialization.

Parameters:: file_path (str) – The file path for the object to be written to.

Examples

>>> moon = Moon()
>>> moon.fit()
>>> moon.save("myMoonObject.pkl")

Planet Class¶

Note

thema.multiverse.system.inner.Planet handles data transformation, managing and exploring the distribution of possible missing values.

class thema.multiverse.system.inner.planet.Planet(data=None, outDir=None, scaler: str = 'standard', encoding: str = 'one_hot', dropColumns=None, imputeMethods=None, imputeColumns=None, numSamples: int = 1, seeds: list = [42], verbose: bool = False, YAML_PATH=None)[source]¶

Bases: Core

Perturb, Label And Navigate Existsing Tabulars —

Plan It. Planet!

The Planet class lives in the –inner system– and handles the transition from raw tabular data to scaled, encoded, and complete data. Specifically, this class is designed to handle datasets with missing values by filling missing values with randomly-sampled data, exploring the distribution of possible missing values.

Parameters:

data (pd.Dataframe, optional) – A pandas dataframe of raw data. Default is None.
outDir (str, optional) – The directory path where the processed data will be saved. Default is None.
scaler (str, optional) – The method used for scaling the data. Default is “standard”.
encoding (str or list, optional) –
The method used for encoding categorical variables. Default is “one_hot” for all categorical variables.

Accepted Values - “one_hot” - “integer” - “hash”
dropColumns (list, optional) – A list of columns to be dropped from the data. Default is None.
imputeMethods (list, str, optional) –
A dictionary mapping column names to the imputation method to be used for each column. Default is None.

NOTE: this parameter can take multiple types

Behavior - imputeMethods: list
- Will iterate overall imputation methods contained in list and create datasets that have been imputed based on the selected methods.
- imputeMethods: “sampleNormal” -> str
  - Will use the sampleNormal method (or other).
- imputeMethods: None
  - Will default to dropping columns with missing values, not imputing (as the imputeMethod is None).
Accepted Values - “sampleNormal” - “drop” - “mean” - “median” - “mode”
imputeColumns (list, optional str "all") –
A list of columns to be imputed. Default is None.

NOTE: this parameter can take multiple types

Behavior - imputeColumns: list
- Will only impute the selection of data columns passed in the list.
- imputeColumns: “all” -> str
  - Will impute all columns with missing values per the specified imputeMethods.
  - NOTE: no other string values accepted.
- imputeColumns: None
  - Will drop all columns with missing values (ignores parameter(s) specified in imputeMethods when this is the case).
numSamples (int, optional) – The number of samples to generate. Default is 1.
seeds (list, optional) – A list of random seeds to use for reproducibility. Default is [42].
verbose (bool, optional) – Whether to print progress messages. Default is False.
YAML_PATH (str, optional) – The path to a YAML file containing configuration settings. Default is None.

data¶

A pandas dataframe of raw data.

Type:: pd.Dataframe

encoding¶

The method used for encoding categorical variables.

Type:: str or list

scaler¶

The method used for scaling the data.

Type:: str

dropColumns¶

A list of columns dropped from the raw data.

Type:: list

imputeColumns¶

A list of impute columns.

Type:: list

imputeMethods¶

The methodology used to impute columns.

Type:: list

numSamples¶

The number of clean data frames produced when imputing.

Type:: int

seeds¶

A list of random seeds.

Type:: list

outDir¶

The path to the out data directory.

Type:: str

YAML_PATH¶

The path to the YAML parameter file.

Type:: str

get_data_path() → str¶: Returns the path to the raw data file.

get_missingData_summary() → dict[source]¶: Returns a dictionary summarizing missing data.

get_recommended_sampling_method() → list¶: Returns the recommended sample method for a dataset.

get_na_as_list() → list[source]¶: Returns a list columns containing NaN values.

getParams() → dict[source]¶: Get a dictionary of parameters used in planet construction.

writeParams_toYaml() → None[source]¶: Saves your parameters to a YAML file.

fit()[source]¶: Fits numSamples number of Moon objects and writes to outDir.

save()[source]¶: Saves Planet to .pkl serialized object file.

Example

>>> data = pd.DataFrame({"A": ["Sally", "Freddy", "Johnny"],
                      "B": ["cat", "dog", None],
                      "C": [14, 22, None]})

>>> data.to_pickle("myRawData")

>>> data_path = "myRawData.pkl"
>>> planet = Planet(
    data = data_path,
    outDir = "/<PATH TO OUT DIRECTORY>",
    scaler= "standard",
    encoding = "one_hot",
    dropColumns = None,
    imputeMethods = "sampleNormal",
    imputeColumns = "all",
    )

>>> planet.fit()

>>> planet.imputeData.to_pickle("myCleanData")

fit()[source]¶

The meat and potatoes – configure and run your planet object based on the specified params.

Uses the ProcessPoolExecutor library to spawn multiple processes and generate results in a time-efficient manner.

Returns:: Saves numSamples of files (cleaned, imputed, scaled etc. data) to the specified outDir.
Return type:: None

Examples

>>> data = pd.DataFrame({"A": ["Sally", "Freddy", "Johnny"],
          "B": ["cat", "dog", None],
          "C": [14, 22, None]})

>>> data.to_pickle("myRawData")

>>> data_path = "myRawData.pkl"
>>> planet = Planet(
    data = data_path,
    outDir = "<PATH TO OUT DIRECTORY>",
    scaler= "standard",
    encoding = "one_hot",
    dropColumns = None,
    imputeMethods = "sampleNormal",
    imputeColumns = "all",
    )

>>> planet.fit()

>>> planet.imputeData.to_pickle("myCleanData")

getParams() → dict[source]¶

Get the parameters used to initialize the space of Moons around this Planet.

Returns:: A dictionary containing the parameters used to initialize this specific Planet instance.
Return type:: dict

Examples

>>> planet = Planet()
>>> params = planet.getParams()
>>> print(params)
{'data': None, 'scaler': 'standard', 'encoding': 'one_hot',
'dropColumns': None, 'imputeColumns': None, 'imputeMethods': None,
'numSamples': 1, 'seeds': [42], 'outDir': None}

get_missingData_summary() → dict[source]¶

Get a summary of missing data in the columns of the ‘data’ dataframe.

Returns:: summary – A dictionary containing a breakdown of columns from ‘data’ that are: - ‘numericMissing’: Numeric columns with missing values - ‘numericComplete’: Numeric columns without missing values - ‘categoricalMissing’: Categorical columns with missing values - ‘categoricalComplete’: Categorical columns without missing values
Return type:: dict

Examples

>>> data = pd.DataFrame({"A": [1, 2, None],
                        "B": [3, None, 5],
                        "C": ["a", "b", None]})

>>> planet = Planet(data=data)
>>> summary = planet.get_missingData_summary()
>>> print(summary)
{'numericMissing': ['A', 'B'], 'numericComplete': [], 'categoricalMissing': ['C'], 'categoricalComplete': ['A', 'B']}

get_na_as_list() → list[source]¶

Get a list of columns that contain NaN values.

Returns:: A list of column names that contain NaN values.
Return type:: list of str

Examples

>>> data = pd.DataFrame({"A": [1, 2, None],
            "B": [3, None, 5],
            "C": ["a", "b", None]})

>>> planet = Planet(data=data)
>>> na_columns = planet.get_na_as_list()
>>> print(na_columns)
['A', 'B', 'C']

get_recomended_sampling_method() → list[source]¶

Get a recommended sampling method for columns with missing values.

Returns:: A list of recommended sampling methods for columns with missing values. For numeric columns, “sampleNormal” is recommended. For non-numeric columns, “sampleCategorical” (most frequent value) is recommended.
Return type:: list

Examples

>>> data = pd.DataFrame({"A": [1, 2, None],
            "B": [3, None, 5],
            "C": ["a", "b", None]})

>>> planet = Planet(data=data)
>>> methods = planet.get_recommended_sampling_method()
>>> print(methods)
['sampleNormal', 'sampleCategorical', 'sampleCategorical']

save(file_path)[source]¶

Save the current object instance to a file using pickle serialization.

Parameters:: file_path (str) – The path to the file where the object will be saved.

Examples

>>> planet = Planet()
>>> planet.save("myPlanet.pkl")

writeParams_toYaml(YAML_PATH=None)[source]¶

Write the specified parameters to a YAML file.

Parameters:: YAML_PATH (str) – The path to an existing YAML file.
Return type:: None

Examples

>>> planet = Planet()
>>> planet.writeParams_toYaml("config.yaml")
YAML file successfully updated

Inner System Utils¶

thema.multiverse.system.inner.inner_utils.add_imputed_flags(df, impute_columns)[source]¶

Add a flag for each value per column that is NA

Parameters:

df (pandas.DataFrame) – The input DataFrame.
impute_columns (list of str) – The list of column names to add imputed flags for.

Returns:

The DataFrame with added imputed flags.

Return type:

pandas.DataFrame

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, None], 'B': [None, 4, 5]})
>>> impute_columns = ['A', 'B']
>>> add_imputed_flags(df, impute_columns)
   A    B  impute_A  impute_B
0  1  NaN         0         1
1  2  4.0         0         0
2  NaN  5.0         1         0

thema.multiverse.system.inner.inner_utils.clean_data_filename(data_name, scaler=None, encoding=None, id=None)[source]¶

Generate a filename for the cleaned and preprocessed data.

Parameters:

data_name (str) – The name of the raw dataframe.
scaler (object, optional) – A scaler object used for scaling the data. If None, the filename will not contain a scaler label. Default is None.
encoding (str, optional) – The encoding type used for categorical variables. Default is “integer”.
filter (bool, optional) – Whether or not columns were filtered during cleaning. If True, the filename will contain a filter label. Default is True.

Returns:

A filename for the clean data.

Return type:

str

Examples

>>> data_name = "raw_data"
>>> scaler = None
>>> encoding = "integer"
>>> filter = True
>>> generate_filename(data_name, scaler, encoding, filter)
'raw_data_integer_imputed_filtered.pkl'

>>> data_name = "raw_data"
>>> scaler = "standard"
>>> encoding = "one-hot"
>>> filter = False
>>> generate_filename(data_name, scaler, encoding, filter)
'raw_data_standard_one-hot_imputed.pkl'

thema.multiverse.system.inner.inner_utils.clear_previous_imputations(dir, key)[source]¶

Clear all files in the directory that contain the specified key.

Parameters:

dir (str) – The directory path where the files are located.
key (str) – The key to search for in the file names.

Examples

>>> clear_previous_imputations('/path/to/directory', 'imputation')

thema.multiverse.system.inner.inner_utils.drop(column, seed)[source]¶

Leave columns as is and let NaNs be dropped from column.

Parameters:

column (array_like) – The input column.
seed (int) – The seed value for random number generation.

Returns:

The column with the element removed.

Return type:

array_like

Examples

>>> drop([1, 2, 3, 4, 5], 42)
[1, 2, 4, 5]
>>> drop(['a', 'b', 'c', 'd'], 123)
['a', 'b', 'd']

thema.multiverse.system.inner.inner_utils.integer_encoder(column_values: array)[source]¶

Encode the given array of categorical values into integers.

Parameters:: column_values (numpy.ndarray) – An array of categorical values to be encoded.
Returns:: An array of integers representing the encoded categorical values.
Return type:: numpy.ndarray

Examples

>>> column_values = np.array(['apple', 'banana', 'apple', 'orange', 'banana'])
>>> integer_encoder(column_values)
array([0, 1, 0, 2, 1])

>>> column_values = np.array(['red', 'green', 'blue', 'red', 'green', 'blue'])
>>> integer_encoder(column_values)
array([0, 1, 2, 0, 1, 2])

thema.multiverse.system.inner.inner_utils.mean(column, seed)[source]¶

Fill in missing data in a column based on its average.

Parameters:

column (pandas.Series) – The column containing the data to be filled.
seed (int) – Seed value for random number generation.

Returns:

The column with missing values filled using the average.

Return type:

pandas.Series

Examples

>>> import pandas as pd
>>> from numpy import nan
>>> column = pd.Series([1, 2, nan, 4, 5])
>>> mean(column, 42)
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

thema.multiverse.system.inner.inner_utils.median(column, seed)[source]¶

Fill in missing data in a column based on its median value.

Parameters:

column (pandas.Series) – The column containing the data to be filled.
seed (int) – Seed value for random number generation.

Returns:

The column with missing values filled using the median.

Return type:

pandas.Series

Examples

>>> import pandas as pd
>>> column = pd.Series([1, 2, None, 4, 5])
>>> seed = 42
>>> median(column, seed)
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

thema.multiverse.system.inner.inner_utils.mode(column, seed)[source]¶

Fill missing values in a column with its mode (most frequent value).

Parameters:

column (pandas.Series) – The column containing the data to be filled.
seed (int or None, optional) – Seed for the random number generator. Default is None.

Returns:

The column with missing values filled with the mode.

Return type:

pandas.Series

Examples

>>> import pandas as pd
>>> column = pd.Series([1, 2, 2, 3, 3, np.nan])
>>> mode(column, 42)
0    2.0
1    2.0
2    2.0
3    3.0
4    3.0
5    2.0
dtype: float64

thema.multiverse.system.inner.inner_utils.sampleCategorical(column, seed)[source]¶

Fill in missing data in a column by sampling from the categorical distribution.

Parameters:

column (pandas.Series) – The column containing the data to be filled.
seed (int) – Seed value for the random number generator.

Returns:

The column with missing values filled using random samples from the categorical distribution.

Return type:

pandas.Series

Examples

>>> import pandas as pd
>>> import numpy as np
>>> column = pd.Series(['apple', 'banana', np.nan, 'orange', np.nan, 'banana'])
>>> seed = 42
>>> sampleCategorical(column, seed)
0    apple
1   banana
2   banana
3   orange
4   banana
5   banana
dtype: object

thema.multiverse.system.inner.inner_utils.sampleNormal(column, seed)[source]¶

Fill in missing data in a column by sampling from the normal distribution.

Parameters:

column (pandas.Series) – The column containing the data to be filled.
seed (int) – Seed value for the random number generator.

Returns:

The column with missing values filled using random samples from the normal distribution.

Return type:

pandas.Series

Examples

>>> import pandas as pd
>>> import numpy as np
>>> column = pd.Series([1, 2, np.nan, 4, np.nan, 6])
>>> seed = 42
>>> sampleNormal(column, seed)
0    1.000000
1    2.000000
2    3.336112
3    4.000000
4    5.336112
5    6.000000
dtype: float64