Inner System¶
Thema’s multiverse.system.inner
submodule offers two key classes: Moon and Planet. These classes facilitate preprocessing of tabular data, easing the transition from raw datasets to cleaned, imputed, and ready-for-analysis dataframes.
- Moon Class:
This class sits close to the original dataset, focusing on preprocessing steps crucial for downstream analysis. It streamlines data cleaning and aids in the creation of an imputeData dataframe, which is a formatted version of the data suitable for in-depth exploration. The Moon class supports standard sklearn.preprocessing operations such as scaling and encoding, with an emphasis on imputation methods for handling missing values.
- Planet Class:
Operating within the inner system, the Planet class manages the transformation of raw tabular data into processed, scaled, encoded, and complete datasets. Its key function is handling datasets with missing values, utilizing a
Moon
imputeData dataframe using random sampling to fill in these gaps while exploring the distribution of possible missing values.
Both classes are integral to Thema’s data preprocessing capabilities, providing efficient solutions for common data cleaning and imputation tasks.
Moon Class¶
Note
thema.multiverse.system.inner.Moon
handles data preprocessing, moving from raw, tabular datasets to cleaned, scaled, and encoded python-friendly formats.
- class thema.multiverse.system.inner.moon.Moon(data, dropColumns=[], encoding='one_hot', scaler='standard', imputeColumns=[], imputeMethods=[], id=None, seed=None)[source]¶
Bases:
Core
The Moon: Modify, Omit, Oscillate and Normalize.¶
The Moon data class resides cosmically near to the original raw dataset. This class handles a multitude of individual preprocessing steps helpful for smooth computation and analysis farther downstream the analysis pipeline.
The intended use of this class is simplify the cleaning process and automate the production of an imputeData dataframe - a format of the data fit for more expansive exploration.
The Moon class supports standard sklearn.preprocessing measures for scaling and encoding, with the primary additive feature being supported imputation methods for filling N/A values.
- data¶
A pandas dataframe of raw data.
- Type:
pd.DataFrame
- imputeData¶
A pandas dataframe of complete, encoded, and scaled data.
- Type:
pd.DataFrame
- encoding¶
A list of encoding methods used for categorical variables.
- Type:
list
- scaler¶
The scaling method used.
- Type:
str
- dropColumns¶
A list of columns dropped from the raw data.
- Type:
list
- imputeColumns¶
A list of columns with missing values.
- Type:
list
- imputeMethods¶
A list of imputation methods used to fill missing values.
- Type:
list
- seeds¶
The random seed used.
- Type:
int
- outDir¶
The path to the output data directory.
- Type:
str
Examples
>>> data = pd.DataFrame({"A": ["Sally", "Freddy", "Johnny"], ... "B": ["cat", "dog", None], ... "C": [14, 22, 43]}) >>> data.to_pickle("myRawData") >>> data_path = "myRawData.pkl" >>> moon = Moon(data=data_path, ... dropColumns=["A"], ... encoding=["one_hot"], ... scaler="standard", ... imputeColumns=["B"], ... imputeMethod=["mode"]) >>> moon.fit() >>> moon.imputeData.to_pickle("myCleanData")
Planet Class¶
Note
thema.multiverse.system.inner.Planet
handles data transformation, managing and exploring the distribution of possible missing values.
- class thema.multiverse.system.inner.planet.Planet(data=None, outDir=None, scaler: str = 'standard', encoding: str = 'one_hot', dropColumns=None, imputeMethods=None, imputeColumns=None, numSamples: int = 1, seeds: list = [42], verbose: bool = False, YAML_PATH=None)[source]¶
Bases:
Core
Perturb, Label And Navigate Existsing Tabulars —
Plan It. Planet!
The Planet class lives in the –inner system– and handles the transition from raw tabular data to scaled, encoded, and complete data. Specifically, this class is designed to handle datasets with missing values by filling missing values with randomly-sampled data, exploring the distribution of possible missing values.
- Parameters:
data (pd.Dataframe, optional) – A pandas dataframe of raw data. Default is None.
outDir (str, optional) – The directory path where the processed data will be saved. Default is None.
scaler (str, optional) – The method used for scaling the data. Default is “standard”.
encoding (str or list, optional) –
The method used for encoding categorical variables. Default is “one_hot” for all categorical variables.
Accepted Values - “one_hot” - “integer” - “hash”
dropColumns (list, optional) – A list of columns to be dropped from the data. Default is None.
imputeMethods (list, str, optional) –
A dictionary mapping column names to the imputation method to be used for each column. Default is None.
NOTE: this parameter can take multiple types
Behavior - imputeMethods: list
Will iterate overall imputation methods contained in list and create datasets that have been imputed based on the selected methods.
- imputeMethods: “sampleNormal” -> str
Will use the sampleNormal method (or other).
- imputeMethods: None
Will default to dropping columns with missing values, not imputing (as the imputeMethod is None).
Accepted Values - “sampleNormal” - “drop” - “mean” - “median” - “mode”
imputeColumns (list, optional str "all") –
A list of columns to be imputed. Default is None.
NOTE: this parameter can take multiple types
Behavior - imputeColumns: list
Will only impute the selection of data columns passed in the list.
- imputeColumns: “all” -> str
Will impute all columns with missing values per the specified imputeMethods.
NOTE: no other string values accepted.
- imputeColumns: None
Will drop all columns with missing values (ignores parameter(s) specified in imputeMethods when this is the case).
numSamples (int, optional) – The number of samples to generate. Default is 1.
seeds (list, optional) – A list of random seeds to use for reproducibility. Default is [42].
verbose (bool, optional) – Whether to print progress messages. Default is False.
YAML_PATH (str, optional) – The path to a YAML file containing configuration settings. Default is None.
- data¶
A pandas dataframe of raw data.
- Type:
pd.Dataframe
- encoding¶
The method used for encoding categorical variables.
- Type:
str or list
- scaler¶
The method used for scaling the data.
- Type:
str
- dropColumns¶
A list of columns dropped from the raw data.
- Type:
list
- imputeColumns¶
A list of impute columns.
- Type:
list
- imputeMethods¶
The methodology used to impute columns.
- Type:
list
- numSamples¶
The number of clean data frames produced when imputing.
- Type:
int
- seeds¶
A list of random seeds.
- Type:
list
- outDir¶
The path to the out data directory.
- Type:
str
- YAML_PATH¶
The path to the YAML parameter file.
- Type:
str
- get_data_path() str ¶
Returns the path to the raw data file.
- get_recommended_sampling_method() list ¶
Returns the recommended sample method for a dataset.
Example
>>> data = pd.DataFrame({"A": ["Sally", "Freddy", "Johnny"], "B": ["cat", "dog", None], "C": [14, 22, None]})
>>> data.to_pickle("myRawData")
>>> data_path = "myRawData.pkl" >>> planet = Planet( data = data_path, outDir = "/<PATH TO OUT DIRECTORY>", scaler= "standard", encoding = "one_hot", dropColumns = None, imputeMethods = "sampleNormal", imputeColumns = "all", )
>>> planet.fit()
>>> planet.imputeData.to_pickle("myCleanData")
- fit()[source]¶
The meat and potatoes – configure and run your planet object based on the specified params.
Uses the ProcessPoolExecutor library to spawn multiple processes and generate results in a time-efficient manner.
- Returns:
Saves numSamples of files (cleaned, imputed, scaled etc. data) to the specified outDir.
- Return type:
None
Examples
>>> data = pd.DataFrame({"A": ["Sally", "Freddy", "Johnny"], "B": ["cat", "dog", None], "C": [14, 22, None]})
>>> data.to_pickle("myRawData")
>>> data_path = "myRawData.pkl" >>> planet = Planet( data = data_path, outDir = "<PATH TO OUT DIRECTORY>", scaler= "standard", encoding = "one_hot", dropColumns = None, imputeMethods = "sampleNormal", imputeColumns = "all", )
>>> planet.fit()
>>> planet.imputeData.to_pickle("myCleanData")
- getParams() dict [source]¶
Get the parameters used to initialize the space of Moons around this Planet.
- Returns:
A dictionary containing the parameters used to initialize this specific Planet instance.
- Return type:
dict
Examples
>>> planet = Planet() >>> params = planet.getParams() >>> print(params) {'data': None, 'scaler': 'standard', 'encoding': 'one_hot', 'dropColumns': None, 'imputeColumns': None, 'imputeMethods': None, 'numSamples': 1, 'seeds': [42], 'outDir': None}
- get_missingData_summary() dict [source]¶
Get a summary of missing data in the columns of the ‘data’ dataframe.
- Returns:
summary – A dictionary containing a breakdown of columns from ‘data’ that are: - ‘numericMissing’: Numeric columns with missing values - ‘numericComplete’: Numeric columns without missing values - ‘categoricalMissing’: Categorical columns with missing values - ‘categoricalComplete’: Categorical columns without missing values
- Return type:
dict
Examples
>>> data = pd.DataFrame({"A": [1, 2, None], "B": [3, None, 5], "C": ["a", "b", None]})
>>> planet = Planet(data=data) >>> summary = planet.get_missingData_summary() >>> print(summary) {'numericMissing': ['A', 'B'], 'numericComplete': [], 'categoricalMissing': ['C'], 'categoricalComplete': ['A', 'B']}
- get_na_as_list() list [source]¶
Get a list of columns that contain NaN values.
- Returns:
A list of column names that contain NaN values.
- Return type:
list of str
Examples
>>> data = pd.DataFrame({"A": [1, 2, None], "B": [3, None, 5], "C": ["a", "b", None]})
>>> planet = Planet(data=data) >>> na_columns = planet.get_na_as_list() >>> print(na_columns) ['A', 'B', 'C']
- get_recomended_sampling_method() list [source]¶
Get a recommended sampling method for columns with missing values.
- Returns:
A list of recommended sampling methods for columns with missing values. For numeric columns, “sampleNormal” is recommended. For non-numeric columns, “sampleCategorical” (most frequent value) is recommended.
- Return type:
list
Examples
>>> data = pd.DataFrame({"A": [1, 2, None], "B": [3, None, 5], "C": ["a", "b", None]})
>>> planet = Planet(data=data) >>> methods = planet.get_recommended_sampling_method() >>> print(methods) ['sampleNormal', 'sampleCategorical', 'sampleCategorical']
Inner System Utils¶
- thema.multiverse.system.inner.inner_utils.add_imputed_flags(df, impute_columns)[source]¶
Add a flag for each value per column that is NA
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
impute_columns (list of str) – The list of column names to add imputed flags for.
- Returns:
The DataFrame with added imputed flags.
- Return type:
pandas.DataFrame
Examples
>>> import pandas as pd >>> df = pd.DataFrame({'A': [1, 2, None], 'B': [None, 4, 5]}) >>> impute_columns = ['A', 'B'] >>> add_imputed_flags(df, impute_columns) A B impute_A impute_B 0 1 NaN 0 1 1 2 4.0 0 0 2 NaN 5.0 1 0
- thema.multiverse.system.inner.inner_utils.clean_data_filename(data_name, scaler=None, encoding=None, id=None)[source]¶
Generate a filename for the cleaned and preprocessed data.
- Parameters:
data_name (str) – The name of the raw dataframe.
scaler (object, optional) – A scaler object used for scaling the data. If None, the filename will not contain a scaler label. Default is None.
encoding (str, optional) – The encoding type used for categorical variables. Default is “integer”.
filter (bool, optional) – Whether or not columns were filtered during cleaning. If True, the filename will contain a filter label. Default is True.
- Returns:
A filename for the clean data.
- Return type:
str
Examples
>>> data_name = "raw_data" >>> scaler = None >>> encoding = "integer" >>> filter = True >>> generate_filename(data_name, scaler, encoding, filter) 'raw_data_integer_imputed_filtered.pkl'
>>> data_name = "raw_data" >>> scaler = "standard" >>> encoding = "one-hot" >>> filter = False >>> generate_filename(data_name, scaler, encoding, filter) 'raw_data_standard_one-hot_imputed.pkl'
- thema.multiverse.system.inner.inner_utils.clear_previous_imputations(dir, key)[source]¶
Clear all files in the directory that contain the specified key.
- Parameters:
dir (str) – The directory path where the files are located.
key (str) – The key to search for in the file names.
Examples
>>> clear_previous_imputations('/path/to/directory', 'imputation')
- thema.multiverse.system.inner.inner_utils.drop(column, seed)[source]¶
Leave columns as is and let NaNs be dropped from column.
- Parameters:
column (array_like) – The input column.
seed (int) – The seed value for random number generation.
- Returns:
The column with the element removed.
- Return type:
array_like
Examples
>>> drop([1, 2, 3, 4, 5], 42) [1, 2, 4, 5] >>> drop(['a', 'b', 'c', 'd'], 123) ['a', 'b', 'd']
- thema.multiverse.system.inner.inner_utils.integer_encoder(column_values: array)[source]¶
Encode the given array of categorical values into integers.
- Parameters:
column_values (numpy.ndarray) – An array of categorical values to be encoded.
- Returns:
An array of integers representing the encoded categorical values.
- Return type:
numpy.ndarray
Examples
>>> column_values = np.array(['apple', 'banana', 'apple', 'orange', 'banana']) >>> integer_encoder(column_values) array([0, 1, 0, 2, 1])
>>> column_values = np.array(['red', 'green', 'blue', 'red', 'green', 'blue']) >>> integer_encoder(column_values) array([0, 1, 2, 0, 1, 2])
- thema.multiverse.system.inner.inner_utils.mean(column, seed)[source]¶
Fill in missing data in a column based on its average.
- Parameters:
column (pandas.Series) – The column containing the data to be filled.
seed (int) – Seed value for random number generation.
- Returns:
The column with missing values filled using the average.
- Return type:
pandas.Series
Examples
>>> import pandas as pd >>> from numpy import nan >>> column = pd.Series([1, 2, nan, 4, 5]) >>> mean(column, 42) 0 1.0 1 2.0 2 3.0 3 4.0 4 5.0 dtype: float64
- thema.multiverse.system.inner.inner_utils.median(column, seed)[source]¶
Fill in missing data in a column based on its median value.
- Parameters:
column (pandas.Series) – The column containing the data to be filled.
seed (int) – Seed value for random number generation.
- Returns:
The column with missing values filled using the median.
- Return type:
pandas.Series
Examples
>>> import pandas as pd >>> column = pd.Series([1, 2, None, 4, 5]) >>> seed = 42 >>> median(column, seed) 0 1.0 1 2.0 2 3.0 3 4.0 4 5.0 dtype: float64
- thema.multiverse.system.inner.inner_utils.mode(column, seed)[source]¶
Fill missing values in a column with its mode (most frequent value).
- Parameters:
column (pandas.Series) – The column containing the data to be filled.
seed (int or None, optional) – Seed for the random number generator. Default is None.
- Returns:
The column with missing values filled with the mode.
- Return type:
pandas.Series
Examples
>>> import pandas as pd >>> column = pd.Series([1, 2, 2, 3, 3, np.nan]) >>> mode(column, 42) 0 2.0 1 2.0 2 2.0 3 3.0 4 3.0 5 2.0 dtype: float64
- thema.multiverse.system.inner.inner_utils.sampleCategorical(column, seed)[source]¶
Fill in missing data in a column by sampling from the categorical distribution.
- Parameters:
column (pandas.Series) – The column containing the data to be filled.
seed (int) – Seed value for the random number generator.
- Returns:
The column with missing values filled using random samples from the categorical distribution.
- Return type:
pandas.Series
Examples
>>> import pandas as pd >>> import numpy as np >>> column = pd.Series(['apple', 'banana', np.nan, 'orange', np.nan, 'banana']) >>> seed = 42 >>> sampleCategorical(column, seed) 0 apple 1 banana 2 banana 3 orange 4 banana 5 banana dtype: object
- thema.multiverse.system.inner.inner_utils.sampleNormal(column, seed)[source]¶
Fill in missing data in a column by sampling from the normal distribution.
- Parameters:
column (pandas.Series) – The column containing the data to be filled.
seed (int) – Seed value for the random number generator.
- Returns:
The column with missing values filled using random samples from the normal distribution.
- Return type:
pandas.Series
Examples
>>> import pandas as pd >>> import numpy as np >>> column = pd.Series([1, 2, np.nan, 4, np.nan, 6]) >>> seed = 42 >>> sampleNormal(column, seed) 0 1.000000 1 2.000000 2 3.336112 3 4.000000 4 5.336112 5 6.000000 dtype: float64