Data Preprocessing

Data preprocessing is a crucial step in the data analysis pipeline. It involves cleaning, transforming, and organizing raw data to prepare it for further analysis. This section outlines the steps involved in preprocessing data using Thema.

Manual Data Exploration

Before diving into automated preprocessing, it’s essential to manually explore the data. This step helps you understand the dataset’s structure, identify missing values, detect outliers, and gain insights into the data’s distribution and relationships.

Thema provides tools to summarize, clean, scale, encode, and pre-process your data, making it easier to identify patterns and anomalies. These insights can guide your preprocessing decisions and help you choose the right techniques to clean and prepare the data.

Once you’ve completed the manual data exploration and identified the preprocessing steps required, Thema supports the creation of a formatted .yaml parameter file. This file captures the preprocessing steps and their parameters, making it easy to reproduce the preprocessing pipeline.

The parameter file can include instructions for handling missing values, scaling features, encoding categorical variables, and other preprocessing tasks. By creating this file, you can automate the preprocessing pipeline and apply it consistently to new datasets.

For detailed instructions on creating the .yaml parameter file and using it to preprocess your data, refer to the Params YAML Guide. This guide provides examples and best practices for defining preprocessing steps and parameters in Thema.

Walkthrough

Step 1: Data

To demonstrate the data cleaning and preprocessing functionality, let’s create a sample DataFrame.

In [1]: data = {
   ...:     'Name': ['Mercury', 'Venus', 'Earth', 'Jupiter', 'Saturn'],
   ...:     'Diameter [1000 km]': [12.7, 121, np.nan, 142.9, 49.5],
   ...:     'Has Moon': ['No', 'No', 'Yes', 'Yes', np.nan],
   ...:     'Distance from Sun [au]': [0.39, 0.72, 1.00, 5.20, 9.58]}
   ...: 
In [2]: import pandas as pd

In [3]: df = pd.DataFrame(data)

In [4]: df
Out[4]: 
      Name  Diameter [1000 km] Has Moon  Distance from Sun [au]
0  Mercury                12.7       No                    0.39
1    Venus               121.0       No                    0.72
2    Earth                 NaN      Yes                    1.00
3  Jupiter               142.9      Yes                    5.20
4   Saturn                49.5      NaN                    9.58

Step 2: Create a Planet

The Planet class is a core component of the Thema library, designed for managing tabular data. Its primary purpose is to perturb, label, and navigate existing tabular datasets.

In the context of our demo, the Planet class serves as the medium for managing data, handling the transition from raw tabular data to scaled, encoded, and complete data. It is particularly useful for datasets with missing values, as it can fill missing values with randomly-sampled data and explore the distribution of possible missing values.

In the following sections, we will explore how to use the Planet class to preprocess and manage our data effectively.

In [5]: from thema.multiverse import Planet
In [6]: df.to_pickle("myRawData.pkl")

In [7]: data_path = "myRawData.pkl"

In [8]: planet = Planet(data = data_path)

In [9]: planet
Out[9]: <thema.multiverse.system.inner.planet.Planet at 0x17dbc3cd0>

Step 3: Explore Dataset

Planet.get_missingData_summary() Method

The get_missingData_summary function provides a breakdown of missing and complete data in the columns of the ‘data’ DataFrame. It returns a dictionary containing the following key-value pairs:

  • 'numericMissing': Numeric columns with missing values.

  • 'numericComplete'`: Numeric columns without missing values.

  • 'categoricalMissing': Categorical columns with missing values.

  • 'categoricalComplete': Categorical columns without missing values.

This function is useful for quickly understanding the data quality and identifying columns that may require imputation or other preprocessing steps.

In [10]: planet.get_missingData_summary()
Out[10]: 
{'numericMissing': ['Diameter [1000 km]'],
 'numericComplete': ['Distance from Sun [au]'],
 'categoricalMissing': ['Has Moon'],
 'categoricalComplete': ['Name']}

get_na_as_list() Method

The get_na_as_list method returns a list of column names that contain NaN values in the DataFrame. These columns correspond to the columns for which the recommended sampling method should be applied (as returned by get_recommended_sampling_method), providing a convenient way to identify columns that require imputation due to missing values.

By using the get_na_as_list method in conjunction with the get_recommended_sampling_method method, you can efficiently locate and address missing values in your dataset, ensuring that it is properly prepared for further analysis or modeling.

In [12]: planet.get_na_as_list()
Out[12]: ['Diameter [1000 km]', 'Has Moon']

Step 4: Impute

Assign imputeMethods and imputeColumns based on the recommended methods for each column *OR* define your own lists

Additionally, you can assign imputeMethods drop, which will drop using Pandas: .dropna(axis=0, inplace=True)

In [13]: planet.imputeMethods = planet.get_recomended_sampling_method()

In [14]: planet.imputeColumns = planet.get_na_as_list()

Why Encode and Scale?

Encoding and scaling are important preprocessing steps in data preparation. Encoding transforms categorical variables into numerical values, making them suitable for machine learning algorithms. Scaling standardizes the range of numerical features, preventing features with large scales from dominating the model.

Default Configuration

  • Scaler: The default scaling method is “standard”, which scales features to have a mean of 0 and a variance of 1.

  • Encoding: The default encoding method is “one_hot” for all categorical variables, which creates binary columns for each category.

Customization Options

You can customize the encoding and scaling methods to suit your needs. For scaling, you can choose from an sklearn scaler classes. For encoding, you can select “integer” encoding for ordinal variables or “hash” encoding for high-cardinality categorical variables.

Step 5: Assign an outDir

An outDir, or the place to write the imputed, encoded, and scaled data, is where a thema.multiverse.Moon object (or objects, see guide on Random Imputation) are saved as .pkl files.

This can also be done in the constructor or params .yaml file

In [15]: planet.outDir = '<YOUR DIRECTORY NAME>'

Step 6: Fit Planet

Fitting cleans, encodes, imputes, etc. your data, and then creates and pickles a Moon object containing your data.

Note

Most objects in Thema have a fit() method, excecuting a procedure that can be time consuming depending on the size of your dataset.

As the Planet.fit() method writes a thema.multiverse.Moon object to a .pkl file, we will need to read in the file to view the data here. This step is not necessary, but its nice to see what the fit() method has done for the purpose of this user guide.

cleanedData = pd.read_pickle('<YOUR DIRECTORY NAME>/myRawData_standard_one_hot_imputed_0.pkl').imputeData

Note the the imputeData member variable being used to access the Moon’s cleaned data. See Moon Class documentation for more info on member variables, etc.

Step 7: Get Params

Note

We recommended using the the planet.writeParams_toYaml() method to write this dictionary to a .yaml file, which in turn can be passed to the Planet constructor as a YAML_PATH for quick pipelining. This will eliminate the need to explore, scale, encode, etc. your data again.

In [16]: planet.getParams()
Out[16]: 
{'data': 'myRawData.pkl',
 'scaler': 'standard',
 'encoding': 'one_hot',
 'dropColumns': [],
 'imputeColumns': ['Diameter [1000 km]', 'Has Moon'],
 'imputeMethods': ['sampleNormal', 'mode'],
 'numSamples': 1,
 'seeds': [42],
 'outDir': '<YOUR DIRECTORY NAME>'}

Once you have explored your data and determined how to best preprocess, this entire workflow can be streamlined using the following code:

from thema.multiverse import Planet

yaml = "<PATH TO params.yaml>"

planet = Planet(YAML_PATH=yaml)
planet.fit()