Data Preprocessing¶
Planet cleans, encodes, scales, and imputes tabular data, outputting Moon files for downstream analysis.
Basic Usage¶
from thema.multiverse import Planet
planet = Planet(
data="/path/to/data.pkl",
outDir="./outputs/my_run/clean",
scaler="standard",
encoding="one_hot",
imputeColumns="auto",
imputeMethods="auto",
numSamples=1,
seeds="auto"
)
planet.fit()
Parameters¶
- datastr or Path
Absolute path to input file. Supported formats: CSV, pickle (
.pkl), parquet, Excel (.xlsx).- outDirstr or Path
Absolute path for output Moon files. Creates directory if missing.
- dropColumnslist of str, optional
Column names to remove before preprocessing. Use for IDs, timestamps, or non-predictive features.
- imputeColumnslist of str or “auto”
Columns requiring imputation.
"auto"detects all columns with missing values.- imputeMethodslist of str or “auto”
Imputation strategy per column in
imputeColumns. Must align in length. Options:"mean"- Column mean (numeric only)"median"- Column median (numeric only)"mode"- Most frequent value"sampleNormal"- Sample from fitted normal distribution (numeric only)"sampleCategorical"- Sample from category distribution"zeros"- Fill with zeros
"auto"selectssampleNormalfor numeric columns,sampleCategoricalfor categorical.- scalerstr or None
Feature scaling method. Applied after encoding and imputation. Options:
"standard"- Zero mean, unit variance (recommended)"minmax"- Scale to [0, 1] range"robust"- Robust to outliers (uses IQR)None- No scaling
- encodingstr
Categorical encoding method:
"one_hot"- Binary columns per category (recommended)"label"- Integer encoding"ordinal"- Ordered integer encoding
- numSamplesint
Number of imputed datasets per seed. Use >1 only with randomized imputation methods (
sampleNormal,sampleCategorical) to capture uncertainty.- seedslist of int or “auto”
Random seeds for reproducibility. Each seed generates
numSamplesdatasets."auto"generates random seeds.
Output¶
Moon files saved as moon_<seed>_<sample>.pkl in outDir. Each contains:
imputeData: Preprocessed DataFrameoriginal_columns: Column names before encodingencoding_map: Mapping for categorical variables
Inspection Methods¶
- get_missingData_summary()
DataFrame showing missing value counts and percentages per column
- get_na_as_list()
List of column names with missing values
- get_recomended_sampling_method()
Dict mapping columns to recommended imputation methods
- getParams()
Dict of current configuration
- writeParams_toYaml(path)
Save configuration to YAML file
Examples¶
Manual Configuration
planet = Planet(
data="/data/survey.pkl",
outDir="./outputs/analysis/clean",
dropColumns=["id", "timestamp"],
imputeColumns=["age", "income", "category"],
imputeMethods=["sampleNormal", "median", "mode"],
scaler="standard",
encoding="one_hot",
seeds=[42, 13, 99],
numSamples=2
)
planet.fit() # Produces 6 Moon files (3 seeds × 2 samples)
Auto-Detection
planet = Planet(
data="/data/survey.pkl",
outDir="./outputs/analysis/clean",
imputeColumns="auto",
imputeMethods="auto",
scaler="standard",
encoding="one_hot",
numSamples=1,
seeds="auto"
)
planet.fit()
YAML Configuration
In params.yaml:
Planet:
scaler: standard
encoding: one_hot
imputeColumns: auto
imputeMethods: auto
numSamples: 1
seeds: auto
Then:
planet = Planet(YAML_PATH="params.yaml")
planet.fit()
Best Practices¶
Use
"auto"for initial runs, then inspect withget_missingData_summary()Set
numSamples=1unless using randomized imputationUse
"standard"scaling for most ML pipelinesDrop high-cardinality categoricals before encoding to avoid dimension explosion
Save raw data as pickle to preserve dtypes across runs