Getting Started¶

Complete end-to-end workflow using uv for environment management.

Prepare Your Data¶

Thema accepts CSV, pickle, or parquet files:

import pandas as pd

df = pd.read_csv("raw_data.csv")
df.to_pickle("data.pkl")  # Recommended for mixed types

Create Configuration¶

Save as params.yaml in your project directory:

runName: my_analysis
data: /absolute/path/to/data.pkl
outDir: ./outputs

Planet:
  scaler: standard
  encoding: one_hot
  imputeColumns: auto
  imputeMethods: auto
  numSamples: 1
  seeds: auto

Oort:
  tsne:
    perplexity: [30]
    dimensions: [2]
    seed: [42]
  pca:
    dimensions: [2]
    seed: [42]
  projectiles: [tsne, pca]

Galaxy:
  metric: stellar_curvature_distance
  selector: max_nodes
  nReps: 2
  stars: [jmap]
  jmap:
    nCubes: [8]
    percOverlap: [0.3]
    minIntersection: [-1]
    clusterer:
      - [HDBSCAN, {min_cluster_size: 5}]

Run the Pipeline¶

from thema.thema import Thema

T = Thema("params.yaml")
T.genesis()

print(f"Representatives: {T.selected_model_files}")

Inspect Outputs¶

Pipeline creates:

outputs/
└── my_analysis/
    ├── clean/          # Preprocessed data (Moon files)
    ├── projections/    # Embeddings (Comet files)
    └── models/         # Graphs (Star files)

Load representative graphs:

import pandas as pd

for file in T.selected_model_files:
    star = pd.read_pickle(file)
    print(f"Nodes: {star.starGraph.nNodes}, Components: {star.starGraph.nComponents}")

Visualize Graph Landscape (Optional)¶

View relationships between all generated graphs:

import matplotlib.pyplot as plt

coords = T.galaxy.get_galaxy_coordinates()  # MDS projection of distance matrix
plt.scatter(coords[:, 0], coords[:, 1], alpha=0.6)
plt.title("Model Space")
plt.show()