Graph Construction¶
THEMA manages and explores a diverse range of data representations by organizing them into a unified framework. Graph-based pairwise distances quantify similarity between data points in graph models, facilitating:
Validation: Evaluate model effectiveness by measuring similarities in data structures.
Objective Insights: Gain quantitative metrics for community detection, anomaly detection, and more.
Democracy in Analysis: Ensure unbiased analysis with multiple perspectives on data complexities.
Managed by the Galaxy and jmap classes, graph construction currently supports using the Kepler Mapper Algorithm to construct graph representations of your data. This creates multiple Jmap Star objects using the parameters seen in the YAML above.
Choices in Graph Construction¶
Galaxy:
metric: stellar_kernel_distance
selector: random
nReps: 3
stars:
- jmap
jmap:
nCubes:
- 5
- 10
percOverlap:
- 0.2
- 0.5
minIntersection:
- -1
clusterer:
- [HDBSCAN, {min_cluster_size: 2}]
- [HDBSCAN, {min_cluster_size: 10}]
Graph Comparison¶
Managed by the Geodesics class, graph comparison uses what we call Stellar Kernel Distance to quantify similarity between and across multiple graph-based representations of your data.
Thema uses the below to compute and quantify similarity across multiple graph models:
Stellar Kernel Distance
The “stellar kernel distance” refers to the measurement of similarity or dissimilarity between graphs based on their structural properties. It calculates how closely related or different two graphs are, considering their connectivity and node attributes.
Grakel Kernels
Grakel kernels are mathematical functions used to compute pairwise distances between graphs. These kernels transform graph structures into numerical representations, enabling comparison and analysis across a dataset.
Example¶
Here, we run through a simplified version of the Thema pipeline to create multiple graph representations of our dataset! In this example, we use the scikit-learn breast cancer dataset to demonstrate Thema’s functionality.
Step 1: YAML Setup¶
For this example, we construct a very simple hyperameter space to example the breast cancer dataset. Here is what our YAML looks like:
runName: demoFiles
data: <Path to your Raw Data>
outDir: <Path to your Out Directory>
Planet:
scaler: standard
encoding: one_hot
dropColumns: None
imputeColumns: None
imputeMethods: None
numSamples: 1
seeds: auto
Oort:
umap:
nn:
- 5
- 10
- 50
minDist:
- 0.05
- 0.1
- 0.15
- 0.5
dimensions:
- 2
seed:
- 32
- 50
projectiles:
- umap
Galaxy:
metric: stellar_kernel_distance
selector: random
nReps: 2
stars:
- jmap
jmap:
nCubes:
- 2
- 5
- 15
- 30
percOverlap:
- 0.05
- 0.1
- 0.5
minIntersection:
- -1
clusterer:
- [HDBSCAN, {min_cluster_size: 2}]
- [HDBSCAN, {min_cluster_size: 5}]
- [HDBSCAN, {min_cluster_size: 10}]
Step 2: Preprocessing¶
Handle filepaths – not necessary when running locally!
In [1]: import sys, os
In [2]: sys.path.insert(0, os.path.abspath('.'))
In [3]: yaml = os.path.join(os.path.abspath('.'),'source', 'userGuides', 'demoFiles', 'params.yaml')
In [4]: if not os.path.isfile(yaml):
...: cwd = os.getcwd()
...: raise FileNotFoundError(f"YAML parameter file could not be found: {yaml}\nCurrent working directory: {cwd}")
...:
See the Data Preprocessing for a detailed look at the steps involved in data preprocessing. In this specific example, the breast cancer dataset being used has no missing values and most pre-processing steps required for more organic, real world datasets have already been taken by scikit-learn.
In [5]: from thema.multiverse import Planet
In [6]: planet = Planet(YAML_PATH=yaml)
In [7]: planet.fit()
Step 3: Embedding¶
See the Embeddings User Guide for more information on embedding selections and hyperameter selection.
In [8]: from thema.multiverse import Oort
In [9]: oort = Oort(YAML_PATH=yaml)
In [10]: oort.fit()
Step 4: Graph Construction¶
Now we get to the real meat and potatoes! This generates graph based on the parameters shown in the demo YAML above.
In [11]: from thema.multiverse import Galaxy
In [12]: galaxy = Galaxy(YAML_PATH=yaml)
In [13]: galaxy.fit()
Step 5: Graph Model Selection¶
Plotting MDS¶
MDS (Multidimensional Scaling) is a statistical technique to visualize data similarities or dissimilarities in a lower-dimensional space while preserving relative distances as accurately as possible. It transforms complex distance data into a visual representation, revealing relationships and patterns that are hard to discern in high-dimensional spaces.
Key Benefits:
Visual Insight: Simplifies complex data relationships for easier interpretation and analysis.
Comparison: Facilitates comparing datasets or models based on distance metrics.
Communication: Communicates findings visually, aiding in decision-making and stakeholder engagement across disciplines.
In [14]: model_representatives = galaxy.collapse()
In [15]: galaxy.show_mds()
Hint
Each point on the above plot represents an entire graph model. Background coloring represents the density of graph models.
Based on the plot, we can decide how many graph representatives we would like to select and analyze. In this case, we will look at 2 graph models - 1 from the highest density region and 1 from the 2nd highest region of model density.
Based on the plot, we would not like to select models that are outside the density coloring, as these models are unrepresentative of the dataset and the hyperameters used to produce them represent a poor selection based on the dataset.
Selecting Models¶
You can either use built-in functionality in the params.yaml to select representative models, based on the selector
and nReps
arguments passed in the yaml:
In [16]: model_representatives = galaxy.collapse()
In [17]: model_representatives
Out[17]:
{1: {'star': '/Users/gathrid/Repos/thema_light/Thema/docs/source/userGuides/demoFiles/models/jmap_clustererHDBSCANmin_cluster_size5_minIntersection-1_nCubes30_percOverlap0.5_id23_15.pkl',
'cluster_size': 474},
0: {'star': '/Users/gathrid/Repos/thema_light/Thema/docs/source/userGuides/demoFiles/models/jmap_clustererHDBSCANmin_cluster_size5_minIntersection-1_nCubes15_percOverlap0.5_id20_19.pkl',
'cluster_size': 65}}
Or you can select your own representatives by index from the show_mds()
plot. Here we select index 311 (the bottom-most point) and 98 (in the middle of a high-density region):
In [18]: import os
In [19]: selection1, selection2 = os.listdir(galaxy.outDir)[311], os.listdir(galaxy.outDir)[98]
In [20]: print(f"\nSelection 1: {selection1}\n\nSelection 2: {selection2}")
Selection 1: jmap_clustererHDBSCANmin_cluster_size5_minIntersection-1_nCubes2_percOverlap0.5_id14_12.pkl
Selection 2: jmap_clustererHDBSCANmin_cluster_size2_minIntersection-1_nCubes5_percOverlap0.5_id5_23.pkl
And that is it! Now you have searched the hyperameter space to create a landscape (galaxy!) of graph models (stars!) and selected representative models from the distribution. More advanced similarity metrics can be used here to select graphs - contact us at Krv Analytics for more info!
See the next guide for information on analyzing selected graph models.