Probe Utils¶
Utility funtionality to help with Probe functionality (to probe your data!)
thema.probe.data_utils¶
- thema.probe.data_utils.custom_Zscore(global_df, subset_df, column_name)[source]¶
Calculate the z-score for a subset of a DataFrame relative to the entire DataFrame.
- Parameters:
global_df (pd.DataFrame) – The entire DataFrame containing the global dataset.
subset_df (pd.DataFrame) – The subset of the DataFrame for which to calculate the z-score.
column_name (str) – The name of the column in both DataFrames for which to calculate the z-score.
- Returns:
The z-score of the subset relative to the global dataset.
- Return type:
float
- thema.probe.data_utils.error(x, mu)[source]¶
Calculate the error between a value and its expected value.
- Parameters:
x (float) – The value.
mu (float) – The expected value.
- Returns:
The error between the value and its expected value.
- Return type:
float
- thema.probe.data_utils.get_best_std_filter(col, global_stats: dict)[source]¶
Calculate the standard deviation of a column.
- Parameters:
col (pd.Series) – The column for which to calculate the standard deviation.
global_stats (dict) – A dictionary containing global statistics for the dataset.
- Returns:
The standard deviation of the column.
- Return type:
float
- thema.probe.data_utils.get_best_zscore_filter(col, global_stats: dict)[source]¶
Calculate the z-score of a column.
- Parameters:
col (pd.Series) – The column for which to calculate the z-score.
global_stats (dict) – A dictionary containing global statistics for the dataset.
- Returns:
The z-score of the column.
- Return type:
float
- thema.probe.data_utils.get_minimal_std(df: DataFrame, mask: array, density_cols=None)[source]¶
Find the column with the minimal standard deviation within a subset of a Dataframe.
- Parameters:
df (pd.Dataframe) – A cleaned dataframe.
mask (np.array) – A boolean array indicating which indices of the dataframe should be included in the computation.
- Returns:
col_label – The index idenitfier for the column in the dataframe with minimal std.
- Return type:
int
- thema.probe.data_utils.get_nearestTarget(G: Graph, targets: dict)[source]¶
Get the nodes and corresponding distances that are closest to each target.
- Parameters:
G (nx.Graph) – The input graph.
targets (dictionary) – A dictionary of target nodes and their aggregated values, obtained from the sunset_dict() function.
- Returns:
nearest_target (dict) – A dictionary where keys are nodes in the graph and values are the nearest target node.
nearest_target_distance (dict) – A dictionary where keys are nodes in the graph and values are the shortest distance to the nearest target node.
- thema.probe.data_utils.select_highestZscoreCols(zscores, n_cols)[source]¶
Select the columns in a DataFrame that have the highest absolute z-scores.
- Parameters:
zscores (pd.DataFrame) – A DataFrame containing z-scores.
n_cols (int) – The number of columns to select with the highest absolute z-scores.
- Returns:
A DataFrame containing the top n columns with the highest absolute z-scores.
- Return type:
pd.DataFrame
- thema.probe.data_utils.std_zscore_threshold_filter(col, global_stats: dict, std_threshold=1, zscore_threshold=1)[source]¶
Calculate the filter value based on the standard deviation and z-score of a column.
- Parameters:
col (pd.Series) – The column for which to calculate the filter value.
global_stats (dict) – A dictionary containing global statistics for the dataset.
std_threshold (float, optional) – The threshold for the standard deviation. Columns with absolute standard deviation below this threshold will be filtered out. Default is 1.
zscore_threshold (float, optional) – The threshold for the z-score. Columns with absolute z-score above this threshold will be filtered out. Default is 1.
- Returns:
The filter value. 0 if the column should be filtered out, 1 otherwise.
- Return type:
int
- thema.probe.data_utils.sunset_dict(d: dict, percentage: float = 0.1, top: bool = True) dict [source]¶
Return the top/bottom n percentage of a dictionary based on values.
- Parameters:
d (dict) – The dictionary to subset, with node : value mappings.
percentage (float, optional) – The percentage of the dictionary to take when subsetting to contain the top n% of values. Default is 0.1.
top (bool, optional) – If True, take the top percentage. If False, take the bottom percentage. Default is True.
- Returns:
A dictionary containing only the nodes and their values that made the cut based on the n percentage.
- Return type:
dict