Probe Utils

Utility funtionality to help with Probe functionality (to probe your data!)

thema.probe.data_utils

thema.probe.data_utils.custom_Zscore(global_df, subset_df, column_name)[source]

Calculate the z-score for a subset of a DataFrame relative to the entire DataFrame.

Parameters:
  • global_df (pd.DataFrame) – The entire DataFrame containing the global dataset.

  • subset_df (pd.DataFrame) – The subset of the DataFrame for which to calculate the z-score.

  • column_name (str) – The name of the column in both DataFrames for which to calculate the z-score.

Returns:

The z-score of the subset relative to the global dataset.

Return type:

float

thema.probe.data_utils.error(x, mu)[source]

Calculate the error between a value and its expected value.

Parameters:
  • x (float) – The value.

  • mu (float) – The expected value.

Returns:

The error between the value and its expected value.

Return type:

float

thema.probe.data_utils.get_best_std_filter(col, global_stats: dict)[source]

Calculate the standard deviation of a column.

Parameters:
  • col (pd.Series) – The column for which to calculate the standard deviation.

  • global_stats (dict) – A dictionary containing global statistics for the dataset.

Returns:

The standard deviation of the column.

Return type:

float

thema.probe.data_utils.get_best_zscore_filter(col, global_stats: dict)[source]

Calculate the z-score of a column.

Parameters:
  • col (pd.Series) – The column for which to calculate the z-score.

  • global_stats (dict) – A dictionary containing global statistics for the dataset.

Returns:

The z-score of the column.

Return type:

float

thema.probe.data_utils.get_minimal_std(df: DataFrame, mask: array, density_cols=None)[source]

Find the column with the minimal standard deviation within a subset of a Dataframe.

Parameters:
  • df (pd.Dataframe) – A cleaned dataframe.

  • mask (np.array) – A boolean array indicating which indices of the dataframe should be included in the computation.

Returns:

col_label – The index idenitfier for the column in the dataframe with minimal std.

Return type:

int

thema.probe.data_utils.get_nearestTarget(G: Graph, targets: dict)[source]

Get the nodes and corresponding distances that are closest to each target.

Parameters:
  • G (nx.Graph) – The input graph.

  • targets (dictionary) – A dictionary of target nodes and their aggregated values, obtained from the sunset_dict() function.

Returns:

  • nearest_target (dict) – A dictionary where keys are nodes in the graph and values are the nearest target node.

  • nearest_target_distance (dict) – A dictionary where keys are nodes in the graph and values are the shortest distance to the nearest target node.

thema.probe.data_utils.select_highestZscoreCols(zscores, n_cols)[source]

Select the columns in a DataFrame that have the highest absolute z-scores.

Parameters:
  • zscores (pd.DataFrame) – A DataFrame containing z-scores.

  • n_cols (int) – The number of columns to select with the highest absolute z-scores.

Returns:

A DataFrame containing the top n columns with the highest absolute z-scores.

Return type:

pd.DataFrame

thema.probe.data_utils.std_zscore_threshold_filter(col, global_stats: dict, std_threshold=1, zscore_threshold=1)[source]

Calculate the filter value based on the standard deviation and z-score of a column.

Parameters:
  • col (pd.Series) – The column for which to calculate the filter value.

  • global_stats (dict) – A dictionary containing global statistics for the dataset.

  • std_threshold (float, optional) – The threshold for the standard deviation. Columns with absolute standard deviation below this threshold will be filtered out. Default is 1.

  • zscore_threshold (float, optional) – The threshold for the z-score. Columns with absolute z-score above this threshold will be filtered out. Default is 1.

Returns:

The filter value. 0 if the column should be filtered out, 1 otherwise.

Return type:

int

thema.probe.data_utils.sunset_dict(d: dict, percentage: float = 0.1, top: bool = True) dict[source]

Return the top/bottom n percentage of a dictionary based on values.

Parameters:
  • d (dict) – The dictionary to subset, with node : value mappings.

  • percentage (float, optional) – The percentage of the dictionary to take when subsetting to contain the top n% of values. Default is 0.1.

  • top (bool, optional) – If True, take the top percentage. If False, take the bottom percentage. Default is True.

Returns:

A dictionary containing only the nodes and their values that made the cut based on the n percentage.

Return type:

dict