Data Module¶

The data module provides functions for loading coal plant datasets and network graphs.

For detailed information on the data sources, dataset construction, aggregation methods, and preprocessing choices, please refer to our Supporting Information.

retire.data.data.load_dataset()[source]¶

Load the US coal plants dataset from the package resources.

Returns:

Complete US coal plants dataset containing plant characteristics, retirement status, contextual vulnerabilities, and associated metadata. Includes columns for plant location, capacity, age, retirement planning, economic factors, and environmental considerations.

Return type:

pandas.DataFrame

Raises:

FileNotFoundError – If the dataset file does not exist at the specified path.
pd.errors.EmptyDataError – If the CSV file is empty.
pd.errors.ParserError – If the CSV file cannot be parsed.

Examples

>>> from retire.data import load_dataset
>>> df = load_dataset()
>>> print(df.shape)
(914, 45)
>>> print(df.columns[:5].tolist())
['Plant Name', 'ORISPL', 'State', 'County', 'LAT']

retire.data.data.load_clean_dataset()[source]¶

Load the cleaned and scaled US coal plant dataset.

This dataset has undergone preprocessing including missing value imputation, feature scaling, and normalization for use in machine learning models and statistical analysis.

Returns:: Cleaned and scaled coal plant dataset with standardized numerical features and processed categorical variables. All features are normalized to facilitate clustering and similarity analysis.
Return type:: pandas.DataFrame

Examples

>>> from retire.data import load_clean_dataset
>>> clean_df = load_clean_dataset()
>>> print(clean_df.dtypes.value_counts())
float64    42
int64       3
dtype: int64

retire.data.data.load_projection()[source]¶

Load the projected US coal plant dataset with future scenario modeling.

This dataset contains projections and forecasts for coal plant operations under various policy and economic scenarios, including retirement timing predictions and capacity factor estimates.

Returns:: Projected coal plant dataset with scenario-based forecasts for retirement timing, capacity utilization, and economic viability under different policy environments.
Return type:: pandas.DataFrame

Examples

>>> from retire.data import load_projection
>>> proj_df = load_projection()
>>> scenario_cols = [col for col in proj_df.columns if 'scenario' in col.lower()]
>>> print(f"Available scenarios: {len(scenario_cols)}")

retire.data.data.load_graph()[source]¶

Load the coal plant network graph from package resources.

Constructs a NetworkX graph representing relationships between coal plant clusters based on similarity metrics and contextual factors. Nodes represent plant clusters, and edges represent similarity relationships weighted by various plant characteristics.

Returns:

Network graph with nodes representing coal plant clusters and edges representing similarity relationships. Node attributes include: - membership: list of plant indices belonging to the cluster - cluster_id: unique identifier for the cluster Edge attributes include: - weight: similarity strength between clusters

Return type:

networkx.Graph

Raises:

FileNotFoundError – If the graph node or edge CSV files do not exist.
ValueError – If the membership field cannot be parsed as a list.

Examples

>>> from retire.data import load_graph
>>> G = load_graph()
>>> print(f"Graph has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")
Graph has 314 nodes and 1247 edges
>>> # Check node attributes
>>> node_attrs = list(G.nodes(data=True))[0]
>>> print(f"Node attributes: {list(node_attrs[1].keys())}")

retire.data.data.load_generator_level_dataset()[source]¶

Load the generator-level US coal plants dataset.

Provides detailed information at the individual generator unit level, including technical specifications, operational history, and retirement planning for each coal-fired generating unit in the US fleet.

Returns:

Generator-level dataset with detailed technical and operational information for individual coal-fired generating units. Includes capacity, age, efficiency metrics, emissions data, and retirement status for each generator.

Return type:

pandas.DataFrame

Raises:

FileNotFoundError – If the dataset file does not exist at the specified path.
pd.errors.EmptyDataError – If the CSV file is empty.
pd.errors.ParserError – If the CSV file cannot be parsed.

Examples

>>> from retire.data import load_generator_level_dataset
>>> gen_df = load_generator_level_dataset()
>>> print(f"Total generators: {len(gen_df)}")
>>> # Group by plant to see generator counts per plant
>>> gens_per_plant = gen_df.groupby('ORISPL').size()
>>> print(f"Average generators per plant: {gens_per_plant.mean():.1f}")

retire.data.data.load_retired_plant_dataset()[source]¶

Load the 2020-2023 set of Retired US Coal Plants.

Dataset Overview¶

This dataset contains detailed information on U.S. coal plants that retired between 2020 and 2023.

Because our analytical framework integrates static snapshots of political, demographic, public-opinion, and other contextual variables drawn from multiple data sources (see Supplemental Information A.1), we restrict the dataset to this recent four-year period. This window captures the most current wave of coal plant retirements while maintaining sufficient data coverage across sources.

Not all variables are available for every plant. Some static datasets could not be applied uniformly because data for certain years were unavailable or incomplete. Compiling this dataset required cross-referencing and harmonizing data from numerous repositories, and additional work could extend this compilation further—such as incorporating older retirements for validation against plants that retired prior to 2020.

returns:: A structured dataset where each row represents a retired coal plant (2020–2023), and columns capture plant-level operational, policy, demographic, and contextual variables.
rtype:: pd.DataFrame
raises FileNotFoundError:: If the dataset file does not exist at the specified path.
raises pd.errors.EmptyDataError:: If the CSV file is empty.
raises pd.errors.ParserError:: If the CSV file cannot be parsed.

Examples

>>> from retire.data import load_generator_level_dataset
>>> gen_df = load_generator_level_dataset()
>>> print(f"Total generators: {len(gen_df)}")
>>> # Group by plant to see generator counts per plant
>>> gens_per_plant = gen_df.groupby('ORISPL').size()
>>> print(f"Average generators per plant: {gens_per_plant.mean():.1f}")

retire.data.data.generate_target_matching_data() → DataFrame[source]¶

Prepare a unified dataset combining active coal plants with retired plants, suitable for target matching within our mapper graph framework.

This function performs several key steps: 1. Loads the raw active coal plant dataset (coal_raw) and the retired plant dataset. 2. Renames retired plant columns to align with our internal dataset conventions. 3. Derives new variables:

Total Nameplate Capacity (MW)

Average Capacity Factor

Mapped Fuel Type

Retirement status (ret_STATUS)

Age (averaged where multiple generators exist)

Aligns the retired plant dataset with the original dataset, keeping only shared columns.
Concatenates the retired and active datasets into a single DataFrame.

Notes

Because the retired plant dataset originates from 2020-2023, some static data sources cannot be applied consistently (data may be missing or unavailable for these years).
This preparation is necessary for mapping retired plants into our existing graph landscape: determining which component and node a new plant most strongly fits.

Returns:: A combined dataset of active and retired coal plants with harmonized column names and derived variables ready for target matching.
Return type:: pd.DataFrame

Data Loading Functions¶

Coal Plant Datasets¶

retire.data.data.load_dataset()[source]¶

Load the US coal plants dataset from the package resources.

Returns:

Return type:

pandas.DataFrame

Raises:

FileNotFoundError – If the dataset file does not exist at the specified path.
pd.errors.EmptyDataError – If the CSV file is empty.
pd.errors.ParserError – If the CSV file cannot be parsed.

Examples

>>> from retire.data import load_dataset
>>> df = load_dataset()
>>> print(df.shape)
(914, 45)
>>> print(df.columns[:5].tolist())
['Plant Name', 'ORISPL', 'State', 'County', 'LAT']

retire.data.data.load_clean_dataset()[source]¶

Load the cleaned and scaled US coal plant dataset.

This dataset has undergone preprocessing including missing value imputation, feature scaling, and normalization for use in machine learning models and statistical analysis.

Returns:: Cleaned and scaled coal plant dataset with standardized numerical features and processed categorical variables. All features are normalized to facilitate clustering and similarity analysis.
Return type:: pandas.DataFrame

Examples

>>> from retire.data import load_clean_dataset
>>> clean_df = load_clean_dataset()
>>> print(clean_df.dtypes.value_counts())
float64    42
int64       3
dtype: int64

retire.data.data.load_projection()[source]¶

Load the projected US coal plant dataset with future scenario modeling.

This dataset contains projections and forecasts for coal plant operations under various policy and economic scenarios, including retirement timing predictions and capacity factor estimates.

Returns:: Projected coal plant dataset with scenario-based forecasts for retirement timing, capacity utilization, and economic viability under different policy environments.
Return type:: pandas.DataFrame

Examples

>>> from retire.data import load_projection
>>> proj_df = load_projection()
>>> scenario_cols = [col for col in proj_df.columns if 'scenario' in col.lower()]
>>> print(f"Available scenarios: {len(scenario_cols)}")

retire.data.data.load_generator_level_dataset()[source]¶

Load the generator-level US coal plants dataset.

Returns:

Return type:

pandas.DataFrame

Raises:

FileNotFoundError – If the dataset file does not exist at the specified path.
pd.errors.EmptyDataError – If the CSV file is empty.
pd.errors.ParserError – If the CSV file cannot be parsed.

Examples

>>> from retire.data import load_generator_level_dataset
>>> gen_df = load_generator_level_dataset()
>>> print(f"Total generators: {len(gen_df)}")
>>> # Group by plant to see generator counts per plant
>>> gens_per_plant = gen_df.groupby('ORISPL').size()
>>> print(f"Average generators per plant: {gens_per_plant.mean():.1f}")

Graph and Network Data¶

retire.data.data.load_graph()[source]¶

Load the coal plant network graph from package resources.

Returns:

Return type:

networkx.Graph

Raises:

FileNotFoundError – If the graph node or edge CSV files do not exist.
ValueError – If the membership field cannot be parsed as a list.

Examples

>>> from retire.data import load_graph
>>> G = load_graph()
>>> print(f"Graph has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")
Graph has 314 nodes and 1247 edges
>>> # Check node attributes
>>> node_attrs = list(G.nodes(data=True))[0]
>>> print(f"Node attributes: {list(node_attrs[1].keys())}")

Data Utilities¶

These functions help with processing and managing the datasets: