libera_utils.scene_definitions.SceneDefinition#

class libera_utils.scene_definitions.SceneDefinition(definition_path: Path)#

Bases: object

Defines scenes and their classification rules from CSV configuration.

Loads and manages scene definitions from a CSV file, providing functionality to identify which scene a given set of atmospheric measurements belongs to.

type#

Type of scene definition (e.g., ‘TRMM’, ‘ERBE’), derived from filename

Type:

str

scenes#

List of scene definitions with their variable ranges

Type:

list of Scene

required_columns#

List of variable names required for scene identification

Type:

list of str

identify(data)#

Identify scene IDs for all data points in a dataset

validate_input_data_columns(data)#

Validate that dataset contains all required variables

Notes

Expected CSV format:

scene_id,variable1_min,variable1_max,variable2_min,variable2_max,… 1,0.0,10.0,20.0,30.0,… 2,10.0,20.0,30.0,40.0,…

Each variable must have both a _min and _max column. NaN or empty values indicate unbounded ranges.

Examples

>>> scene_def = SceneDefinition(Path("trmm.csv"))
>>> print(scene_def.type)
'TRMM'
>>> print(len(scene_def.scenes))
42

Methods

identify_and_update(data)

Identify scene IDs for all data points.

__init__(definition_path: Path)#

Initialize scene definition from CSV file.

Methods

identify_and_update(data)

Identify scene IDs for all data points.

_compute_global_bounds(variables: list[str]) dict[str, tuple[float | None, float | None]]#

Compute the global bounding box that contains all scenes.

Parameters:

variables (list of str) – List of variable names

Returns:

Global bounds for each variable

Return type:

dict of str to tuple of (float or None, float or None)

Notes

Global min is the minimum of all scene mins (excluding None/unbounded). Global max is the maximum of all scene maxs (excluding None/unbounded). If all scenes are unbounded in a direction, returns None for that bound.

static _compute_intersection(rect1: Scene, rect2: Scene, variables: list[str]) dict[str, tuple[float | None, float | None]] | None#

Compute the intersection of two hyper-rectangles.

Parameters:
  • rect1 (Scene) – Rectangles to intersect

  • rect2 (Scene) – Rectangles to intersect

  • variables (list of str) – List of variable names

Returns:

Dictionary of variable bounds for the intersection region, or None if the rectangles don’t intersect

Return type:

dict or None

Notes

For each dimension, computes the intersection of two intervals: - Intersection of [a1, b1] and [a2, b2] is [max(a1, a2), min(b1, b2)] - Intersection exists only if max(a1, a2) < min(b1, b2) - Special handling for unbounded (None) values

static _extract_variable_names(columns: Index) list[str]#

Extract unique variable names from min/max column pairs.

Parameters:

columns (pd.Index) – Column names from the CSV

Returns:

Sorted list of unique variable names

Return type:

list of str

Notes

Variable names are extracted by removing the ‘_min’ or ‘_max’ suffix from column names. Only columns with these suffixes are considered as variable definitions.

Examples

>>> cols = pd.Index(['scene_id', 'temp_min', 'temp_max', 'pressure_min', 'pressure_max'])
>>> scene_def._extract_variable_names(cols)
['pressure', 'temp']
static _find_gaps(scenes: list[Scene], global_bounds: dict[str, tuple[float, float]], variables: list[str]) list[dict[str, tuple[float, float]]]#

Find gaps in property ranges defined in scenes.

Parameters:
  • scenes (list of Scene) – List of scene regions

  • global_bounds (dict) – Global bounding box

  • variables (list of str) – List of variable names

Returns:

List of uncovered regions (gaps)

Return type:

list of dict

static _generate_cells_from_boundaries(boundary_values: dict[str, list[float]], variables: list[str]) list[dict[str, tuple[float, float]]]#

Generate all cells (hyper-rectangles) from boundary values.

Parameters:
  • boundary_values (dict) – For each variable, a sorted list of boundary values

  • variables (list of str) – List of variable names

Returns:

List of cells, where each cell is a dict mapping variable names to (min, max) tuples

Return type:

list of dict

_identify_classification_variables() list[str]#

Identify variables that are actually used for classification.

A variable is used for classification if at least one scene has at least one defined bound for that variable.

Returns:

Variables used for classification

Return type:

list of str

_identify_vectorized(data: Dataset, shape: tuple[int, ...]) ndarray#

Vectorized scene identification using numpy arrays.

static _parse_row_to_ranges(row: Series, variable_names: list[str]) dict[str, tuple[float | None, float | None]]#

Parse a CSV row into variable ranges.

Parameters:
  • row (pd.Series) – Row from the scene definition DataFrame containing scene_id and variable min/max values

  • variable_names (list of str) – List of variable names to extract ranges for

Returns:

Dictionary mapping variable names to (min, max) tuples. None values indicate unbounded ranges (no constraint).

Return type:

dict of str to tuple of (float or None, float or None)

Notes

For each variable, looks for columns named {variable}_min and {variable}_max. NaN values in the CSV are converted to None to indicate unbounded ranges.

Examples

>>> row = pd.Series({'scene_id': 1, 'temp_min': 0.0, 'temp_max': 100.0,
...                  'pressure_min': np.nan, 'pressure_max': 1000.0})
>>> scene_def._parse_row_to_ranges(row, ['temp', 'pressure'])
{'temp': (0.0, 100.0), 'pressure': (None, 1000.0)}
static _point_in_scene(point: dict[str, float], scene, variables: list[str]) bool#

Check if a point falls within a scene’s bounds.

Parameters:
  • point (dict) – Point coordinates

  • scene (Scene) – Scene to test

  • variables (list of str) – List of variable names

Returns:

True if point is in scene, False otherwise

Return type:

bool

_validate_column_name_format(scene_df: DataFrame) None#

Validate that all required variable columns exist with _min and _max suffixes.

Parameters:
  • scene_df (pd.DataFrame) – Scene definition DataFrame

  • required_columns (list of str) – List of variable names that should have min/max pairs

Raises:

ValueError – If scene_id column is missing or if any required variable is missing its _min or _max column

_validate_complete_coverage() None#

Validate that scenes completely cover the bounded parameter space.

_validate_footprint_data_columns_present(data: Dataset)#

Ensure input data contains all required FootprintVariables.

Parameters:

data (xr.Dataset) – Dataset to validate

Raises:

ValueError – If required variables are missing from the dataset, with a message listing all missing variables

Examples

>>> scene_def = SceneDefinition(Path("scenes.csv"))
>>> scene_def.required_columns = ['cloud_fraction', 'optical_depth']
>>> data = xr.Dataset({'cloud_fraction': [10, 20]})
>>> scene_def.validate_input_data_columns(data)
ValueError: Required columns ['optical_depth'] not in input data for TRMM scene identification.
_validate_min_max_ordering(scene_df: DataFrame) None#

Validate that min values are less than or equal to max values for all bins.

Parameters:
  • scene_df (pd.DataFrame) – Scene definition DataFrame

  • required_columns (list of str) – List of variable names to check

Raises:

ValueError – If any bin has min > max for any variable

_validate_no_overlaps() None#

Validate that no two scenes in the scene definition overlap.

Raises:

ValueError – If any two scenes overlap, with details about the overlapping region

Notes

This handles unbounded ranges in the following ways: - None for min means -∞ (always overlaps with any max) - None for max means +∞ (always overlaps with any min)

_validate_scene_definition_file(scene_df: DataFrame) None#

Validate scene definition file for complete coverage and no overlaps.

Ensures that: 1. Column names follow the expected format (variable_min, variable_max pairs) 2. Min values are less than or equal to max values for all bins 3. Every possible combination of variable values maps to exactly one scene ID 4. There are no gaps in coverage (all value combinations are classified) 5. There are no overlaps (no ambiguous classifications)

Parameters:
  • scene_df (pd.DataFrame) – DataFrame loaded from scene definition CSV with columns: scene_id, var1_min, var1_max, var2_min, var2_max, …

  • required_columns (list of str) – List of variable names that should have min/max pairs

Raises:

ValueError – If any validation check fails, with detailed description of the issue

Examples

>>> df = pd.DataFrame({
...     'scene_id': [1, 2],
...     'temp_min': [0, 50],
...     'temp_max': [50, 100],
...     'pressure_min': [900, 900],
...     'pressure_max': [1100, 1100]
... })
>>> validate_scene_definition_file(df, ['temp', 'pressure'])
# Passes validation
>>> df = pd.DataFrame({
...     'scene_id': [1, 2],
...     'temp_min': [0, 40],  # Overlap at 40-50
...     'temp_max': [50, 100],
...     'pressure_min': [900, 900],
...     'pressure_max': [1100, 1100]
... })
>>> validate_scene_definition_file(df, ['temp', 'pressure'])
ValueError: Overlapping scenes detected...
static _validate_scene_ids(scene_df: DataFrame) None#

Validate scene_id column contains unique integer values.

Parameters:

scene_df (pd.DataFrame) – Scene definition DataFrame

Raises:

ValueError – If scene IDs are not unique or not integer-convertible

identify_and_update(data: Dataset) Dataset#

Identify scene IDs for all data points.

Parameters:

data (xr.Dataset) – Dataset containing all required variables for scene identification

Returns:

Input dataset with scene ID variable added as f”scene_id_{self.type.lower()}”

Return type:

xr.Dataset