libera_utils.scene_definitions#
Classes
|
Represents a single scene with its variable bin definitions. |
|
Defines scenes and their classification rules from CSV configuration. |
- class libera_utils.scene_definitions.Scene(scene_id: int, variable_ranges: dict[str, tuple[float | None, float | None]])#
Represents a single scene with its variable bin definitions.
A scene defines a specific atmospheric state characterized by ranges of multiple variables (e.g., cloud fraction, optical depth, surface type). Data points are classified into scenes when all their variable values fall within the scene’s defined ranges.
- variable_ranges#
Dictionary mapping variable names to (min, max) tuples defining the acceptable range for each variable. None values indicate unbounded ranges (no min or no max constraint).
- matches(data_point)#
Check if a data point belongs to this scene
Examples
>>> scene = Scene( ... scene_id=1, ... variable_ranges={ ... "cloud_fraction": (0.0, 50.0), ... "optical_depth": (0.0, 10.0) ... } ... ) >>> scene.matches({"cloud_fraction": 30.0, "optical_depth": 5.0}) True >>> scene.matches({"cloud_fraction": 60.0, "optical_depth": 5.0}) False
Methods
Get list of variables that have at least one defined bound.
matches(data_point)Check if a data point falls within all variable ranges for this scene.
- class libera_utils.scene_definitions.SceneDefinition(definition_path: Path)#
Defines scenes and their classification rules from CSV configuration.
Loads and manages scene definitions from a CSV file, providing functionality to identify which scene a given set of atmospheric measurements belongs to.
- identify(data)#
Identify scene IDs for all data points in a dataset
- validate_input_data_columns(data)#
Validate that dataset contains all required variables
Notes
- Expected CSV format:
scene_id,variable1_min,variable1_max,variable2_min,variable2_max,… 1,0.0,10.0,20.0,30.0,… 2,10.0,20.0,30.0,40.0,…
Each variable must have both a _min and _max column. NaN or empty values indicate unbounded ranges.
Examples
>>> scene_def = SceneDefinition(Path("trmm.csv")) >>> print(scene_def.type) 'TRMM' >>> print(len(scene_def.scenes)) 42
Methods
identify_and_update(data)Identify scene IDs for all data points.
- _compute_global_bounds(variables: list[str]) dict[str, tuple[float | None, float | None]]#
Compute the global bounding box that contains all scenes.
- Parameters:
- Returns:
Global bounds for each variable
- Return type:
Notes
Global min is the minimum of all scene mins (excluding None/unbounded). Global max is the maximum of all scene maxs (excluding None/unbounded). If all scenes are unbounded in a direction, returns None for that bound.
- static _compute_intersection(rect1: Scene, rect2: Scene, variables: list[str]) dict[str, tuple[float | None, float | None]] | None#
Compute the intersection of two hyper-rectangles.
- Parameters:
- Returns:
Dictionary of variable bounds for the intersection region, or None if the rectangles don’t intersect
- Return type:
dict or None
Notes
For each dimension, computes the intersection of two intervals: - Intersection of [a1, b1] and [a2, b2] is [max(a1, a2), min(b1, b2)] - Intersection exists only if max(a1, a2) < min(b1, b2) - Special handling for unbounded (None) values
- static _extract_variable_names(columns: Index) list[str]#
Extract unique variable names from min/max column pairs.
- Parameters:
columns (pd.Index) – Column names from the CSV
- Returns:
Sorted list of unique variable names
- Return type:
Notes
Variable names are extracted by removing the ‘_min’ or ‘_max’ suffix from column names. Only columns with these suffixes are considered as variable definitions.
Examples
>>> cols = pd.Index(['scene_id', 'temp_min', 'temp_max', 'pressure_min', 'pressure_max']) >>> scene_def._extract_variable_names(cols) ['pressure', 'temp']
- static _find_gaps(scenes: list[Scene], global_bounds: dict[str, tuple[float, float]], variables: list[str]) list[dict[str, tuple[float, float]]]#
Find gaps in property ranges defined in scenes.
- static _generate_cells_from_boundaries(boundary_values: dict[str, list[float]], variables: list[str]) list[dict[str, tuple[float, float]]]#
Generate all cells (hyper-rectangles) from boundary values.
- _identify_classification_variables() list[str]#
Identify variables that are actually used for classification.
A variable is used for classification if at least one scene has at least one defined bound for that variable.
- _identify_vectorized(data: Dataset, shape: tuple[int, ...]) ndarray#
Vectorized scene identification using numpy arrays.
- static _parse_row_to_ranges(row: Series, variable_names: list[str]) dict[str, tuple[float | None, float | None]]#
Parse a CSV row into variable ranges.
- Parameters:
- Returns:
Dictionary mapping variable names to (min, max) tuples. None values indicate unbounded ranges (no constraint).
- Return type:
Notes
For each variable, looks for columns named {variable}_min and {variable}_max. NaN values in the CSV are converted to None to indicate unbounded ranges.
Examples
>>> row = pd.Series({'scene_id': 1, 'temp_min': 0.0, 'temp_max': 100.0, ... 'pressure_min': np.nan, 'pressure_max': 1000.0}) >>> scene_def._parse_row_to_ranges(row, ['temp', 'pressure']) {'temp': (0.0, 100.0), 'pressure': (None, 1000.0)}
- static _point_in_scene(point: dict[str, float], scene, variables: list[str]) bool#
Check if a point falls within a scene’s bounds.
- _validate_column_name_format(scene_df: DataFrame) None#
Validate that all required variable columns exist with _min and _max suffixes.
- Parameters:
- Raises:
ValueError – If scene_id column is missing or if any required variable is missing its _min or _max column
- _validate_complete_coverage() None#
Validate that scenes completely cover the bounded parameter space.
- _validate_footprint_data_columns_present(data: Dataset)#
Ensure input data contains all required FootprintVariables.
- Parameters:
data (xr.Dataset) – Dataset to validate
- Raises:
ValueError – If required variables are missing from the dataset, with a message listing all missing variables
Examples
>>> scene_def = SceneDefinition(Path("scenes.csv")) >>> scene_def.required_columns = ['cloud_fraction', 'optical_depth'] >>> data = xr.Dataset({'cloud_fraction': [10, 20]}) >>> scene_def.validate_input_data_columns(data) ValueError: Required columns ['optical_depth'] not in input data for TRMM scene identification.
- _validate_min_max_ordering(scene_df: DataFrame) None#
Validate that min values are less than or equal to max values for all bins.
- Parameters:
- Raises:
ValueError – If any bin has min > max for any variable
- _validate_no_overlaps() None#
Validate that no two scenes in the scene definition overlap.
- Raises:
ValueError – If any two scenes overlap, with details about the overlapping region
Notes
This handles unbounded ranges in the following ways: - None for min means -∞ (always overlaps with any max) - None for max means +∞ (always overlaps with any min)
- _validate_scene_definition_file(scene_df: DataFrame) None#
Validate scene definition file for complete coverage and no overlaps.
Ensures that: 1. Column names follow the expected format (variable_min, variable_max pairs) 2. Min values are less than or equal to max values for all bins 3. Every possible combination of variable values maps to exactly one scene ID 4. There are no gaps in coverage (all value combinations are classified) 5. There are no overlaps (no ambiguous classifications)
- Parameters:
- Raises:
ValueError – If any validation check fails, with detailed description of the issue
Examples
>>> df = pd.DataFrame({ ... 'scene_id': [1, 2], ... 'temp_min': [0, 50], ... 'temp_max': [50, 100], ... 'pressure_min': [900, 900], ... 'pressure_max': [1100, 1100] ... }) >>> validate_scene_definition_file(df, ['temp', 'pressure']) # Passes validation
>>> df = pd.DataFrame({ ... 'scene_id': [1, 2], ... 'temp_min': [0, 40], # Overlap at 40-50 ... 'temp_max': [50, 100], ... 'pressure_min': [900, 900], ... 'pressure_max': [1100, 1100] ... }) >>> validate_scene_definition_file(df, ['temp', 'pressure']) ValueError: Overlapping scenes detected...
- static _validate_scene_ids(scene_df: DataFrame) None#
Validate scene_id column contains unique integer values.
- Parameters:
scene_df (pd.DataFrame) – Scene definition DataFrame
- Raises:
ValueError – If scene IDs are not unique or not integer-convertible