libera_utils.scene_id.SceneDefinition#

class libera_utils.scene_id.SceneDefinition(definition_path: Path)#

Bases: object

Defines scenes and their classification rules from CSV configuration.

Loads and manages scene definitions from a CSV file, providing functionality to identify which scene a given set of atmospheric measurements belongs to.

type#

Type of scene definition (e.g., ‘TRMM’, ‘ERBE’), derived from filename

Type:

str

scenes#

List of scene definitions with their variable ranges

Type:

list of Scene

required_columns#

List of variable names required for scene identification

Type:

list of str

identify(data)#

Identify scene IDs for all data points in a dataset

validate_input_data_columns(data)#

Validate that dataset contains all required variables

Notes

Expected CSV format:

scene_id,variable1_min,variable1_max,variable2_min,variable2_max,… 1,0.0,10.0,20.0,30.0,… 2,10.0,20.0,30.0,40.0,…

Each variable must have both a _min and _max column. NaN or empty values indicate unbounded ranges.

Examples

>>> scene_def = SceneDefinition(Path("trmm.csv"))
>>> print(scene_def.type)
'TRMM'
>>> print(len(scene_def.scenes))
42

Methods

identify(data)

Identify scene IDs for all data points in the dataset.

validate_input_data_columns(data)

Ensure input data contains all required FootprintVariables.

validate_scene_definition_file()

Ensure scene definition file contains valid column names, bin ranges, that classification parameters are not duplicated across IDs, and that there are no gaps in classification bins.

__init__(definition_path: Path)#

Initialize scene definition from CSV file.

Parameters:

definition_path (pathlib.Path) – Path to CSV file containing scene definitions

Raises:

Notes

The CSV file must contain: - A ‘scene_id’ column with unique integer identifiers - Pairs of columns for each variable: {var}_min and {var}_max - At least one variable pair

Methods

identify(data)

Identify scene IDs for all data points in the dataset.

validate_input_data_columns(data)

Ensure input data contains all required FootprintVariables.

validate_scene_definition_file()

Ensure scene definition file contains valid column names, bin ranges, that classification parameters are not duplicated across IDs, and that there are no gaps in classification bins.

_extract_variable_names(columns: Index) list[str]#

Extract unique variable names from min/max column pairs.

Parameters:

columns (pd.Index) – Column names from the CSV

Returns:

Sorted list of unique variable names

Return type:

list of str

Notes

Variable names are extracted by removing the ‘_min’ or ‘_max’ suffix from column names. Only columns with these suffixes are considered as variable definitions.

Examples

>>> cols = pd.Index(['scene_id', 'temp_min', 'temp_max', 'pressure_min', 'pressure_max'])
>>> scene_def._extract_variable_names(cols)
['pressure', 'temp']
_identify_vectorized(data: Dataset, dims: list[str], shape: tuple[int, ...]) ndarray#

Vectorized scene identification for better performance.

Uses NumPy array operations to efficiently classify all data points simultaneously rather than iterating point-by-point.

Parameters:
  • data (xr.Dataset) – Dataset containing all required variables

  • dims (list of str) – List of dimension names

  • shape (tuple of int) – Shape of the output array

Returns:

Array of scene IDs with shape matching input dimensions

Return type:

np.ndarray

Notes

For each scene, creates a boolean mask identifying all matching points, then assigns the scene ID to those points. Earlier scenes in the list have priority for overlapping classifications.

_parse_row_to_ranges(row: Series, variable_names: list[str]) dict[str, tuple[float | None, float | None]]#

Parse a CSV row into variable ranges.

Parameters:
  • row (pd.Series) – Row from the scene definition DataFrame containing scene_id and variable min/max values

  • variable_names (list of str) – List of variable names to extract ranges for

Returns:

Dictionary mapping variable names to (min, max) tuples. None values indicate unbounded ranges (no constraint).

Return type:

dict of str to tuple of (float or None, float or None)

Notes

For each variable, looks for columns named {variable}_min and {variable}_max. NaN values in the CSV are converted to None to indicate unbounded ranges.

Examples

>>> row = pd.Series({'scene_id': 1, 'temp_min': 0.0, 'temp_max': 100.0,
...                  'pressure_min': np.nan, 'pressure_max': 1000.0})
>>> scene_def._parse_row_to_ranges(row, ['temp', 'pressure'])
{'temp': (0.0, 100.0), 'pressure': (None, 1000.0)}
identify(data: Dataset) DataArray#

Identify scene IDs for all data points in the dataset.

Classifies each data point in the dataset by finding the first scene whose variable ranges match all the data point’s variable values.

Parameters:

data (xr.Dataset) – Dataset containing all required variables for scene identification

Returns:

Array of scene IDs with the same dimensions as the input data. Scene ID of -1 indicates no matching scene was found for that point.

Return type:

xr.DataArray

Raises:

ValueError – If the dataset is missing required variables

Notes

  • Scene matching uses first-match priority: if multiple scenes could match a data point, the first one in the definition list is assigned

  • Data points with NaN values in any required variable are not matched

  • The method logs statistics about matched/unmatched points and the distribution of scene IDs

Examples

>>> data = xr.Dataset({
...     'cloud_fraction': ([('x',)], [20.0, 60.0, 85.0]),
...     'optical_depth': ([('x',)], [5.0, 15.0, 25.0])
... })
>>> scene_def = SceneDefinition(Path("scenes.csv"))
>>> scene_ids = scene_def.identify(data)
>>> print(scene_ids.values)
array([1, 2, -1])  # Last point didn't match any scene
validate_input_data_columns(data: Dataset)#

Ensure input data contains all required FootprintVariables.

Parameters:

data (xr.Dataset) – Dataset to validate

Raises:

ValueError – If required variables are missing from the dataset, with a message listing all missing variables

Examples

>>> scene_def = SceneDefinition(Path("scenes.csv"))
>>> scene_def.required_columns = ['cloud_fraction', 'optical_depth']
>>> data = xr.Dataset({'cloud_fraction': [10, 20]})
>>> scene_def.validate_input_data_columns(data)
ValueError: Required columns ['optical_depth'] not in input data for TRMM scene identification.
validate_scene_definition_file()#

Ensure scene definition file contains valid column names, bin ranges, that classification parameters are not duplicated across IDs, and that there are no gaps in classification bins.

Raises:

NotImplementedError – This validation is not yet implemented

Notes

TODO: LIBSDC-589 Implement validation checks for: - Valid column naming conventions - Non-overlapping scene definitions - Complete coverage of parameter space (no gaps) - Consistent min/max value ordering