libera_utils.scene_id.SceneDefinition#
- class libera_utils.scene_id.SceneDefinition(definition_path: Path)#
Bases:
objectDefines scenes and their classification rules from CSV configuration.
Loads and manages scene definitions from a CSV file, providing functionality to identify which scene a given set of atmospheric measurements belongs to.
- identify(data)#
Identify scene IDs for all data points in a dataset
- validate_input_data_columns(data)#
Validate that dataset contains all required variables
Notes
- Expected CSV format:
scene_id,variable1_min,variable1_max,variable2_min,variable2_max,… 1,0.0,10.0,20.0,30.0,… 2,10.0,20.0,30.0,40.0,…
Each variable must have both a _min and _max column. NaN or empty values indicate unbounded ranges.
Examples
>>> scene_def = SceneDefinition(Path("trmm.csv")) >>> print(scene_def.type) 'TRMM' >>> print(len(scene_def.scenes)) 42
Methods
identify(data)Identify scene IDs for all data points in the dataset.
Ensure input data contains all required FootprintVariables.
Ensure scene definition file contains valid column names, bin ranges, that classification parameters are not duplicated across IDs, and that there are no gaps in classification bins.
- __init__(definition_path: Path)#
Initialize scene definition from CSV file.
- Parameters:
definition_path (pathlib.Path) – Path to CSV file containing scene definitions
- Raises:
FileNotFoundError – If the definition file does not exist
ValueError – If the CSV format is invalid or missing required columns
Notes
The CSV file must contain: - A ‘scene_id’ column with unique integer identifiers - Pairs of columns for each variable: {var}_min and {var}_max - At least one variable pair
Methods
identify(data)Identify scene IDs for all data points in the dataset.
Ensure input data contains all required FootprintVariables.
Ensure scene definition file contains valid column names, bin ranges, that classification parameters are not duplicated across IDs, and that there are no gaps in classification bins.
- _extract_variable_names(columns: Index) list[str]#
Extract unique variable names from min/max column pairs.
- Parameters:
columns (pd.Index) – Column names from the CSV
- Returns:
Sorted list of unique variable names
- Return type:
Notes
Variable names are extracted by removing the ‘_min’ or ‘_max’ suffix from column names. Only columns with these suffixes are considered as variable definitions.
Examples
>>> cols = pd.Index(['scene_id', 'temp_min', 'temp_max', 'pressure_min', 'pressure_max']) >>> scene_def._extract_variable_names(cols) ['pressure', 'temp']
- _identify_vectorized(data: Dataset, dims: list[str], shape: tuple[int, ...]) ndarray#
Vectorized scene identification for better performance.
Uses NumPy array operations to efficiently classify all data points simultaneously rather than iterating point-by-point.
- Parameters:
- Returns:
Array of scene IDs with shape matching input dimensions
- Return type:
np.ndarray
Notes
For each scene, creates a boolean mask identifying all matching points, then assigns the scene ID to those points. Earlier scenes in the list have priority for overlapping classifications.
- _parse_row_to_ranges(row: Series, variable_names: list[str]) dict[str, tuple[float | None, float | None]]#
Parse a CSV row into variable ranges.
- Parameters:
- Returns:
Dictionary mapping variable names to (min, max) tuples. None values indicate unbounded ranges (no constraint).
- Return type:
Notes
For each variable, looks for columns named {variable}_min and {variable}_max. NaN values in the CSV are converted to None to indicate unbounded ranges.
Examples
>>> row = pd.Series({'scene_id': 1, 'temp_min': 0.0, 'temp_max': 100.0, ... 'pressure_min': np.nan, 'pressure_max': 1000.0}) >>> scene_def._parse_row_to_ranges(row, ['temp', 'pressure']) {'temp': (0.0, 100.0), 'pressure': (None, 1000.0)}
- identify(data: Dataset) DataArray#
Identify scene IDs for all data points in the dataset.
Classifies each data point in the dataset by finding the first scene whose variable ranges match all the data point’s variable values.
- Parameters:
data (xr.Dataset) – Dataset containing all required variables for scene identification
- Returns:
Array of scene IDs with the same dimensions as the input data. Scene ID of -1 indicates no matching scene was found for that point.
- Return type:
xr.DataArray
- Raises:
ValueError – If the dataset is missing required variables
Notes
Scene matching uses first-match priority: if multiple scenes could match a data point, the first one in the definition list is assigned
Data points with NaN values in any required variable are not matched
The method logs statistics about matched/unmatched points and the distribution of scene IDs
Examples
>>> data = xr.Dataset({ ... 'cloud_fraction': ([('x',)], [20.0, 60.0, 85.0]), ... 'optical_depth': ([('x',)], [5.0, 15.0, 25.0]) ... }) >>> scene_def = SceneDefinition(Path("scenes.csv")) >>> scene_ids = scene_def.identify(data) >>> print(scene_ids.values) array([1, 2, -1]) # Last point didn't match any scene
- validate_input_data_columns(data: Dataset)#
Ensure input data contains all required FootprintVariables.
- Parameters:
data (xr.Dataset) – Dataset to validate
- Raises:
ValueError – If required variables are missing from the dataset, with a message listing all missing variables
Examples
>>> scene_def = SceneDefinition(Path("scenes.csv")) >>> scene_def.required_columns = ['cloud_fraction', 'optical_depth'] >>> data = xr.Dataset({'cloud_fraction': [10, 20]}) >>> scene_def.validate_input_data_columns(data) ValueError: Required columns ['optical_depth'] not in input data for TRMM scene identification.
- validate_scene_definition_file()#
Ensure scene definition file contains valid column names, bin ranges, that classification parameters are not duplicated across IDs, and that there are no gaps in classification bins.
- Raises:
NotImplementedError – This validation is not yet implemented
Notes
TODO: LIBSDC-589 Implement validation checks for: - Valid column naming conventions - Non-overlapping scene definitions - Complete coverage of parameter space (no gaps) - Consistent min/max value ordering