Working with NetCDF4 Files#

NetCDF4 is a complete redesign of the NetCDF file format based on HDF5 data structures. i.e. all NetCDF4 files are HDF5 files with some additional requirements and limitation of functionality. Note that not all HDF5 files are NetCDF4 files. For more information on NetCDF5 and the underlying HDF5 structures, see the documentation here. There are several python packages (and libraries in other languages) that support reading and writing NetCDF4 files. The SDC is using the Xarray python library.

The official documentation for Xarray is here.. It includes a much more comprehensive user guide with code examples.

Xarray builds on the numpy package, introducing labels for multidimensional arrays in python. These labels come in the form of coordinates, dimensions, and attributes. Xarray is broken into two main data structures: DataArrays and DataSets. DataArrays are contained within DataSets, such that a single DataSet can hold multiple DataArrays. DataSets can then be written to NetCDF4 files.

Reading NetCDF4 Files#

To read NetCDF4 files we can use Xarray as well. NetCDF4 files have similar structure to HDF5 files. NetCDF4 DataSets can have DataSets nested within one another. Here is an example of how to access each DataSet/Group.

import xarray

with xarray.open_dataset("filename", group='/') as ds:
    print(ds) # print the highest level group

Creating and Using DataArrays#

See documentation on DataArray.to_netcdf here.

DataArrays are arrays that can handle multiple dimensions with named or labeled axes. These DataArray objects add metadata such as dimension names, coordinates, and attributes. DataArrays can be created from numpy arrays, numpy-like arrays, pandas Series, and pandas DataFrames.

import xarray as xr
import numpy as np
import pandas as pd

# Create the coordinates
time = pd.date_range(start='2022-01-01', periods=10, freq='D')  # 10 daily time steps
lat = np.linspace(-90, 90, 5)  # 5 latitude points from -90 to 90
lon = np.linspace(-180, 180, 8)  # 8 longitude points from -180 to 180

# Create random data
data = np.random.random((len(time), len(lat), len(lon)))

# Create the DataArray
data_array = xr.DataArray(data, 
                          coords=[time, lat, lon], 
                          dims=['time', 'lat', 'lon'])

# Display the DataArray
print(data_array)


print(data_array.values)  # the data in the object
print(data_array.dims)  # access the dimensions 
print(data_array.coords)  # access the coors attribute
print(data_array.attrs)  # access metadata about the DataArray

You can create DataArrays across more dimensions as well. The number of variables in dims and coords should be equal for multiple dimensions. You can also modify the DataArrays values with scalars.

data_array.values = data_array.values * 2 # multiply the entire array by 2

Writing DataSets as NetCDF4#

See documentation on Dataset.to_netcdf here.

Creating DataSets is similar to creating DataArrays, provide the data variables themselves, along with the coords, dims and attributes you want to include. The data variables should be a dictionary with each key being the name of the data and each value can be a DataSet, pandas dataframe or numpy array. Coords and Dims should be a dictionary as well.

For this example, we are creating the DataArrays first ,then will add them all into one DataSet and write that to a NetCDF4 file. When creating DataSets, Xarray will often infer the dims from the coords and data variables given, if you do not pass any while creating DataSets. When writing to the NetCDF4 files if different variables “lie” on different dimensions, they will smash them together and replace the extra values (when viewed in a file viewer) with zeros. When writing a file using Xarray, using the engine “h5netcdf” will write the file faster.

import random
import pandas 
import numpy 
import xarray as xr

data_length = random.randint(1000,2000) # creating random vector length to simulate data

# creating multiple data fields to simulate fake data
times = pandas.date_range("2014-09-06", periods=data_length)
short_wave = numpy.random.rand(data_length)
long_wave = numpy.random.rand(data_length)
total_radiance = numpy.random.rand(data_length)
split_radiance = numpy.random.rand(data_length)

# creating the DataSets from the created fields
short_wave = xr.DataArray(short_wave, coords=[times], dims=['times'])
long_wave = xr.DataArray(long_wave, coords=[times], dims=['times'])
total_rad = xr.DataArray(total_radiance, coords=[times], dims=['times'])
split_rad = xr.DataArray(split_radiance, coords=[times], dims=['times'])

# create the DataSet 
ds = xr.Dataset({
    'short_wave': short_wave,
    'long_wave': long_wave,
    'total_radiance': total_rad,
    'split_rad': split_rad
},
    coords={'times': times},
)

# write to a NetCDF4 file
ds.to_netcdf('filename', group="/", mode='a', engine='h5netcdf')

You can specify group structure with group keyword, similar to a filesystem path (/groups/are/paths). When writing multiple DataSets to a file or if you need to append them, use keyword “mode” with value “a” to append.