Libera Data Models and Database Schema in DynamoDB on AWS#

Under Review (Remove before release)#

This document provides an overview of the data models used in the Libera databases in AWS DynamoDB. The Libera databases are designed to take advantage of the DynamoDB pricing model to minimize costs and utilize the serverless nature of DyanmoDB to facilitate an asynchronous event-driven architecture.

See DynamoDB Basics in AWS for more information on the basics of DynamoDB before diving into this page if you are unfamiliar with this AWS service.

This documentation will provide the examples of the data models for different DynamoDB tables used in the Libera SDC. Each visual table here will describe the key-value nature of the DynamoDB tables and how they are used in the Libera in the following way:

Partition Key

Sort Key

Attribute 1 Key

Attribute 2 Key

Attribute N Key

PK Value Example

SK Value Example

Value1 Example

Value2 Example

ValueN Ex.

PK Description

SK Description

Attribute 1 Desc

Attribute 2 Desc

Attribute N Desc

Note: Many tables in the Libera SDC use a generic key-value entry for the two keys (partition and sort) as follows:

  • Partition Key - “PK”:”value”

  • Sort Key- “SK”:”value”

This is in contrast to a more specifically named key-value entry that is more readable by a human, but can be restrictive in the DynamoDB table design as it relates to vertical partitioning and logical grouping of data. For example, below is a more human-readable key set that is not used in the Libera SDC, but is representative of logical example of how the keys are used.

  • Partition Key - ‘Filename’:”Example_file.nc”

  • Sort Key - ‘Filetype’:”L0”

This is generic usage of “PK” and “SK” is a common pattern in DynamoDB to allow for easy access to the data in the table and allows greater flexibility in the logical design of the data model without confusing semantics of the keys.

Libera Metadata and Provenance Database#

Metadata Data Model#

This database is used to store metadata and provenance information about the data files generated by the Libera SDC. The metadata for each file can broadly be broken down into three categories that are all stored together in the same table:

  1. File metadata about the file itself, such as the filename, the time it was archived, and the version of the algorithm used to generate the data file

  2. Sortable metadata about the file, such as the applicable date of the data in the file, calibration versions used, and other metadata that can be used to sort and filter the data in other processing steps

  3. Additional metadata that is specific to a given data level. For example, the construction records for level 0 data files contain a wide array of additional metadata that is not present in higher level data products.

Minimum Database Examples#

Additional attributes may be added as needed for specific data products, this shows the expected minimum.

File Metadata Example#

PK

SK

archive-time

algorithm-version

“Example_file.nc”

“#”

“2024-01-01 00:00:00”

“1.0.0”

Unique Filename

Placeholder

Archive Time

Algorithm Version

Sortable Metadata Example#

PK

SK

applicable-date

“Example_file.nc”

“#L0#SPICE#AZ”

“2024-01-01”

Unique Filename

File Type Identifier

Applicable Date of Data

Additional Metadata Example#

No minimum requirement.

Level 0 (L0) Data Specifics#

These are the raw binary data files received from ASDC and come as two separate files: construct records and PDS files.

Construction Record (CR)#

CR File Metadata#

PK

SK

ingest-time

archive-time

algorithm-version

“L0_CONS_file.PDS”

“#”

“2024-01-01 00:00:00”

“2024-01-01 00:01:00”

“1.0.0”

Unique Filename

Placeholder

Applicable Date of Data

Archive Time

Algorithm Version

CR Sortable Metadata#

PK

SK

applicable-date

first-packet-time

last-packet-time

missing-packet-count

filled-gap-count

“L0_CONS_file.PDS”

“#L0#APID11”

“2024-01-01”

“2024-01-01 00:00:00”

“2024-01-01 00:00:00”

0

0

Unique Filename

File Type Identifier

Applicable Date of Data

First Packet Time

Last Packet Time

Missing Packet Count

Filled Gap Count

CR Additional Metadata (Not Comprehensive of all Entries)#

PK

SK

filename

edos-version

SCID

APID

“L0_CONS_file.PDS”

“#PDS”

“L0_PDS_file.PDS”

Unique Filename

Identifier

Applicable Date of Data

——————–

————–

————————-

———————–

—–

“L0_CONS_file.PDS”

“#APID11”

“1.0.0”

1

11

Unique Filename

Identifier

EDOS Version

SCID

APID

The above example shows only 2 example items. The construction record file has approximately 10 more items associated with it that are stored in this table. See the io/pds.py file for the full details of items stored in the construction record.

PDS File#

PDS File Metadata#

This is the same as the construction record file metadata.

PK

SK

ingest-time

archive-time

algorithm-version

“L0_file.nc”

“#”

“2024-01-01 00:00:00”

“2024-01-01 00:01:00”

“1.0.0”

Unique Filename

File Type

Applicable Date of Data

Archive Time

Algorithm Version

PDS Sortable Metadata#

This is covered in the construction record sortable metadata.

PDS Additional Metadata#

This is covered by the construction record additional metadata.

Higher Level Data Products#

TBD

Calibration Data Products#

Cal File Metadata#

PK

SK

archive-time

calibration-version

“Example_calibration_file.nc”

“#CAL#L0”

“2024-01-01 00:00:00”

“1.0.0”

Unique Filename

Identifier

Archive Time

Calibration Version

Cal Sortable Metadata#

Not applicable.

Cal Additional Metadata#

TBD.

Using this Data Model with a Global Secondary Index (GSI)#

The Libera databases make use of a global secondary index (GSI) to allow for efficient querying of the data in the table using an alternate key. This is used to define specific “Use Cases” that allow for querying the data in different ways from the standard primary key of PK + SK.

NOTE: GSI’s are not required to be unique with a primary key (PK + SK).

The GSI uses the existing attributes in the existing database and allows for a different way to access the data in the table in more efficient ways than scanning the whole database manually then sorting yourself.

Libera Metadata Database GSI’s#

  1. Date searching GSI - This GSI is used to search for all files that have a specific date associated with them and get the filenames associated with particular identifier.

applicable-date

SK

PK

“2024-01-01”

“#L0”

“L0_CONS_file.PDS”

Applicable Date

Type Identifier

Unique Filename

  1. Calibration Version searching GSI (TBD) - This GSI is used to search for all files that have a specific calibration identifier and retrieve the version associated with them.

SK

****

PK

version

“#Cal”

“Example_calibration_file.nc “

“1.0.0”

Type Identifier

Unique Filename

Calibration Version

  1. Additional GSI’s are TBD.

Implementation#

Implementation specifics for general data products live in the db/dynamodb_utils.py and are under active development.

Specifics for level 0 (L0) raw data files are in the io/pds.py file. This is for construction records and PDS files received from ASDC, and these are used and refined in the io/pds-ingest.py.

The handling of this data model for higher level products is TBD with the basic structure and tools in place in the db/dynamodb_utils.py file.