Libera Data Models and Database Schema in DynamoDB on AWS#
Under Review (Remove before release)#
This document provides an overview of the data models used in the Libera databases in AWS DynamoDB. The Libera databases are designed to take advantage of the DynamoDB pricing model to minimize costs and utilize the serverless nature of DyanmoDB to facilitate an asynchronous event-driven architecture.
See DynamoDB Basics in AWS for more information on the basics of DynamoDB before diving into this page if you are unfamiliar with this AWS service.
This documentation will provide the examples of the data models for different DynamoDB tables used in the Libera SDC. Each visual table here will describe the key-value nature of the DynamoDB tables and how they are used in the Libera in the following way:
Partition Key |
Sort Key |
Attribute 1 Key |
Attribute 2 Key |
… |
Attribute N Key |
---|---|---|---|---|---|
PK Value Example |
SK Value Example |
Value1 Example |
Value2 Example |
… |
ValueN Ex. |
PK Description |
SK Description |
Attribute 1 Desc |
Attribute 2 Desc |
… |
Attribute N Desc |
Note: Many tables in the Libera SDC use a generic key-value entry for the two keys (partition and sort) as follows:
Partition Key - “PK”:”value”
Sort Key- “SK”:”value”
This is in contrast to a more specifically named key-value entry that is more readable by a human, but can be restrictive in the DynamoDB table design as it relates to vertical partitioning and logical grouping of data. For example, below is a more human-readable key set that is not used in the Libera SDC, but is representative of logical example of how the keys are used.
Partition Key - ‘Filename’:”Example_file.nc”
Sort Key - ‘Filetype’:”L0”
This is generic usage of “PK” and “SK” is a common pattern in DynamoDB to allow for easy access to the data in the table and allows greater flexibility in the logical design of the data model without confusing semantics of the keys.
Libera Metadata and Provenance Database#
Metadata Data Model#
This database is used to store metadata and provenance information about the data files generated by the Libera SDC. The metadata for each file can broadly be broken down into three categories that are all stored together in the same table:
File metadata about the file itself, such as the filename, the time it was archived, and the version of the algorithm used to generate the data file
Sortable metadata about the file, such as the applicable date of the data in the file, calibration versions used, and other metadata that can be used to sort and filter the data in other processing steps
Additional metadata that is specific to a given data level. For example, the construction records for level 0 data files contain a wide array of additional metadata that is not present in higher level data products.
Minimum Database Examples#
Additional attributes may be added as needed for specific data products, this shows the expected minimum.
File Metadata Example#
PK |
SK |
archive-time |
algorithm-version |
---|---|---|---|
“Example_file.nc” |
“#” |
“2024-01-01 00:00:00” |
“1.0.0” |
Unique Filename |
Placeholder |
Archive Time |
Algorithm Version |
Sortable Metadata Example#
PK |
SK |
applicable-date |
---|---|---|
“Example_file.nc” |
“#L0#SPICE#AZ” |
“2024-01-01” |
Unique Filename |
File Type Identifier |
Applicable Date of Data |
Additional Metadata Example#
No minimum requirement.
Level 0 (L0) Data Specifics#
These are the raw binary data files received from ASDC and come as two separate files: construct records and PDS files.
Construction Record (CR)#
CR File Metadata#
PK |
SK |
ingest-time |
archive-time |
algorithm-version |
---|---|---|---|---|
“L0_CONS_file.PDS” |
“#” |
“2024-01-01 00:00:00” |
“2024-01-01 00:01:00” |
“1.0.0” |
Unique Filename |
Placeholder |
Applicable Date of Data |
Archive Time |
Algorithm Version |
CR Sortable Metadata#
PK |
SK |
applicable-date |
first-packet-time |
last-packet-time |
missing-packet-count |
filled-gap-count |
---|---|---|---|---|---|---|
“L0_CONS_file.PDS” |
“#L0#APID11” |
“2024-01-01” |
“2024-01-01 00:00:00” |
“2024-01-01 00:00:00” |
0 |
0 |
Unique Filename |
File Type Identifier |
Applicable Date of Data |
First Packet Time |
Last Packet Time |
Missing Packet Count |
Filled Gap Count |
CR Additional Metadata (Not Comprehensive of all Entries)#
PK |
SK |
filename |
edos-version |
SCID |
APID |
… |
---|---|---|---|---|---|---|
“L0_CONS_file.PDS” |
“#PDS” |
“L0_PDS_file.PDS” |
||||
Unique Filename |
Identifier |
Applicable Date of Data |
||||
——————– |
————– |
————————- |
———————– |
—– |
– |
– |
“L0_CONS_file.PDS” |
“#APID11” |
“1.0.0” |
1 |
11 |
… |
|
Unique Filename |
Identifier |
EDOS Version |
SCID |
APID |
… |
The above example shows only 2 example items. The construction record file has approximately 10 more items associated
with it that are stored in this table. See the io/pds.py
file for the full details of items stored in the construction
record.
PDS File#
PDS File Metadata#
This is the same as the construction record file metadata.
PK |
SK |
ingest-time |
archive-time |
algorithm-version |
---|---|---|---|---|
“L0_file.nc” |
“#” |
“2024-01-01 00:00:00” |
“2024-01-01 00:01:00” |
“1.0.0” |
Unique Filename |
File Type |
Applicable Date of Data |
Archive Time |
Algorithm Version |
PDS Sortable Metadata#
This is covered in the construction record sortable metadata.
PDS Additional Metadata#
This is covered by the construction record additional metadata.
Higher Level Data Products#
TBD
Calibration Data Products#
Cal File Metadata#
PK |
SK |
archive-time |
calibration-version |
---|---|---|---|
“Example_calibration_file.nc” |
“#CAL#L0” |
“2024-01-01 00:00:00” |
“1.0.0” |
Unique Filename |
Identifier |
Archive Time |
Calibration Version |
Cal Sortable Metadata#
Not applicable.
Cal Additional Metadata#
TBD.
Using this Data Model with a Global Secondary Index (GSI)#
The Libera databases make use of a global secondary index (GSI) to allow for efficient querying of the data in the table using an alternate key. This is used to define specific “Use Cases” that allow for querying the data in different ways from the standard primary key of PK + SK.
NOTE: GSI’s are not required to be unique with a primary key (PK + SK).
The GSI uses the existing attributes in the existing database and allows for a different way to access the data in the table in more efficient ways than scanning the whole database manually then sorting yourself.
Libera Metadata Database GSI’s#
Date searching GSI - This GSI is used to search for all files that have a specific date associated with them and get the filenames associated with particular identifier.
applicable-date |
SK |
PK |
---|---|---|
“2024-01-01” |
“#L0” |
“L0_CONS_file.PDS” |
Applicable Date |
Type Identifier |
Unique Filename |
Calibration Version searching GSI (TBD) - This GSI is used to search for all files that have a specific calibration identifier and retrieve the version associated with them.
SK |
**** |
PK |
version |
---|---|---|---|
“#Cal” |
“Example_calibration_file.nc “ |
“1.0.0” |
|
Type Identifier |
Unique Filename |
Calibration Version |
Additional GSI’s are TBD.
Implementation#
Implementation specifics for general data products live in the db/dynamodb_utils.py
and are under active development.
Specifics for level 0 (L0) raw data files are in the io/pds.py
file. This is for construction records and PDS files received
from ASDC, and these are used and refined in the io/pds-ingest.py
.
The handling of this data model for higher level products is TBD with the basic structure and tools in place in the
db/dynamodb_utils.py
file.