# Libera Data Models and Database Schema in DynamoDB on AWS
## Under Review (Remove before release)
This document provides an overview of the data models used in the Libera databases in AWS DynamoDB. The Libera 
databases are designed to take advantage of the DynamoDB pricing model to minimize costs and utilize the serverless
nature of DyanmoDB to facilitate an asynchronous event-driven architecture.

See [DynamoDB Basics in AWS](dynamodb_basics.md) for more information on the basics of DynamoDB before diving into
this page if you are unfamiliar with this AWS service.

This documentation will provide the examples of the data models for different DynamoDB tables used in the Libera SDC.
Each visual table here will describe the key-value nature of the DynamoDB tables and how they are used in the Libera
in the following way:

| **Partition Key**   | **Sort Key**       | Attribute 1 Key    | Attribute 2 Key    | ...   | Attribute N Key    |
|---------------------|--------------------|--------------------|--------------------|-------|--------------------|
| PK Value Example    | SK Value Example   | Value1 Example     | Value2 Example     | ...   | ValueN Ex.         |
| _PK Description_    | _SK Description_   | _Attribute 1 Desc_ | _Attribute 2 Desc_ | ...   | _Attribute N Desc_ |

Note: Many tables in the Libera SDC use a generic key-value entry for the two keys (partition and sort) as follows:
- Partition Key - "PK":"value"
- Sort Key- "SK":"value"

This is in contrast to a more specifically named key-value entry that is more readable by a human, but can be restrictive in
the DynamoDB table design as it relates to vertical partitioning and logical grouping of data. For example, below is a 
more human-readable key set that is not used in the Libera SDC, but is representative of logical example of how the keys are used.

- Partition Key - 'Filename':"Example_file.nc"
- Sort Key - 'Filetype':"L0"

This is generic usage of "PK" and "SK" is a common pattern in DynamoDB to allow for easy access to the data in the table 
and allows greater flexibility in the logical design of the data model without confusing semantics of the keys.


## Libera Metadata and Provenance Database
## Metadata Data Model
This database is used to store metadata and provenance information about the data files generated by the Libera
SDC. The metadata for each file can broadly be broken down into three categories that are all stored together in the
same table:
1. **File metadata** about the file itself, such as the filename, the time it was archived, and the version of the algorithm
used to generate the data file
2. **Sortable metadata** about the file, such as the applicable date of the data in the file, calibration versions used, and
other metadata that can be used to sort and filter the data in other processing steps
3. **Additional metadata** that is specific to a given data level. For example, the construction records for level 0 data
files contain a wide array of additional metadata that is not present in higher level data products.

## Minimum Database Examples
Additional attributes may be added as needed for specific data products, this shows the expected minimum. 

### File Metadata Example
| **PK**            | **SK**        | archive-time          | algorithm-version   |
|-------------------|---------------|-----------------------|---------------------|
| "Example_file.nc" | "#"           | "2024-01-01 00:00:00" | "1.0.0"             |
| _Unique Filename_ | _Placeholder_ | _Archive Time_        | _Algorithm Version_ |

### Sortable Metadata Example
| **PK**            | **SK**                 | applicable-date           |
|-------------------|------------------------|---------------------------|
| "Example_file.nc" | "#L0#SPICE#AZ"         | "2024-01-01"              |
| _Unique Filename_ | _File Type Identifier_ | _Applicable Date of Data_ |

### Additional Metadata Example
No minimum requirement.


## Level 0 (L0) Data Specifics
These are the raw binary data files received from ASDC and come as two separate files: construct records and PDS files.

### Construction Record (CR)
### CR File Metadata
| **PK**             | **SK**        | ingest-time               | archive-time          | algorithm-version   |
|--------------------|---------------|---------------------------|-----------------------|---------------------|
| "L0_CONS_file.PDS" | "#"           | "2024-01-01 00:00:00"     | "2024-01-01 00:01:00" | "1.0.0"             |
| _Unique Filename_  | _Placeholder_ | _Applicable Date of Data_ | _Archive Time_        | _Algorithm Version_ |

### CR Sortable Metadata
| **PK**             | **SK**                 | applicable-date           | first-packet-time     | last-packet-time      | missing-packet-count   | filled-gap-count   |
|--------------------|------------------------|---------------------------|-----------------------|-----------------------|------------------------|--------------------|
| "L0_CONS_file.PDS" | "#L0#APID11"           | "2024-01-01"              | "2024-01-01 00:00:00" | "2024-01-01 00:00:00" | 0                      | 0                  |
| _Unique Filename_  | _File Type Identifier_ | _Applicable Date of Data_ | _First Packet Time_   | _Last Packet Time_    | _Missing Packet Count_ | _Filled Gap Count_ |

### CR Additional Metadata (Not Comprehensive of all Entries)
| **PK**               | **SK**         | filename                  | edos-version            | SCID   | APID   | ... |
|----------------------|----------------|---------------------------|-------------------------|--------|--------|-----|
| "L0_CONS_file.PDS"   | "#PDS"         | "L0_PDS_file.PDS"         |                         |        |        |     |
| _Unique Filename_    | _Identifier_   | _Applicable Date of Data_ |                         |        |        |     |
| -------------------- | -------------- | ------------------------- | ----------------------- | -----  | --     | --  |
| "L0_CONS_file.PDS"   | "#APID11"      |                           | "1.0.0"                 | 1      | 11     | ... |
| _Unique Filename_    | _Identifier_   |                           | _EDOS Version_          | _SCID_ | _APID_ | ... |

The above example shows only 2 example items. The construction record file has approximately 10 more items associated
with it that are stored in this table. See the `io/pds.py` file for the full details of items stored in the construction
record.

### PDS File
### PDS File Metadata
This is the same as the construction record file metadata.

| **PK**            | **SK**      | ingest-time             | archive-time          | algorithm-version   |
|-------------------|-------------|-------------------------|-----------------------|---------------------|
| "L0_file.nc"      | "#"         | "2024-01-01 00:00:00"   | "2024-01-01 00:01:00" | "1.0.0"             |
| _Unique Filename_ | _File Type_ | _Applicable Date of Data_ | _Archive Time_        | _Algorithm Version_ |

### PDS Sortable Metadata
This is covered in the construction record sortable metadata.

### PDS Additional Metadata
This is covered by the construction record additional metadata.

## Higher Level Data Products
TBD

## Calibration Data Products

### Cal File Metadata
| **PK**                        | **SK**       | archive-time          | calibration-version   |
|-------------------------------|--------------|-----------------------|-----------------------|
| "Example_calibration_file.nc" | "#CAL#L0"    | "2024-01-01 00:00:00" | "1.0.0"               |
| _Unique Filename_             | _Identifier_ | _Archive Time_        | _Calibration Version_ |

### Cal Sortable Metadata
Not applicable.
### Cal Additional Metadata
TBD.

## Using this Data Model with a Global Secondary Index (GSI)
The Libera databases make use of a global secondary index (GSI) to allow for efficient querying of the data in the table
using an alternate key. This is used to define specific "Use Cases" that allow for querying the data in different ways from
the standard primary key of PK + SK.

**NOTE:** GSI's are not required to be unique with a primary key (PK + SK).

The GSI uses the existing attributes in the existing database and allows for a different way to access the data in the table 
in more efficient ways than scanning the whole database manually then sorting yourself.

### Libera Metadata Database GSI's

1. Date searching GSI - This GSI is used to search for all files that have a specific date associated with them and get 
the filenames associated with particular identifier.

| **applicable-date** | **SK**            | PK                  | 
|---------------------|-------------------|---------------------|
| "2024-01-01"        | "#L0"             | "L0_CONS_file.PDS"  |
| _Applicable Date_   | _Type Identifier_ | _Unique Filename_   |

2. Calibration Version searching GSI (TBD) - This GSI is used to search for all files that have a specific calibration 
identifier and retrieve the version associated with them.

| **SK**            | **** | PK                             | version                |
|-------------------|------|--------------------------------|------------------------|
| "#Cal"            |      | "Example_calibration_file.nc " | "1.0.0"                |
| _Type Identifier_ |      | _Unique Filename_              | _Calibration Version_  |

3. Additional GSI's are TBD.

### Implementation
Implementation specifics for general data products live in the `db/dynamodb_utils.py` and are under active development.

Specifics for level 0 (L0) raw data files are in the `io/pds.py` file. This is for construction records and PDS files received 
from ASDC, and these are used and refined in the `io/pds-ingest.py`.

The handling of this data model for higher level products is TBD with the basic structure and tools in place in the 
`db/dynamodb_utils.py` file.