Sunday 11 February 2024

Table level data lineage

Table-level lineage provides a higher-level view, focusing on how entire tables or files move and transform from one system to another. This type is beneficial for understanding the relationships between different datasets and the flow of data at a broader level.

 

Example: Let's consider a retail company that operates both online and offline stores. The company maintains two databases one for online stores and other for offline stores. The company wants to analyze its sales data to gain insights into customer behaviour and product performance.

 

The json schema for the above use case looks like below.

{
  "raw_data_tables": [
    {
      "name": "online_sales_raw",
      "fields": [
        {"name": "transaction_id", "type": "string"},
        {"name": "customer_id", "type": "string"},
        {"name": "product_id", "type": "string"},
        {"name": "quantity", "type": "integer"},
        {"name": "unit_price", "type": "number"},
        {"name": "order_date", "type": "string"}
      ]
    },
    {
      "name": "instore_sales_raw",
      "fields": [
        {"name": "transaction_id", "type": "string"},
        {"name": "customer_id", "type": "string"},
        {"name": "product_id", "type": "string"},
        {"name": "quantity", "type": "integer"},
        {"name": "unit_price", "type": "number"},
        {"name": "order_date", "type": "string"}
      ]
    }
  ],
  "transformations": [
    {
      "name": "data_integration_transformation",
      "source_tables": ["online_sales_raw", "instore_sales_raw"],
      "destination_table": "integrated_sales_data",
      "mappings": [
        {"source_field": "transaction_id", "destination_field": "transaction_id"},
        {"source_field": "customer_id", "destination_field": "customer_id"},
        {"source_field": "product_id", "destination_field": "product_id"},
        {"source_field": "quantity", "destination_field": "quantity"},
        {"source_field": "unit_price", "destination_field": "unit_price"},
        {"source_field": "order_date", "destination_field": "order_date"},
        {"source_field": "store_type", "transformation": "IF(online_sales_raw) THEN 'Online' ELSE 'Instore'"}
      ]
    }
  ],
  "analytics_tables": [
    {
      "name": "sales_summary",
      "fields": [
        {"name": "store_type", "type": "string"},
        {"name": "total_sales", "type": "number"},
        {"name": "average_price", "type": "number"},
        {"name": "total_quantity", "type": "integer"}
      ]
    },
    {
      "name": "customer_behavior",
      "fields": [
        {"name": "customer_id", "type": "string"},
        {"name": "total_orders", "type": "integer"},
        {"name": "total_spent", "type": "number"},
        {"name": "last_order_date", "type": "string"}
      ]
    },
    {
      "name": "product_performance",
      "fields": [
        {"name": "product_id", "type": "string"},
        {"name": "total_sales", "type": "number"},
        {"name": "total_orders", "type": "integer"},
        {"name": "average_price", "type": "number"}
      ]
    }
  ]
}

JSON schema in more detail:

 

Raw Data Tables:

This section defines the raw data tables where the initial sales data from online and in-store transactions is stored.

Each raw data table has a name and a list of fields representing the attributes of the transactions, such as transaction ID, customer ID, product ID, quantity, unit price, and order date.

 

Transformations:

This section captures the integration and transformation process applied to the raw data tables to create a unified dataset.

The transformation is named "data_integration_transformation".

It specifies the source tables ("online_sales_raw" and "instore_sales_raw") and the destination table ("integrated_sales_data").

The "mappings" array defines how fields from the source tables are mapped to fields in the destination table.

 

Additionally, a transformation is applied to derive the "store_type" field, indicating whether a transaction originated from an online or in-store purchase. This transformation uses a conditional statement ("IF-THEN-ELSE") based on the source table name.

 

Analytics Tables:

This section defines the tables used for analytics and reporting purposes after the integration and transformation process.

The tables include "sales_summary", "customer_behavior", and "product_performance". Each analytics table has a name and a list of fields representing various metrics or attributes relevant for analysis.

 

Overall, this JSON schema provides a structured representation of the data pipeline, from the initial raw data tables to the final analytics tables, including the transformations applied during the process. It helps to document and understand the flow of data and the relationships between different datasets in the system.


 

 

Previous                                                 Next                                                 Home

No comments:

Post a Comment