Programming for beginners: Field level data lineage

Field-level lineage is like tracing the journey of each column or field information in the entire data lineage process. It helps us to see exactly how the data in one column or field gets created, updated, transformed and used. This is super helpful because it gives us a detailed picture of how each individual bit of data gets manipulated or used, rather than just looking at the big picture of entire sets of data.

Example

Suppose we have a customer database with the following tables:

a. Customer Table:

Columns: customer_id, first_name, last_name, email, phone_number, address_id

b. Address Table:

Columns: address_id, street_address, city, state, postal_code

Now, let's say we have a business requirement to create a new field in our reporting system called "full_address", which concatenates the street_address, city, state, and postal_code from the Address table for each customer.

The field level lineage looks like below for above use case.

{
  "source_database": {
    "name": "CustomerDatabase",
    "url": "jdbc:mysql://source-database-url",
    "type": "MySQL"
  },
  "source_table": "Address",
  "source_fields": [
    "street_address",
    "city",
    "state",
    "postal_code"
  ],
  "transformation": {
    "type": "concatenation",
    "expression": "CONCAT(street_address, ', ', city, ', ', state, ' ', postal_code)"
  },
  "target_database": {
    "name": "ReportingDatabase",
    "url": "jdbc:mysql://destination-database-url",
    "type": "MySQL"
  },
  "target_table": "CustomerDetails",
  "target_field": "full_address",
  "usage": {
    "reports": ["CustomerAddressesReport", "SalesAnalysisReport"],
    "applications": ["CustomerPortal", "SalesDashboard"]
  }
}

In the above lineage

a. source_database: Contains details about the source database, including its name, URL, and type (e.g., MySQL).

b. source_table: Specifies the table where the source fields originate (in this case, "Address").

c. source_fields: Lists the individual fields from the source table that are involved in the lineage.

d. transformation: Describes how the source fields are transformed to derive the target field.

e. type: Indicates the type of transformation (in this case, "concatenation").

f. expression: Provides the expression or logic used for the transformation.

g. target_database: Contains details about the destination database, including its name, URL, and type (e.g., MySQL).

h. target_table: Identifies the table or system where the transformed data is stored (here, "ReportingSystem").

i. target_field: Specifies the specific field in the target table where the transformed data is loaded ("full_address" in this case).

j. usage: Provides information about where the target field (full_address) is used, including reports and applications. This can help stakeholders understand the significance and impact of the field in various contexts.

Previous Next Home

Programming for beginners

Sunday, 11 February 2024

Field level data lineage

No comments:

Post a Comment