Monday, 11 October 2021

Introduction to Apache Airflow

What is Apache Airflow?

a.   Airflow is a software to author, schedule and monitor batch data pipelines.

b.   Airflow is a data pipeline orchestrator

c.    Airflow is a platform to programmatically author, schedule and monitor workflows or data pipelines.

 

What is a workflow?

Workflow is a collection of tasks, where workflow can be triggered by a scheduler or by an event.

 

A typical workflow looks like below.

a.   Fetch data from database or source

b.   Clean the data

c.    Transform the data

d.   Place the data in HDFS

e.   Send Email

 

 

 

In traditional ETL approach, we will write a script/program to fetch the data from database, clean and transform and send this data to HDFS. This script/program is triggered by a CRON job.

 

Problems with traditional approach

a.   What if any task in the workflow is failed in middle?

It is programmers responsibility to implement retry mechanism. Here some other requirements will come, how many times I should retry and what is the fail-over behaviour.

b.   Monitoring the workflow

Programmer should implement the monitoring, which is used to track each task execution status (like failure/success)

c.    Dependency management

Tasks within a workflow are dependent on each other. Programmer should tackle this in the script. For example, task T5 is dependent on task T1 and Task T1 can only start when the empsInfo.csv file is available etc.,

d.   Scalability

e.   Deployment

Apply new changes to the workflow

f.     Maintaining history of workflow runs

 

Why Apache Airflow?

All the problems I mentioned in the section ‘Problems with traditional approach’ are handled by Airflow. Using Apache Airflow, you can define tasks and their dependencies in Python. Apache Airlflow is scalable, where you can configure multiple worker nodes to run the tasks.


Previous                                                    Next                                                    Home

No comments:

Post a Comment