Programming for beginners: Introduction to Hadoop

What is Big Data?

Big data represents huge amount of data that is usually not possible to store in a single system.

For example, facebook.com generates 4 petabytes of data every day. It is impossible to store and process this huge data with the help of a singe compute.

Big data can be characterized with 4 V’s

a. Volume

b. Velocity

c. Variety

d. Veracity

Volume

Volume of the data is huge. For example, Data size is in terabytes, petabytes or more that a traditional systems can’t handle.

Velocity

Velocity represent the speed at which new data is getting generated. For example, facebook.com is generating 4 petabytes of data every day.

Variety

Data can be in different formats given below.

a. Structured: RDBMS tables data.

b. Semi structured: csv, json, xml data

c. Unstructured: images, video, application log files etc.,

Veracity

Veracity represents the quality of the data. Most of the times, the data is messy, we need to pre-process the data, before getting some insights out of the it. For example, some of the columns in the data do not have a value.

Why traditional systems can’t handle big data?

Big data is huge in size, and the data is getting accumulated in the sizes of terabytes every day. We can’t store this huge data in a single system. Traditional systems which are not distributed in nature by default, needs vertical scaling (adding more CPUs, memory to the system) to handle this data. Since the data is getting generated day by day is huge, at some point of time, this vertical scaling will reach its upper limit and can’t solve the use case.

Big data is in different formats like structured, semi structured, and unstructured, RDBMS systems which require data to follow certain schema can’t store all the varieties of the data.

What is Hadoop?

a. Hadoop is open-source software for reliable, scalable, distributed computing.

b. Hadoop is a framework to manage and process big data

c. Hadoop is a software to address big data usecases

What is Distributed Computing?

In Distributed computing, a network of systems connects and communicates together to achieve a task. In distributed computing, a problem can be divided as many tasks, each task assigned to a computer in distributed network.

Since Big data size is in petabytes or more, we can store this huge data on cluster of machines. We can add any number of nodes to the cluster in future depends on the need.

What is Distributed program?

A computer program that runs in a distributed environment is called distributed program.

Why Distributed Computing?

Today world is overwhelmed by huge and huge amount of data. This data is coming from many resources like social networking sites, stock markets, sensor networks, vehicle GPS traces, online transactions and spatial data. Is it possible to store and analyze all this data in a single machine? Definitely No, it is not possible.

In distributed computing, a distributed program can utilize the processing capability, memory, other hardware resources of all computers connected in distributed network. So we can solve this huge and huge data problem in distributed computing.

Hadoop core components:

a. HDFS: It is Hadoop distributed file system. Data is divided into chunks and stored across the systems in a cluster (Cluster is a collection of computes).

b. MapReduce: Programming model to process the data stored in HDFS.

c. YARN: stands for Yet Another Resource Negotiator, handles resource management part in Hadoop.

References

https://kinsta.com/blog/facebook-statistics/

Previous Next Home

Programming for beginners

Saturday, 2 January 2016

Introduction to Hadoop

No comments:

Post a Comment