Tuesday 7 January 2020

Cassandra Architecture


Keyspaces are the top level building blocks to organize the data in Cassandra. Whereas Cluster is a container for keyspaces.

If you are from RDBMS (Relational Database Management Systems) background, you can map keyspace to a database in RDBMS world. Like a database in RDBMS, a keyspace has a name and attributes (replication_factor, class etc.,) associated with it.

In RDBMS, a database is collection of tables, whereas in Cassandra,  keyspace is a collection of column families. You can map column families to tables in RDBMS.

A keyspace contain
A keyspace can contain
a.   Tables
b.   Views
c.    User-defined Types
d.   Functions
e.   Aggregates

Apart from acting as a container for data structures like tables, indexes, a keyspace can hold some extra important attributes.

For example, while creating a keyspace, you can specify the replication startegy, this attribute deals with 'how data replications are made in cassandra cluster'.

Example
CREATE keyspace cassandratutorial WITH REPLICATION =
{
         'class' : 'SimpleStrategy',
         'replication_factor' : 3
};

Above statement creates a keyspace 'cassandraTutorial' and set the 'replication_factor' to the value 3. That means, data is replicated in 3 different nodes.

Why do we maintain replications of data?
To achieve high availability, Cassandra maintain replications of the data. For example, if a node in the Cassandra cluster fails, then the data is given from other node.

How the nodes communicate each other in Cassandra cluster?
Data in Cassandra is stored across multiple nodes in a cluster. Each node in the cluster, frequently send its state information to other nodes,
Cassandra uses Gossip protocol to communicate the data between nodes in Cassandra cluster.

Primary Key, Partition Key and Clustering Key

Primary Key
Primary key is used to uniquely identify a row in a table. For example,

CREATE TABLE IF NOT EXISTS cassandratutorial.employee (
  id INT PRIMARY KEY,
  firstName VARCHAR,
  lastName VARCHAR,
  age int
);

In the above example, I set id as primary key for the table employee.

A primary can also be a combination of multiple columns. For example, I made countryCode, departmentId, id combination as primary key in manager table.

CREATE TABLE IF NOT EXISTS cassandratutorial.manager (
  id INT,
  departmentId INT,
  countryCode VARCHAR,
  firstName VARCHAR,
  lastName VARCHAR,
  age int,
  PRIMARY KEY(countryCode, departmentId, id)
);

Partition Key
Cassandra organizes the data into partitions. Suppose if you are trying to insert a row into a table, then the node (where to insert this row) is determined by partition key.



If a table has a primary key of single column, then primary key is the partition key.
For the table 'cassandratutorial.employee', id is the partition key.

If a table has primary key of multiple columns, then the first columns of primary key is partition key.
For the table 'cassandratutorial.manager' countryCode is the partition key.

Clustering Key
Every primary key column after the partition key is a clustering key.

For example, for the table 'cassandratutorial.manager',  departmentId and id are the clustering keys. Clustering keys determines how the data is ordered in the disk.

Let me explain with an example.

Suppose we had five node Cassandra cluster.


As shown in the figure, We had 5 nodes in the Cassandra cluster. Node1 can store the row, where hash of the partition key is fall in between 0-19, node 2 store the row, where hash of the partition key is fall in between 20-39.

Let’s insert some data to manager table.

INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (1, 1, 'IN', 'Krishna', 'Gurram', 35);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (1, 2, 'IN', 'Krishna', 'Gurram', 35);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (1, 3, 'AU', 'Ram', 'Ponnam', 39);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (2, 1, 'US', 'Sushmita', 'Sen', 41);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (3, 2, 'IN', 'Sunil', 'Nanda', 29);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (4, 3, 'AU', 'Prapuran', 'dam', 36);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (5, 1, 'US', 'Prathuk', 'Kumar', 54);

While string a row into Cassandra cluster, Cassandra calculates hash of countryCode (Since it is the partition key), depends on the hash value, it stores the row in corresponding node. Custering keys are used to store the data in ascending order of departmentId followed by Id. You can observe the same by listing the table contents using SELECT query.

cqlsh> SELECT * FROM cassandratutorial.manager;

 countrycode | departmentid | id | age | firstname | lastname
-------------+--------------+----+-----+-----------+----------
          IN |            1 |  1 |  35 |   Krishna |   Gurram
          IN |            2 |  1 |  35 |   Krishna |   Gurram
          IN |            2 |  3 |  29 |     Sunil |    Nanda
          AU |            3 |  1 |  39 |       Ram |   Ponnam
          AU |            3 |  4 |  36 |  Prapuran |      dam
          US |            1 |  2 |  41 |  Sushmita |      Sen
          US |            1 |  5 |  54 |   Prathuk |    Kumar

(7 rows)

Specifying clustering key ordering
Clustering keys are ordered in Ascending order by default, you can change this behaviour using ‘CLUSTERING ORDER BY’ clause.

Example
CREATE TABLE IF NOT EXISTS cassandratutorial.manager (
  id INT,
  departmentId INT,
  countryCode VARCHAR,
  firstName VARCHAR,
  lastName VARCHAR,
  age int,
  PRIMARY KEY(countryCode, departmentId, id)
) WITH CLUSTERING ORDER BY (departmentId DESC, id ASC);

Partition key can be made by multiple columns
A partition key can be made by multiple columns.

For example,
CREATE TABLE IF NOT EXISTS cassandratutorial.manager (
  id INT,
  departmentId INT,
  countryCode VARCHAR,
  firstName VARCHAR,
  lastName VARCHAR,
  age int,
  PRIMARY KEY((countryCode, departmentId), id)
);


In the above case, countryCode and departmentId together form a partition key.


Previous                                                    Next                                                    Home

No comments:

Post a Comment