Keyspaces
are the top level building blocks to organize the data in Cassandra. Whereas
Cluster is a container for keyspaces.
If you are
from RDBMS (Relational Database Management Systems) background, you can map
keyspace to a database in RDBMS world. Like a database in RDBMS, a keyspace has
a name and attributes (replication_factor, class etc.,) associated with it.
In RDBMS,
a database is collection of tables, whereas in Cassandra, keyspace is a collection of column families.
You can map column families to tables in RDBMS.
A keyspace contain
A keyspace
can contain
a.
Tables
b.
Views
c.
User-defined
Types
d.
Functions
e.
Aggregates
Apart from
acting as a container for data structures like tables, indexes, a keyspace can
hold some extra important attributes.
For
example, while creating a keyspace, you can specify the replication startegy,
this attribute deals with 'how data replications are made in cassandra
cluster'.
Example
CREATE
keyspace cassandratutorial WITH REPLICATION =
{
'class' : 'SimpleStrategy',
'replication_factor' : 3
};
Above
statement creates a keyspace 'cassandraTutorial' and set the
'replication_factor' to the value 3. That means, data is replicated in 3
different nodes.
Why do we maintain replications of
data?
To achieve
high availability, Cassandra maintain replications of the data. For example, if
a node in the Cassandra cluster fails, then the data is given from other node.
How the nodes communicate each other
in Cassandra cluster?
Data in
Cassandra is stored across multiple nodes in a cluster. Each node in the
cluster, frequently send its state information to other nodes,
Cassandra
uses Gossip protocol to communicate the data between nodes in Cassandra
cluster.
Primary Key, Partition Key and
Clustering Key
Primary Key
Primary
key is used to uniquely identify a row in a table. For example,
CREATE
TABLE IF NOT EXISTS cassandratutorial.employee (
id INT PRIMARY KEY,
firstName VARCHAR,
lastName VARCHAR,
age int
);
In the
above example, I set id as primary key for the table employee.
A primary
can also be a combination of multiple columns. For example, I made countryCode,
departmentId, id combination as primary key in manager table.
CREATE
TABLE IF NOT EXISTS cassandratutorial.manager (
id INT,
departmentId INT,
countryCode VARCHAR,
firstName VARCHAR,
lastName VARCHAR,
age int,
PRIMARY KEY(countryCode, departmentId, id)
);
Partition Key
Cassandra
organizes the data into partitions. Suppose if you are trying to insert a row
into a table, then the node (where to insert this row) is determined by
partition key.
If a table
has a primary key of single column, then primary key is the partition key.
For the
table 'cassandratutorial.employee', id is the partition key.
If a table
has primary key of multiple columns, then the first columns of primary key is
partition key.
For the
table 'cassandratutorial.manager' countryCode is the partition key.
Clustering Key
Every
primary key column after the partition key is a clustering key.
For
example, for the table 'cassandratutorial.manager', departmentId and id are the clustering keys.
Clustering keys determines how the data is ordered in the disk.
Let me
explain with an example.
Suppose we
had five node Cassandra cluster.
As shown
in the figure, We had 5 nodes in the Cassandra cluster. Node1 can store the
row, where hash of the partition key is fall in between 0-19, node 2 store the
row, where hash of the partition key is fall in between 20-39.
Let’s
insert some data to manager table.
INSERT
INTO cassandratutorial.manager (id, departmentId, countryCode, firstName,
lastName, age) VALUES (1, 1, 'IN', 'Krishna', 'Gurram', 35);
INSERT
INTO cassandratutorial.manager (id, departmentId, countryCode, firstName,
lastName, age) VALUES (1, 2, 'IN', 'Krishna', 'Gurram', 35);
INSERT
INTO cassandratutorial.manager (id, departmentId, countryCode, firstName,
lastName, age) VALUES (1, 3, 'AU', 'Ram', 'Ponnam', 39);
INSERT
INTO cassandratutorial.manager (id, departmentId, countryCode, firstName,
lastName, age) VALUES (2, 1, 'US', 'Sushmita', 'Sen', 41);
INSERT
INTO cassandratutorial.manager (id, departmentId, countryCode, firstName,
lastName, age) VALUES (3, 2, 'IN', 'Sunil', 'Nanda', 29);
INSERT
INTO cassandratutorial.manager (id, departmentId, countryCode, firstName,
lastName, age) VALUES (4, 3, 'AU', 'Prapuran', 'dam', 36);
INSERT
INTO cassandratutorial.manager (id, departmentId, countryCode, firstName,
lastName, age) VALUES (5, 1, 'US', 'Prathuk', 'Kumar', 54);
While
string a row into Cassandra cluster, Cassandra calculates hash of countryCode
(Since it is the partition key), depends on the hash value, it stores the row
in corresponding node. Custering keys are used to store the data in ascending
order of departmentId followed by Id. You can observe the same by listing the
table contents using SELECT query.
cqlsh> SELECT * FROM cassandratutorial.manager;
countrycode | departmentid | id | age | firstname | lastname
-------------+--------------+----+-----+-----------+----------
IN | 1 | 1 | 35 | Krishna | Gurram
IN | 2 | 1 | 35 | Krishna | Gurram
IN | 2 | 3 | 29 | Sunil | Nanda
AU | 3 | 1 | 39 | Ram | Ponnam
AU | 3 | 4 | 36 | Prapuran | dam
US | 1 | 2 | 41 | Sushmita | Sen
US | 1 | 5 | 54 | Prathuk | Kumar
(7 rows)
Specifying clustering key ordering
Clustering
keys are ordered in Ascending order by default, you can change this behaviour
using ‘CLUSTERING ORDER BY’ clause.
Example
CREATE
TABLE IF NOT EXISTS cassandratutorial.manager (
id INT,
departmentId INT,
countryCode VARCHAR,
firstName VARCHAR,
lastName VARCHAR,
age int,
PRIMARY KEY(countryCode, departmentId, id)
) WITH
CLUSTERING ORDER BY (departmentId DESC, id ASC);
Partition key can be made by multiple
columns
A partition
key can be made by multiple columns.
For
example,
CREATE
TABLE IF NOT EXISTS cassandratutorial.manager (
id INT,
departmentId INT,
countryCode VARCHAR,
firstName VARCHAR,
lastName VARCHAR,
age int,
PRIMARY KEY((countryCode, departmentId), id)
);
In the
above case, countryCode and departmentId together form a partition key.
No comments:
Post a Comment