Thursday, 5 March 2020

Cassandra: Secondary Index

In Cassandra, you can able to query on columns that are part of primary key or columns that has secondary index.

Let’s create a KEYSPACE cassandratutorial.
CREATE KEYSPACE cassandratutorial WITH REPLICATION = 
{ 
  'class' : 'SimpleStrategy', 
  'replication_factor' : 1 
};

Let’s create a table manager in cassandratutorial KEYSPACE.
CREATE TABLE IF NOT EXISTS cassandratutorial.manager (
  id INT,
  departmentId INT,
  countryCode VARCHAR,
  firstName VARCHAR,
  lastName VARCHAR,
  age int,
  PRIMARY KEY(countryCode, departmentId, id)
) WITH CLUSTERING ORDER BY (departmentId DESC, id ASC);

Insert data to manager table, by executing below statements.
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (1, 1, 'IN', 'Krishna', 'Gurram', 35);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (1, 2, 'IN', 'Krishna', 'Gurram', 35);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (1, 3, 'AU', 'Ram', 'Ponnam', 39);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (2, 1, 'US', 'Sushmita', 'Sen', 41);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (3, 2, 'IN', 'Sunil', 'Nanda', 29);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (4, 3, 'AU', 'Prapuran', 'dam', 36);
INSERT INTO cassandratutorial.manager (id, departmentId, countryCode, firstName, lastName, age) VALUES (5, 1, 'US', 'Prathuk', 'Kumar', 54);

cqlsh> SELECT * FROM cassandratutorial.manager;

 countrycode | departmentid | id | age | firstname | lastname
-------------+--------------+----+-----+-----------+----------
          IN |            2 |  1 |  35 |   Krishna |   Gurram
          IN |            2 |  3 |  29 |     Sunil |    Nanda
          IN |            1 |  1 |  35 |   Krishna |   Gurram
          AU |            3 |  1 |  39 |       Ram |   Ponnam
          AU |            3 |  4 |  36 |  Prapuran |      dam
          US |            1 |  2 |  41 |  Sushmita |      Sen
          US |            1 |  5 |  54 |   Prathuk |    Kumar

(7 rows)

PRIMARY KEY(countryCode, departmentId, id)
Since I specified countryCode, departmentId, id as primary key, I can query with below combinations.
a. countryCode
b. countryCode AND departmentId
c. countryCode AND departmentId AND id
cqlsh> SELECT * FROM cassandratutorial.manager WHERE countryCode='US';

 countrycode | departmentid | id | age | firstname | lastname
-------------+--------------+----+-----+-----------+----------
          US |            1 |  2 |  41 |  Sushmita |      Sen
          US |            1 |  5 |  54 |   Prathuk |    Kumar

(2 rows)
cqlsh> SELECT * FROM cassandratutorial.manager WHERE countryCode='US' AND departmentId=1;

 countrycode | departmentid | id | age | firstname | lastname
-------------+--------------+----+-----+-----------+----------
          US |            1 |  2 |  41 |  Sushmita |      Sen
          US |            1 |  5 |  54 |   Prathuk |    Kumar

(2 rows)
cqlsh> SELECT * FROM cassandratutorial.manager WHERE countryCode='US' AND departmentId=1 AND id=2;

 countrycode | departmentid | id | age | firstname | lastname
-------------+--------------+----+-----+-----------+----------
          US |            1 |  2 |  41 |  Sushmita |      Sen

(1 rows)

SELECT * FROM cassandratutorial.manager WHERE departmentId=1;
When you try to execute above query, you will end up in below error.

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
cqlsh> SELECT * FROM cassandratutorial.manager WHERE departmentId=1;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

Even though ‘departmentId’ is part of the primary key, Cassandra is unable to query in this field. Just read the error message, it is saying that, to execute this query Cassandra require additional filtering of the data and may not be able to execute the query in efficient way.

To execute this query add ‘ALLOW FILTERING’ clause at the end of query.
cqlsh> SELECT * FROM cassandratutorial.manager WHERE departmentId=1 ALLOW FILTERING;

 countrycode | departmentid | id | age | firstname | lastname
-------------+--------------+----+-----+-----------+----------
          IN |            1 |  1 |  35 |   Krishna |   Gurram
          US |            1 |  2 |  41 |  Sushmita |      Sen
          US |            1 |  5 |  54 |   Prathuk |    Kumar

(3 rows)

You can use ‘ALLOW FILTERING’ clause to query on columns that are not part of PRIMARY KEY.
cqlsh> SELECT * FROM cassandratutorial.manager WHERE age=35 ALLOW FILTERING;

 countrycode | departmentid | id | age | firstname | lastname
-------------+--------------+----+-----+-----------+----------
          IN |            2 |  1 |  35 |   Krishna |   Gurram
          IN |            1 |  1 |  35 |   Krishna |   Gurram

(2 rows)
cqlsh> SELECT * FROM cassandratutorial.manager WHERE firstName='Ram' ALLOW FILTERING;

 countrycode | departmentid | id | age | firstname | lastname
-------------+--------------+----+-----+-----------+----------
          AU |            3 |  1 |  39 |       Ram |   Ponnam

(1 rows)

To execute these kind of queries(Queries on columns that are not part of primary key), Cassandra should invest lot of computing power. To improve the performance, better to create secondary index on a columns that you want to query on.

Syntax to create secondary index
<create-index-stmt> ::= CREATE ( CUSTOM )? INDEX ( IF NOT EXISTS )? ( <indexname> )?
                            ON <tablename> '(' <index-identifier> ')'
                            ( USING <string> ( WITH OPTIONS = <map-literal> )? )?

<index-identifier> ::= <identifier>
                     | keys( <identifier> )

Example
CREATE INDEX firstNameIndex ON cassandratutorial.manager(firstName);

Above statement create an index on firstName field. Once you created the secondary index, you can run the query without ‘ALLOW FILTERING’ clause.
cqlsh> CREATE INDEX firstNameIndex ON cassandratutorial.manager(firstName);
cqlsh> 
cqlsh> SELECT * FROM cassandratutorial.manager WHERE firstName='Ram';

 countrycode | departmentid | id | age | firstname | lastname
-------------+--------------+----+-----+-----------+----------
          AU |            3 |  1 |  39 |       Ram |   Ponnam

(1 rows)



Previous                                                    Next                                                    Home

No comments:

Post a Comment