Monday, 22 October 2012

Cassandra - TTL columns, Tombstones and complete row deletion.

This article is an investigation into Cassandra TTL columns, how it cleans up tombstoned data and gets rid of rows all together. The following investigation uses cassandra v1.0.10 + cql 2.0.0. TTL's are set on to your data at insert/upsert time and specify an expiry date (to the second) for the data, creating a tombstone. To understand why it doesn't just delete the data on expiry see here. The columns are completely deleted during compaction, but the documentation states that you need to be careful in how you manage this or you will end up with failed nodes. In order to understand this better, I decided to run some scenarios that match my current setup in which some data is TTL'd but data is never deleted.

I concluded that I will not get failed nodes. Please read on to understand my reasoning.

Test 1: Write columns with replication factor = 1, TTL 30s, gc_grace_seconds 60s.

create keyspace testks WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor = 1;
use testks;
create table test (ID varchar primary key, firstname varchar, surname varchar) with  gc_grace_seconds=60 and compression_parameters:sstable_compression='SnappyCompressor';

Grace period is set to 60 seconds. TTL on insert is set to 30 seconds. So after 30 seconds, the columns should be tombstoned, which means they will be marked for removal at the next compaction as long as the time of compaction LESS the TTL time exceeds 60 seconds.

Write ten records: cqlsh < 10_testrecords.cql

insert into testks.test (id, firstname, surname) values (1,'tom','jones') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (2,'tom','cruise') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (3,'tom','tom') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (4,'tom','thumb') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (5,'tom','ljones') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (6,'tom','dixon') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (7,'tom','baker') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (8,'tom','smith') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (9,'tom','jessy') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (10,'tom','froun') using consistency QUORUM and TTL 30;

node 1
cqlsh:testks> select * from test ;
 ID | firstname | surname
----+-----------+---------
  3 |      stom |    stom
  6 |      stom |   dixon
  5 |      stom |  ljones
 10 |      stom |   froun
  8 |      stom |   smith
  2 |      stom |  cruise
  1 |      stom |   jones
  9 |      stom |   jessy
  4 |      stom |   thumb
  7 |      stom |   baker

Wait 30 seconds:

node 1
cqlsh:testks> select * from test ;
 ID
----
  3
  6
  5
 10
  8
  2
  1
  9
  4
  7

As aluded to above regarding tombstones, we would still expect to see the row keys. An excellent write up on this can be found here.

So how do we get rid of the row keys and thus the records all together?

In theory we should be able to flush and compact the test column family on each node which will get rid of the rows for good. Compaction is responsible for permanently removing tombstone records once compaction time less tombstone time is beyond gc_grace_seconds.  So let's try it:

node1
nodetool -h localhost flush testks test
nodetool -h localhost compact testks test

This should have removed 1/3rd of the keys, ie keys that reside on node1:

cqlsh:testks> select * from test using consistency one;
 ID
----
  3
  6
  5
 10
  8
  2
  1
  9
  4
  7

Hmm... doesn't seem to have worked. Let's check the log:

INFO [CompactionExecutor:249] 2012-10-18 11:43:23,275 CompactionTask.java (line 245) Nothing to compact in test.  Use forceUserDefinedCompaction if you wish to force compaction of single sstables (e.g. for tombstone collection)

So compaction cannot delete tombstones if there is only 1 data file, as it seems to work by merging multiple files together. We'll test this soon, but for now let's follow the suggestion to use forceUserDefinedCompaction. This is available as java mbean and can be invoked using jconsole.

Connect jconsole to node1. Select mbeans tab, org.apache.cassandra.db, CompactionManager, Operations. The right hand panel now shows forceUserDefinedCompaction. This takes two parameters, keyspace name and file name to compact. So let's call:

forceUserDefinedCompaction (testks, test-hd-1-Data.db)

Check system.log:

 INFO [CompactionExecutor:251] 2012-10-18 13:26:54,507 CompactionTask.java (line 112) Compacting [SSTableReader(path='/media/data/cassandra/data/testks/test-hd-1-Data.db')]

cqlsh:testks> select * from test using consistency one;
 ID
----
  3
  6
  5
 10
  8
  2
  1
  9

VoilĂ , we have now got rid of 2 keys which this node is responsible for. We could go around each other node and do the same, which would remove all row keys, but I won't bother. Instead I will focus on how we can invoke key deletion under 'natural selection', so to speak.

I'll reset the data to the point before I ran the forceUserDefinedCompaction.

Now, if compaction requires more that one data file, let's upsert the data, keeping the same TTL:

insert into testks.test (id, firstname, surname) values (1,'stom','jones') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (2,'stom','cruise') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (3,'stom','stom') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (4,'stom','thumb') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (5,'stom','ljones') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (6,'stom','dixon') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (7,'stom','baker') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (8,'stom','smith') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (9,'stom','jessy') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (10,'stom','froun') using consistency QUORUM and TTL 30;

We should only have to wait 60 seconds to run the compaction in order to remove the rows completely (gc_grace_seconds=60s), but I find that gives unreliable results. So I give it 2-3 minutes:

nodetool -h localhost flush testks test
Wait 2-3 minutes.
nodetool -h localhost compact testks test

Check system.log:

 INFO [CompactionExecutor:261] 2012-10-18 13:36:09,735 CompactionTask.java (line 112) Compacting [SSTableReader(path='/media/data/cassandra/data/testks/test-hd-1-Data.db'), SSTableReader(path='/media/data/cassandra/data/testks/test-hd-2-Data.db')]

So it seems to have worked. Let's check the data:

cqlsh:testks> select * from test using consistency one;
 ID
----
  3
  6
  5
 10
  8
  2
  1
  9

VoilĂ  encore. Row keys 4 and 7 have gone, as in the forceUserDefinedCompaction example.

Let's run the flush and compaction on the other nodes too:

Node2
nodetool -h localhost flush testks test
nodetool -h localhost compact testks test

Node3
nodetool -h localhost flush testks test
nodetool -h localhost compact testks test

Let's check the data again:

select * from test using consistency one;
cqlsh:testks>

Hooray, no row keys!

Of course the beauty of Cassandra is being able to easily replicate data, so a more real world example would be to have a replication factor of say, 3. The inserts are quorum, therefore will block on 2 writes and replicate eventually a total 3 times.

Test 2: Let's crank up the replication factor to 3 and keep TTL 30s, gc_grace_seconds 60s.

create keyspace testks WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor = 3;
use testks;
create table test (ID varchar primary key, firstname varchar, surname varchar) with  gc_grace_seconds=60 and compression_parameters:sstable_compression='SnappyCompressor';


Let's load the data:

cqlsh < 10_testrecords.cql

Node1
nodetool -h localhost flush testks test

Node2
nodetool -h localhost flush testks test

Node3
nodetool -h localhost flush testks test

upsert the data
cqlsh < 10_testrecords2.cql

Now let's flush:

Node1
nodetool -h localhost flush testks test

Node2
nodetool -h localhost flush testks test

Node3
nodetool -h localhost flush testks test

Each node will now have 2 datafile. Wait 2-3 minutes then run compaction:

Node1
 nodetool -h localhost compact testks test
INFO [CompactionExecutor:342] 2012-10-18 14:55:50,228 CompactionTask.java (line 112) Compacting [SSTableReader(path='/media/data/cassandra/data/testks/test-hd-2-Data.db'), SSTableReader(path='/media/data/cassandra/data/testks/test-hd-1-Data.db')]

Select on each node again:

Node1
select * from test using consistency one;
cqlsh:testks>

I guess this is because CL = one and cassandra uses a snitch mechanism to know that node is responsible for one complete set of replicas, so it returns no rows.


Node2+3
cqlsh:testks> select * from test using consistency one;
 ID
----
  3
  6
  5
 10
  8
  2
  1
  9
  4
  7


So the keys have been deleted from Node1, but not Node2+3.

Run the compaction on Nodes2+3 finally gets rid of all the keys.

So given the foregoing, what would happen if we select with a higher consistency level?

Repeat setup + compaction as in last example:

Node1:
select * from test using consistency one;
cqlsh:testks>

As expected, no rows. How about reading quorum, which requires data from 2 nodes. This should return all the row keys, as it will be combining null from the local node with successful reads from nodes 2 and/or 3 (this is probably an 'or' situation if the snitch is working efficiently):

Node1:
select * from test using consistency quorum;
 ID
----
  3
  6
  5
 10
  8
  2
  1
  9
  4
  7

... ta da! Now, what about setting CL back to one?

Node1:
select * from test using consistency one;
 ID
----
  3
  6
  5
 10
  8
  2
  1
  9
  4
  7


The log gives no clue, but it would seem a read repair has taken place to Node1. This creates a memtable structure once again. If this conjecture is correct, we should be able to flush and compacted once again:

 INFO [CompactionExecutor:455] 2012-10-18 16:47:20,631 CompactionTask.java (line 245) Nothing to compact in test.  Use forceUserDefinedCompaction if you wish to force compaction of single sstables (e.g. for tombstone collection)

I found that the original compaction actually removed the data files, since no keys were left for the column family. Thus the flush produces the one and only datafile that cannot be compacted as we saw earlier.

Of course, under normal operating conditions each node would remove its TTL'd data once gc_grace_seconds has past, so such scenarios are short lived.

Although this article covers TTL data, the same rules should apply data deleted programmatically.

So what are the implication of failed nodes as implied here DistributedDeletes which states "if you have a node down for longer than GCGraceSeconds, you should treat it as a failed node"? I would caveat this statement with the following for nodes that have been down for longer than gc_grace_seconds:

1. If you never programmatically delete data, then you will not get failed nodes.
2. If you only go as far as to TTL your columns to delete data, you will not get failed nodes.