Friday 23 May 2014

Upgrading Cassandra from 1.0 to 1.2 - Warts and all!

I recently upgraded an 8 node Cassandra cluster (logically split over 2 data centres) from version 1.0 to 1.2. It took me quite a while of experimentation in a sandpit environment before I came up with a formula that I had the confidence to use in production. This was done in rolling fashion.

Any data that I cared about had a replication factor of 3 in each data centre.

The first step in this journey is of course to follow the often excellent documentation on the Datastax website:

1.1 upgrade guide
1.2 upgrade guide 

I've included the v1.1 upgrade pages for reasons that will become apparent.

These upgrade guides are rather economical and it is worth reading around the forums in order to fill out the detail.

First I'll describe my journey which started by following the v1.2 upgrade guide. Since the database was already running v1.0, upgrading straight to v1.2 is considered legal. This involved a number of prerequisite checks such as modifying queries as stipulated in the guide, checking the News.txt file for any upgrade related insights and likewise with the Changes.txt file. All fairly standard.

With that out of the way, the first step was to take a snapshot per node. I strongly recommend copying the snapshots to a location outside of cassandra's control. As I discovered during testing, part of the upgrade process is to re-organise the data file structure from:

<path>/KS/filename (1.0 file name format)

to:

<path>/KS/CF/filename (1.2 file name format)

which includes the snapshot files! That leaves you with two headaches. First, should you wish to revert back from 1.2 to 1.0, you need to grab all the snapshot files, convert them back to the original file name format and move them back to the original directory structure. Secondly, and perhaps more worryingly, if the upgrade fails it could leave your snapshot files in a mess with some of them in the original format/location, some in the new and some possibly missing. I can vouch for this, having experienced this very issue during testing. There was no return. Being a test cluster I would have been happy enough to blow away the cluster and start again. In the end I elected to use cassandra's excellent capability to rebuild a node from the surviving nodes, none of which had been upgraded at that point.

Tip #1: Copy/rsync snapshots to a directory outside of cassandra's control or better.

My first upgrade attempt comes right out of the guide.

Iteration 1 Upgrade
Per Node:
1. snapshot cassandra.
2. copy snapshots to remote server (or new directory on same server if you have space.)
3. nodetool drain
4. stop cassandra
5. update the cassandra 1.2 binaries.
6. start cassandra 1.2
7. upgrade sstables:  nodetool -h localhost upgradesstables -a

On my spinning-disk cluster, upgradesstables runs at about 1G every 2-3 minutes. Given my quantity of data per node, I would expect this task to last 4 hours! The only way to update an 8 node cluster is by doing 1 node per day making the complete upgrade over 1 week (as the upgrade was to happen at night). N.B. the documentation also tells you not to move, repair or bootstrap until the upgrade is complete. That sounds pretty impractical.

So what happens if you need to cutback to 1.0? There is no easy answer to this and I think the only sensible approach is to NOT cutback to 1.0 after a few hours of operating on 1.2. That means you will only be part way through upgrading your first node when you must decide to commit to it. The implication is to thoroughly test your applications against cassandra 1.2, i.e take every precaution to not have to rollback.

However, I have a trick up my sleeve that ameliorates circumstances somewhat and I'll come on to that a bit later.

Tip #2: Functionally and non-functionally test your apps as thoroughly as you can, as you REALLY don't want to cut back to cassandra 1.0. Copy your production data to a new cluster if possible, even feeble vms, and test the upgrade (I did this).

This is where it gets interesting. On checking the log file having fired up cassandra 1.2, i.e. before the upgradesstable step, I noticed these errors:

INFO [HANDSHAKE-/[ip redacted]] 2014-01-15 12:58:48,047 OutboundTcpConnection.java (line 408) Cannot handshake version with /[ip redacted]
 INFO [HANDSHAKE-/[ip redacted] 2014-01-15 12:58:48,048 OutboundTcpConnection.java (line 399) Handshaking version with /[ip redacted]
...etc for all nodes...

Googling around I came across this issue on an Apache jira in which Jonathan Ellis states that for rolling upgrades you cannot jump major versions, i.e. you must follow 1.0>1.1>1.2 (https://issues.apache.org/jira/browse/CASSANDRA-5740).

Iteration 2 Upgrade
As per iteration 1, but done twice. Once for 1.0>1.1 and then 1.1>1.2

Needless to say I experimented with such an approach, but was wholly unenamoured by the prospect of having to run upgradesstable twice per node!

So I dug a little deeper into this error. As far as I could make out, range queries would fail on a mixed version cluster and some nodetool commands fail, such as nodetool ring. The natural question is then, can I not complete the binary upgrade on all nodes and then upgrade the sstables? That would be the silver bullet as I could upgrade directly from 1.0 to 1.2. Upgrading the binaries takes just a few minutes and just might be acceptable at night when traffic drops off a cliff.

My new direction was affirmed in this posting where Aaron Morton of the Last Pickle suggests the same which gave me a warm and cuddly feeling!

Things were starting to take shape. The plan was to upgrade binaries on a rolling basis for all 8 nodes, then stagger but overlap upgradesstable such that the full end-to-end upgrade would last at most 1 day.

Ultimately, the method I went with was as follows:

Iteration 3
Upgrade binaries (per node)
1. take a snapshot
2. copy to backup server
3. test retrieving a record
4. save a list of all datafiles
5. run nodetool ring and make sure all nodes are as expected
6. stop cassandra
7. start cassandra
8. test retrieving a record
9. clear snapshots (no need to have them lurking around as they have been copied elsewhere)
10. upgrade binaries.
11. rolling restart application nodes*
12. nodetool describecluster
13. nodetool ring

* - I found that the upgrade caused hector to mark the node as dead and gone and would only be recognised by a restart.

Cutback Plan
So what about cutback options now? The cluster has 4 nodes per data centre, 8 nodes in total. Bearing in mind that each data centre contains a complete set of data the cut back plan was to reinstall the 1.0 binaries and rebuild data from the other nodes (remember that I am using a replication factor of 3 per DC). So if the upgrade of node 1 in DC1 failed, I would cut back as I could not be certain that other nodes would not similarly fail and I can only withstand 1 node failure per DC. Similar with nodes 2 and 3. However, if node 4 failed I would proceed to upgrade the nodes in DC2. This is because I could live with one node being down in each DC and could resolve this shortly after the upgrade completed.

Upgrade sstables 
I did this staggered with some overlap per node such that the whole process took about 7-8 hours.

Test cutback (all nodes)
1. shutdown cassandra
2. clear data directories
3. restore 1.0 files from the backup server
4. restore 1.0 binaries
5. start cassandra.







Friday 2 May 2014

Cassandra Node Grab Toolkit

Wish you had time to write a toolkit that grabs log files, configuration files and other useful stats (os and cassandra) from all of your nodes at a quasi point-in-time? No need, I've done it for you.

This consists of a couple of simple bash scripts, the first of which grabs logs, config and other output from a predefined list of nodes. Output is organised by servernamedate with a tar gz archive created. This is handy for sending off to other interested parties such as your support partners. The other runs on a loop sleeping N seconds between each run until cancel cntrl-c. This can be useful when you want to monitor some/all of your nodes for stats such as tpstats and operating system stats like io stats.

The assumption is that you have a server from which you can access all of your cassandra nodes without entering a password (i.e. with ssh keys) and that cassandra is running *nix.

A little set up work is required to set paths correctly for your environment.

https://github.com/dbwrangler/cassandra_grabs

Git me up!

Tuesday 1 April 2014

org.apache.thrift.transport.TTransportException while dropping or updating a keyspace with Cassandra 1.2

Cassandra offers great flexibility and power to the developer like no database before it IMHO. That also means the ability to screw it up. Recently I migrated Cassandra to a new datacenter where to original set up was spread over 2 datacenters. This meant temporarily adding a 3rd datacenter, then tearing down one of the old ones.

This meant the developers had to adjust their deployment scripts to include the new datacenter, eg:

create keyspace MyKS
with placement_strategy = 'NetworkTopologyStrategy'
and strategy_options = {DC1 : 3, DC2 : 3}
and durable_writes = true;

became this:

create keyspace MyKS
with placement_strategy = 'NetworkTopologyStrategy'
and strategy_options = {DC1 : 3, DC2 : 3, DC3 : 3}
and durable_writes = true;

This meant updating the cassandra-topology.properties file on each node and DC3 added.

Migration complete, DC3 was renamed to DC1 and the original DC1 was taken out in the topology files. Unfortunately a number of deployment scripts did not get reverted to the original 2 DC set up and continued to reference DC3 in drop and create scripts.

This mostly worked harmlessly under 1.0, but a recent upgrade to 1.2 became more problematic. Developers complained that they could no longer drop keyspaces with errors such as:

org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.cassandra.thrift.Cassandra$Client.recv_system_drop_keyspace(Cassandra.java:1437)
at org.apache.cassandra.thrift.Cassandra$Client.system_drop_keyspace(Cassandra.java:1424)
at org.apache.cassandra.cli.CliClient.executeDelKeySpace(CliClient.java:1364)
at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:249)
at org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:213)
at org.apache.cassandra.cli.CliMain.evaluateFileStatements(CliMain.java:393)
at org.apache.cassandra.cli.CliMain.main(CliMain.java:272)

Under v1.0 about 1 in 5-10 attempts to drop a keyspace would randomly fail even though DC3 was still referenced in the scripts. Under v1.2, it was consistently doing so.

The developers updated their script and references to DC3 removed. I then attempted to update one of the keyspaces manually (using the cassandra-cli) to rid it of DC3 but was left with the solitary error message:


org.apache.thrift.transport.TTransportException

Stubborn!

I then snuck the DC3 configuration back into the cassandra-topology.properties file on the node which I was working. Even though the DC3 nodes no longer existed, that did the trick! DC3 was then removed by updating all affected keyspaces using the cli. Now the deployment scripts drop/create keyspaces without issue.

Hope that is helpful to someone out there!

UPDATE 3 April 2014

Unfortunately I spoke to soon. Although the cluster was seemingly ok (according to nodetool ring/info/describecluster), there were errors in the log of the nature of:

(on the node where I made the cassandra topology update)

java.lang.RuntimeException: java.io.IOException: Connection reset by peer
at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:59)
at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:30)
at org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:392)
at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:377)
at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:120)
at org.apache.cassandra.net.OutboundTcpConnection.write(OutboundTcpConnection.java:255)
at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:201)
at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:149)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:69)
at sun.nio.ch.IOUtil.write(IOUtil.java:40)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:336)
at java.nio.channels.Channels.writeFullyImpl(Channels.java:59)
at java.nio.channels.Channels.writeFully(Channels.java:81)
at java.nio.channels.Channels.access$000(Channels.java:47)
at java.nio.channels.Channels$1.write(Channels.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:297)
at org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:244)
at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:99)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.cassandra.utils.ByteBufferUtil.write(ByteBufferUtil.java:328)
at org.apache.cassandra.utils.ByteBufferUtil.writeWithLength(ByteBufferUtil.java:315)
at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:55)
... 8 more

(on other nodes)
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=eb68069e-20b7-394c-9a3f-a727d45ec594

I then noticed that root had ownership of the cassandra-topology.properties file. So I quickly changed this, but it made no difference. Then I restarted the instance and that made no difference either.

Regarding the error on the other node, it looks like a schema mismatch, but running 'nodetool describecluster' indicated that the schema was consistent across all nodes. Nonetheless I decided to shutdown this node, clear out the system keyspace directory and restart. This produced an interesting result. Initially I could see the same UnknownColumnFamilyException error appearing, but once it had rebuild the system keyspace directory the error went away. Same goes for the original node oddly enough.