DB Wrangler

Kafka Elevator pitch

2017-09-11T09:24:00.001-07:00

If you had 60 seconds to describe what Kafka is, how would you do it? Recently I had to do this for stakeholders and the outcome was the 60 second slide :)

Deployment considerations around Apache Kafka in AWS

2016-04-03T10:43:00.001-07:00

Introduction

Apache Kafka is a high throughput auto sharded/replicated data streaming technology. It emerged from the deep tech minds at Linkedin and is now maintained and developed by startup Confluent.

Reading through their many excellent blog posts and documentation, it's clear that they had physical hardware in mind when designing and implementing a Kafka cluster. I've found no guidelines about running Kafka in the cloud, so I though I'd share my experiences with you here. Some (but hopefully not all) of this is only useful to AWS and/or Kafka noobs.

AWS Topology

As you probably know, AWS spans geographic regions with failure zones (availability zones) within each region. My infrastructure is based in eu-west-1 which has the three availability zones a, b and c. Availability zone failures are quite rear and region failures rarer (see AWS Service Health Dashboard). From the AWS Architecture Center we have this useful schematic:

The service I designed was bound to one region and spread over 3 AZs, such that one AZ can go down with affecting availability.

Deploying Kafka

So how do we deploy a Kafka cluster such that one AZ can go down without affecting it's availability? Kafka has two ways of replicating data, sharding/replicating and mirroring. Sharding/replicating is Kafka's natural running mode and pretty much it's raison d'etre. Mirroring (mirrormaker) replicates whole data from a source to target cluster, much like classic RDBMS master-slave replication. So which to choose? (For the sake of brevity I will not discuss the deployment of zookeeper which has been deployed as a 3 node cluster with 1 node in each AZ.)

The diagram depicts the 2 available options. Both options seem to have issues. With mirrormaker, the sharding/replicating happens in 1 zone and entirely replicated to the others. The downside is that:

Each zone has a full copy of data at 3x the cost.
Consumers (Kafka speak for subscriber) would need reconfiguring during a zone failure and grapple with consumer offsets.

On the other hand, sharding/replication over zones doesn't seem to fit as it should be used in closely proximate racks. In order to decide which way to go, we need to understand a little more about AWS and Kafka.

It's all about latency!

Kafka has an interesting and rather clever replication strategy called the ISR, or in-sync replica. The ISR is a flexible replica set which determines which replica nodes are in-sync based upon how long a partition follower has been lagging behind a leader. If a follower is not fully in sync for the last replica.lag.time.max.ms seconds, it is ejected from the ISR. By default this is 10 seconds. NB the ISR contains the leader in its number which is somewhat confusing as the leader is insync by definition!

Apart from sickly machines and other local issues, latency across availability zones is key to understanding if replicating over AZs is feasible. According to AWS, availability zones within a region are connected by low latency links. In tests (hours, not days unfortunately) I found latency to be of the order of milliseconds and in a couple of instances, over a second. A nice study dating back some years can be found here. One can be confident that things have only get better since then. In effect, you can treat AZs as if they are racks in a DC.

How many brokers?

For my cluster, we have a message rate of about 50,000 messages a second with multiple consumer and producer jobs plus a mirrormaker data archiver job. Your choice is to have fewer more powerful nodes or more lower power nodes. Kafka will take advantage of either. In a sharded/replicated model, the more broker nodes the better (in a n node cluster, 1 node failure leads to 1/(n^2-n) additional load on surviving nodes). My cluster is thus 6 lower power nodes. Kafka loves memory and exploits the unix filesystem cache. Thus the instance type chosen was r3.xlarge. I will describe my experience of cluster expansion in a future post, as initally we started out with a 3 node cluster.

Beware - Not Yet Rack Aware!

Unfortunately, Kafka has no rack-awareness (yet, see this proposed fix) meaning that partition leaders and followers could end up in the same AZ which would be a bad thing in the event of a zone failure! Here I equate rack to AZ as already discussed.

There are two strategies around this. The first is to over replicate such that NR(replicas) > NM(max no. nodes per zone). With 6 nodes, there are 2 nodes per zone, so NR=3 is required.

The second strategy is to monitor and manually assign your replicas to brokers in different AZs. The advantage here is that you can run with RF:2 if you really want to. With this strategy, you'll need to run as a scheduled job to pick up new topics and any cluster changes (e.g. expansion) and then shuffle partitions in order to achieve the desired anti-affinity.

As a corollary, it would not make sense to have 4 nodes with the over replicating strategy. Why? Because NR has increased 50% (i.e. 2 to 3) but the cluster has only increased by 33% (i.e. 3 to 4). You might as well rule 5 out as well and go straight to 6.

Chosen Design

RF:3, Producer ACKS:-1, Min Insync replicas: 2. In other words, there must be at least 2 nodes in the ISR to acknowledge the writes and single node failure is tolerable. Monitoring ISR is paramount and you should aim to keep the ISR at 3, or stated another way min.insync.replicas+1. With this set up, a node dropping out of the ISR is tantamount to a node crashing, hence the caution about monitoring.

Code Appropriately For Your Chosen Design

With this set up we have message delivery as follows:

Under normal running conditions: Once
Under node/zone failure: For some messages possibly more than once.

Since offset management and message delivery are not atomic, it is possible to re-read a message via a replica during a failure. This should not affect many messages.

Code appropriately for your chosen design!

EBS

It has become something of a trend in the world of cloud computing / clustered databases (e.g. nosql etc) to treat data as a second class citizen. You are actively discouraged from persisting data and told to reconstruct node data from surviving replicas in the event of failure.

Wrong, wrong, wrong! Without digressing too much, it is more sensible to come up with a infrastructure build pattern that persists data volumes (EBS in the case of AWS). The two big drivers for doing this are:

Guarding against user error. Unless you police your data system carefully someone is going to create a data structure RF:1!
Reducing mean time to recovery. Recover your nodes in seconds/minutes not hours/days.

Pulling it all together

Our initial design looks like this :

Kafka nodes = 6 (2 per AZ)

Zookeeper nodes = 3 (1 per AZ)

Producer acks = -1 (min insync replicas acknowledge write requests)

Replication Factor = 3 (which includes the leader)

Min in-sync replicas = 2 (to be implemented)

offset management = kafka (preferable to zookeeper)

I hope you've found this post useful and I'd be happy to receive your feedback. In future posts I'll talk about my other Kafka experience such as performance, consumer lag, expanding a kafka cluster and mirrormaking!

Need to reset Cassandra superuser password?

2015-03-29T05:20:00.002-07:00

If you find yourself in the unfortunate situation where you have forgotten the cassandra superuser password, here's how to reset it (procedure tested using cassandra 2.0 with cql 3.1):

1. Stop panicking ;)

2. Turn off authorisation and authentication :

edit cassandra.yaml and set the following:

authenticator: AllowAllAuthenticator
authorizer: AllowAllAuthorizer

bounce cassandra

service cassandra restart

3. Fire up the cli

cqlsh

At this point you need to identify you superusers. If you were a good girl/boy, you would have set up a fresh superuser and dumped the default cassandra user.

list users;

name | super
-----------+-------
cassandra | False
myadmin | True

As you can see, I've taken super privs away from the default superuser cassandra and created my own called myadmin as per the recommendations of the docs.

Now, depending on how many nodes and data centers you have, the system_auth keyspace is likely to be replicated on other nodes and specifically the credentials column family. You need to manually update this table to get back into shape, as this saves you the hassle of having to visit all nodes in your cluster and reset authentication as above.

Type in the following:

update system_auth.credentials set salted_hash='$2a$10$vbfmLdkQdUz3Rmw.fF7Ygu6GuphqHndpJKTvElqAciUJ4SZ3pwquu' where username='myadmin' ;

4. Revert the cassandra.yaml:

authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer

and restart

service cassandra restart

5. Now you can log in with:

cqlsh -u myadmin -p cassandra

Once logged, reset your password to something less obvious.

Upgrading Cassandra from 1.0 to 1.2 - Warts and all!

2014-05-23T10:13:00.000-07:00

I recently upgraded an 8 node Cassandra cluster (logically split over 2 data centres) from version 1.0 to 1.2. It took me quite a while of experimentation in a sandpit environment before I came up with a formula that I had the confidence to use in production. This was done in rolling fashion.

Any data that I cared about had a replication factor of 3 in each data centre.

The first step in this journey is of course to follow the often excellent documentation on the Datastax website:

1.1 upgrade guide
1.2 upgrade guide

I've included the v1.1 upgrade pages for reasons that will become apparent.

These upgrade guides are rather economical and it is worth reading around the forums in order to fill out the detail.

First I'll describe my journey which started by following the v1.2 upgrade guide. Since the database was already running v1.0, upgrading straight to v1.2 is considered legal. This involved a number of prerequisite checks such as modifying queries as stipulated in the guide, checking the News.txt file for any upgrade related insights and likewise with the Changes.txt file. All fairly standard.

With that out of the way, the first step was to take a snapshot per node. I strongly recommend copying the snapshots to a location outside of cassandra's control. As I discovered during testing, part of the upgrade process is to re-organise the data file structure from:

<path>/KS/filename (1.0 file name format)

to:

<path>/KS/CF/filename (1.2 file name format)

which includes the snapshot files! That leaves you with two headaches. First, should you wish to revert back from 1.2 to 1.0, you need to grab all the snapshot files, convert them back to the original file name format and move them back to the original directory structure. Secondly, and perhaps more worryingly, if the upgrade fails it could leave your snapshot files in a mess with some of them in the original format/location, some in the new and some possibly missing. I can vouch for this, having experienced this very issue during testing. There was no return. Being a test cluster I would have been happy enough to blow away the cluster and start again. In the end I elected to use cassandra's excellent capability to rebuild a node from the surviving nodes, none of which had been upgraded at that point.

Tip #1: Copy/rsync snapshots to a directory outside of cassandra's control or better.

My first upgrade attempt comes right out of the guide.

Iteration 1 Upgrade
Per Node:
1. snapshot cassandra.
2. copy snapshots to remote server (or new directory on same server if you have space.)
3. nodetool drain
4. stop cassandra
5. update the cassandra 1.2 binaries.
6. start cassandra 1.2
7. upgrade sstables: nodetool -h localhost upgradesstables -a

On my spinning-disk cluster, upgradesstables runs at about 1G every 2-3 minutes. Given my quantity of data per node, I would expect this task to last 4 hours! The only way to update an 8 node cluster is by doing 1 node per day making the complete upgrade over 1 week (as the upgrade was to happen at night). N.B. the documentation also tells you not to move, repair or bootstrap until the upgrade is complete. That sounds pretty impractical.

So what happens if you need to cutback to 1.0? There is no easy answer to this and I think the only sensible approach is to NOT cutback to 1.0 after a few hours of operating on 1.2. That means you will only be part way through upgrading your first node when you must decide to commit to it. The implication is to thoroughly test your applications against cassandra 1.2, i.e take every precaution to not have to rollback.

However, I have a trick up my sleeve that ameliorates circumstances somewhat and I'll come on to that a bit later.

Tip #2: Functionally and non-functionally test your apps as thoroughly as you can, as you REALLY don't want to cut back to cassandra 1.0. Copy your production data to a new cluster if possible, even feeble vms, and test the upgrade (I did this).

This is where it gets interesting. On checking the log file having fired up cassandra 1.2, i.e. before the upgradesstable step, I noticed these errors:

INFO [HANDSHAKE-/[ip redacted]] 2014-01-15 12:58:48,047 OutboundTcpConnection.java (line 408) Cannot handshake version with /[ip redacted]
INFO [HANDSHAKE-/[ip redacted] 2014-01-15 12:58:48,048 OutboundTcpConnection.java (line 399) Handshaking version with /[ip redacted]
...etc for all nodes...

Googling around I came across this issue on an Apache jira in which Jonathan Ellis states that for rolling upgrades you cannot jump major versions, i.e. you must follow 1.0>1.1>1.2 (https://issues.apache.org/jira/browse/CASSANDRA-5740).

Iteration 2 Upgrade
As per iteration 1, but done twice. Once for 1.0>1.1 and then 1.1>1.2

Needless to say I experimented with such an approach, but was wholly unenamoured by the prospect of having to run upgradesstable twice per node!

So I dug a little deeper into this error. As far as I could make out, range queries would fail on a mixed version cluster and some nodetool commands fail, such as nodetool ring. The natural question is then, can I not complete the binary upgrade on all nodes and then upgrade the sstables? That would be the silver bullet as I could upgrade directly from 1.0 to 1.2. Upgrading the binaries takes just a few minutes and just might be acceptable at night when traffic drops off a cliff.

My new direction was affirmed in this posting where Aaron Morton of the Last Pickle suggests the same which gave me a warm and cuddly feeling!

Things were starting to take shape. The plan was to upgrade binaries on a rolling basis for all 8 nodes, then stagger but overlap upgradesstable such that the full end-to-end upgrade would last at most 1 day.

Ultimately, the method I went with was as follows:

Iteration 3
Upgrade binaries (per node)
1. take a snapshot
2. copy to backup server
3. test retrieving a record
4. save a list of all datafiles
5. run nodetool ring and make sure all nodes are as expected
6. stop cassandra
7. start cassandra
8. test retrieving a record
9. clear snapshots (no need to have them lurking around as they have been copied elsewhere)
10. upgrade binaries.
11. rolling restart application nodes*
12. nodetool describecluster
13. nodetool ring

* - I found that the upgrade caused hector to mark the node as dead and gone and would only be recognised by a restart.

Cutback Plan
So what about cutback options now? The cluster has 4 nodes per data centre, 8 nodes in total. Bearing in mind that each data centre contains a complete set of data the cut back plan was to reinstall the 1.0 binaries and rebuild data from the other nodes (remember that I am using a replication factor of 3 per DC). So if the upgrade of node 1 in DC1 failed, I would cut back as I could not be certain that other nodes would not similarly fail and I can only withstand 1 node failure per DC. Similar with nodes 2 and 3. However, if node 4 failed I would proceed to upgrade the nodes in DC2. This is because I could live with one node being down in each DC and could resolve this shortly after the upgrade completed.

Upgrade sstables
I did this staggered with some overlap per node such that the whole process took about 7-8 hours.

Test cutback (all nodes)
1. shutdown cassandra
2. clear data directories
3. restore 1.0 files from the backup server
4. restore 1.0 binaries
5. start cassandra.

Cassandra Node Grab Toolkit

2014-05-02T08:05:00.002-07:00

Wish you had time to write a toolkit that grabs log files, configuration files and other useful stats (os and cassandra) from all of your nodes at a quasi point-in-time? No need, I've done it for you.

This consists of a couple of simple bash scripts, the first of which grabs logs, config and other output from a predefined list of nodes. Output is organised by servernamedate with a tar gz archive created. This is handy for sending off to other interested parties such as your support partners. The other runs on a loop sleeping N seconds between each run until cancel cntrl-c. This can be useful when you want to monitor some/all of your nodes for stats such as tpstats and operating system stats like io stats.

The assumption is that you have a server from which you can access all of your cassandra nodes without entering a password (i.e. with ssh keys) and that cassandra is running *nix.

A little set up work is required to set paths correctly for your environment.

https://github.com/dbwrangler/cassandra_grabs

Git me up!

org.apache.thrift.transport.TTransportException while dropping or updating a keyspace with Cassandra 1.2

2014-04-01T08:35:00.005-07:00

Cassandra offers great flexibility and power to the developer like no database before it IMHO. That also means the ability to screw it up. Recently I migrated Cassandra to a new datacenter where to original set up was spread over 2 datacenters. This meant temporarily adding a 3rd datacenter, then tearing down one of the old ones.

This meant the developers had to adjust their deployment scripts to include the new datacenter, eg:

create keyspace MyKS

with placement_strategy = 'NetworkTopologyStrategy'

and strategy_options = {DC1 : 3, DC2 : 3}

and durable_writes = true;

became this:

create keyspace MyKS

with placement_strategy = 'NetworkTopologyStrategy'

and strategy_options = {DC1 : 3, DC2 : 3, DC3 : 3}

and durable_writes = true;

This meant updating the cassandra-topology.properties file on each node and DC3 added.

Migration complete, DC3 was renamed to DC1 and the original DC1 was taken out in the topology files. Unfortunately a number of deployment scripts did not get reverted to the original 2 DC set up and continued to reference DC3 in drop and create scripts.

This mostly worked harmlessly under 1.0, but a recent upgrade to 1.2 became more problematic. Developers complained that they could no longer drop keyspaces with errors such as:

org.apache.thrift.transport.TTransportException

at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)

at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)

at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)

at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)

at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)

at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)

at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)

at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)

at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)

at org.apache.cassandra.thrift.Cassandra$Client.recv_system_drop_keyspace(Cassandra.java:1437)

at org.apache.cassandra.thrift.Cassandra$Client.system_drop_keyspace(Cassandra.java:1424)

at org.apache.cassandra.cli.CliClient.executeDelKeySpace(CliClient.java:1364)

at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:249)

at org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:213)

at org.apache.cassandra.cli.CliMain.evaluateFileStatements(CliMain.java:393)

at org.apache.cassandra.cli.CliMain.main(CliMain.java:272)

Under v1.0 about 1 in 5-10 attempts to drop a keyspace would randomly fail even though DC3 was still referenced in the scripts. Under v1.2, it was consistently doing so.

The developers updated their script and references to DC3 removed. I then attempted to update one of the keyspaces manually (using the cassandra-cli) to rid it of DC3 but was left with the solitary error message:

org.apache.thrift.transport.TTransportException

Stubborn!

I then snuck the DC3 configuration back into the cassandra-topology.properties file on the node which I was working. Even though the DC3 nodes no longer existed, that did the trick! DC3 was then removed by updating all affected keyspaces using the cli. Now the deployment scripts drop/create keyspaces without issue.

Hope that is helpful to someone out there!

UPDATE 3 April 2014

Unfortunately I spoke to soon. Although the cluster was seemingly ok (according to nodetool ring/info/describecluster), there were errors in the log of the nature of:

(on the node where I made the cassandra topology update)

java.lang.RuntimeException: java.io.IOException: Connection reset by peer

at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:59)

at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:30)

at org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)

at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:392)

at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:377)

at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:120)

at org.apache.cassandra.net.OutboundTcpConnection.write(OutboundTcpConnection.java:255)

at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:201)

at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:149)

Caused by: java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcher.write0(Native Method)

at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)

at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:69)

at sun.nio.ch.IOUtil.write(IOUtil.java:40)

at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:336)

at java.nio.channels.Channels.writeFullyImpl(Channels.java:59)

at java.nio.channels.Channels.writeFully(Channels.java:81)

at java.nio.channels.Channels.access$000(Channels.java:47)

at java.nio.channels.Channels$1.write(Channels.java:155)

at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)

at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)

at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:297)

at org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:244)

at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:99)

at java.io.DataOutputStream.write(DataOutputStream.java:90)

at org.apache.cassandra.utils.ByteBufferUtil.write(ByteBufferUtil.java:328)

at org.apache.cassandra.utils.ByteBufferUtil.writeWithLength(ByteBufferUtil.java:315)

at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:55)

... 8 more

(on other nodes)

org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=eb68069e-20b7-394c-9a3f-a727d45ec594

I then noticed that root had ownership of the cassandra-topology.properties file. So I quickly changed this, but it made no difference. Then I restarted the instance and that made no difference either.

Regarding the error on the other node, it looks like a schema mismatch, but running 'nodetool describecluster' indicated that the schema was consistent across all nodes. Nonetheless I decided to shutdown this node, clear out the system keyspace directory and restart. This produced an interesting result. Initially I could see the same UnknownColumnFamilyException error appearing, but once it had rebuild the system keyspace directory the error went away. Same goes for the original node oddly enough.

Oracle - How To Determine The Actions of noaudit / audit command

2013-09-27T01:39:00.001-07:00

Time to abandon NOSQL? (The name that is...)

2013-08-15T02:44:00.000-07:00

It's odd when a technology comes along that is named after something it is not, rather than something it is. Case in point: NOSQL.

Here's my take on it.

RDBMS has been the crowned prince of data technologies for the last few decades. The momentum behind it was so strong that it has taken something like the last 5 years for other paradigms, more suitable to web scale data, to challenge its dominance. In order to highlight this paradigm shift, someone, somewhere, coined the buzz word NOSQL. The downside of doing this is that, rather than being known for what you are, you become known for what you are not.

And this is a problem. As organisations adopt NOSQL technologies, more and more want to join the band wagon. But rather than having a real world problem to solve with NOSQL, they just "know" that they want NOSQL, any NOSQL. Is the world of data really divided into the SQL and NOSQL? Does that even make sense?

Let's consider the following analogy.

Imagine a town with a large lake in the middle of it. Imagine too that the car is the only form of transportation known to man. Then, one day, someone invents a way to travel over water, cutting down journey times to the other side of the lake in half. But what to call this new fangled vehicle? Well, since it needs to be made clear that this is not just another form of car, let's call it NOCAR. Sounds ridiculous? It's not long before other NOCARs come into being. One that travels through air, one that requires no fuel and just uses peddle power (aka NOPISTON) etc.

Now, imagine someone saying 'Look I don't care what NOCAR we adopt, I just know that we should have one'. Or more subtly 'Aeroplane is my NOCAR of choice', as if now we should compare NOCAR solutions as a level playing field and exclude CARs altogether. Weird. The NOCAR that travels across water is as different to the NOCAR that travels through air as it is to the CAR.

Likewise with NOSQL. MongoDB is as different to Cassandra, as Cassandra is to RDBMS.

As alluded to above, perhaps WEBDB is more appropriate as it describes these technologies for what they are (or claim to be), rather than what they are not.

It's time to abandon the term NOSQL and start discussing these technologies for what they are rather than for what they are not.

Nasty bug with Dataguard Setup via RMAN!

2012-11-30T07:49:00.000-08:00

There is a very nasty bug that you will hit if you implement Dataguard using rman on an active database on 11.2. This set up method is describe in detail in Oracle doc: ID 1075908.1

The main task is performed via rman command:

duplicate target database for standby from active database

This invokes an Oracle provided procedure and includes the creation of a controlfile copy. Unfortunately, this controlfile copy uses the same name shared by controlfile snapshots that are taken as part of your routine rman backups.

So what is the problem?

The problem, though tiny, has big implications for your database, especially your production databases. This applies if you use rman to manage obsolete/redundant backup sets. If so, you will be deleting said backups that fall outside of the retention policy using the following rman command:

delete obsolete;

Bad luck I'm afraid! This command will fail with:

RMAN-06207: WARNING: 1 objects could not be deleted for DISK channel(s) due

RMAN-06208: to mismatched status. Use CROSSCHECK command to fix status

since rman has the controlfile down as as snapshot, not a copy. Unfortunately, this is the first delete it tries to run, fails and quits. Unless you are super vigilant your backups will fill up the flash recovery area until your database comes to a halt! VERY BAD!

Solution

I raised a ticket with Oracle about this and currently there is no patch. So, they way I work around it is as follows. Just before running the duplicate command I rename the snapshot controlfile name. The once the Dataguard set up is complete, I rename the snapshot backup to its original name and delete the controlfile copy:

CONFIGURE SNAPSHOT CONTROLFILE NAME TO '/u01/app/oracle/product/11.2.0/dbhome_1/dbs/snapcf_temp.f';

duplicate target database for standby from active database

CONFIGURE SNAPSHOT CONTROLFILE NAME TO default;

crosscheck controlfilecopy '/u01/app/oracle/product/11.2.0/dbhome_1/dbs/snapcf_temp.f';

delete force noprompt expired controlfilecopy '/u01/app/oracle/product/11.2.0/dbhome_1/dbs/snapcf_temp.f';

This steers clear of the original snapshot controlfile and my rman deletes continue to operate as normal.

Cassandra - TTL columns, Tombstones and complete row deletion.

2012-10-22T06:19:00.001-07:00

This article is an investigation into Cassandra TTL columns, how it cleans up tombstoned data and gets rid of rows all together. The following investigation uses cassandra v1.0.10 + cql 2.0.0. TTL's are set on to your data at insert/upsert time and specify an expiry date (to the second) for the data, creating a tombstone. To understand why it doesn't just delete the data on expiry see here. The columns are completely deleted during compaction, but the documentation states that you need to be careful in how you manage this or you will end up with failed nodes. In order to understand this better, I decided to run some scenarios that match my current setup in which some data is TTL'd but data is never deleted.

I concluded that I will not get failed nodes. Please read on to understand my reasoning.

Test 1: Write columns with replication factor = 1, TTL 30s, gc_grace_seconds 60s.

create keyspace testks WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor = 1;
use testks;
create table test (ID varchar primary key, firstname varchar, surname varchar) with gc_grace_seconds=60 and compression_parameters:sstable_compression='SnappyCompressor';

Grace period is set to 60 seconds. TTL on insert is set to 30 seconds. So after 30 seconds, the columns should be tombstoned, which means they will be marked for removal at the next compaction as long as the time of compaction LESS the TTL time exceeds 60 seconds.

Write ten records: cqlsh < 10_testrecords.cql

insert into testks.test (id, firstname, surname) values (1,'tom','jones') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (2,'tom','cruise') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (3,'tom','tom') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (4,'tom','thumb') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (5,'tom','ljones') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (6,'tom','dixon') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (7,'tom','baker') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (8,'tom','smith') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (9,'tom','jessy') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (10,'tom','froun') using consistency QUORUM and TTL 30;

node 1
cqlsh:testks> select * from test ;
ID | firstname | surname
----+-----------+---------
3 | stom | stom
6 | stom | dixon
5 | stom | ljones
10 | stom | froun
8 | stom | smith
2 | stom | cruise
1 | stom | jones
9 | stom | jessy
4 | stom | thumb
7 | stom | baker

Wait 30 seconds:

node 1
cqlsh:testks> select * from test ;
ID
----
3
6
5
10
8
2
1
9
4
7

As aluded to above regarding tombstones, we would still expect to see the row keys. An excellent write up on this can be found here.

So how do we get rid of the row keys and thus the records all together?

In theory we should be able to flush and compact the test column family on each node which will get rid of the rows for good. Compaction is responsible for permanently removing tombstone records once compaction time less tombstone time is beyond gc_grace_seconds. So let's try it:

node1
nodetool -h localhost flush testks test
nodetool -h localhost compact testks test

This should have removed 1/3rd of the keys, ie keys that reside on node1:

cqlsh:testks> select * from test using consistency one;
ID
----
3
6
5
10
8
2
1
9
4
7

Hmm... doesn't seem to have worked. Let's check the log:

INFO [CompactionExecutor:249] 2012-10-18 11:43:23,275 CompactionTask.java (line 245) Nothing to compact in test. Use forceUserDefinedCompaction if you wish to force compaction of single sstables (e.g. for tombstone collection)

So compaction cannot delete tombstones if there is only 1 data file, as it seems to work by merging multiple files together. We'll test this soon, but for now let's follow the suggestion to use forceUserDefinedCompaction. This is available as java mbean and can be invoked using jconsole.

Connect jconsole to node1. Select mbeans tab, org.apache.cassandra.db, CompactionManager, Operations. The right hand panel now shows forceUserDefinedCompaction. This takes two parameters, keyspace name and file name to compact. So let's call:

forceUserDefinedCompaction (testks, test-hd-1-Data.db)

Check system.log:

INFO [CompactionExecutor:251] 2012-10-18 13:26:54,507 CompactionTask.java (line 112) Compacting [SSTableReader(path='/media/data/cassandra/data/testks/test-hd-1-Data.db')]

cqlsh:testks> select * from test using consistency one;
ID
----
3
6
5
10
8
2
1
9

Voilà, we have now got rid of 2 keys which this node is responsible for. We could go around each other node and do the same, which would remove all row keys, but I won't bother. Instead I will focus on how we can invoke key deletion under 'natural selection', so to speak.

I'll reset the data to the point before I ran the forceUserDefinedCompaction.

Now, if compaction requires more that one data file, let's upsert the data, keeping the same TTL:

insert into testks.test (id, firstname, surname) values (1,'stom','jones') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (2,'stom','cruise') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (3,'stom','stom') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (4,'stom','thumb') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (5,'stom','ljones') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (6,'stom','dixon') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (7,'stom','baker') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (8,'stom','smith') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (9,'stom','jessy') using consistency QUORUM and TTL 30;
insert into testks.test (id, firstname, surname) values (10,'stom','froun') using consistency QUORUM and TTL 30;

We should only have to wait 60 seconds to run the compaction in order to remove the rows completely (gc_grace_seconds=60s), but I find that gives unreliable results. So I give it 2-3 minutes:

nodetool -h localhost flush testks test
Wait 2-3 minutes.
nodetool -h localhost compact testks test

Check system.log:

INFO [CompactionExecutor:261] 2012-10-18 13:36:09,735 CompactionTask.java (line 112) Compacting [SSTableReader(path='/media/data/cassandra/data/testks/test-hd-1-Data.db'), SSTableReader(path='/media/data/cassandra/data/testks/test-hd-2-Data.db')]

So it seems to have worked. Let's check the data:

cqlsh:testks> select * from test using consistency one;
ID
----
3
6
5
10
8
2
1
9

Voilà encore. Row keys 4 and 7 have gone, as in the forceUserDefinedCompaction example.

Let's run the flush and compaction on the other nodes too:

Node2
nodetool -h localhost flush testks test
nodetool -h localhost compact testks test

Node3
nodetool -h localhost flush testks test
nodetool -h localhost compact testks test

Let's check the data again:

select * from test using consistency one;
cqlsh:testks>

Hooray, no row keys!

Of course the beauty of Cassandra is being able to easily replicate data, so a more real world example would be to have a replication factor of say, 3. The inserts are quorum, therefore will block on 2 writes and replicate eventually a total 3 times.

Test 2: Let's crank up the replication factor to 3 and keep TTL 30s, gc_grace_seconds 60s.

create keyspace testks WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor = 3;
use testks;
create table test (ID varchar primary key, firstname varchar, surname varchar) with gc_grace_seconds=60 and compression_parameters:sstable_compression='SnappyCompressor';

Let's load the data:

cqlsh < 10_testrecords.cql

Node1
nodetool -h localhost flush testks test

Node2
nodetool -h localhost flush testks test

Node3
nodetool -h localhost flush testks test

upsert the data
cqlsh < 10_testrecords2.cql

Now let's flush:

Node1
nodetool -h localhost flush testks test

Node2
nodetool -h localhost flush testks test

Node3
nodetool -h localhost flush testks test

Each node will now have 2 datafile. Wait 2-3 minutes then run compaction:

Node1
nodetool -h localhost compact testks test
INFO [CompactionExecutor:342] 2012-10-18 14:55:50,228 CompactionTask.java (line 112) Compacting [SSTableReader(path='/media/data/cassandra/data/testks/test-hd-2-Data.db'), SSTableReader(path='/media/data/cassandra/data/testks/test-hd-1-Data.db')]

Select on each node again:

Node1
select * from test using consistency one;
cqlsh:testks>

I guess this is because CL = one and cassandra uses a snitch mechanism to know that node is responsible for one complete set of replicas, so it returns no rows.

Node2+3
cqlsh:testks> select * from test using consistency one;
ID
----
3
6
5
10
8
2
1
9
4
7

So the keys have been deleted from Node1, but not Node2+3.

Run the compaction on Nodes2+3 finally gets rid of all the keys.

So given the foregoing, what would happen if we select with a higher consistency level?

Repeat setup + compaction as in last example:

Node1:
select * from test using consistency one;
cqlsh:testks>

As expected, no rows. How about reading quorum, which requires data from 2 nodes. This should return all the row keys, as it will be combining null from the local node with successful reads from nodes 2 and/or 3 (this is probably an 'or' situation if the snitch is working efficiently):

Node1:
select * from test using consistency quorum;
ID
----
3
6
5
10
8
2
1
9
4
7

... ta da! Now, what about setting CL back to one?

Node1:
select * from test using consistency one;
ID
----
3
6
5
10
8
2
1
9
4
7

The log gives no clue, but it would seem a read repair has taken place to Node1. This creates a memtable structure once again. If this conjecture is correct, we should be able to flush and compacted once again:

INFO [CompactionExecutor:455] 2012-10-18 16:47:20,631 CompactionTask.java (line 245) Nothing to compact in test. Use forceUserDefinedCompaction if you wish to force compaction of single sstables (e.g. for tombstone collection)

I found that the original compaction actually removed the data files, since no keys were left for the column family. Thus the flush produces the one and only datafile that cannot be compacted as we saw earlier.

Of course, under normal operating conditions each node would remove its TTL'd data once gc_grace_seconds has past, so such scenarios are short lived.

Although this article covers TTL data, the same rules should apply data deleted programmatically.

So what are the implication of failed nodes as implied here DistributedDeletes which states "if you have a node down for longer than GCGraceSeconds, you should treat it as a failed node"? I would caveat this statement with the following for nodes that have been down for longer than gc_grace_seconds:

1. If you never programmatically delete data, then you will not get failed nodes.
2. If you only go as far as to TTL your columns to delete data, you will not get failed nodes.

Oracle O/S Audit Purging

2011-10-04T05:45:00.000-07:00

It's great that Oracle have introduced an ever richer set of auditing management functions in 11g release 2. I was particularly interested in auditing to the filesystem, thereby keeping the db clean and allowing standard operating system methods to archive audit data to a backup server. The solution being neatly wrapped up by purging old audit files on a regular basis, all controlled by the db scheduler.

There is a great article how to do so here. It is, however, more focussed on purging audit records in the database, not the os. Furthermore, an additional scheduler job is required to update the timestamp against which Oracle will delete old records (see end of article).

BUG: Hands up how many folk out there use upper-case SID names? Yeah, me too. Sorry folks, o/s purging does not work if you SID is upper case unless you apply patch 9438890! Before I discovered this patch, I tried creating lower case SID db's, thinking this might be the way forward. I was left feeling somewhat uncomfortable when DBCA left with a mixture of upper and lower uses of the SID (lower case in the spfile, upper case in the flash recovery folder name). The opposite cannot be said of specifying upper-case SID's to DBCA. Perhaps it is an unspoken wisdom of being a DBA, but I say better stick with upper case SID's.

I will keep you posted on the patch which is yet to be installed.

Update: 4 Oct 2011. Sid's are now being generated in upper case and therefore being identified by purging.

Update: 31 Oct 2011. So it seems a job needs also to be scheduled to update the otherwise static LAST_ARCHIVE_TS. The following code should take care of it:

create or replace procedure set_archive_retention
(retention in number default 7) as
begin
DBMS_AUDIT_MGMT.SET_LAST_ARCHIVE_TIMESTAMP(
audit_trail_type =>DBMS_AUDIT_MGMT.AUDIT_TRAIL_OS,
last_archive_time => sysdate - retention);
end;
/

begin
DBMS_SCHEDULER.disable('advance_archive_timestamp');
DBMS_SCHEDULER.drop_job('advance_archive_timestamp');
exception
when others then
null;
end;
/

BEGIN
DBMS_SCHEDULER.create_job (
job_name => 'advance_archive_timestamp',
job_type => 'STORED_PROCEDURE',
job_action => 'SET_ARCHIVE_RETENTION',
number_of_arguments => 1,
start_date => SYSDATE,
repeat_interval => 'freq=daily' ,
enabled => false,
auto_drop => FALSE);
dbms_scheduler.set_job_argument_value
(job_name =>'advance_archive_timestamp',
argument_position =>1,
-- one week, you can customize this line:
argument_value => 7);
DBMS_SCHEDULER.ENABLE('advance_archive_timestamp');
End;
/

BEGIN
DBMS_SCHEDULER.run_job (job_name => 'advance_archive_timestamp',
use_current_session => FALSE);
END;
/

Quirky Oracle DOS attack option

2010-07-08T02:46:00.001-07:00

As of 11g, Oracle has implemented a number of built-in password protections. One such protection is retarding login attempts when an incorrect password is used. After three failed attempts, Oracle will retard login for several seconds up to approximately 10 seconds on successive attempts.

The documentation states that if the correct password is entered, log in occurs without delay. However, this is not quite true. Take the following situation: A new application is trying to connect to the database but is configured with the wrong password. Typically configured with some connection pooling, this will very quickly reach the 10 second delay. Now, whilst this new application is trying to incorrectly connect to the database, ALL attempts (correct and incorrect) to connect to that database user are delayed. This means existing applications will but delayed even though they have the correct credentials.

I experienced this on 11.1.0.7.0. Let's hope Oracle come up with a fix!

Quirky Oracle LOBS

2010-03-29T06:33:00.000-07:00

Recently a client of mine took to storing volitile session related data in the database in the form of a BLOB. Whilst I argued the virtues of storing session related information at the app tier and the reduced scalability of such a design, for the short term it looked like the only solution.

Naturally I was inclined to test and tune a variety of configurations for the LOB such as chunk size, storage in and out of row (see the "SecureFiles and Large Objects Developer's Guide" in www.oracle.com/pls/db111/portal.all_books).

As I explained to the developers, when dealing with LOBS, you can chuck it and suffer the consequences (as they were seeing in their own tests) or do it properly. I was fairly confident that I would be able to improve upon the performance that comes by default.

Initially the lob was 8k, so I wouldn't have the luxury of choosing between in-row and out-of-row design (anything over 4k forces you to the latter). My choice was then reduced to lobs that were cached, but generated logging (with Oracle LOBS you can't have one without the other for out-of-row LOBS) or non-cached without logging. Load testing revealed the load profile in the database and here's a snipped from AWR (1 hr snapshot with LOB configured of non-caching):

Reads CPU Elapsed
Physical Reads Executions per Exec %Total Time (s)
-------------- ----------- ------------- ------ -------- ---------
289,226 70,151 4.1 53.7
update mybooking set data = :1 where id = :2

data is the the LOB (a BLOB in my case). This revealed that the app updates the LOB 70k times per hour.

The average size was 22k, so I set chunksize to 24K (rounded up to multiple of db block size = 8k). You can work out the average size with:

AVG(DBMS_LOB.GETLENGTH(DATA))

I then tried the caching/logging option and found that physical disk reads reduced by 70-80% but of course redo generation rose dramatically to 500 MB/hr. Therein lies the dilemma. Admittedly the application response times were similar with both configuration, but at the time of the tests the application was being throttled with a 100ms delay.

I didn't have to scratch my head for long however (so I still have hair!)... The application was changed such that the average LOB size reduced to 2k. I jumped at the chance of storing the LOB in-row, knowing that I could create the table with the NOLOGGING option. I would benefit from caching and nologging hooray! So what happened...?

To my dissappointment, Oracle kept logging away, much like before. It turns out that there is a bug in Oracle (Bug 4205241) that effects 9.2.0.4 all the way up to 11g (groan).

Quirky 11g Default Profile

2010-03-25T09:15:00.001-07:00

So you're used to setting up and configuring your Oracle database in a 3-tier architecture (typically web app). As has been the case for many past releases, Oracle assigns the default profile to users unless otherwise specified. However, to help you along (...not!), the default profile in 11g is now restrictive as follows:

PASSWORD_LIFE_TIME 180     (days)
PASSWORD_GRACE_TIME 7      (days)
FAILED_LOGIN_ATTEMPTS 10
PASSWORD_LOCK_TIME 1       (days)

Oracle probably thought that this was smart security, but I wager to bet that most Oracle databases are connected directly application servers and password management is not dictated by this way. The upshot is that many an unsuspecting DBA, having setup a database that is happily supporting a website or a.n.other 3-tier configuration, gets a rude awakening when the application can no longer connect after 6 months! Frankly it would have been better for Oracle to make available a secure profile with the above properties that needs to be explicitly assigned to users.

To resolve this issue you can run the following:

ALTER PROFILE DEFAULT LIMIT
FAILED_LOGIN_ATTEMPTS UNLIMITED
PASSWORD_LIFE_TIME UNLIMITED;