Cassandra: Pros and Cons

Posted on September 17, 2012

Differences between relational database model and NoSQL database models are vast – NoSQL is a set of technologies that addressing problems that begin to plague Codd's relational model for very large systems, and they have a lot of drawbacks, but also some very important advantages. Cassandra is selected as very robust, performant and decentralized system that I've had the opportunity to work on multiple projects. It's not the only solution, but being well documented and with strong and helpful community, it is one of the best options.

Cassandra is a combination of two big-data technologies, Dynamo and Google's BigTable, open sourced by Facebook in 2008. Cassandra is currently under very active development, and it can be downloaded from Apache Cassandra website.

Relational model – recap and warm-up

In relational model, database is the outer layer. Database contains tables, and each table contains one or more named columns. New record (row) is defined by providing values for all defined columns; if value doesn't exist, null value is used. Records can be accessed if row row unique identifier (primary key) is known, or by using SQL query language for retrieving rows that satisfies certain criteria.

If Cassandra is answer, what is the question?

What’s wrong with RDBMS? Development of RDBMS didn’t follow IT expansion; nowadays we have huge systems absorbing daily huge amounts of data. Doing so with technology from 1970s can’t be effective, because of lack of scalability and degrading performances with increasing amount of data. Also, world is not ideal, and hardware fails, so system need to be fault-tolerant, scalable, without single-point of failure.

PRO...

Cassandra is solving the problem of distributed and scalable systems, and it’s built to cope with data management challenges of modern business.

Cassandra is decentralized system - There is no single point of failure, if minimum required setup for cluster is present - every node in the cluster has the same role, and every node can service any request. Replication strategies can be configured. It is possible to add new nodes to server cluster very easy. Also, if one node fails, data can be retrieved from some of the other nodes (redundancy can be tuned). It is especially suitable for multiple data-center deployment, redundancy, failover and disaster recovery, with possibility of replication across multiple data centers.

Very important, Cassandra has Hadoop integration, with MapReduce support, also for Apache Pig and Apache Hive.

ET CONTRA...

This level of flexibility has it’s price.

  • there is no referential integrity - there is no concept of JOIN connections in Cassandra
  • querying options for retrieving data are very limited
  • sorting of data is a design decision; it can be done through one of predefined ways; data can be retrieved back in same order; that’s all - there is no things like ORDER BY, GROUP BY
  • denormalization is good; you want to normalize your data and to have redundancy (big no-no according to Codd) - data is stored in a way that it will be retrieved
  • different database design; in RDBMS we think about data modeling first, and after that we create queries; here, we think first about most common queries, and after that, data is being modeled around those queries.

Worth of trying?

Definitely yes.

NoSQL database models won't and can't completely replace RDBMS technology, but importance of NoSQL will grow because of scale, flexibility and ease-of-use. We are dealing with more and more of data; we want durable and fault-tolerant applications; we want apps that scale and apps that are fast. Because all of these, NoSQL will be around us more and more, and it's definitely technology worth exploring.

... or drop me a line