One of the most popular NoSQL databases at the present moment is Apache Cassandra. In this article, we will look into one of the most important pieces of a Cassandra cluster: The node.
Cassandra is a distributed database and relies on several nodes to be high-available and fault-tolerant. Ironically, I’m not talking about distribution here. I’ll be describing the purpose of a single node in Cassandra. And why would I do that?
Selecting a database technology that will impact the overall behavior of the products we build is not something to take lightly. Before choosing between Cassandra or something else, it’s important to understand its architecture, starting with the most fundamental element, the node, because the node is the most important player in the cluster when it comes to do what a database needs to do: store data. After understanding the responsibility of single node, understanding an entire Cassandra cluster becomes a lot easier.
That said, this article does not intend to touch any distribution aspect of Apache Cassandra, although I’ll be writing about that another time. There’s a lot going on within a single node, let’s just understand that first before going into how it fits in a distributed cluster.
What is a node?
Everything starts with a node. A node is a JVM (Java Virtual Machine) process running inside a machine. It will have its own heap allocation and will interact with off-heap memory and disk files.
The node is responsible for the data that it stores. In Cassandra, data is partitioned by key.
The client accesses the node by using a driver that is language-specific. The driver provides the means for writing/reading data in the node. Actually, it does a bit more than that when acting as a part of a cluster of nodes, but that is out of the scope of this article.
To manage a node, a tool exist that allows an operator to interact with the node. That tool is co-located with the node in the machine it lives, and it was brilliantly called
The below is the output of the command
nodetool info ran in a brand-new node with just a few records of data added.
The configuration file for the node is the
cassandra.yaml that lives under the
conf folder in the Cassandra distribution.
Partitions are an ordering scheme on disk and they allow one to take certain sets of data and order them.
The partition key is defined when creating a table. A primary key is composed by a partition key followed by clustering columns. If partition key is not unique, then we can add another column to enforce uniqueness of the primary key.
Let’s say Simon and Anna are from the Engineering department, John and Luise are from the People department and David is from the Support Desk department.
This data will end up on 3 different partitions, one for each department (ENG, PPL and DSK). Clustering columns provide ordering, which means that inside each partition, rows will be sorted by name in ascending order. The id then provides the uniqueness of the row.
The partitioner job is to translate a partition key into a partition token value. Internally, token values will be used to write, read and compact data.
To each node in a cluster there’s a token value range assigned. A node is responsible for holding the partitions having a token value inside its range.
What happens when we write to a node?
When we write data into a node, two structures are involved before returning back to the client: The Commit Log and the Memtable.
The first stop for data is the commit log, which ensures durability. If this node crashes, the state can be recovered from the log. This is an append-only log, which makes the writes fast.
The next stop for data is the Memtable. Memtable is a representation of the state in-memory. It lives in the heap by default but, in order to alleviate the pressure of GC (Garbage Collection), it can be configured to live off-heap by changing the value of the property
After data is written to the Memtable, the node acknowledges back to the client.
From a synchronous standpoint, that’s about it, and that’s why Cassandra is so famous for its write performance. If you access that node for retrieving data just after that, it’s there.
The commit log is only there for crash recovery. The source of truth for data inside that node is what’s called an SSTable. Those are generated from a flush operation that runs asynchronously from time to time.
When the SSTable is written, the commit log for that segment is removed. Unlike traditional relational databases, which are all about random reads and writes to a data file, the SSTables are immutable structures, which means that after being created, they are never modified.
When another flush happens, it will store another SSTable out of the Memtable. That may contain overlapping information for the same partition of the previous SSTable. From a reading standpoint, that’s solved using a LWW (Last Write Wins) approach by comparing timestamps.
Using spinning disks (HDD), it’s recommended to have the commit log and the SSTables living in separate disks, to avoid disk head contention. In general, SSDs are the recommended drives to use with Cassandra.
What happens when we read from a node?
Reads are a lot more complicated. From a read standpoint, let’s say we have a Memtable and one or more SSTables. When we ask for a partition of data, the Memtable is checked, which contains the most recent data, but it may not contain every piece of data in that partition, so the SSTable also needs to be checked.
An SSTable contains sequential data belonging to multiple partitions. Imagine what would be if every single read hits potentially multiple SSTables on disk, scans all of them to find the data on that partition so that it can be returned to the client. Slow, huh?
Associated to a SSTable, Cassandra has multiple data structures stored on heap, off-heap and disk to optimize the access to that SSTable.
- Row Cache
- Bloom Filter
- Partition Key Cache
- Partition Summary
- Partition Index
- Compression Offset Map
Starting from what’s more close to the SSTable, the compression offset map holds the offset information for compressed blocks, so it knows the exact location of data on disk.
The partition index stores a mapping between partition tokens and their offsets. The information is used within the compression offset map to find the block on disk containing the data.
The partition summary stores a mapping between a range of partition tokens into their position in the partition index.
The partition key cache is an ordinary cache for frequently accessed partition indexes. If a key is frequently accessed, and it’s stored in the key cache, the partition summary and index are bypassed and the compression offset map will be used.
The bloom filter is a simple probabilistic structure that is able to answer No or Maybe, regarding the existence of the partition on the underlying SSTable. If well tuned, it may have just a few false positives and can prevent a lot of files from being checked. Feel free to check this simple bloom filter implementation to have an idea how it works.
The row cache is an ordinary cache that stores the whole partition. It’s there for heavily accessed partitions. It’s optional, and it’s recommended only for read-intensive scenarios.
If a row is not in the row cache, all the bloom filters associated with the column family will be checked. Let’s say that multiple of them return “Maybe”.
Then, the partition key cache is checked to verify whether the key is mapped or not. If so, the compression map offset is checked and the data is returned from its location on disk. If not, the partition summary will be checked to find the position in the partition index so that the compression offset map can be checked to return the data from its location on disk.
Tuning read structures
To tune the bloom filter false positive probability one can set the
bloom_filter_fp_chance for a column family.
Caching (both row cache and key cache) can be enabled when creating/altering a table.
keys attribute can take the values
ALL. Those enable only row cache, key cache, none of them or both, respectively.
What happens when I delete data from a node?
When deleting data from a node, data is marked with a tombstone. Deletes follow pretty much the flow as writes, except that it’s a special write case where we mark data as deleted.
The tombstone has an expiration time after which it is considered expired. It’s then removed as part of the compaction process as described below.
Compaction is a clean-up process for the data. It merges SSTables, resolving conflicts by using the timestamp, and cleans up data marked with expired tombstones.
A new SSTable comes out of the merge of other two. It might still leave some tombstones that are not yet expired.
A single node does not suffice
True, a single node does not suffice, as well as a soccer player does not win the game alone. If our database needs to be high-available and fault-tolerant, we need more nodes working as a team, and these nodes must communicate with each other to share data as well as cluster metadata. I’ll be leaving that to another article.
A node in Cassandra is able to do pretty much everything regarding the store of data. There’s no special or leader nodes, although operations may require coordination, but any node can act as the coordinator. If we have a cluster, all nodes work as a team to make a healthy cluster, and that means no SPoF (Single Point of Failure).
A node has unique capabilities that were described in this article that make writes and reads fast. It splits data into partitions and stores them quickly by using a Commit Log (for durability) and a Memtable (for data access). The magic happens in the background when storing and compacting SSTables. As for the read flow, a set of data structures exist to make the reads out of SSTables fast, which makes Cassandra really popular for data storage.
I’ll be writing another article about the Cassandra Ring, or a cluster of nodes that work together to achieve a fault-tolerant and resilient storage.
This article was an introduction to the subject. For more details feel free to checkout Datastax Cassandra documentation, which is pretty well structured and complete.
Thanks for the time reading this, any feedback is appreciated.
subscribe via RSS