We have been witnessing a paradigm shift across many companies to the world of data streaming. But is that the hype or is there a stronger reason behind it?
The traditional basic approach to a web application architecture is to have a 3-tier architecture with some clients, a (typically stateless) backend service holding the business logic which can scale out easily, and a database.
The stateless nature of the backend service means that it does not have to remember anything, instead, it delegates that responsibility to the database. This makes it easy to deploy and scale by adding commodity hardware.
The database is always something one needs to care about when scaling out the backend service, because it must handle the requests coming through. A lot of technologies and paradigms have been raising to help us move from a central hard to scale data storage to something that also scales out along with the services, although sometimes sacrificing or trading off important properties like ACID (atomicity, consistency, isolation, and durability).
So any change in the underlying data infrastructure should be carefully thought along with the functional and non-functional requirements for our product.
And even this is still a simple 3-tier approach, we are just making things more or less scalable by using the right tool for the job.
From simple 3-tier approach to operational hell
If you work on large scale data systems you know as I do that it stops being this simple soon, and the backend service starts feeding data to variety of other systems that you need for multiple purposes (full-text search, analytics, etc).
Such an architecture based on dual-writes leads to data inconsistencies that are difficult, if not impossible, to recover from.
Based on my own experience with data architectures and the brilliant work of Martin Kleppmann and Ben Stopford, this post intends to show what’s behind streaming architectures and why it is naturally happening on many businesses.
A very common pattern for data integration is to have the data being pushed into a platform that then is responsible for processing it in a certain way (e.g. filtering, transformation, enrichment). Then the result is pushed into different systems (e.g. calling a REST endpoint, feeding a database or some sort of analytics filesystem).
Stream processing is not a new concept at all. Our CPUs have been doing that for ages to process native instructions through pipelines.
More recently, despite having a totally different purpose, Akka actors or Golang routines follow the same concepts of sharing data between multiple of these resources through communication. Please feel free to checkout a simple data processing pipeline implemented on top of Golang channels and routines.
A simple metaphor
If we think about human reasoning, we get a lot of data from the environment surrounding us, we process it using our brain and we take actions based on our reasoning over that data. In that perspective, assuming time traveling is not possible, or at least temporal do-over is not possible, our life is simply sequence of immutable events or facts that happened at some point in time during our life stream.
The result of that reasoning is a knowledge base that can be used for totally different purposes. We can use it to teach other people, we can use it to answer questions when somebody asks us, we can simply discard it, we can enrich it by correlating that knowledge with other past events, or we can use it simply to learn and improve our knowledge of ourselves and the world surrounding us.
It’s not likely that we have a big database inside our brains. We have a long-term memory for historic lookups, a short-term memory for quick lookups, a bunch of other knowledge structures that we use in order to be versatile when we speak to different people or just to optimise our own life, and a set of neurons that work together to make the information flow to these structures.
write once, read multiple times
One thing we may think we have is an event stream. It’s common that, when we forget something like our car keys (similar an inconsistency or state loss), we reproduce the past set of events that may led us to the last time we have seen the keys, in order to recover them.
So this seems pretty much the natural way of processing everything.
What about databases ?
One may argue that many database-backed systems don’t do such thing and still they have been there for ages working pretty well.
TL;DR they actually do that (sort of), hidden behind the database itself.
Databases, whether they are traditional relational databases, NoSQL or whatever, they are a big abstraction which only purpose is to store data in some format for later use. For that they offer extremely powerful capabilities and APIs for getting the data, joining different pieces of it, aggregating results, etc.
Different technologies differ mostly by the capabilities and guarantees they provide around this simple idea of remembering state. If we checkout the anatomy of a database, depending on the technology, there’s a high chance that we will find what’s described below.
Write ahead log
Many databases don’t flush state to disk immediately to favor write throughput. To allow for crash recovery, they first write instructions to a WAL (Write Ahead Log) which simply supports the append write operation, so the performance is pretty much limited by the throughput of sequential writes of the disk.
For high-availability and scaling purposes, databases have the concept of replicas. If we think of a leader/follower approach, writes are only processed by a primary leader node to ensure consistency, but then a set of read-only replicas are available to retrieve that very same data.
Those replicas must be kept consistent (or at least eventually consistent). This is typically done by having a replication log with raw instructions (insert, update, deletes) so the replicas are kept consistent with the leader. In some cases this log may be the same as the above mentioned WAL.
Indexes, Caches and Materialized Views
Before SSD disks, hard drives were labeled with the slow stamp, which is a misconception. Disks were only perceived as slow because most of the times they were random accessed. But hard drives are optimised for sequential access, and so are most filesystems.
For faster lookups, indexes were created so we can quickly know in which position or range of positions we may find the records we want.
Caches are a way of keeping our data (or part of it) in memory, because RAM is faster for random access than disks.
As such, materialized views are simply pre-processed queries on disk (hence, materialized). This pre-computation allows one to expose the information in advance in a way that it is optimised for some kind of lookup.
The commonality around these structures is that they are used as “tricks” to optimise reads, but they need to be kept up to date each time a record is updated. That impacts writes in order to favor reads. If it doesn’t impact write performance, and thus availability, it will impact data consistency. So it’s always a trade-off.
So, really, what about databases?
Internally, what databases are doing, is some kind of rudimentary approach to a Kappa architecture, but by living inside a monolitic abstraction, which makes it hard to scale.
Simply because reads and writes share the same common schema, we always need to trade-off between both.
If we could just have access to that internal log…
Database logs are seen by most databases as an internal implementation detail. If we could access that raw set of instructions, it would be simpler to implement a mechanism to fill our multiple destinations without any kind of concurrency or consistency problems, because we could always replay the log to “recover our lost car keys”.
While data streaming technologies like Kafka allow us to achieve such a goal by treating the log as a key element in the architecture, I’ll show below that sometimes it’s useful or convenient to keep the database as the source of truth and collect data changes from it in order to feed the multiple destinations (e.g. transactions).
More recently some efforts are being done in order to expose the internal log of the databases. A good example of it is PostgreSQL Logical Decoding feature. This exposes a totally ordered set of successful transactional changes that happened on the database, by exposing the internal log in an enhanced way.
This is good because many times we don’t have the budget or opportunity to redesign our internal system and use a different database or move into a full streaming solution.
The streaming paradigm shift
What we can see these days is basically a database deployed as a set of smaller components that do less things very well, all together form what used to be a big monolit for storing data:
- A data stream (e.g. Kafka)
- A stream processing layer (e.g. Kafka Streams, Storm, Spark).
- A set of destinations to serve multiple use cases (e.g. ElasticSearch, Hadoop, PostgreSQL, Cassandra).
These products allow for building scaling and robust solutions around the data our business cares of.
For us to be able to use our data stream as the source of truth, we cannot afford to expire data by time. Kafka allows such behaviour by enabling the log compaction feature.
Kafka splits the log partition in segments. Each time you produce into a Kafka partition, it appends the data to the end of the active segment. Depending on the configuration, the file is rolled to create new segments (as we traditionally do with application logs). The inactive segments become eligible for compaction, i.e., removing duplicate keys. This way Kafka may as well act as a simple key value store.
If we place a processing layer downstream (using Kafka Streams, Storm, or whatever fits your use case), we get a solution that scales well and which can feed our multiple destinations from that source of truth, and recover them easily if necessary by replaying the stream, avoiding the inconsistency recovery problems inherent to the dual-writes done at the source.
And what about transactions?
This may all smell like roses, but what about transactions? Some applications can’t simply discard the transactional behaviour just because it’s faster to append to the end of the stream. Mission-critical systems simply cannot afford to it. Multiple solutions may be used to achieve such a goal.
We can use some sort of transactional layer built on the top of the stream. For a Kafka-based approach, please refer to KIP-98 - Exactly Once Delivery and Transactional Messaging as an example.
An alternative that keeps your database (almost) untouched include using Change Data Capture (CDC) which allows to use a traditional database as before, and capture the instructions issued to mutate its data.
Some databases already simplify this process. Examples are:
Having the same database schema for writes and all sorts of complex queries is what makes the assumption that databases are slow.
One tool to rule them all
We have seen how streaming has been part of our lives (literally) since ever and how databases implement such a similar pattern.
By having an immutable sequence of events with a fixed order, we can simplify a lot the way we access it. That’s what Kafka does, and Kafka Streams API help on dealing with that data easily.
Picking up from the first sentence of this post, what we are witnessing is a gradual extraction of the elements that form a database to be used as independent scalable and robust components that can work together data processing flow.
Multiple right tools to rule them well
Is this replacing other common architectures based on micro-services, JMS queues, and so on? In my opinion the answer is no. Instead, all these elements can complement each other to serve different use cases. We just need to use the right tool for the job.
Happy new year everyone!
subscribe via RSS