You might be wondering why would I write a post about Big Data when there are plenty of them over the web. When I discuss with someone about the concept, the idea is always vague and most of the times the discussion goes around the size of data. Well, in fact the name suggests it that way, but it’s not only the size that matters.
Everyone talks about Big Data but there’s not really a unique single definition to it. Every big company ends up around Big Data at some point. During my professional career, I worked in the Telco and Betting industries. As you can imagine, these types of companies model different traffic shapes over time, and store a huge amount of data that can be used to improve customer experience, increase customer base, and keep the customer happy.
So I actually started hanging out with Big Data before it was this much popular, but you might ask, as I asked before, what is it ? Big Data is a fancy word, sometimes wrongly overused to describe data generically.
There are a lot of definitions of Big Data, so you should pick the one that makes you understand it for real. I do not intend to invent another definition here, so I’ll use the one that best suits me.
What is it all about ?
According to industry analyst Doug Laney, Big Data can be described in terms of its properties, known as the 3 V’s of Big Data: volume, velocity and variety. I tend to think of an extra property, which is important from my perspective, which is data accuracy. So let’s think a bit on these properties.
This property is related to the amount of data and how it increases over time. This is the first property that comes to people’s minds when thinking of Big Data. Every day lots of data is generated by millions of people (e.g. when you make a voice call, when you post to your favourite social network, when you store your pictures in the cloud, etc).
Every single record of a voice call or a message is stored into a database. Not the content of the call itself, but its details (e.g. duration, location, cost, etc). This information can be used, for example, to suggest you a cheaper rate plan if you send more messages and make less calls.
This property is related to the pace at which data flows through a system. A telco company charges millions of pre-paid calls in real-time every day, meaning that it does not only needs to cope with the huge amount of calls being done everyday, but also to process every single of them in real-time even, even during peak time at a proper pace, for proper cost control, discount application, etc.
This property is related to the variety of data and data sources providing data into a system. Data may come from different sources in a huge variety of formats. If you think of a system that collects every single buy made in different supermarkets so we are able to construct a heat map of mostly bought food for each location, you might end up with a lot of data coming at a huge speed, but also in different formats.
I personally like to add accuracy because not always the data you have to deal with is accurate, in the sense that it might not reflect the reality. Imagine a system that collects information based on non-automated actions (thus non-deterministic).
Picking the previous example of the supermarket, if the source of data is a customer tweet instead of an automatic output of the supermarket cashier machine, you might not want to rely on it to build the heat map as not all users will publish to Twitter, the final processed information would be bogus.
How to handle it ?
If you are in the IT industry, you might easily end up hitting the wall of Big Data and you have to deal with it in some manner. Tools to deal with data, in general, might be categorised within the following groups:
Tools that are used to communicate data between systems that will process it. Most popular are any kind of message brokers such as Kafka, ActiveMQ, ZeroMQ, RabbitMQ, Flume.
As part of the ETL (Extraction, Transformation, Loading) tools, data processing tools are responsible for some calculation over data such as transformation, filtering, aggregation, etc. These tools are typically divided into batch processing (e.g. Hadoop), stream processing (e.g. Storm, Flink) and micro-batch processing (e.g. Spark).
Despite the name, not only size matters. When talking about Big Data, we also need to consider how fast it flows through the system, from where it comes and in what format and how reliable is that data. Data can be handled differently by different tools including processing, storage and communication tools.
One does not simply choose the technology to process, store and communicate data without understanding the data characteristics and patterns, otherwise you might end up having to refactor your entire architecture later on, when the data reveals itself.
subscribe via RSS