The term Big Data is on everyone’s lips these days when trying to describe the phenomenon that companies and public organizations in particular have an ever-increasing amount of data at their disposal, which is pushing traditional databases to their limits.
Definition Big Data
The Gartner IT Dictionary defines Big Data as follows:
“Big Data is high-volume, high-speed, and/or high-variant information assets that require cost-effective, innovative forms of information processing that enable improved insights, decision-making, and process automation. “
4 V’s of the data
Although it is difficult to capture exactly what makes a Big Data system “big”, a total of four concepts are used to identify such systems. Derived from their names, these are also referred to as the 4 V’s of Big Data:
- Volume: Big Data applications from companies such as Netflix, Amazon, or Facebook comprise enormous amounts of data, which are already measured in orders of tera- or even zettabytes. In some cases, thousands of machines are required to store and process such large amounts of data. In addition, for security reasons, the data is replicated so that it can be accessed in the event of a failure, which increases the amount of data even more.
- Variety: The data not only comes from a variety of data sources (e.g., image data, audio data, etc.) but also has a wide variety of data structures. The information must be converted into formats so that they can be used together. For example, a uniform schema must be agreed upon for the specification of data.
- Velocity: Velocity refers to the speed of data processing. Typical Big Data systems store and manage large amounts of data at ever-increasing speeds. The speed at which new data is generated, modified, and processed is a challenge. Users of social networks such as Twitter, Facebook, or YouTube are constantly producing new content. This includes not only millions of tweets posted every hour, but also tracking information, for example, the number of views or users’ GPS data.
- Veracity: Human-produced data can be unreliable. Posts on social networks or blogs can contain incorrect information, contradictions, or just plain typos. All of this makes it difficult for algorithms to extract value from the data. The challenge is to identify which data is trustworthy and which is not. Algorithms are used to measure data quality and perform data cleansing steps.
Depending on the literature, Big Data is defined with only three Vs, namely Volume, Velocity, and Variety. In other definitions, even more, Vs are mentioned. One example is “Value”, which means that Big Data should be used to extract meaningful values from data, e.g. by applying machine learning algorithms.
Where does the data come from?
In traditional information systems (such as those used in banks or insurance companies), the data was mainly collected by the company’s employees. In Big Data applications, the data originates from more diverse sources. In today’s world, a wide variety of data is generated in almost all industrial sectors and company sizes. In addition to digitally active companies, the manufacturing sector also collects information from a wide variety of sources:
- “Classic” data: This is data that companies are required to collect by law anyway, or have been collecting for some time due to general interest. This includes, for example, all information about an order that is found on an invoice (order number, sales, customer, products purchased, etc.).
- Multimedia sources: Videos, music, voice recordings, and multimedia documents such as presentation slides are even more difficult to analyze than textual input. Proper preprocessing of these input formats is one of the most important steps required to store such data. Simple image preprocessing algorithms can be used to extract the size or main color of an image. More complex algorithms that use machine learning techniques can identify what is in an image or who the people in an image are.
- Sensor data and other data for monitoring: Servers, smartphones, and many other devices produce so-called log entries that arise during use. A web server logs every single request of a web page. Such a log entry contains a lot of information about the surfer: his IP address, country, city, browser, operating system, screen resolution, and much more. This makes it possible to analyze click behavior, the length of time spent on certain web pages, and whether a visitor is a returning or a new visitor. In smartphones, there are sensors to collect data such as GPS position or battery status. A proximity sensor and the gyroscope can be used in combination to detect whether the phone is in a pocket, whether the user is holding it, or whether it is on a desk.
This is what you should take with you
- Big Data refers to high-volume, fast and/or variant-rich information stocks. In order to process these, we need new forms of information processing.
- Big Data can be characterized by the so-called 4 V’s: Volume, Variety, Velocity, and Veracity.
- In many cases, the data originates either from classic data stocks, multimedia sources, or monitoring data (sensor data).
Explanation of the Apache Hadoop Distributed File System with examples and benefits.
Other Articles on the Topic of Big Data
- You can find a detailed definition here.