By Poulomi Damany, BitYota
It’s a new era for data warehousing. Enterprises now draw upon multi-source, semi-structured data to facilitate new and more useful products and services. These diverse data sets also enable business models built on a deeper understanding of customers and constituents, problems and phenomena.
Big Data encompasses social data (comments, likes, shares and posts); the Internet of Things (sensor data, motion detection, GPS); and advertising data (search, display, click-throughs) from billions of discrete sources. Leveraging all this information at the speed and versatility analysts and other users expect requires a rethinking of what data warehousing is all about -- beginning with these five key principles:
Leave No Data Behind
Today’s data warehouses can’t limit themselves to relational, "clean" data from well-structured sources like RP systems and Oracle OLTP databases. To deliver the deep understanding that’s driving the Big Data revolution, enterprises must be able to bring in heterogeneous data from multiple sources, in multiple formats, speeds and sizes.
Alone, a structured data warehouse fed by traditional extract-transform-load (ETL) processes can’t reasonably keep up with data volumes, velocity and variety. Furthermore, such processes are expensive and time consuming to maintain.
In today’s world, it’s important to deal with data in its raw format, including semi-structured, first-class formats such as JSON and XML. Data warehouses must be able to not only bulk-load large data sets as is, but also scale linearly in a cost-effective manner as volumes and formats grow, without upfront planning. This ability preserves the richness of the data while also circumventing the need (and cost) of writing custom code in order to build, collect, model and integrate data from various streaming and semi-structured sources before analysis.
A Real-Time, Fresh Data World
Today’s business analysis can’t be accomplished with yesterday’s data. Enterprises need insights based on real-time or near-real-time information. And that’s a challenge.
Big Data solutions like Hadoop attempt analytics mostly through batch processing. This requires Map Reduce jobs that are created, submitted and processed to produce an output file. That’s a non-starter when you want fast answers from quickly streaming data, or when you need to iterate quickly on ad-hoc, exploratory analytics.
Companies can unlock significant value by making information usable at much higher frequency, be it through more personalized services or solutions that support faster decision-making in order to adjust business levers just-in-time. The new data warehouse should be able to support fast, interactive queries at an affordable price/performance ratio -- even those that require the scanning and joining of data across billions of records and multiple data sources. When accomplished, "what if" questions can take on a whole new dimension -- and insights are provided like never before.
Emphasize Data Discovery Before Modeling
From their inception, traditional data warehouses operate on a single data design model -- a requirement to which all future data must conform, in order to be loaded and analyzed. Unfortunately, Big Data structures are varying, sources change, and the "insight value" of various data stores are often unknown at the time they’re collected.
New data warehouses must allow for discovery of heterogeneous data for analysis without the expense of first normalizing, cleansing and modeling the data. Analysts need to be able to ask intelligent questions -- using multiple values, conditions and ranges -- on all this un-normalized, undefined data in order to understand its true nature and value. Only then can a model/schema be designed to generate KPIs and related reporting.
That’s not all. To be effective, data warehouses should be able to support query optimizations on polymorphic data, while delaying analysis of the data structure until runtime. (Extra points if the solution is able to handle constrained I/O and flaky networks.)
SQL Is the Lingua Franca of Analytics
Without a data model in place, you can’t create a relational database. As a result, most data resides in file systems on cheap disk. The net result is that you can no longer analyze this data using well-understood and widely used SQL-based tools.
SQL is the language of choice for analytics because it’s declarative in nature, allowing users to focus on their queries and not worry about the underlying data access and retrieval. Without SQL, enterprises are forced to spend precious resources on a Hadoop or Netezza whisperer whose sole job is to write custom code, or tweak the inner workings of arcane technology just to collect and warehouse data. Furthermore, analysts have to discard their preferred statistical tools and learn yet another analytical grammar and syntax just to do their jobs.
Data warehouses should be able to run SQL queries against data of an unknown nature, residing in an unstructured file system. It must be able to adapt to the existing ecosystem of data tools and programming languages so that analysts can use their favorite SQL-based business intelligence (BI) apps and user-defined functions.
Data Analysis Shouldn’t Cost a Fortune
Traditional data warehouses were designed for small amounts of well-structured, relational data. They run on older SMP-style computing infrastructures with proprietary backplanes. Throw high volumes and speeds at them, and costs escalate quickly.
Newer MPP-style analytics platforms using columnar storage can handle high volumes of data with more predictable performance, but their engineered systems don’t come cheap either. To adapt to today’s high-demand environment, some enterprises are moving to a data warehouse service; this solution supports the 3 Vs (volume, velocity and variety) needed to run in the cloud on commodity hardware, with the flexibility to adapt as users and data sources grow.
Big Data is already taking us to places we’ve never been before. Everyone, from data scientists and engineers to SQL-savvy analysts and businesspeople at enterprises of all sizes, needs a data warehouse that allows them to understand their data, build better products and services, and create new avenues of growth and productivity. By instituting a data warehousing strategy that supports the rise in Big Data, enterprises can stay in step -- and simultaneously, create opportunities that benefit us all
Poulomi Damany is VP/Products at BitYota.