Data Virtualization and Big Data Business Intelligence

by Wayne Kernochan

Data virtualization could be the glue that ties together data warehouses and social media data.

When I set out to write this article, I thought I could assume a basic understanding of data virtualization (DV) in readers and focus on the neat new benefits to agile business intelligence from using DV to combine Big Data with the data warehouse dynamically. Then I found, reading what's out there on the 'Net, that there's still a lack of understanding of what data virtualization really is and what it does. For example, as recently as a couple of months ago, Wikipedia was unsure of the relationship between data virtualization and "Enterprise Information Integration," (EII) which it said "failed in the market."

So before I note data virtualization's upcoming benefits, I am going to ask the reader to review the definition, history and existing benefits of DV, briefly, with me. I promise: the review will help.

Data, data everywhere

The basic aim of data virtualization, whose first solutions arrived around 2001, is to allow users to query across differing data sources in real time. That means that any DV solution needs three brand spanking new (in 2001) technologies:

1.       A way to gather data from any data stores accessed by different vendors' databases or file systems or applications in real time;

2.       A global metadata repository that shows not only what data was out there, anywhere, but also the relationships between the data in various data stores;

3.       A common format across any and all data types that allows DV when it combines the data to present to the end user the relationships between the data, not just differing formats side by side.

And that has been the core of DV's value proposition. But not the whole story. Because, by definition, data virtualization aims to be (to stretch a much-abused word) "agile." That is, its ongoing value lies in its ability to keep pace with the proliferating number of data types out there in the world. A data warehouse, or a file system, achieves performance above all by refining its ability to process a particular type of data. DV piggybacks on this performance, but focuses on its own performance improvements in combining data types where the existing database has not done so. Over time, the "rich have become richer": the gap between what's stored in a data warehouse and a zettabyte's worth of a wide array of other data types stored all over the world has become ever wider, and DV continues to bridge that gap and assemble a richer and richer set of combined data and metadata.

But that is by no means the only DV value delivery that has surfaced since it arrived, because it turns out that data virtualization is a superb "Swiss army knife." You can take any of the three technologies I cited above and use it for other purposes, as well, simultaneously. You can combine DV as a whole with other infrastructure software, especially data management software, and create a full enterprise database architecture or global data architecture that makes everything look like it's in one consistent, real-time-data-available "virtual" database, complete with common XQuery data access. Here are a few more cute things you can do, with some tweaking:

·         Data discovery – discover all the data you didn't know you had in your enterprise, and store their relationships in your very own global metadata repository;

·         Master data management (MDM) – store at least some of the combined common-format data in a permanent data store;

·         Real-time beyond-warehouse business intelligence – combine queries of data types not in the data warehouse, or stuff not yet in the data warehouse, with data warehouse queries;

·         Merge corporate data as you merge corporations, immediately, without having to try to physically move and merge the databases and all their applications;

·         And, of course, relate social media Big Data to data warehouse data without the major risks of downtime, poor performance, and inaccurate data that come with trying to move the Big Data into the data warehouse.

This last, of course, is what data virtualization vendor products such as Composite Software's Composite Information Server 6 are now offering. The rest is the bulk of actual uses of data virtualization, because vendors such as IBM put their DV technology inside their MDM, BI, "information server", "global metadata repository", and "data integration" solutions.

Now we come to the confusion about definitions. Originally, data virtualization was called "Enterprise Information Integration" – actually, a better description of its full capabilities, although it failed to capture the ability to access the data in real time. Over time, most of the original DV companies were bought by large infrastructure-software vendors, who continue to offer their products as standalones but whose major sales are as part of larger products – for example, Metamatrix to Red Hat, or Venetica to IBM. Then, when virtualization became the rage, other, smaller DV companies like Composite Software or Denodo Technologies were able to point out that, as noted above, data virtualization technology lets you mimic one gigantic "virtual" database, and EII was successfully re-christened as Data Virtualization. As you can see from the above, calling it data virtualization is completely justified by what DV technology does; but it can do much more.

And that is why saying "EII failed in the market" is a joke. Because if you have an MDM solution, DV technology is there. If you have real-time business intelligence that can reach data outside of the data warehouse, DV technology is there. And, of course, over the years, endless DV "projects" have accumulated inside large and medium-sized enterprises and in government. The DV vendors you see are the tip of the iceberg. Data virtualization is everywhere.

Page 2: Social media mining and data warehouses

Wayne Kernochan of Infostructure Associates has been an IT industry analyst focused on infrastructure software for more than 20 years.


There's also a lot of confusion out there about data integration, Enterprise Application Integration (EAI), and ETL (Extract, Transform, Load) tools. Loosely speaking, DV, ETL, and EAI all perform data integration, or the combination of data from different data sources. But data virtualization delivers the combined data in real time, and gives you the choice of whether to put the combined data in a physical data store or not. EAI allows users to exchange data between their enterprise apps' data stores, without trying to keep these apps' data in sync by real-time exchange. ETL takes relational data from multiple data sources, puts it in the format of the data warehouse, cleanses it, and bulk loads it into the warehouse's data store, with a significant delay between the time it arrives at the data source and the time it arrives in the data warehouse – often a day or more. So the key differences are:

  • Data virtualization is real-time;
  • DV handles a wide variety of data types; and
  • DV gives you the choice of storing the combined data or not.

 Above all, data virtualization plus ETL and EAI gives you a full spectrum of choices between consolidating your data in one or a few data stores, or keeping it in a broad array of data sources, with a DV "veneer" that makes it all look like one gigantic real-time-accessible database and data store to end users, programmers, and administrators.

The IT buyer bottom line

The implications for IT Buyers trying to implement better business intelligence, and particularly trying to figure out how to combine access to Big Data with their existing BI, are really straightforward. As in the past, data virtualization stands ready to allow you to not only query across both sets of data in near real time, but also make it look like one gigantic database and one gigantic data store, for MDM, for data discovery, for isolating the data warehouse from the very different characteristics of Hadoop-accessed Big Data, and because you have about as much chance of downloading Big Data in bulk into your data warehouse for near-real-time access as a snowball does of surviving hell.

Data virtualization applied to Big Data is an interesting case in point. As I have noted, Big Data accessed by interfaces such as Hadoop offers unprecedented scalability via "delayed consistency," especially when you try to access Facebook-type social media data. However, to achieve this it risks consistency between various data copies – so that what you see may be inaccurate or out of date – and downtime as the system catches up. Therefore, you need to isolate the rich-data risky querying of Big Data from your existing mission-critical data warehouse while combining business intelligence directed at each. Data virtualization is exactly the way to do this. Its long experience in supplementing data warehouse business intelligence and its flexibility in incorporating new data types allows you to both isolate the two data sources from each other when necessary and to combine their output in real time when appropriate.

What data virtualization solutions are out there? Well, as you can see from the above discussion, there are plenty of large vendors to pick from, including IBM, SAP/Sybase, Red Hat, and Oracle/BEA. However, I have a special fondness for the smaller ones like Composite and Denodo that are still plugging away independently – because, more than others, they have focused their efforts for many years on delivering top performance for this particular use case. Their long alliances with folks like MicroStrategy ensure that your integration into more general solutions, and your service and support, should also be top-notch. Still, your mileage may vary, and your existing vendor of other solutions may fit your needs best.

One more point: I stress again that data virtualization also makes your business intelligence more agile, no matter where you use it – if you are open to each new data type as it arrives. Don't just put data virtualization in there and let it hum along automatically. Constantly ask, are there new data types out there, new forms of Big Data, new types of "sensor data" from smartphone videos, that I can apply this to? Because if you don't, some of your competitors will. That's what agile business intelligence is all about.


  This article was originally published on Monday Nov 7th 2011
Mobile Site | Full Site