Business Intelligence, Data Warehousing and Data Virtualization

by Wayne Kernochan

'Move or stay' is the big decision facing business intelligence implementations – should data be local or centralized? Some new technologies make the decision a lot more interesting than it used to be.

One of the most fundamental decisions that business intelligence implementers in IT make, at the beginning of every new BI initiative, is whether the new data involved should be copied into a central data mart or data warehouse or accessed where it is.

The advent of software as a service (SaaS) and the public cloud has added a new dimension to this decision:  Now the BI implementer must also decide whether to move the data physically into the cloud and "de-link" the cloud and internal data stores. In fact, this decision is no longer the purview solely of the CTO – the security concerns when you move data to a public service provider mean that corporate needs to have input into the decision. However, fundamentally, the decision is the same: Move the one copy of the data; keep the one copy where it is; or copy the data, move one copy, and synchronize between copies.

Almost two decades of experience with data warehousing has shown that these decisions have serious long-term consequences. On the one hand, for customer buying-pattern insights that demand a petabyte of related data, failure to copy to a central location can mean serious performance degradation – as in, it takes hours instead of minutes to detect a major cross-geography shift in buying behavior. On the other hand, attempting to stuff Big Data like the graphics and video involved in social networking into a data warehouse means an exceptionally long lag time before the data yields insights. The firm's existing data warehouse wasn't designed for this data; it is not fine-tuned for good performance on this data; and despite the best efforts of vendors, periodic movement or replication of such massive amounts of data to the data warehouse has a large impact on the data warehouse's ability to perform its usual tasks.  Above all, these consequences are long-term – applications are written that depend for their performance on the existing location of the data, and redoing all of these applications if you want to move to a different database engine or a different location is beyond the powers of most IT shops.

The reason that it is time to revisit the "move or stay" decision now is that business intelligence users, and therefore the BI IT that supports them, are faced with an unprecedented opportunity and an unprecedented problem. The opportunity, which now as never before is available not only to large but also medium-sized firms, is to gather mammoth amounts of new customer data on the Web and use rapid-fire BI on that data to drive faster "customer-of-one" and agile product development and sales. The problem is that much of this new data is large (by some estimates, even unstructured data inside the organization is approaching 90% of organizational data by size), is generated outside the organization, and changes very rapidly due to rapid and dangerous changes in the company's environment: fads, sudden brand-impacting environmental concerns, and/or competitors who are playing the same BI game as you.

How do you get performance without moving the data into the organization's data warehouse? How can the data warehouse support both new and old types of data and deliver performance on both? Most importantly, how do you keep from making the same mistakes in "move or stay" decisions that make present-day data warehousing so expensive and sub-optimal for your new needs?

Business Intelligence Political Wars

Today's business intelligence users, wowed by case studies of great analytics insights leading to major cost-cutting and add-on sales, are likely to view these "move or stay" decisions as the property of IT, to be decided after the CEO or CMO decides how best to use the latest dashboard, Facebook data miner, or performance management tool. In turn, BI IT has tended to view these decisions as more short-term and ad-hoc, meant to meet immediate urgent needs. Alas, past experience has shown that not only is such an approach to business intelligence implementation unwise, it is also futile.

An old (slightly "altared") story is of the new pastor, some of whose congregation ask him to change the location of the altar. He asks an older pastor what the tradition is, and the pastor, instead of answering, says to try it that way. The result is a massive argument among the congregation. He goes back, and the older pastor says to put it back. Instead of dying down, the argument gets even hotter. He goes back again, and says, what's the tradition? My congregation is fighting like mad about this.  Ah, says the older pastor, that's the tradition.

In the same way, when data warehousing was first introduced, CEOs deferred the "move or stay" decision to IT, which attempted to shoehorn all data into the central data warehouse. Lines of business, of course, resisted the idea that corporate IT should be gatekeepers over their data, making them wait for weeks for reports on their local sales that used to take a day. The result was that CEOs were being frequently appealed to by IT or by lines of business over the matter – that became the tradition.  Eventually, the advent of data marts and the fact on the ground that data warehouses could not handle new data from acquisitions ended the arguments, at the cost of BI that was poorly equipped to handle new data from outside the organization and executives and lines of business that under-used corporate BI.

To avoid these political wars, the BI user needs to set out a long-term approach to "move or stay" that should inform implementation decisions at both the corporate and IT level. Instead of "I have a hammer, everything looks like a nail", this approach stresses maximum flexibility and agility of any BI solution – which, in turn, translates to asking BI IT to deliver, not the highest performance, but reasonable performance and maximum architectural flexibility to accommodate new types of data.

Wayne Kernochan of Infostructure Associates has been an IT industry analyst focused on infrastructure software for more than 20 years.

Page 2: the BI Technology of the Decade

The Business Intelligence Technology of the Decade

To make such an approach effective, the BI-using enterprise needs to understand just what IT can and cannot do to make the architecture flexible by "move or stay" decisions.  Here, the good news is that software technology developed over the last decade has given IT a lot more ability to make their architectures flexible – if they will use it.

In my view, the greatest business intelligence technology advance of the last decade is not cloud, analytics, BI for the masses, or "near-real-time" BI, but a technology originally called Enterprise Information Integration (EII) and now called data virtualization. This technology makes many disparate data stores appear as one to the user, developer, and administrator.  In order to do so, it also demands that the data virtualization user develop a global metadata directory that catalogs not only data copies, but also variant storage formats. Thus, in master data management, data virtualization allows a wide variety of ways of representing a customer, many of them from acquired companies with their own approaches to customer data, to have a common "master record." In business intelligence, data virtualization allows Big Data from outside the organization to be combined with new types of operational data that are not yet stored in the data warehouse and with data-warehouse data in carrying out near-real-time data mining and analytics.

The practical effect of data virtualization is that in the real world it delivers the "move or stay" flexibility that data warehousing alone never could. It does this in two ways:

1.       It gives the BI end user a third option: not just copy new data to the data warehouse and wait until the data warehouse allows you access to it or don't copy it and have no access to it, but you also have the option not to copy it and have slower-performance access to it.

2.       It makes IT and the organization aware of its data assets, allowing IT to provide high-level BI interfaces below which the data's location can be changed as needed, and allowing the organization to understand better the BI opportunities afforded by new data they hadn't known about.

There are data virtualization products available today not only from the likes of Composite Software but also in the master data management solutions of vendors like IBM, and embeddable in the solutions of business intelligence vendors like MicroStrategy. At the same time, BI buyers should bear in mind that a Metadata Officer to abet storage of metadata information in the repository and enforce corporate standards for master data management is needed – and is a good idea in the long run anyway.

The second advance that BI users should know about is a work in progress, and is also a bit trickier to describe. The essence of the problem with scattering data copies across geographies is that either you have to make it so that every time one copy of the data is updated, it appears to the user as if all other copies are updated simultaneously, or you have to deal with the difficulties that result when one user thinks the data has one value, and the other, another.  For example, suppose your bank receives a deposit to your checking account in New York to cover a withdrawal, and the withdrawal itself in Shanghai immediately after. If the Shanghai branch doesn't receive the notification of the first update in time, you will face overdraft fees and will be rightfully annoyed at the bank.

For at least the last thirty years, software folks have been wrestling with this problem. The solution that covers all cases is something called the two-phase commit, but it requires two back-and-forth communications between NY and Shanghai, and is therefore so slow in real life that it can only handle a small percentage of today's data. In the late 1990s, Microsoft and others found a way to delay some of the updates and still make it look like all the data is in synchronization, in the local area networks that support today's global organizations. Above all, over the last few years, the need to support distributed cloud data stores has led to identification of many use cases (often associated with the "noSQL" movement) in which two-phase commit isn't needed and so-called "eventual consistency" is OK. The result is that, on the cloud, you are much more able to keep multiple copies of data in different locations without slowing down business intelligence performance unacceptably. Of course, this is still a long ways from the cloud hype of "your data is somewhere in the cloud, and you don't need to know the location" – and it is likely that in the real world we will never get to the point where data location doesn't matter.

Thus, the business intelligence-using organization should expect IT to be able to use data virtualization to deliver much greater BI flexibility, in the cloud or outside it, and should demand that IT consider the flexibility benefits of noSQL-type data copying in certain use cases – but should not expect cloud data-location nirvana.

The Business Intelligence User Bottom Line: Think Long Term

So "move or stay" is an important decision for the business intelligence buyer or upgrader; it should be made up front by corporate as well as IT; there are much better solutions today that allow smart BI implementers to avoid many past mistakes; and the key to "move or stay" decision success is to emphasize flexibility over raw performance. Suppose you do all that; now what?

The first thing you find is that decisions like "cloud BI or not cloud BI" become a lot easier. With less dependence on data-location-dependent apps, moving these apps and their data from location to location becomes, if not easy, at least doable in a lot more cases. So where now it makes sense to move only a small subset of mission-critical Big Data to a small cloud BI provider, because otherwise the coordination between that and your data warehouse becomes unwieldy, now you can make the decision based more on potential cost savings from the cloud vs. the overall performance advantages (smaller than before) of a single massive data warehouse.

The second thing you discover is that you have created some new problems – but also some new opportunities. The old, dying "data warehouse plus operational database" model of handling enterprise data had its drawbacks; but compared to the new architectures, complexity was not one of them. However, well-designed new architectures that include moved, copied, and accessed-in-place data also allow the BI user to constantly change the proportions rapidly to adapt to new data. In this case, the agility benefits far outweigh the complexity costs.

The third thing you see is that corporate is being forced to think about BI as a long-term investment, not an endless series of "fad" tools – and that's an excellent thing. All too often, today, the market for BI in general and analytics in particular is driven by one-time cost-cutting or immediate needs for insights and rapid decisions. The key difference between organization flexibility and organizational agility is that the latter makes the organization constantly change, because it is always focused on the change after this one. "Move or stay" decisions by IT can make your business intelligence and enterprise architecture more flexible; a long-term mindset by corporate that drives the "move or stay" and other BI decisions makes BI and the whole organization more agile. And data on agile software development suggests that agility delivers greater cost, revenue, and customer satisfaction benefits than flexibility, both in the short and long term.

A final thought: a popular term a few years ago was "glocal," in which the most effective enterprise was the one that best combined global and local insights. "Move or stay" is the equivalent in business intelligence, a way of tuning BI to combine the best of internal/local and Web/cloud/global data for better insights. In essence, "move or stay" success is a way to achieve information glocality. Given the importance today of long-term, customer-information-driven competitive advantage,  a key action item for every business inteligence user, in corporate and IT, should be to redesign the BI architecture in accordance with more agile "move or stay" norms. It's not administrivia; it's about the long term.


  This article was originally published on Monday Aug 29th 2011
Mobile Site | Full Site