Business intelligence analyst Wayne Kernochan, of Infostructure Associates, says it's a shame exploratory data analysis (EDA) doesn’t get more attention from enterprises looking to increase their competitive edge.
Business intelligence has taken a step forward in maturity over the last few years, as statistical packages have become more associated with analytics. SAS has for years distinguished itself by its statistics-focused business intelligence solution; but when IBM acquired SPSS, the grand-daddy of statistical packages, the importance of more rigorous analysis of company and customer data seemed both confirmed and more obvious.
Moreover, over the years, data miners have begun to draw on the insights of university researchers about things like “data mining bias” and Bayesian statistics – and the most in-depth, competitive-advantage-determining analyses have benefited as a result.
So it would seem that data miners, business analysts and IT are on a nice query-technology glide path. Statistics completes the flexibility of analytics by covering one extreme of certainty and analytical complexity, while traditional analytics tools cover the rest of the spectrum up from situations where shallow and imprecise analysis is appropriate. And statistical techniques filter down by technology evolution to the “unwashed masses” of end users.
And yet there is a glaring gap in this picture – or at least a gap that should be glaring. This gap might be summed up as Alice in Wonderland’s “verdict first, then the trial." Both the business and the researcher start with their own narrow picture of what the customer or research subject should look like, and the analytics and statistics that accompany such hypotheses are designed to narrow in on a solution rather than expand due to unexpected data. Thus, the business/researcher is likely to miss key customer insights, psychological and otherwise.
Pile on top of this the “not invented here” syndrome characteristic of most enterprises, and the “confirmation bias” that recent research has shown to be prevalent among individuals and organizations, and you have a real analytical problem on your hands.
In our excitement about the real advances in our ability to understand the customer via social media, we often fail to notice how the recent popularity of “qualitative methods” in psychology has exposed, to those who are willing to see, the enormous amount of insights that traditional statistics fails to capture about customer psychology, sociology and behavior. In the world of business, as I can personally attest, the same type of problem exists.
For more than a decade, I have run total cost-of-ownership (TCO) studies, particularly on SMB use of databases. I discovered early on that open-ended interviews of relatively few sys admins was far more effective in capturing the real costs of databases than broader on-a-scale-from-one-to-five inflexible surveys of CIOs. Moreover, if I just included the ability of the interviewee to tell a story from his or her point of view, the respondent would consistently come up with an insight of extraordinary value. One such insight, for example, is the idea that SMBs don't always care so much about technology that saves operational costs as much as technology that saves an office head time by requiring him or her to just press a button as he or she shuts off the lights on Saturday night.
EDA: Getting Preliminary Data Analysis Right
The key to success for my “surveys” was that they were designed to be:
- Open-ended (They were able to go in a new direction during the interview, and leaving space for whatever the interviewer might have left out.)
- Interviewee-driven (They started by letting the interviewee tell a story as he or she saw it.)
- Flexible in the kind of data collected (Typically an IT organization did not know the overall costs of database administration for their organization and in a survey they would have guessed -- badly --but they almost invariably knew how many database instances per administrator.)
As it turns out, there is a comparable statistical approach for the data analysis side of things. It’s called exploratory data analysis, or EDA.
EDA is about analyzing smaller initial amounts of data to generate as many plausible hypotheses (or “patterns in the data”) as possible, before winnowing them down with further data. For example, the technique creates abstract unlabeled visualizations (“data visualization”) of possible patterns, such as the strangely-named box-and-whisker plot, and then picks the ones with the most promise.
In practice, EDA identifies far more hypotheses than pre-defined models alone. Therefore, applying EDA right after partial data collection typically results in quite a few more key insights.
The automation of EDA over the last couple of decades means the average analyst can insert EDA into his/her typical analysis routine with minimal training and minor added analysis time. Preliminary use cases in academia suggest that effective use of EDA should yield a major improvement in analytics effectiveness “at the margin” (in resulting in-depth analyses) for a small "time overhead" cost.
Where Is It?
You would think, given its potential advantages, that EDA would be used more often in corporate business intelligence. It’s not necessarily the software’s fault. There is, for example, an open-source solution, Orange, which has merged full EDA capabilities with a fully capable data-mining tool.
And yet, in both academia and business, business intelligence is a hot topic while EDA is rarely mentioned. Even vendor EDA products don’t appear to be flourishing as they ought. For a while, SAS’ JMP product stood bravely and alone as a tool that could at least potentially be used by businesses. However, according to Wikipedia, SAS has recently discontinued support for its use on Linux. Still, Wikipedia notes 14 such EDA suites, and there may be more out there.
So let’s summarize: EDA is out there. It’s easy to use. Now that statistical analysis in general is creeping into greater use in analytics, users are ready for it. I fully anticipate that it would have major positive effects on in-depth analytics for enterprises from the largest down at least to many medium-sized ones.
IT shops will have to do some customization and integration themselves, because most if not all vendors have not yet fully integrated it as part of the analytics process in their business intelligence suites, but with open-source and other “standard” EDA tools that’s not inordinately difficult. The only thing lacking is for somebody, anybody, to wake up and pay attention.
IT would be a great driver of EDA use. The most effective initial use of EDA is in support of the longer-term efforts of today’s business analysts, and not in IT-driven agile business intelligence. However, IT should find these business analysts to be surprisingly receptive to an IT EDA pitch – or, at the least, amazed that IT isn’t being a “boat anchor” yet again.
You see, EDA has a sheen of “innovation” about it, and so folks who are in some way associated with the business’ “innovation” efforts should like it a lot. The rest is simply a matter of its becoming part of these business analysts' growing toolkit of rapid-query-generation and statistical in-depth-insight-at-the-margin tools. EDA may not in the normal course of usage get the glory of notice as the source of a new competition-killer, but with a little assiduous use-case monitoring by IT the business case can be made.
It is equally important for IT to note that EDA is twice as effective if it is joined at the front end by a data-gathering process that is to a much greater extent open-ended, customer-driven and flexible in the type of data gathered. Remember, there are ways of doing this – such as parallel in-depth customer interviews or Internet surveys that don’t just parrot SurveyMonkey – that add little “overhead” to data-gathering. IT should seriously consider doing this as well, and preferably design the data-gathering process so as to feed the gathered data to EDA tools, where in-depth statistical analysis of that data will probably be an appropriate next step.
The overall effect will be like replacing a steadily narrowing view of the data with one that expands the potential analyses until the right balance between “data blindness” and “paralysis by analysis” risks is reached.
Making Data Sing
To view EDA as comparable to other business intelligence technologies/solutions is to miss the point. EDA is much more like agile development; its main value lies in changing our analytics methodology, not in improving traditional analyses. It helps the organization itself to think not “outside the box” but “outside the organization” – to be able to combine the viewpoint of the vendor with the viewpoint and reality of the customer, rather than trying to force customer interactions into corporate fantasies of the way customers should think and act for maximum vendor profit.
We all witnessed the public-relations disaster that ensued when Bank of America announced new charges for debit cards – a situation that, if we were honest, we would admit most other enterprises find it all too easy to stumble into. If EDA (or, better still, EDA plus open-ended, customer-driven data-gathering) prevents only one such misstep, it will pay for itself 10 times over, no matter what the numbers say.
EDA seems like it’s about competitive advantage. That’s true as far as it goes, but EDA is actually much more about business risk.
The reference in my title is to an old, bad joke in which a Mob boss, to gain culture, attends an opera, accompanied by his lieutenants. As the long second act draws to a close, the henchmen start getting restless. “Wait!” shouts the boss with cultural authority, silencing their complaints. “It ain’t over ‘til the fat lady sings!” And, of course, that it is what EDA should do for you: make your hypotheses broader and meatier so that, ultimately, the data really sings.
Companies might find out through EDA of their Facebook pages that several key customers-of-one are in opera-lovers’ groups and will respond to an Internet commercial with an aria in it. Sing PROFIT! That's a happy ending.
Wayne Kernochan is the president of Infostructure Associates, an affiliate of Valley View Ventures that aims to identify ways for businesses to leverage information for innovation and competitive advantage. Wayne has been an IT industry analyst for 22 years. During that time, he has focused on analytics, databases, development tools and middleware, and ways to measure their effectiveness, such as TCO, ROI, and agility measures. He has worked for respected firms such as Yankee Group, Aberdeen Group and Illuminata, and has helped craft marketing strategies based on competitive intelligence for vendors ranging from Progress Software to IBM.