How important is data preparation when it comes to analytics success?
It is so important, said Mike Gualtieri, an analyst at Forrester Research, that some data analysts spend more than three quarters of their time preparing data: calculating aggregate fields, stripping extraneous characters, filling in missing data or merging multiple data sources.
While most observers agree that predictive analytics will have a growing presence in the coming years, one aspect of it still scares people: the accuracy of the data on which conclusions are based. Imagine making forecasts about the oil and gas industry based on 2014 data, when two years later the per-barrel price of oil per barrel has dropped dramatically? The results would be disastrous.
It isn't just inaccuracy that can lay waste to an otherwise well-executed predictive analytics forecast. Sometimes the data is in so many places and formats that it can't all be consolidated.
Why Data Preparation for Analytics Is Important
"Predictive models are only as accurate as the data fed into them, and over time they may degrade or increase their effectiveness," Gualtieri said.
So what does it take to get it right when it comes to data preparation for analytics?
Adjust Predictive Analytics Algorithms
The best way to monitor effectively, Gualtieri said, is to take newly accumulated data, run it through existing predictive analytics algorithms and keep a close eye on results. If results falter, parameters within the algorithms will have to be adjusted and additional data sought out.
Start with Already Consolidated Data Sources
After a few initial wins on predictions, the size of the target is often ramped up rapidly. This is a mistake, however.
"One of the top reasons why many Big Data predictive analytics projects fail is they attempt to boil the ocean, resulting in a long time to ROI and waning organizational support," said Jeff Erhardt, CEO of predictive analytics provider Wise.io. "There is more opportunity in leveraging predictive analytics on the smaller, already consolidated data sources that exist within the enterprise, especially those within SaaS-based business tools."
This approach largely avoids, or at least minimizes, the data preparation headaches that can stultify any predictive analytics initiative.
Take Agile Approach to Predictive Analytics
One discipline that appears to be applicable to predictive analytics is agile software development. Some of its best practices can be carried over into analytics initiatives, if done in a targeted manner with as little potentially unclean data as possible.
"It is smart to apply an agile approach to the implementation of predictive analytics, rapid demonstration of ROI and incremental expansion of complexity and capability," Erhardt said.
Get the Right Data Quality Tools
Some business intelligence and analytics vendors have built their businesses around data quality.
"If you use an integrated platform that allows for data access to hundreds of data sources and a full data quality suite, you can be sure the results being delivered to the business are based on high quality data, while avoiding the risks of linking disparate technology," said Bruce Kolodziej, predictive analytics sales manager, Information Builders.
Automate Data Preparation, When Possible
Obviously, you have to manage your data and connect to many disparate data sources. This means cutting down on highly manual data preparation tasks.
"Companies are struggling with multi-month cycles involving heavy IT resources to normalize the data, which often produces inconsistent ways of implementing models into production," said Steven Hillion, co-founder and chief product officer for predictive analytics provider Alpine Data.
It is possible, he says, to add automation to the data preparation process. Some predictive analytics tools, for instance, include a model management and collaboration layer to centralize data preparation and modeling. This allows business users to maintain governance over the entire process without putting stress on their IT or data science teams.
Don't Drown in Data Lakes
Many organizations look to Hadoop to address the issues of disparate databases. They are building vast data lakes to provide analysts with access to diverse data. But some embarking on this journey find themselves beset upon by similar pitfalls to those trying to conquer Old Man Sea; more than a few drown without ever reaching dry land.
Sascha Schubert, director, analytics product marketing. SAS, said the way to navigate effectively is to not sail off into the uncharted waters but to make a few test runs until you know where you are going and what to expect. Experiment with the data until you know what you are dealing with and where it might lead you.
"Users of predictive analytics applications need to be able to experiment with the data and easily apply different approaches to find new insights," Schubert said.
Use Natural Language Processing
Natural language processing is a good way to incorporate unstructured data.
"Users can use this to transform call center notes or Twitter feeds into structured data that can be used as input to predictive analytics," said Schubert. "With this new type of data available, we are seeing an increase in the accuracy of predictive analytics in many customer-focused applications."
Offer Self Service Data Preparation
Data preparation capabilities should not only be integrated into predictive analytics applications, they should also be made available to data analysts in a self-service manner, Schubert recommended. You don't want data analysts waiting around for clean data when a major market shakeup demands a rapid response.
Know, Then Go
Many of the cautions above apply to those just establishing their predictive analytics capabilities, learning the ropes and understanding what technology can and cannot do. But once you have the right predictive analytics platform, known methods of data cleansing and confidence in your ability to get results - unleash!
The real power of predictive analytics, after all, is to scan thousands of data attributes quickly and find the relationships that provide predictive capabilities. For this, analysts must be able to combine data from different sources into a data object that can be processed by predictive analytics algorithms.
"As it is unknown upfront which data attributes have predictive power, it is beneficial to feed as many data attributes as possible to minimize the risk to miss important relationships," Schubert said. "In general, more data is better. Only using a subset of data that is known to be good would probably limit the insights that can be extracted."
That said, the larger the data pool, the more risk there is for error. So the lesson to learn is to get your feet wet, know what you can and can’t achieve with your predictions and gradually work your way up into larger scale projects.