Companies have no problem accumulating data. According to an IDC study sponsored by EMC Corporation, the amount of digital data will reach 40,000 exabytes by 2020, up from 130 exabytes in 2005.
However, they do have problems analyzing it. The same study found that less than 1 percent of the so-called "digital universe" is analyzed – despite research that indicates that 25 percent of the data could yield valuable information if it was.
Though there are likely multiple reasons for this relative lack of analysis, one widely cited reason is a lack of data scientists. Members of the International Institute for Analytics (IIA) made "a continued shortage of data scientists" one of their eight analytics predictions for 2013.
In a study from New Vantage Partners, 70 percent of surveyed organizations said they planned on hiring data scientists, a role the survey defined as "applying varying degrees of statistics, data visualizations, computer programming, data mining, machine learning, and database engineering to solve complex data problems. " A whopping 80 percent of respondents found it challenging to fill these positions.
Go Team Data Science
The IIA's recommended approach is to create analytics teams rather than looking for individual data scientists. "It's too hard to find all the requisite skills that comprise a data scientist in one person," said Tom Davenport, the IIA's research director and a visiting professor at the Harvard Business School.
Opera Solutions, a provider of data analytics software and consulting services, uses a team approach. Jacob Spoelstra, the company's global head of R&D, explained that data science teams include both hands-on scientists responsible for creating algorithms as well as analytics managers who provide guidance to scientists, in a role he likened to an academic advisor at a university.
In addition, teams also often include more traditional project managers, who manage scheduling, costs and deliverables from a broader business perspective.
The company is comprised of business units that serve specific verticals, such as health care, insurance and supply chain/operations. Leadership for each unit is shared by a business head, who handles business development and budgets, and a science head, who helps develop analytic technologies and works closely with analytic teams.
"People who can manage the business and develop the products are typically not the people who can do the analytics," Spoelstra said.
Because of all of the different skill sets that potentially come into play, Spoelstra said it's important for companies to define what they want from their data analysis projects and then drill down into the specific skills required to deliver it.
"Will data scientists be organizing data or deriving models for specific decision functions? Once you know that and begin looking for people to staff, this team concept comes into play because one person cannot do all of these functions," Spoelstra said. "One person needs to understand the decision function part of it, then you might have someone like a software engineer who knows how to deploy models, and someone else better at capturing data and working with different platforms."
Opera Solutions grabbed headlines when a team of its data scientists nabbed the runner-up prize in a 2009 contest to create an improved search algorithm for the Netflix site. After that performance, Opera Solutions CEO Arnab Gupta launched a "consistent and intensive recruiting effort" for data scientists, said Laura Teller, the company's chief strategy officer.
Who Is a Data Scientist?
The effort grew after Opera Solutions raised $84 million in an equity financing round in late 2011. The company's ranks of data scientists swelled from about 130 to the current 230.
How does Opera Solutions recruit and retain these scarce and sought-after data professionals?
It can be tough, at least in part because no one can seem to agree on exactly what a data scientist is or what he or she does, Spoelstra said. "It's become almost a catch-all term for anyone who works with data and provides some analysis."
As consultant Neil Raden, CEO of Hired Brains, wrote in a blog post, the term was first used to describe professionals at companies like Amazon and Google, often trained mathematicians or statisticians, tackling data management and analysis duties that typically involved "massive data volumes, unruly data formats and sources that are beyond the typical enterprise data flows."
Since then, as Spoelstra pointed out, the term has begun to encompass a much wider swathe of data professionals. Opera Solutions employs data pros from 20-plus disciplines, including computer science, engineering, statistics and physics. Many of them, including Spoelstra, have PhDs.
This diversity is intentional, said Teller. "We can use this 'brain trust' as a means of crowdsourcing answers and coming up with very innovative models and techniques that simply work better."
Some programming background, involving languages like Python and Java, is essential, Spoelstra said, and Opera Solutions prefers people with a background in machine learning for its data scientists. After that, he said, the requirements get a bit more subjective.
"Real data, especially Big Data, is messy. You have to be able to understand how the data got together and what could possibly be wrong with it – so you need a practical side," he said.
People with a background in physics tend to make good data scientists, he said, because their training has typically involved some programming, mathematics and, most important, hands-on experience conducting experiments that involve gathering large amounts of real-world data.
"Physicists tend to bring the kind of skepticism that comes with measuring real data and understanding all of the things that can go wrong with your measurements," Spoelstra said.
Though many companies look to academia for potential data scientists, Spoelstra said the lack of a business background can be an issue. And, he added, "People in academia almost never work with really big data sets. They've often just used Excel, so there is a learning curve with programming and with using other tools to analyze data."
Care and Feeding of Data Scientists
Opera Solutions must be doing something right with its data scientists. Teller is proud of the company's low turnover rate, which she attributes in part to the key role data scientists play in the company's business.
"Scientists are central to what we do. They are co-heads of our Solutions Groups; they work with our business domain experts and consultants to solve problems. They are not relegated to a back room and told to build models – so they get to see the results of their work and solve difficult, real-life problems," she said. "This is applied science, but innovation and sophisticated scientific thinking is rewarded."
The sense of community created with employing a "critical mass" of scientists is important, too. "It’s a real community that we nurture carefully," Teller said. "They know they are going to have peers and others they can turn to for help solving problems."
Ann All has been writing about technology and business for 15 years. She is the editor of Enterprise Apps Today and eSecurity Planet.